Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity

07/03/2020 ∙ by Gonçalo M. Correia, et al. ∙ 0

Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging, due to the need for marginalization over large or combinatorial sets. To circumvent this issue, one typically resorts to sampling-based approximations of the true marginal, requiring noisy gradient estimators (e.g., score function estimator) or continuous relaxations with lower-variance reparameterized gradients (e.g., Gumbel-Softmax). In this paper, we propose a new training strategy which replaces these estimators by an exact yet efficient marginalization. To achieve this, we parameterize discrete distributions over latent assignments using differentiable sparse mappings: sparsemax and its structured counterparts. In effect, the support of these distributions is greatly reduced, which enables efficient marginalization. We report successful results in three tasks covering a range of latent variable modeling applications: a semisupervised deep generative model, a latent communication game, and a generative model with a bit vector latent representation. In all cases, we obtain good performance while still achieving the practicality of sampling-based approximations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural latent variable models are powerful and expressive tools for finding patterns in high-dimensional data, such as images or text

[16, 17, 39]. Of particular interest are discrete latent variables, which can recover categorical and structured encodings of hidden aspects of the data, leading to compact representations and, in some cases, superior explanatory power [46, 1]. However, with discrete variables, training can become challenging, due to the need to compute a gradient of a large sum over all possible latent variable assignments, with each term itself being potentially expensive. This challenge is typically tackled by estimating the gradient with Monte Carlo methods [MC; 31], which rely on sampling estimates. The two most common strategies for MC gradient estimation are the score function estimator [SFE; 41, 35], which suffers from high variance, or surrogate methods that rely on the continuous relaxation of the latent variable, like straight-through [2] or Gumbel-Softmax [26, 12] which potentially reduce variance but introduce bias and modeling assumptions.

In this work, we take a step back and ask: Can we avoid sampling entirely, and instead deterministically evaluate the sum with less computation? To answer affirmatively, we propose an alternative method to train these models by parameterizing the discrete distribution with sparse mappings — sparsemax [28] and two structured counterparts, SparseMAP [33] and a novel mapping top- sparsemax. Sparsity implies that some assignments of the latent variable are entirely ruled out. This leads to the corresponding terms in the sum evaluating trivially to zero, allowing us to disregard potentially expensive computations.


We introduce a general strategy for learning deep models with discrete latent variables that hinges on learning a sparse distribution over the possible assignments. In the unstructured categorical case our strategy relies on the sparsemax activation function, presented in §

3, while in the structured case we propose two strategies, SparseMAP and top- sparsemax, presented in §4

. Unlike existing approaches, our strategies involve neither MC estimation nor any relaxation of the discrete latent variable to the continuous space. We demonstrate our strategy on three different applications: a semisupervised generative model, an emergent communication game, and a bit-vector variational autoencoder. We provide a thorough analysis and comparison to MC methods, and — when feasible — to exact marginalization. Our approach is consistently a top performer, combining the accuracy and robustness of exact marginalization with the efficiency of single-sample estimators.


We denote scalars, vectors, matrices, and sets as , , , and , respectively. The indicator vector is denoted by , for which every entry is zero, except the th, which is 1. The simplex is denoted . denotes the Shannon entropy of a distribution , and

denotes the Kullback-Leibler divergence of

from . The number of nonzeros of a sequence is denoted . Letting

be a random variable, we write the expectation of a function

under distribution as .

2 Background

We assume throughout a latent variable model with observed variables and latent stochastic variables . The overall fit to a dataset is , where the loss of each observation,


is the expected value of a downstream loss

under a probability model

of the latent variable. To model complex data, one parameterizes both the downstream loss and the distribution over latent assignments using neural networks, due to their flexibility and capacity [17].

In this work, we study discrete latent variables, where is finite, but possibly very large. One example is when is a categorical distribution, parametrized by a vector . To obtain , a neural network computes a vector of scores , one score for each assignment, which is then mapped to the probability simplex, typically via . Another example is when is a structured (combinatorial) set, such as . In this case, grows exponentially with and it is infeasible to enumerate and score all possible assignments. For this structured case, scoring assignments involves a decomposition into parts, which we describe in §4.

Training such models requires summing the contributions of all assignments of the latent variable, which involves as many as evaluations of the downstream loss. When is not too large, the expectation may be evaluated explicitly, and learning can proceed with exact gradient updates. If is large, and/or if is an expensive computation, evaluating the expectation becomes prohibitive. In such cases, practitioners typically turn to MC estimates of derived from latent assignments sampled from . Under an appropriate learning rate schedule, this procedure converges to a local optimum of as long as gradient estimates are unbiased [40]. Next, we describe the two current main strategies for MC estimation of this gradient. Later, in §34, we propose our deterministic alternative, based on sparsifying .

Monte Carlo gradient estimates.

Let , where is the subset of weights that depends on, and the subset of weights that depends on. Given a sample

, an unbiased estimator of the gradient for Eq. 

1 w. r. t. is . Unbiased estimation of is less trivial, since is involved in the sampling of , but can be done with SFE [41, 35]: , also known as reinforce [50]. The SFE is powerful and general, making no assumptions on the form of or , requiring only a sampling oracle and a way to assess gradients of . However, it comes with the cost of high variance. Making the estimator practically useful requires variance reduction techniques such as baselines [50, 9] and control variates [49, 47, 8]. Variance reduction can also be achieved with Rao-Blackwellization techniques such as sum and sample [5, 37, 25], which marginalizes an expectation over the top- elements of and takes a sample estimate from the complement set.

Reparametrization trick.

For continuous latent variables, low-variance pathwise gradient estimators can be obtained by separating the source of stochasticity from the sampling parameters, using the so-called reparametrization trick [17, 39]. For discrete latent variables, reparametrizations can only be obtained by introducing a step function like , with null gradients almost everywhere. Replacing with a non-flat surrogate like the identity function, known as Straight-Through [2], or , known as Gumbel-Softmax [26, 12], leads to a biased estimator that can still perform well in practice. Continuous relaxations like Straight-Through and Gumbel-Softmax are only possible under a further modeling assumption that is defined continuously (thus differentiably) in a neighbourhood of the indicator vector for every . In contrast, both SFE-based methods as well as our approach make no such assumption.

3 Efficient Marginalization via Sparsity

The challenge of computing the exact expectation in Eq. 1

is linked to the need to compute a sum with a large number of terms. This holds when the probability distribution over latent assignments is

dense (i.e., every assignment has non-zero probability), which is indeed the case for most parameterizations of discrete distributions. Our proposed methods hinge on sparsifying this sum.

Take the example where , with a neural network predicting from a -dimensional vector of real-valued scores , such that is the score of .111Not to be confused with “score function,” as in SFE, which refers to the gradient of the log-likelihood. The traditional way to obtain the vector parametrizing is with the softmax transform, i. e. . Since this gives , the expectation in Eq. 1 depends on for every possible .

We rethink this standard parametrization, proposing a sparse mapping from scores to the simplex. In particular, we substitute by sparsemax [28]:


Like softmax, sparsemax is differentiable and has efficient forward and backward passes [11, 28]. However, since Eq. 2 is the Euclidean projection operator onto the probability simplex, sparsemax can assign exactly zero probabilities whenever it hits the simplex boundary—unlike softmax.

Our main insight is that with a sparse parametrization of , we can compute the expectation in Eq. 1 evaluating only for assignments . This leads to a powerful alternative to MC estimation, which requires fewer than evaluations of , and which strategically — yet deterministically — selects which assignments to evaluate on. Empirically, our analysis in §5 reveals an adaptive behavior of this sparsity-inducing mechanism, performing more loss evaluations in early iterations while the model is uncertain, and quickly reducing the number of evaluations, especially for unambiguous data points. This is a notable property of our learning strategy: In contrast, MC estimation cannot decide when an ambiguous data point may require more sampling for accurate estimation; and directly evaluating Eq. 1 with the dense resulting from a softmax parametrization never reduces the number of evaluations required, even for simple instances.

4 Structured Latent Variables

While the approach described in §3 theoretically applies to any discrete distribution, many models of interest involve structured (or combinatorial) latent variables. In this section, we assume can be represented as a bit-vectori. e.

a random vector of discrete binary variables

. This assignment of binary variables may involve global factors and constraints (e. g. tree constraints, or budget constraints on the number of active variables, i. e. , where is the maximum number of variables allowed to activate at the same time). In such structured problems, increases exponentially with , making exact evaluation of prohibitive, even with sparsemax.

Structured prediction typically handles this combinatorial explosion by parametrizing scores for individual binary variables and interactions within the global structured configuration, yielding a compact vector of variable scores (e. g., log-potentials for binary attributes), with . Then, the score of some global configuration is . The variable scores induce a unique Gibbs distribution over structures, given by . Equivalently, defining as the matrix with columns for all , we consider the discrete distribution parameterized by , where . (In the unstructured case, .)

In practice, however, we cannot materialize the matrix or the global score vector , let alone compute the softmax and the sum in Eq. 1. The SFE, however, can still be used, provided that exact sampling of is feasible, and efficient algorithms exist for computing the normalizing constant  [48], needed to compute the probability of a given sample.

While it may be tempting to consider using sparsemax to avoid the expensive marginalization, this is prohibitive too: solving the problem in Eq. 2 still requires explicit manipulation of the large vector , and even if we could avoid this, in the worst case () the resulting sparsemax distribution would still have exponentially large support. Fortunately, we show next that it is still possible to develop sparsification strategies to handle the combinatorial explosion of in the structured case. We propose two different methods to obtain a sparse distribution supported only over a bounded-size subset of : top- sparsemax (§4.1) and SparseMAP (§4.2).

4.1 Top- Sparsemax

Recall that the sparsemax operator (Eq. 2) is simply the Euclidean projection onto the -dimensional probability simplex. While this projection has a propensity to be sparse, there is no upper bound on the number of non-zeros of the resulting distribution. When is large, one possibility is to add a cardinality constraint for some prescribed  [4]. The resulting problem becomes


which is known as a sparse projection onto the simplex and studied in detail by Kyrillidis et al. [21]. Remarkably, while this is a non-convex problem, its solution can be written as a composition of two functions: a top- operator , which returns a vector identical to its input but where all the entries not among the largest ones are masked out (set to ),and the -dimensional sparsemax operator. Formally, . Being a composition of operators, its Jacobian becomes a product of matrices and hence simple to compute (the Jacobian of is a diagonal matrix whose diagonal is a multi-hot vector indicating the top- elements of ).

To apply the top- sparsemax to a large or combinatorial set , all we need is a primitive to compute the top- entries of —this is available for many structured problems (for example, sequential models via -best dynamic programming) and, when is the set of joint assignments of discrete binary variables, it can be done with a cost . After enumerating this set, we parameterize by applying the sparsemax transformation to that top-, with a computational cost . Note that this method is identical to sparsemax whenever : if during training the model learns to assign a sparse distribution to the latent variable, we are effectively using a sparsemax parametrization as presented in §3 with cheap computation. In fact, the solution of Eq. 3 gives us a certificate of optimality whenever .

4.2 SparseMAP

A second possibility to obtain efficient summation over a combinatorial space without imposing any constraints on is to use SparseMAP [33, 34], a structured extension of sparsemax:


SparseMAP has been used successfully in discriminative latent models to model structures such as trees and matchings, and Niculae et al. [33] proposes an active set algorithm for evaluating it and computing gradients efficiently, requiring only a primitive for computing . While the in (4) is generally not unique, solutions with support size are guaranteed to exist by Carathéodory’s theorem, and the active set algorithm of [33] enjoys linear and finite convergence to a very sparse optimal distribution. Crucially, (4) has a solution such that the set grows only linearly with , and therefore . Therefore, assessing the expectation in Eq. 1 only requires evaluating terms.

5 Experimental Analysis

We next demonstrate the applicability of our proposed strategies by tackling three tasks: a deep generative model with semisupervision (§5.1), an emergent communication two-player game over a discrete channel (§5.2), and a variational autoencoder with latent binary factors (§5.3

). We describe any further architecture and hyperparameter details in App. 


5.1 Semisupervised Variational Auto-encoder (VAE)

Method Accuracy (%) Dec. calls Monte Carlo SFE SFE Gumbel Sum&Sample Marginalization Dense Sparse (proposed)
Figure 1:

Semisupervised VAE on MNIST. Left: Learning curves (test). Right: Average test results and standard errors over 10 runs.

We consider the semisupervised VAE of [18], which models the joint probability , where is an observation (an MNIST image), is a continuous latent variable with a -dimensional standard Gaussian prior, and

is a discrete random variable with a uniform prior over

categories. The marginal is intractable, due to marginalization of . For a fixed (e. g., sampled), marginalizing requires calls to the decoder, which can be costly depending on the decoder’s architecture.

To circumvent the need for the marginal likelihood, Kingma et al. [18] use variational inference [13], with an approximate posterior with parameters

. This trains a classifier

along with the generative model. In [18], is sampled with a reparameterization, and the expectation over is computed in closed-form, that is, assessing all terms of the sum for a sampled . Under the notation in §2, we set and


which turns Eq. 1 into the (negative) evidence lower bound (ELBO). To update , we use the reparameterization trick to obtain gradients through a sampled . For , we may still explicitly marginalize over each possible assignment of , but this has a multiplicative cost on . As alternative, we experiment with parameterizing with a sparse mapping, comparing it to the original formulation and with stochastic gradients based on SFE and continuous relaxations of .

Data and architecture.

We evaluate this model on the MNIST dataset [23], using 10% of labeled data, treating the remaining data as unlabeled. For the parameterization of the model components we follow the architecture and training procedure used in [25]

. Each model was trained for 200 epochs.


Our proposal’s key ingredient is sparsity, which permits exact marginalization and a deterministic gradient. To investigate the impact of sparsity alone, we report a comparison against the exact marginalization over the entire support using a dense softmax parameterization. To investigate the impact of deterministic gradients, we compare to stochastic gradient strategies: (i) unbiased SFE with a moving average baseline; (ii) SFE with a self-critic baseline [SFE+; 38];222That is, the baseline corresponds to the log-likelihood assessed at an independent sample and treated as independent of the parameters of the generative model. (iii) sum-and-sample, a Rao-Blackwellized version of SFE [25]; and (iv) Gumbel-Softmax.

Results and discussion.

In Fig. 1, we see that our proposed sparse marginalization approach performs just as well as its dense counterpart, both in terms of ELBO and accuracy. However, by inspecting the number of times each method calls the decoder for assessments of , we can see that the effective support of our method is much smaller — sparsemax-parameterized posteriors get very confident, and mostly require one, and sometimes two, calls to the decoder. Regarding the Monte Carlo methods, the continuous relaxation done by Gumbel-Softmax underperformed all the other methods, with the exception of SFE with a moving average. While SFE+ and Sum&Sample are very strong performers, they will always require throughout training the same number of calls to the decoder (in this case, two). On the other hand, sparsemax makes a small number of decoder calls not due to a choice in hyperparameters but thanks to the model converging to only using a small support, which can endow this method with a lower number of computations as it becomes more confident.

5.2 Emergent Communication Game

Emergent communication studies how two agents can develop a communication protocol to solve a task collaboratively [19]. Recent work used neural latent variable models to train these agents via a “collaborative game” between them [24, 22, 10, 14, 7, 45]. In [22], one of the agents (the sender) sees an image and sends a single symbol message chosen from a set (the vocabulary) to the other agent (the receiver), who needs to choose the correct image out of a collection of images .333Lazaridou et al. [22] lets the sender see the full set . In contrast, we follow [10] in showing only the correct image to the sender. This makes the game harder, as the message needs to encode a good “description” of the correct image instead of encoding only its differences from . They found that the messages communicated this way can be correlated with broad object properties amenable to interpretation by humans. In our framework, we let and define and , where corresponds to the sender and to the receiver.

Data and architecture.

We follow the architecture described in [22]. However, to make the game harder, we increase the collection of images , as suggested by [10], from 2 to 16. All methods are trained for 500 epochs.


We compare our method to SFE with a moving average baseline trained with 0/1 loss and negative log-likelihood loss, Gumbel-Softmax, Straight-Through Gumbel-Softmax and exact marginalization under a dense softmax parameterization of .

Results and discussion.

Method Comm. succ. (%) Dec. calls
Monte Carlo
SFE (0/1)
ST Gumbel
Sparse (proposed)
Table 1: Emergent communication success test results, averaged across 10 runs. Random guess baseline .

Table 1 shows the communication success (accuracy of the receiver at picking the correct image ). While the communication success for in [22] was close to perfect, we see that increasing to 16 makes this game much harder to sampling-based approaches. In fact, only the models that do explicit marginalization achieve close to perfect communication in the test set. However, as increases, marginalizing with a softmax parameterization gets computationally more expensive, as it requires forward and backward passes on the receiver. Unlike softmax, the model trained with sparsemax outputs a very small support, requiring only 3 times more receiver calls, on average, than sampling-based approaches. In fact, sparsemax begins quite dense, but its support quickly falls to being close to 1 (see App. C).

5.3 Bit-Vector Variational Autoencoder

Method Monte Carlo SFE Marginalization Top- sparsemax SparseMAP SparseMAP (w/ budget)
Figure 2: Test results for Fashion-MNIST. Left and middle: RD plots. Right: NLL in bits/dim.

As described in §4, in many interesting problems, combinatorial interactions and constraints make exponentially large. In this section, we study the illustrative case of encoding (compressing) images into a binary codeword , by training a latent bit-vector variational autoencoder [12, 30]. One approach for parameterizing the approximate posterior is to use a Gibbs distribution, decomposable as a product of independent Bernoullis, , with and being the number of latent variables. While marginalizing over all the possible is intractable, drawing samples can be done efficiently by sampling each component independently, and the entropy has a closed-form expression. This efficient sampling and entropy computation relies on an independence assumption; in general, we may not have access to such efficient computation.

Training this VAE minimizing the negative ELBO, ; we use a uniform prior . This objective does not constrain to the Gibbs parameterization, and thus to apply our methods we will differ from it, as described hereinafter.

Top- sparsemax parametrization.

As pointed out in §4, we cannot explicitly handle the structured sparsemax distribution , as it involves a vector of dimension . However, given , we can efficiently find the largest configurations in time , with the procedure described in §4.1, and thus we can evaluate efficiently.

SparseMAP parametrization.

Another sparse alternative to the intractable structured sparsemax, as discussed in §4, is SparseMAP. In this case, we compute an optimal distribution using the active set algorithm of [33], by using a maximization oracle which can be computed in :


Since SparseMAP can handle structured problems, we also experimented with adding a budget constraint to SparseMAP: this is done by adding a constraint , where ; we used . The budget constraint ensures the images are represented with sparse codes, and the maximization oracle can be computed in as described in App. A. With both top- sparsemax and these two variants of SparseMAP, the approximate posterior is very sparse, so we may compute the terms and explicitly.

Data and architecture.

We use Fashion-MNIST [51], consisting of 256-level grayscale images . The decoder uses an independent categorical distribution for each pixel, . For top- sparsemax, we choose . We compare our methods to SFE with a moving average and train all models for 100 epochs.

Results and discussion.

Fig. 2 shows an importance sampling estimate (1024 samples per test example were taken) of the negative log-likelihood for the several methods in bits per dimension of , together with the converged values of each method in the rate-distortion (RD) plane.444Distortion is the expected value of the reconstruction negative log likelihood, while rate is the average KL divergence between the prior and . Both show results for which the bit-vector has dimensionality and . It is evident that the learned representation of our methods has increased performance

Figure 3:

Bit vector VAE median and quartile decoder calls per epoch,

(top) / (bottom).

when compared to SFE — not only the estimated negative log-likelihood is significantly lower, but our methods also have a higher rate and lower distortion, suggesting a better fit of . In Fig. 3, we can observe the training progress in number of calls to for the models with 32 and 128 latent bits, respectively. While introduces bias towards the most probable assignments and may discard outcomes that would assign non-zero probability to, as training progresses distributions may (or tend to) be sufficiently sparse and this mismatch disappears, making the gradient computation exact. Remarkably, this happens for  — the support of is smaller than , giving the true gradient to for most of the training. This no longer happens for , for which it remains with full support throughout, due to the model being harder. On the other hand, SparseMAP solutions become very sparse from the start in both cases, while still obtaining good performance. There is, therefore, a trade-off between the solutions we propose: on one hand, can become exact with respect to the expectation in Eq. 1, but it only does so if the chosen is suitable to the difficulty of the model; on the other hand, SparseMAP may not offer an exact gradient to , but its performance is very close to and its higher propensity for sparsity gifts it with less computation.

6 Related Work

Sparse mappings.

There has been recent interest in applying sparse mappings of discrete distributions in deep models [28, 33, 32, 36], attention mechanisms [27, 42, 29, 6], and as a part of discriminative models [34]. Our work focuses instead in the parameterization of distributions over latent variables with these sparse mappings and also by contrasting this novel training method of discrete latent variables with common sampling-based ones.

Reducing sampling noise.

The sampling procedure found in SFE is a great source of variance in models that use this method. To reduce this variance, many works have proposed baselines [50, 9] or control variates [49, 47, 8]. Our method contrasts with these approaches by using an exact gradient that does not use any sampling of the latent variable at training time. Furthermore, we do this while not introducing any new parameters in the network. Closer to our work are variance reduction techniques that rely on partial marginalization, typically of the top- scores of the assignments to the latent variable [25, 20]. These methods show improved performance and variance reduction but still rely on a noisy estimate of the set outside the top-, and require a choice of . Our methods do not require any sampling, neither are they fixed to a particular choice of  — the sparse mappings we use adapt their support as training progresses.

7 Conclusion

We described a novel training strategy for discrete latent variable models, eschewing the common approach based on MC gradient estimation in favor of deterministic, exact marginalization under a sparse distribution. Sparsity leads to a powerful adaptive method, which can investigate fewer or more latent assignments depending on the ambiguity of a training instance , as well as on the stage in training. We showcase the performance and flexibility of our method by investigating a variety of applications, with both discrete and structured latent variables, with positive results. Our models very quickly approach a small number of latent assignment evaluations per sample, but make progress much faster and overall lead to superior results. Our proposed method thus offer the accuracy and robustness of exact marginalization while meeting the efficiency and flexibility of score function estimator methods, providing a promising alternative.

Broader Impact

We discuss here the broader impact of our work. Discussion in this section is predominantly speculative, as the methods described in this work are not yet tested in broader applications. However, we do think that the methods described here can be applied to many applications — as this work is applicable to any model that contains discrete latent variables, even of combinatorial type.

Currently, the solutions available to train discrete latent variable models greatly rely on MC sampling, which can have high variance. Methods that aim to decrease this variance are often not trivial to train and to implement and may disincentivize practitioners from using this class of models. However, we believe that discrete latent variable models have, in many cases, more interpretable and intuitive latent representations. Our methods offer: a simple approach in implementation to train these models; no addition in the number of parameters; low increase in computational overhead (especially when compared to more sophisticated methods of variance reduction [25]); and improved performance.

As we have already pointed out, oftentimes latent variable models have superior explanatory power and so can aid in understanding cases in which the model failed the downstream task. Interpretability of deep neural models can be essential to better discover any ethically harmful biases that exist in the data or in the model itself.

On the other hand, the generative models discussed in this work may also pave the way for malicious use cases, such as is the case with Deepfakes, fake human avatars used by malevolent Twitter users, and automatically generated fraudulent news. Generative models are remarkable at sampling new instances of fake data and, with the power of latent variables, the interpretability discussed before can be used maliciously to further push harmful biases instead of removing them. Furthermore, our work is promising in improving the performance of latent variable models with several discrete variables, that can be trained as attributes to control the sample generation. Attributes that can be activated or deactivated at will to generate fake data can both help beneficial and malignant users to finely control the generated sample. Our work may be currently agnostic to this, but we recognize the dangers and dedicate effort to combating any malicious applications.

Energy-wise, latent variable models often require less data and computation than other models that rely on a massive amount of data and infrastructure. This makes latent variable models ideal for situations where data is scarce, or where there are few computational resources to train large models. We believe that better latent variable modeling is a step forward in the direction of alleviating environmental concerns of deep learning research 

[44]. However, the models proposed in this work tend to use more resources earlier on in training than standard methods, and even though in the applications shown they consume much less as training progresses, it’s not clear if that trend is still observed in all potential applications.

In data science, latent variable models (LVMs), such as mixed-membership models 

[3], can be used to uncover correlations in large amounts of data, for example, by clustering observations. Training these models requires various degrees of approximations which are not without consequence, they may impact the quality of our conclusions and their fealty to the data. For example, variational inference tends to under-estimate uncertainty and give very little support to certain less-represented groups of variables. Where such a model informs decision-makers on matters that affect lives, these decisions may be based on an incomplete view of the correlations in the data and/or these correlations may be exaggerated in harmful ways. On the one hand, our work contributes to more stable training of LVMs, and thus it is a step towards addressing some of the many approximations that can blur the picture. On the other hand, sparsity may exhibit a larger risk of disregarding certain correlations or groups of observations, and thus contribute to misinforming the data scientist. At this point it is unclear to which extent the latter happens and if it does whether it is consistent across LVMs and their various uses. We aim to study this issue further and work with practitioners to identify failure cases.


This work was supported by the European Research Council (ERC StG DeepSPIN 758969), by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2019 and CMUPERI/TIC/0046/2014 (GoLocal), and by the MAIA project, funded by the P2020 program under contract number 045909. This project also received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 825299 (GoURMET). Top- sparsemax is due in great part to initial work and ideas of Mathieu Blondel.


Appendix A Budget Constraint

The maximization oracle for the budget constraint described in §5.3 can be computed in . This is done by sorting the Bernoulli scores and selecting the entries among the top- which have a positive score.

Appendix B Training Details

In our applications, we follow the experimental procedures described in [25] and [22] for §5.1 and §5.2, respectively. We describe below the most relevant training details and key differences in architectures when applicable. For other implementation details that we do not mention here, we refer the reader to the works referenced above.

Semisupervised Variational Autoencoder.

In this experiment, the classification network consists of three fully connected hidden layers of size 256, using ReLU activations. The generative and inference network both consist of one hidden layer of size 128, also with ReLU activations. The multivariate Gaussian has 8 dimensions and its covariance is diagonal. For all models we have chosen the learning rate based on the best ELBO on the validation set, doing a grid search (5e-5, 1e-4, 5e-4, 1e-3, 5e-3). Optimization was done with Adam.

Emergent communication game.

In this application, we closely followed the experimental procedure described by Lazaridou et al. [22] with a few key differences. The architecture of the sender and the receiver is identical with the exception that the sender does not take as input the distractor images along with the correct image — only the correct image. The collection of images shown to the receiver was increased from 2 to 16 and the vocabulary of the sender was increased to 256. The hidden size and embedding size was also increased to 512 and 256, respectively. We did a grid search on the learning rate (0.01, 0.005, 0.001) and entropy regularizer (0.1, 0.05, 0.01) and chose the best configuration for each model on the validation set based on the communication success. The Gumbel models were trained with a temperature of 1 throughout. All models were trained with the Adam optimizer, with a batch size of 64 and during 200 epochs.implemented in EGG [15] We choose the vocabulary of the sender to be 256, the hidden size to be 512 and the embedding size to be 256.

Bit-Vector Variational Autoencoder.

In this experiment, we have set the generative and inference network to consist of one hidden layer with 128 nodes, using ReLU activations. We have searched a learning rate doing grid search (0.0005, 0.001, 0.002) and chosen the model based on the ELBO performance on the validation set. We used the Adam optimizer.

b.1 Datasets

Semisupervised Variational Autoencoder.

MNIST consists of gray-scale images of hand-written digits. It contains 60,000 datapoints for training and 10,000 datapoints for testing. We perform model selection on the last 10,000 datapoints of the training split.

Emergent communication game.

The data used by Lazaridou et al. [22]

is a subset of ImageNet containing 463,000 images, chosen by sampling 100 images from 463 base-level concepts. The images are then applied a forward-pass through the pre-trained VGG ConvNet 

[43] and the representations at the second-to-last fully connected layer are saved to use as input to the sender/receiver.

Bit-Vector Variational Autoencoder.

Fashion-MNIST consists of gray-scale images of clothes. It contains 60,000 datapoints for training and 10,000 datapoints for testing. We perform model selection on the last 10,000 datapoints of the training split.

Appendix C Performance in Decoder Calls

Fig. 4 shows the number of decoder calls with percentiles for the experiment in §5.2.

Figure 4: Median decoder calls per epoch during training time with 10 and 90 percentiles in dotted lines by sparsemax in §5.2.

Appendix D Computing infrastructure

Our infrastructure consists of 4 machines with the specifications shown in Table 2. The machines were used interchangeably, and all experiments were executed in a single GPU. Despite having machines with different specifications, we did not observe large differences in the execution time of our models across different machines.

1. 4 Titan Xp - 12GB 16 AMD Ryzen 1950X @ 3.40GHz - 128GB
2. 4 GTX 1080 Ti - 12GB 8 Intel i7-9800X @ 3.80GHz - 128GB
3. 3 RTX 2080 Ti - 12GB 12 AMD Ryzen 2920X @ 3.50GHz - 128GB
4. 3 RTX 2080 Ti - 12GB 12 AMD Ryzen 2920X @ 3.50GHz - 128GB
Table 2: Computing infrastructure.