Neural latent variable models are powerful and expressive tools for finding patterns in high-dimensional data, such as images or text[16, 17, 39]. Of particular interest are discrete latent variables, which can recover categorical and structured encodings of hidden aspects of the data, leading to compact representations and, in some cases, superior explanatory power [46, 1]. However, with discrete variables, training can become challenging, due to the need to compute a gradient of a large sum over all possible latent variable assignments, with each term itself being potentially expensive. This challenge is typically tackled by estimating the gradient with Monte Carlo methods [MC; 31], which rely on sampling estimates. The two most common strategies for MC gradient estimation are the score function estimator [SFE; 41, 35], which suffers from high variance, or surrogate methods that rely on the continuous relaxation of the latent variable, like straight-through  or Gumbel-Softmax [26, 12] which potentially reduce variance but introduce bias and modeling assumptions.
In this work, we take a step back and ask: Can we avoid sampling entirely, and instead deterministically evaluate the sum with less computation? To answer affirmatively, we propose an alternative method to train these models by parameterizing the discrete distribution with sparse mappings — sparsemax  and two structured counterparts, SparseMAP  and a novel mapping top- sparsemax. Sparsity implies that some assignments of the latent variable are entirely ruled out. This leads to the corresponding terms in the sum evaluating trivially to zero, allowing us to disregard potentially expensive computations.
We introduce a general strategy for learning deep models with discrete latent variables that hinges on learning a sparse distribution over the possible assignments. In the unstructured categorical case our strategy relies on the sparsemax activation function, presented in §3, while in the structured case we propose two strategies, SparseMAP and top- sparsemax, presented in §4
. Unlike existing approaches, our strategies involve neither MC estimation nor any relaxation of the discrete latent variable to the continuous space. We demonstrate our strategy on three different applications: a semisupervised generative model, an emergent communication game, and a bit-vector variational autoencoder. We provide a thorough analysis and comparison to MC methods, and — when feasible — to exact marginalization. Our approach is consistently a top performer, combining the accuracy and robustness of exact marginalization with the efficiency of single-sample estimators.
We denote scalars, vectors, matrices, and sets as , , , and , respectively. The indicator vector is denoted by , for which every entry is zero, except the th, which is 1. The simplex is denoted . denotes the Shannon entropy of a distribution , and
denotes the Kullback-Leibler divergence offrom . The number of nonzeros of a sequence is denoted . Letting
be a random variable, we write the expectation of a functionunder distribution as .
We assume throughout a latent variable model with observed variables and latent stochastic variables . The overall fit to a dataset is , where the loss of each observation,
is the expected value of a downstream loss
under a probability modelof the latent variable. To model complex data, one parameterizes both the downstream loss and the distribution over latent assignments using neural networks, due to their flexibility and capacity .
In this work, we study discrete latent variables, where is finite, but possibly very large. One example is when is a categorical distribution, parametrized by a vector . To obtain , a neural network computes a vector of scores , one score for each assignment, which is then mapped to the probability simplex, typically via . Another example is when is a structured (combinatorial) set, such as . In this case, grows exponentially with and it is infeasible to enumerate and score all possible assignments. For this structured case, scoring assignments involves a decomposition into parts, which we describe in §4.
Training such models requires summing the contributions of all assignments of the latent variable, which involves as many as evaluations of the downstream loss. When is not too large, the expectation may be evaluated explicitly, and learning can proceed with exact gradient updates. If is large, and/or if is an expensive computation, evaluating the expectation becomes prohibitive. In such cases, practitioners typically turn to MC estimates of derived from latent assignments sampled from . Under an appropriate learning rate schedule, this procedure converges to a local optimum of as long as gradient estimates are unbiased . Next, we describe the two current main strategies for MC estimation of this gradient. Later, in §3–4, we propose our deterministic alternative, based on sparsifying .
Monte Carlo gradient estimates.
Let , where is the subset of weights that depends on, and the subset of weights that depends on. Given a sample
, an unbiased estimator of the gradient for Eq.1 w. r. t. is . Unbiased estimation of is less trivial, since is involved in the sampling of , but can be done with SFE [41, 35]: , also known as reinforce . The SFE is powerful and general, making no assumptions on the form of or , requiring only a sampling oracle and a way to assess gradients of . However, it comes with the cost of high variance. Making the estimator practically useful requires variance reduction techniques such as baselines [50, 9] and control variates [49, 47, 8]. Variance reduction can also be achieved with Rao-Blackwellization techniques such as sum and sample [5, 37, 25], which marginalizes an expectation over the top- elements of and takes a sample estimate from the complement set.
For continuous latent variables, low-variance pathwise gradient estimators can be obtained by separating the source of stochasticity from the sampling parameters, using the so-called reparametrization trick [17, 39]. For discrete latent variables, reparametrizations can only be obtained by introducing a step function like , with null gradients almost everywhere. Replacing with a non-flat surrogate like the identity function, known as Straight-Through , or , known as Gumbel-Softmax [26, 12], leads to a biased estimator that can still perform well in practice. Continuous relaxations like Straight-Through and Gumbel-Softmax are only possible under a further modeling assumption that is defined continuously (thus differentiably) in a neighbourhood of the indicator vector for every . In contrast, both SFE-based methods as well as our approach make no such assumption.
3 Efficient Marginalization via Sparsity
The challenge of computing the exact expectation in Eq. 1
is linked to the need to compute a sum with a large number of terms. This holds when the probability distribution over latent assignments isdense (i.e., every assignment has non-zero probability), which is indeed the case for most parameterizations of discrete distributions. Our proposed methods hinge on sparsifying this sum.
Take the example where , with a neural network predicting from a -dimensional vector of real-valued scores , such that is the score of .111Not to be confused with “score function,” as in SFE, which refers to the gradient of the log-likelihood. The traditional way to obtain the vector parametrizing is with the softmax transform, i. e. . Since this gives , the expectation in Eq. 1 depends on for every possible .
We rethink this standard parametrization, proposing a sparse mapping from scores to the simplex. In particular, we substitute by sparsemax :
Like softmax, sparsemax is differentiable and has efficient forward and backward passes [11, 28]. However, since Eq. 2 is the Euclidean projection operator onto the probability simplex, sparsemax can assign exactly zero probabilities whenever it hits the simplex boundary—unlike softmax.
Our main insight is that with a sparse parametrization of , we can compute the expectation in Eq. 1 evaluating only for assignments . This leads to a powerful alternative to MC estimation, which requires fewer than evaluations of , and which strategically — yet deterministically — selects which assignments to evaluate on. Empirically, our analysis in §5 reveals an adaptive behavior of this sparsity-inducing mechanism, performing more loss evaluations in early iterations while the model is uncertain, and quickly reducing the number of evaluations, especially for unambiguous data points. This is a notable property of our learning strategy: In contrast, MC estimation cannot decide when an ambiguous data point may require more sampling for accurate estimation; and directly evaluating Eq. 1 with the dense resulting from a softmax parametrization never reduces the number of evaluations required, even for simple instances.
4 Structured Latent Variables
While the approach described in §3 theoretically applies to any discrete distribution, many models of interest involve structured (or combinatorial) latent variables. In this section, we assume can be represented as a bit-vector—i. e.
a random vector of discrete binary variables. This assignment of binary variables may involve global factors and constraints (e. g. tree constraints, or budget constraints on the number of active variables, i. e. , where is the maximum number of variables allowed to activate at the same time). In such structured problems, increases exponentially with , making exact evaluation of prohibitive, even with sparsemax.
Structured prediction typically handles this combinatorial explosion by parametrizing scores for individual binary variables and interactions within the global structured configuration, yielding a compact vector of variable scores (e. g., log-potentials for binary attributes), with . Then, the score of some global configuration is . The variable scores induce a unique Gibbs distribution over structures, given by . Equivalently, defining as the matrix with columns for all , we consider the discrete distribution parameterized by , where . (In the unstructured case, .)
In practice, however, we cannot materialize the matrix or the global score vector , let alone compute the softmax and the sum in Eq. 1. The SFE, however, can still be used, provided that exact sampling of is feasible, and efficient algorithms exist for computing the normalizing constant , needed to compute the probability of a given sample.
While it may be tempting to consider using sparsemax to avoid the expensive marginalization, this is prohibitive too: solving the problem in Eq. 2 still requires explicit manipulation of the large vector , and even if we could avoid this, in the worst case () the resulting sparsemax distribution would still have exponentially large support. Fortunately, we show next that it is still possible to develop sparsification strategies to handle the combinatorial explosion of in the structured case. We propose two different methods to obtain a sparse distribution supported only over a bounded-size subset of : top- sparsemax (§4.1) and SparseMAP (§4.2).
4.1 Top- Sparsemax
Recall that the sparsemax operator (Eq. 2) is simply the Euclidean projection onto the -dimensional probability simplex. While this projection has a propensity to be sparse, there is no upper bound on the number of non-zeros of the resulting distribution. When is large, one possibility is to add a cardinality constraint for some prescribed . The resulting problem becomes
which is known as a sparse projection onto the simplex and studied in detail by Kyrillidis et al. . Remarkably, while this is a non-convex problem, its solution can be written as a composition of two functions: a top- operator , which returns a vector identical to its input but where all the entries not among the largest ones are masked out (set to ),and the -dimensional sparsemax operator. Formally, . Being a composition of operators, its Jacobian becomes a product of matrices and hence simple to compute (the Jacobian of is a diagonal matrix whose diagonal is a multi-hot vector indicating the top- elements of ).
To apply the top- sparsemax to a large or combinatorial set , all we need is a primitive to compute the top- entries of —this is available for many structured problems (for example, sequential models via -best dynamic programming) and, when is the set of joint assignments of discrete binary variables, it can be done with a cost . After enumerating this set, we parameterize by applying the sparsemax transformation to that top-, with a computational cost . Note that this method is identical to sparsemax whenever : if during training the model learns to assign a sparse distribution to the latent variable, we are effectively using a sparsemax parametrization as presented in §3 with cheap computation. In fact, the solution of Eq. 3 gives us a certificate of optimality whenever .
SparseMAP has been used successfully in discriminative latent models to model structures such as trees and matchings, and Niculae et al.  proposes an active set algorithm for evaluating it and computing gradients efficiently, requiring only a primitive for computing . While the in (4) is generally not unique, solutions with support size are guaranteed to exist by Carathéodory’s theorem, and the active set algorithm of  enjoys linear and finite convergence to a very sparse optimal distribution. Crucially, (4) has a solution such that the set grows only linearly with , and therefore . Therefore, assessing the expectation in Eq. 1 only requires evaluating terms.
5 Experimental Analysis
We next demonstrate the applicability of our proposed strategies by tackling three tasks: a deep generative model with semisupervision (§5.1), an emergent communication two-player game over a discrete channel (§5.2), and a variational autoencoder with latent binary factors (§5.3
). We describe any further architecture and hyperparameter details in App.B.
5.1 Semisupervised Variational Auto-encoder (VAE)
We consider the semisupervised VAE of , which models the joint probability , where is an observation (an MNIST image), is a continuous latent variable with a -dimensional standard Gaussian prior, and
is a discrete random variable with a uniform prior overcategories. The marginal is intractable, due to marginalization of . For a fixed (e. g., sampled), marginalizing requires calls to the decoder, which can be costly depending on the decoder’s architecture.
. This trains a classifieralong with the generative model. In , is sampled with a reparameterization, and the expectation over is computed in closed-form, that is, assessing all terms of the sum for a sampled . Under the notation in §2, we set and
which turns Eq. 1 into the (negative) evidence lower bound (ELBO). To update , we use the reparameterization trick to obtain gradients through a sampled . For , we may still explicitly marginalize over each possible assignment of , but this has a multiplicative cost on . As alternative, we experiment with parameterizing with a sparse mapping, comparing it to the original formulation and with stochastic gradients based on SFE and continuous relaxations of .
Data and architecture.
Our proposal’s key ingredient is sparsity, which permits exact marginalization and a deterministic gradient. To investigate the impact of sparsity alone, we report a comparison against the exact marginalization over the entire support using a dense softmax parameterization. To investigate the impact of deterministic gradients, we compare to stochastic gradient strategies: (i) unbiased SFE with a moving average baseline; (ii) SFE with a self-critic baseline [SFE+; 38];222That is, the baseline corresponds to the log-likelihood assessed at an independent sample and treated as independent of the parameters of the generative model. (iii) sum-and-sample, a Rao-Blackwellized version of SFE ; and (iv) Gumbel-Softmax.
Results and discussion.
In Fig. 1, we see that our proposed sparse marginalization approach performs just as well as its dense counterpart, both in terms of ELBO and accuracy. However, by inspecting the number of times each method calls the decoder for assessments of , we can see that the effective support of our method is much smaller — sparsemax-parameterized posteriors get very confident, and mostly require one, and sometimes two, calls to the decoder. Regarding the Monte Carlo methods, the continuous relaxation done by Gumbel-Softmax underperformed all the other methods, with the exception of SFE with a moving average. While SFE+ and Sum&Sample are very strong performers, they will always require throughout training the same number of calls to the decoder (in this case, two). On the other hand, sparsemax makes a small number of decoder calls not due to a choice in hyperparameters but thanks to the model converging to only using a small support, which can endow this method with a lower number of computations as it becomes more confident.
5.2 Emergent Communication Game
Emergent communication studies how two agents can develop a communication protocol to solve a task collaboratively . Recent work used neural latent variable models to train these agents via a “collaborative game” between them [24, 22, 10, 14, 7, 45]. In , one of the agents (the sender) sees an image and sends a single symbol message chosen from a set (the vocabulary) to the other agent (the receiver), who needs to choose the correct image out of a collection of images .333Lazaridou et al.  lets the sender see the full set . In contrast, we follow  in showing only the correct image to the sender. This makes the game harder, as the message needs to encode a good “description” of the correct image instead of encoding only its differences from . They found that the messages communicated this way can be correlated with broad object properties amenable to interpretation by humans. In our framework, we let and define and , where corresponds to the sender and to the receiver.
Data and architecture.
We compare our method to SFE with a moving average baseline trained with 0/1 loss and negative log-likelihood loss, Gumbel-Softmax, Straight-Through Gumbel-Softmax and exact marginalization under a dense softmax parameterization of .
Results and discussion.
|Method||Comm. succ. (%)||Dec. calls|
Table 1 shows the communication success (accuracy of the receiver at picking the correct image ). While the communication success for in  was close to perfect, we see that increasing to 16 makes this game much harder to sampling-based approaches. In fact, only the models that do explicit marginalization achieve close to perfect communication in the test set. However, as increases, marginalizing with a softmax parameterization gets computationally more expensive, as it requires forward and backward passes on the receiver. Unlike softmax, the model trained with sparsemax outputs a very small support, requiring only 3 times more receiver calls, on average, than sampling-based approaches. In fact, sparsemax begins quite dense, but its support quickly falls to being close to 1 (see App. C).
5.3 Bit-Vector Variational Autoencoder
As described in §4, in many interesting problems, combinatorial interactions and constraints make exponentially large. In this section, we study the illustrative case of encoding (compressing) images into a binary codeword , by training a latent bit-vector variational autoencoder [12, 30]. One approach for parameterizing the approximate posterior is to use a Gibbs distribution, decomposable as a product of independent Bernoullis, , with and being the number of latent variables. While marginalizing over all the possible is intractable, drawing samples can be done efficiently by sampling each component independently, and the entropy has a closed-form expression. This efficient sampling and entropy computation relies on an independence assumption; in general, we may not have access to such efficient computation.
Training this VAE minimizing the negative ELBO, ; we use a uniform prior . This objective does not constrain to the Gibbs parameterization, and thus to apply our methods we will differ from it, as described hereinafter.
Top- sparsemax parametrization.
Another sparse alternative to the intractable structured sparsemax, as discussed in §4, is SparseMAP. In this case, we compute an optimal distribution using the active set algorithm of , by using a maximization oracle which can be computed in :
Since SparseMAP can handle structured problems, we also experimented with adding a budget constraint to SparseMAP: this is done by adding a constraint , where ; we used . The budget constraint ensures the images are represented with sparse codes, and the maximization oracle can be computed in as described in App. A. With both top- sparsemax and these two variants of SparseMAP, the approximate posterior is very sparse, so we may compute the terms and explicitly.
Data and architecture.
We use Fashion-MNIST , consisting of 256-level grayscale images . The decoder uses an independent categorical distribution for each pixel, . For top- sparsemax, we choose . We compare our methods to SFE with a moving average and train all models for 100 epochs.
Results and discussion.
Fig. 2 shows an importance sampling estimate (1024 samples per test example were taken) of the negative log-likelihood for the several methods in bits per dimension of , together with the converged values of each method in the rate-distortion (RD) plane.444Distortion is the expected value of the reconstruction negative log likelihood, while rate is the average KL divergence between the prior and . Both show results for which the bit-vector has dimensionality and . It is evident that the learned representation of our methods has increased performance
when compared to SFE — not only the estimated negative log-likelihood is significantly lower, but our methods also have a higher rate and lower distortion, suggesting a better fit of . In Fig. 3, we can observe the training progress in number of calls to for the models with 32 and 128 latent bits, respectively. While introduces bias towards the most probable assignments and may discard outcomes that would assign non-zero probability to, as training progresses distributions may (or tend to) be sufficiently sparse and this mismatch disappears, making the gradient computation exact. Remarkably, this happens for — the support of is smaller than , giving the true gradient to for most of the training. This no longer happens for , for which it remains with full support throughout, due to the model being harder. On the other hand, SparseMAP solutions become very sparse from the start in both cases, while still obtaining good performance. There is, therefore, a trade-off between the solutions we propose: on one hand, can become exact with respect to the expectation in Eq. 1, but it only does so if the chosen is suitable to the difficulty of the model; on the other hand, SparseMAP may not offer an exact gradient to , but its performance is very close to and its higher propensity for sparsity gifts it with less computation.
6 Related Work
There has been recent interest in applying sparse mappings of discrete distributions in deep models [28, 33, 32, 36], attention mechanisms [27, 42, 29, 6], and as a part of discriminative models . Our work focuses instead in the parameterization of distributions over latent variables with these sparse mappings and also by contrasting this novel training method of discrete latent variables with common sampling-based ones.
Reducing sampling noise.
The sampling procedure found in SFE is a great source of variance in models that use this method. To reduce this variance, many works have proposed baselines [50, 9] or control variates [49, 47, 8]. Our method contrasts with these approaches by using an exact gradient that does not use any sampling of the latent variable at training time. Furthermore, we do this while not introducing any new parameters in the network. Closer to our work are variance reduction techniques that rely on partial marginalization, typically of the top- scores of the assignments to the latent variable [25, 20]. These methods show improved performance and variance reduction but still rely on a noisy estimate of the set outside the top-, and require a choice of . Our methods do not require any sampling, neither are they fixed to a particular choice of — the sparse mappings we use adapt their support as training progresses.
We described a novel training strategy for discrete latent variable models, eschewing the common approach based on MC gradient estimation in favor of deterministic, exact marginalization under a sparse distribution. Sparsity leads to a powerful adaptive method, which can investigate fewer or more latent assignments depending on the ambiguity of a training instance , as well as on the stage in training. We showcase the performance and flexibility of our method by investigating a variety of applications, with both discrete and structured latent variables, with positive results. Our models very quickly approach a small number of latent assignment evaluations per sample, but make progress much faster and overall lead to superior results. Our proposed method thus offer the accuracy and robustness of exact marginalization while meeting the efficiency and flexibility of score function estimator methods, providing a promising alternative.
We discuss here the broader impact of our work. Discussion in this section is predominantly speculative, as the methods described in this work are not yet tested in broader applications. However, we do think that the methods described here can be applied to many applications — as this work is applicable to any model that contains discrete latent variables, even of combinatorial type.
Currently, the solutions available to train discrete latent variable models greatly rely on MC sampling, which can have high variance. Methods that aim to decrease this variance are often not trivial to train and to implement and may disincentivize practitioners from using this class of models. However, we believe that discrete latent variable models have, in many cases, more interpretable and intuitive latent representations. Our methods offer: a simple approach in implementation to train these models; no addition in the number of parameters; low increase in computational overhead (especially when compared to more sophisticated methods of variance reduction ); and improved performance.
As we have already pointed out, oftentimes latent variable models have superior explanatory power and so can aid in understanding cases in which the model failed the downstream task. Interpretability of deep neural models can be essential to better discover any ethically harmful biases that exist in the data or in the model itself.
On the other hand, the generative models discussed in this work may also pave the way for malicious use cases, such as is the case with Deepfakes, fake human avatars used by malevolent Twitter users, and automatically generated fraudulent news. Generative models are remarkable at sampling new instances of fake data and, with the power of latent variables, the interpretability discussed before can be used maliciously to further push harmful biases instead of removing them. Furthermore, our work is promising in improving the performance of latent variable models with several discrete variables, that can be trained as attributes to control the sample generation. Attributes that can be activated or deactivated at will to generate fake data can both help beneficial and malignant users to finely control the generated sample. Our work may be currently agnostic to this, but we recognize the dangers and dedicate effort to combating any malicious applications.
Energy-wise, latent variable models often require less data and computation than other models that rely on a massive amount of data and infrastructure. This makes latent variable models ideal for situations where data is scarce, or where there are few computational resources to train large models. We believe that better latent variable modeling is a step forward in the direction of alleviating environmental concerns of deep learning research. However, the models proposed in this work tend to use more resources earlier on in training than standard methods, and even though in the applications shown they consume much less as training progresses, it’s not clear if that trend is still observed in all potential applications.
In data science, latent variable models (LVMs), such as mixed-membership models, can be used to uncover correlations in large amounts of data, for example, by clustering observations. Training these models requires various degrees of approximations which are not without consequence, they may impact the quality of our conclusions and their fealty to the data. For example, variational inference tends to under-estimate uncertainty and give very little support to certain less-represented groups of variables. Where such a model informs decision-makers on matters that affect lives, these decisions may be based on an incomplete view of the correlations in the data and/or these correlations may be exaggerated in harmful ways. On the one hand, our work contributes to more stable training of LVMs, and thus it is a step towards addressing some of the many approximations that can blur the picture. On the other hand, sparsity may exhibit a larger risk of disregarding certain correlations or groups of observations, and thus contribute to misinforming the data scientist. At this point it is unclear to which extent the latter happens and if it does whether it is consistent across LVMs and their various uses. We aim to study this issue further and work with practitioners to identify failure cases.
This work was supported by the European Research Council (ERC StG DeepSPIN 758969), by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2019 and CMUPERI/TIC/0046/2014 (GoLocal), and by the MAIA project, funded by the P2020 program under contract number 045909. This project also received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 825299 (GoURMET). Top- sparsemax is due in great part to initial work and ideas of Mathieu Blondel.
-  (2019) Interpretable neural predictions with differentiable binary variables. In Proc. of ACL, Cited by: §1.
-  (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. preprint arXiv:1308.3432. Cited by: §1, §2.
-  (2014) Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application 1, pp. 203–232. Cited by: Broader Impact.
Learning with Fenchel-Young losses.
Journal of Machine Learning Research21 (35), pp. 1–69. Cited by: §4.1.
-  (1996) Rao-Blackwellisation of sampling schemes. Biometrika 83 (1), pp. 81–94. Cited by: §2.
-  (2019) Adaptively sparse transformers. In Proc. EMNLP, Cited by: §6.
-  (2016) Learning to communicate with deep multi-agent reinforcement learning. In Proc. NeurIPS, Cited by: §5.2.
-  (2018) Backpropagation through the void: optimizing control variates for black-box gradient estimation. In Proc. ICLR, Cited by: §2, §6.
-  (2016) MuProp: unbiased backpropagation for stochastic neural networks. In Proc. ICLR, Cited by: §2, §6.
-  (2017) Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In Proc. NeurIPS, Cited by: §5.2, §5.2, footnote 3.
-  (1974) Validation of subgradient optimization. Mathematical Programming 6 (1), pp. 62–88. Cited by: §3.
-  (2017) Categorical reparameterization with Gumbel-Softmax. In Proc. ICLR, Cited by: §1, §2, §5.3.
-  (1999) An introduction to variational methods for graphical models. Machine Learning 37 (2), pp. 183–233. Cited by: §5.1.
Learning to play guess who? and inventing a grounded language as a consequence.
Proc. NeurIPS Workshop on Deep Reinforcement Learning, Cited by: §5.2.
-  (2019) EGG: a toolkit for research on Emergence of lanGuage in Games. In Proc. EMNLP, Cited by: Appendix B.
-  (2018) A tutorial on deep latent variable models of natural language. preprint arXiv:1812.06834. Cited by: §1.
-  (2014) Auto-encoding variational Bayes. In Proc. ICLR, Cited by: §1, §2, §2.
-  (2014) Semi-supervised learning with deep generative models. In Proc. NeurIPS, Cited by: §5.1, §5.1.
-  (2002) Natural language from artificial life. Artificial life 8 (2), pp. 185–215. Cited by: §5.2.
-  (2020) Estimating Gradients for Discrete Random Variables by Sampling without Replacement. In Proc. ICLR, Cited by: §6.
-  (2013) Sparse projections onto the simplex. In Proc. ICML, Cited by: §4.1.
-  (2017) Multi-agent cooperation and the emergence of (natural) language. In Proc. ICLR, Cited by: Appendix B, §B.1, Appendix B, §5.2, §5.2, §5.2, footnote 3.
-  (1998) Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), pp. 2278–2324. Cited by: §5.1.
-  (1969) Convention: a philosophical study. Cited by: §5.2.
-  (2019) Rao-Blackwellized stochastic gradients for discrete distributions. In Proc. ICML, Cited by: Appendix B, §2, §5.1, §5.1, §6, Broader Impact.
-  (2017) The Concrete distribution: a continous relaxation of discrete random variables. In Proc. ICLR, Cited by: §1, §2.
-  (2018) Sparse and constrained attention for neural machine translation. In Proc. ACL, Cited by: §6.
-  (2016) From softmax to sparsemax: a sparse model of attention and multi-label classification. In Proc. ICML, Cited by: §1, §3, §6.
-  (2019) Selective attention for context-aware neural machine translation. preprint arXiv:1903.08788. Cited by: §6.
-  (2014) Neural variational inference and learning in belief networks. In Proc. ICML, Cited by: §5.3.
-  (2019) Monte Carlo gradient estimation in machine learning. preprint arXiv:1906.10652. Cited by: §1.
-  (2017) A regularized framework for sparse and structured neural attention. In Proc. NeurIPS, Cited by: §6.
-  (2018) SparseMAP: differentiable sparse structured inference. In Proc. ICML, Cited by: §1, §4.2, §5.3, §6.
-  (2018) Towards dynamic computation graphs via sparse latent structure. In Proc. EMNLP, Cited by: §4.2, §6.
-  (2012) Variational bayesian inference with stochastic search. In Proc. ICML, Cited by: §1, §2.
-  (2019) Sparse sequence-to-sequence models. In Proceedings of ACL, Cited by: §6.
-  (2014) Black box variational inference. In Proc. AISTATS, Cited by: §2.
-  (2017) Self-critical sequence training for image captioning. In Proc. CVPR, Cited by: §5.1.
-  (2014) Stochastic backpropagation and approximate inference in deep generative models. In Proc. ICML, Cited by: §1, §2.
-  (1951) A stochastic approximation method. The Annals of Mathematical Statistics 22 (3), pp. 400–407. Cited by: §2.
-  (1976) A Monte Carlo method for estimating the gradient in a stochastic network. Unpublished manuscript, Technion, Haifa, Israel. Cited by: §1, §2.
-  (2019) SSN: Learning sparse switchable normalization via SparsestMax. In Proc. CVPR, Cited by: §6.
-  (2015) Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, Cited by: §B.1.
-  (2019) Energy and policy considerations for deep learning in NLP. preprint arXiv:1906.02243. Cited by: Broader Impact.
-  (2016) Learning multiagent communication with backpropagation. In Proc. NeurIPS, Cited by: §5.2.
-  (2008) A joint model of text and aspect ratings for sentiment summarization. In Proc. ACL, Cited by: §1.
-  (2017) REBAR: low-variance, unbiased gradient estimates for discrete latent variable models. In Proc. NeurIPS, Cited by: §2, §6.
-  (2008) Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 (1-2), pp. 1–305. Cited by: §4.
-  (2013) Variance reduction for stochastic gradient optimization. In Proc. NeurIPS, Cited by: §2, §6.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §2, §6.
-  (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. preprint arXiv:1708.07747. Cited by: §5.3.
Appendix A Budget Constraint
The maximization oracle for the budget constraint described in §5.3 can be computed in . This is done by sorting the Bernoulli scores and selecting the entries among the top- which have a positive score.
Appendix B Training Details
In our applications, we follow the experimental procedures described in  and  for §5.1 and §5.2, respectively. We describe below the most relevant training details and key differences in architectures when applicable. For other implementation details that we do not mention here, we refer the reader to the works referenced above.
Semisupervised Variational Autoencoder.
In this experiment, the classification network consists of three fully connected hidden layers of size 256, using ReLU activations. The generative and inference network both consist of one hidden layer of size 128, also with ReLU activations. The multivariate Gaussian has 8 dimensions and its covariance is diagonal. For all models we have chosen the learning rate based on the best ELBO on the validation set, doing a grid search (5e-5, 1e-4, 5e-4, 1e-3, 5e-3). Optimization was done with Adam.
Emergent communication game.
In this application, we closely followed the experimental procedure described by Lazaridou et al.  with a few key differences. The architecture of the sender and the receiver is identical with the exception that the sender does not take as input the distractor images along with the correct image — only the correct image. The collection of images shown to the receiver was increased from 2 to 16 and the vocabulary of the sender was increased to 256. The hidden size and embedding size was also increased to 512 and 256, respectively. We did a grid search on the learning rate (0.01, 0.005, 0.001) and entropy regularizer (0.1, 0.05, 0.01) and chose the best configuration for each model on the validation set based on the communication success. The Gumbel models were trained with a temperature of 1 throughout. All models were trained with the Adam optimizer, with a batch size of 64 and during 200 epochs.implemented in EGG  We choose the vocabulary of the sender to be 256, the hidden size to be 512 and the embedding size to be 256.
Bit-Vector Variational Autoencoder.
In this experiment, we have set the generative and inference network to consist of one hidden layer with 128 nodes, using ReLU activations. We have searched a learning rate doing grid search (0.0005, 0.001, 0.002) and chosen the model based on the ELBO performance on the validation set. We used the Adam optimizer.
Semisupervised Variational Autoencoder.
MNIST consists of gray-scale images of hand-written digits. It contains 60,000 datapoints for training and 10,000 datapoints for testing. We perform model selection on the last 10,000 datapoints of the training split.
Emergent communication game.
The data used by Lazaridou et al. 
is a subset of ImageNet containing 463,000 images, chosen by sampling 100 images from 463 base-level concepts. The images are then applied a forward-pass through the pre-trained VGG ConvNet and the representations at the second-to-last fully connected layer are saved to use as input to the sender/receiver.
Bit-Vector Variational Autoencoder.
Fashion-MNIST consists of gray-scale images of clothes. It contains 60,000 datapoints for training and 10,000 datapoints for testing. We perform model selection on the last 10,000 datapoints of the training split.
Appendix C Performance in Decoder Calls
Appendix D Computing infrastructure
Our infrastructure consists of 4 machines with the specifications shown in Table 2. The machines were used interchangeably, and all experiments were executed in a single GPU. Despite having machines with different specifications, we did not observe large differences in the execution time of our models across different machines.
|1.||4 Titan Xp - 12GB||16 AMD Ryzen 1950X @ 3.40GHz - 128GB|
|2.||4 GTX 1080 Ti - 12GB||8 Intel i7-9800X @ 3.80GHz - 128GB|
|3.||3 RTX 2080 Ti - 12GB||12 AMD Ryzen 2920X @ 3.50GHz - 128GB|
|4.||3 RTX 2080 Ti - 12GB||12 AMD Ryzen 2920X @ 3.50GHz - 128GB|