1 Introduction
Neural latent variable models are powerful and expressive tools for finding patterns in highdimensional data, such as images or text
[16, 17, 39]. Of particular interest are discrete latent variables, which can recover categorical and structured encodings of hidden aspects of the data, leading to compact representations and, in some cases, superior explanatory power [46, 1]. However, with discrete variables, training can become challenging, due to the need to compute a gradient of a large sum over all possible latent variable assignments, with each term itself being potentially expensive. This challenge is typically tackled by estimating the gradient with Monte Carlo methods [MC; 31], which rely on sampling estimates. The two most common strategies for MC gradient estimation are the score function estimator [SFE; 41, 35], which suffers from high variance, or surrogate methods that rely on the continuous relaxation of the latent variable, like straightthrough [2] or GumbelSoftmax [26, 12] which potentially reduce variance but introduce bias and modeling assumptions.In this work, we take a step back and ask: Can we avoid sampling entirely, and instead deterministically evaluate the sum with less computation? To answer affirmatively, we propose an alternative method to train these models by parameterizing the discrete distribution with sparse mappings — sparsemax [28] and two structured counterparts, SparseMAP [33] and a novel mapping top sparsemax. Sparsity implies that some assignments of the latent variable are entirely ruled out. This leads to the corresponding terms in the sum evaluating trivially to zero, allowing us to disregard potentially expensive computations.
Contributions.
We introduce a general strategy for learning deep models with discrete latent variables that hinges on learning a sparse distribution over the possible assignments. In the unstructured categorical case our strategy relies on the sparsemax activation function, presented in §
3, while in the structured case we propose two strategies, SparseMAP and top sparsemax, presented in §4. Unlike existing approaches, our strategies involve neither MC estimation nor any relaxation of the discrete latent variable to the continuous space. We demonstrate our strategy on three different applications: a semisupervised generative model, an emergent communication game, and a bitvector variational autoencoder. We provide a thorough analysis and comparison to MC methods, and — when feasible — to exact marginalization. Our approach is consistently a top performer, combining the accuracy and robustness of exact marginalization with the efficiency of singlesample estimators.
Notation.
We denote scalars, vectors, matrices, and sets as , , , and , respectively. The indicator vector is denoted by , for which every entry is zero, except the ^{th}, which is 1. The simplex is denoted . denotes the Shannon entropy of a distribution , and
denotes the KullbackLeibler divergence of
from . The number of nonzeros of a sequence is denoted . Lettingbe a random variable, we write the expectation of a function
under distribution as .2 Background
We assume throughout a latent variable model with observed variables and latent stochastic variables . The overall fit to a dataset is , where the loss of each observation,
(1) 
is the expected value of a downstream loss
under a probability model
of the latent variable. To model complex data, one parameterizes both the downstream loss and the distribution over latent assignments using neural networks, due to their flexibility and capacity [17].In this work, we study discrete latent variables, where is finite, but possibly very large. One example is when is a categorical distribution, parametrized by a vector . To obtain , a neural network computes a vector of scores , one score for each assignment, which is then mapped to the probability simplex, typically via . Another example is when is a structured (combinatorial) set, such as . In this case, grows exponentially with and it is infeasible to enumerate and score all possible assignments. For this structured case, scoring assignments involves a decomposition into parts, which we describe in §4.
Training such models requires summing the contributions of all assignments of the latent variable, which involves as many as evaluations of the downstream loss. When is not too large, the expectation may be evaluated explicitly, and learning can proceed with exact gradient updates. If is large, and/or if is an expensive computation, evaluating the expectation becomes prohibitive. In such cases, practitioners typically turn to MC estimates of derived from latent assignments sampled from . Under an appropriate learning rate schedule, this procedure converges to a local optimum of as long as gradient estimates are unbiased [40]. Next, we describe the two current main strategies for MC estimation of this gradient. Later, in §3–4, we propose our deterministic alternative, based on sparsifying .
Monte Carlo gradient estimates.
Let , where is the subset of weights that depends on, and the subset of weights that depends on. Given a sample
, an unbiased estimator of the gradient for Eq.
1 w. r. t. is . Unbiased estimation of is less trivial, since is involved in the sampling of , but can be done with SFE [41, 35]: , also known as reinforce [50]. The SFE is powerful and general, making no assumptions on the form of or , requiring only a sampling oracle and a way to assess gradients of . However, it comes with the cost of high variance. Making the estimator practically useful requires variance reduction techniques such as baselines [50, 9] and control variates [49, 47, 8]. Variance reduction can also be achieved with RaoBlackwellization techniques such as sum and sample [5, 37, 25], which marginalizes an expectation over the top elements of and takes a sample estimate from the complement set.Reparametrization trick.
For continuous latent variables, lowvariance pathwise gradient estimators can be obtained by separating the source of stochasticity from the sampling parameters, using the socalled reparametrization trick [17, 39]. For discrete latent variables, reparametrizations can only be obtained by introducing a step function like , with null gradients almost everywhere. Replacing with a nonflat surrogate like the identity function, known as StraightThrough [2], or , known as GumbelSoftmax [26, 12], leads to a biased estimator that can still perform well in practice. Continuous relaxations like StraightThrough and GumbelSoftmax are only possible under a further modeling assumption that is defined continuously (thus differentiably) in a neighbourhood of the indicator vector for every . In contrast, both SFEbased methods as well as our approach make no such assumption.
3 Efficient Marginalization via Sparsity
The challenge of computing the exact expectation in Eq. 1
is linked to the need to compute a sum with a large number of terms. This holds when the probability distribution over latent assignments is
dense (i.e., every assignment has nonzero probability), which is indeed the case for most parameterizations of discrete distributions. Our proposed methods hinge on sparsifying this sum.Take the example where , with a neural network predicting from a dimensional vector of realvalued scores , such that is the score of .^{1}^{1}1Not to be confused with “score function,” as in SFE, which refers to the gradient of the loglikelihood. The traditional way to obtain the vector parametrizing is with the softmax transform, i. e. . Since this gives , the expectation in Eq. 1 depends on for every possible .
We rethink this standard parametrization, proposing a sparse mapping from scores to the simplex. In particular, we substitute by sparsemax [28]:
(2) 
Like softmax, sparsemax is differentiable and has efficient forward and backward passes [11, 28]. However, since Eq. 2 is the Euclidean projection operator onto the probability simplex, sparsemax can assign exactly zero probabilities whenever it hits the simplex boundary—unlike softmax.
Our main insight is that with a sparse parametrization of , we can compute the expectation in Eq. 1 evaluating only for assignments . This leads to a powerful alternative to MC estimation, which requires fewer than evaluations of , and which strategically — yet deterministically — selects which assignments to evaluate on. Empirically, our analysis in §5 reveals an adaptive behavior of this sparsityinducing mechanism, performing more loss evaluations in early iterations while the model is uncertain, and quickly reducing the number of evaluations, especially for unambiguous data points. This is a notable property of our learning strategy: In contrast, MC estimation cannot decide when an ambiguous data point may require more sampling for accurate estimation; and directly evaluating Eq. 1 with the dense resulting from a softmax parametrization never reduces the number of evaluations required, even for simple instances.
4 Structured Latent Variables
While the approach described in §3 theoretically applies to any discrete distribution, many models of interest involve structured (or combinatorial) latent variables. In this section, we assume can be represented as a bitvector—i. e.
a random vector of discrete binary variables
. This assignment of binary variables may involve global factors and constraints (e. g. tree constraints, or budget constraints on the number of active variables, i. e. , where is the maximum number of variables allowed to activate at the same time). In such structured problems, increases exponentially with , making exact evaluation of prohibitive, even with sparsemax.Structured prediction typically handles this combinatorial explosion by parametrizing scores for individual binary variables and interactions within the global structured configuration, yielding a compact vector of variable scores (e. g., logpotentials for binary attributes), with . Then, the score of some global configuration is . The variable scores induce a unique Gibbs distribution over structures, given by . Equivalently, defining as the matrix with columns for all , we consider the discrete distribution parameterized by , where . (In the unstructured case, .)
In practice, however, we cannot materialize the matrix or the global score vector , let alone compute the softmax and the sum in Eq. 1. The SFE, however, can still be used, provided that exact sampling of is feasible, and efficient algorithms exist for computing the normalizing constant [48], needed to compute the probability of a given sample.
While it may be tempting to consider using sparsemax to avoid the expensive marginalization, this is prohibitive too: solving the problem in Eq. 2 still requires explicit manipulation of the large vector , and even if we could avoid this, in the worst case () the resulting sparsemax distribution would still have exponentially large support. Fortunately, we show next that it is still possible to develop sparsification strategies to handle the combinatorial explosion of in the structured case. We propose two different methods to obtain a sparse distribution supported only over a boundedsize subset of : top sparsemax (§4.1) and SparseMAP (§4.2).
4.1 Top Sparsemax
Recall that the sparsemax operator (Eq. 2) is simply the Euclidean projection onto the dimensional probability simplex. While this projection has a propensity to be sparse, there is no upper bound on the number of nonzeros of the resulting distribution. When is large, one possibility is to add a cardinality constraint for some prescribed [4]. The resulting problem becomes
(3) 
which is known as a sparse projection onto the simplex and studied in detail by Kyrillidis et al. [21]. Remarkably, while this is a nonconvex problem, its solution can be written as a composition of two functions: a top operator , which returns a vector identical to its input but where all the entries not among the largest ones are masked out (set to ),and the dimensional sparsemax operator. Formally, . Being a composition of operators, its Jacobian becomes a product of matrices and hence simple to compute (the Jacobian of is a diagonal matrix whose diagonal is a multihot vector indicating the top elements of ).
To apply the top sparsemax to a large or combinatorial set , all we need is a primitive to compute the top entries of —this is available for many structured problems (for example, sequential models via best dynamic programming) and, when is the set of joint assignments of discrete binary variables, it can be done with a cost . After enumerating this set, we parameterize by applying the sparsemax transformation to that top, with a computational cost . Note that this method is identical to sparsemax whenever : if during training the model learns to assign a sparse distribution to the latent variable, we are effectively using a sparsemax parametrization as presented in §3 with cheap computation. In fact, the solution of Eq. 3 gives us a certificate of optimality whenever .
4.2 SparseMAP
A second possibility to obtain efficient summation over a combinatorial space without imposing any constraints on is to use SparseMAP [33, 34], a structured extension of sparsemax:
(4) 
SparseMAP has been used successfully in discriminative latent models to model structures such as trees and matchings, and Niculae et al. [33] proposes an active set algorithm for evaluating it and computing gradients efficiently, requiring only a primitive for computing . While the in (4) is generally not unique, solutions with support size are guaranteed to exist by Carathéodory’s theorem, and the active set algorithm of [33] enjoys linear and finite convergence to a very sparse optimal distribution. Crucially, (4) has a solution such that the set grows only linearly with , and therefore . Therefore, assessing the expectation in Eq. 1 only requires evaluating terms.
5 Experimental Analysis
We next demonstrate the applicability of our proposed strategies by tackling three tasks: a deep generative model with semisupervision (§5.1), an emergent communication twoplayer game over a discrete channel (§5.2), and a variational autoencoder with latent binary factors (§5.3
). We describe any further architecture and hyperparameter details in App.
B.5.1 Semisupervised Variational Autoencoder (VAE)
We consider the semisupervised VAE of [18], which models the joint probability , where is an observation (an MNIST image), is a continuous latent variable with a dimensional standard Gaussian prior, and
is a discrete random variable with a uniform prior over
categories. The marginal is intractable, due to marginalization of . For a fixed (e. g., sampled), marginalizing requires calls to the decoder, which can be costly depending on the decoder’s architecture.To circumvent the need for the marginal likelihood, Kingma et al. [18] use variational inference [13], with an approximate posterior with parameters
. This trains a classifier
along with the generative model. In [18], is sampled with a reparameterization, and the expectation over is computed in closedform, that is, assessing all terms of the sum for a sampled . Under the notation in §2, we set and(5) 
which turns Eq. 1 into the (negative) evidence lower bound (ELBO). To update , we use the reparameterization trick to obtain gradients through a sampled . For , we may still explicitly marginalize over each possible assignment of , but this has a multiplicative cost on . As alternative, we experiment with parameterizing with a sparse mapping, comparing it to the original formulation and with stochastic gradients based on SFE and continuous relaxations of .
Data and architecture.
Comparisons.
Our proposal’s key ingredient is sparsity, which permits exact marginalization and a deterministic gradient. To investigate the impact of sparsity alone, we report a comparison against the exact marginalization over the entire support using a dense softmax parameterization. To investigate the impact of deterministic gradients, we compare to stochastic gradient strategies: (i) unbiased SFE with a moving average baseline; (ii) SFE with a selfcritic baseline [SFE+; 38];^{2}^{2}2That is, the baseline corresponds to the loglikelihood assessed at an independent sample and treated as independent of the parameters of the generative model. (iii) sumandsample, a RaoBlackwellized version of SFE [25]; and (iv) GumbelSoftmax.
Results and discussion.
In Fig. 1, we see that our proposed sparse marginalization approach performs just as well as its dense counterpart, both in terms of ELBO and accuracy. However, by inspecting the number of times each method calls the decoder for assessments of , we can see that the effective support of our method is much smaller — sparsemaxparameterized posteriors get very confident, and mostly require one, and sometimes two, calls to the decoder. Regarding the Monte Carlo methods, the continuous relaxation done by GumbelSoftmax underperformed all the other methods, with the exception of SFE with a moving average. While SFE+ and Sum&Sample are very strong performers, they will always require throughout training the same number of calls to the decoder (in this case, two). On the other hand, sparsemax makes a small number of decoder calls not due to a choice in hyperparameters but thanks to the model converging to only using a small support, which can endow this method with a lower number of computations as it becomes more confident.
5.2 Emergent Communication Game
Emergent communication studies how two agents can develop a communication protocol to solve a task collaboratively [19]. Recent work used neural latent variable models to train these agents via a “collaborative game” between them [24, 22, 10, 14, 7, 45]. In [22], one of the agents (the sender) sees an image and sends a single symbol message chosen from a set (the vocabulary) to the other agent (the receiver), who needs to choose the correct image out of a collection of images .^{3}^{3}3Lazaridou et al. [22] lets the sender see the full set . In contrast, we follow [10] in showing only the correct image to the sender. This makes the game harder, as the message needs to encode a good “description” of the correct image instead of encoding only its differences from . They found that the messages communicated this way can be correlated with broad object properties amenable to interpretation by humans. In our framework, we let and define and , where corresponds to the sender and to the receiver.
Data and architecture.
Comparisons.
We compare our method to SFE with a moving average baseline trained with 0/1 loss and negative loglikelihood loss, GumbelSoftmax, StraightThrough GumbelSoftmax and exact marginalization under a dense softmax parameterization of .
Results and discussion.
Method  Comm. succ. (%)  Dec. calls  

Monte Carlo  
SFE (0/1)  
SFE (NLL)  
Gumbel  
ST Gumbel  
Marginalization  
Dense  
Sparse (proposed) 
Table 1 shows the communication success (accuracy of the receiver at picking the correct image ). While the communication success for in [22] was close to perfect, we see that increasing to 16 makes this game much harder to samplingbased approaches. In fact, only the models that do explicit marginalization achieve close to perfect communication in the test set. However, as increases, marginalizing with a softmax parameterization gets computationally more expensive, as it requires forward and backward passes on the receiver. Unlike softmax, the model trained with sparsemax outputs a very small support, requiring only 3 times more receiver calls, on average, than samplingbased approaches. In fact, sparsemax begins quite dense, but its support quickly falls to being close to 1 (see App. C).
5.3 BitVector Variational Autoencoder
As described in §4, in many interesting problems, combinatorial interactions and constraints make exponentially large. In this section, we study the illustrative case of encoding (compressing) images into a binary codeword , by training a latent bitvector variational autoencoder [12, 30]. One approach for parameterizing the approximate posterior is to use a Gibbs distribution, decomposable as a product of independent Bernoullis, , with and being the number of latent variables. While marginalizing over all the possible is intractable, drawing samples can be done efficiently by sampling each component independently, and the entropy has a closedform expression. This efficient sampling and entropy computation relies on an independence assumption; in general, we may not have access to such efficient computation.
Training this VAE minimizing the negative ELBO, ; we use a uniform prior . This objective does not constrain to the Gibbs parameterization, and thus to apply our methods we will differ from it, as described hereinafter.
Top sparsemax parametrization.
SparseMAP parametrization.
Another sparse alternative to the intractable structured sparsemax, as discussed in §4, is SparseMAP. In this case, we compute an optimal distribution using the active set algorithm of [33], by using a maximization oracle which can be computed in :
(6) 
Since SparseMAP can handle structured problems, we also experimented with adding a budget constraint to SparseMAP: this is done by adding a constraint , where ; we used . The budget constraint ensures the images are represented with sparse codes, and the maximization oracle can be computed in as described in App. A. With both top sparsemax and these two variants of SparseMAP, the approximate posterior is very sparse, so we may compute the terms and explicitly.
Data and architecture.
We use FashionMNIST [51], consisting of 256level grayscale images . The decoder uses an independent categorical distribution for each pixel, . For top sparsemax, we choose . We compare our methods to SFE with a moving average and train all models for 100 epochs.
Results and discussion.
Fig. 2 shows an importance sampling estimate (1024 samples per test example were taken) of the negative loglikelihood for the several methods in bits per dimension of , together with the converged values of each method in the ratedistortion (RD) plane.^{4}^{4}4Distortion is the expected value of the reconstruction negative log likelihood, while rate is the average KL divergence between the prior and . Both show results for which the bitvector has dimensionality and . It is evident that the learned representation of our methods has increased performance
when compared to SFE — not only the estimated negative loglikelihood is significantly lower, but our methods also have a higher rate and lower distortion, suggesting a better fit of . In Fig. 3, we can observe the training progress in number of calls to for the models with 32 and 128 latent bits, respectively. While introduces bias towards the most probable assignments and may discard outcomes that would assign nonzero probability to, as training progresses distributions may (or tend to) be sufficiently sparse and this mismatch disappears, making the gradient computation exact. Remarkably, this happens for — the support of is smaller than , giving the true gradient to for most of the training. This no longer happens for , for which it remains with full support throughout, due to the model being harder. On the other hand, SparseMAP solutions become very sparse from the start in both cases, while still obtaining good performance. There is, therefore, a tradeoff between the solutions we propose: on one hand, can become exact with respect to the expectation in Eq. 1, but it only does so if the chosen is suitable to the difficulty of the model; on the other hand, SparseMAP may not offer an exact gradient to , but its performance is very close to and its higher propensity for sparsity gifts it with less computation.
6 Related Work
Sparse mappings.
There has been recent interest in applying sparse mappings of discrete distributions in deep models [28, 33, 32, 36], attention mechanisms [27, 42, 29, 6], and as a part of discriminative models [34]. Our work focuses instead in the parameterization of distributions over latent variables with these sparse mappings and also by contrasting this novel training method of discrete latent variables with common samplingbased ones.
Reducing sampling noise.
The sampling procedure found in SFE is a great source of variance in models that use this method. To reduce this variance, many works have proposed baselines [50, 9] or control variates [49, 47, 8]. Our method contrasts with these approaches by using an exact gradient that does not use any sampling of the latent variable at training time. Furthermore, we do this while not introducing any new parameters in the network. Closer to our work are variance reduction techniques that rely on partial marginalization, typically of the top scores of the assignments to the latent variable [25, 20]. These methods show improved performance and variance reduction but still rely on a noisy estimate of the set outside the top, and require a choice of . Our methods do not require any sampling, neither are they fixed to a particular choice of — the sparse mappings we use adapt their support as training progresses.
7 Conclusion
We described a novel training strategy for discrete latent variable models, eschewing the common approach based on MC gradient estimation in favor of deterministic, exact marginalization under a sparse distribution. Sparsity leads to a powerful adaptive method, which can investigate fewer or more latent assignments depending on the ambiguity of a training instance , as well as on the stage in training. We showcase the performance and flexibility of our method by investigating a variety of applications, with both discrete and structured latent variables, with positive results. Our models very quickly approach a small number of latent assignment evaluations per sample, but make progress much faster and overall lead to superior results. Our proposed method thus offer the accuracy and robustness of exact marginalization while meeting the efficiency and flexibility of score function estimator methods, providing a promising alternative.
Broader Impact
We discuss here the broader impact of our work. Discussion in this section is predominantly speculative, as the methods described in this work are not yet tested in broader applications. However, we do think that the methods described here can be applied to many applications — as this work is applicable to any model that contains discrete latent variables, even of combinatorial type.
Currently, the solutions available to train discrete latent variable models greatly rely on MC sampling, which can have high variance. Methods that aim to decrease this variance are often not trivial to train and to implement and may disincentivize practitioners from using this class of models. However, we believe that discrete latent variable models have, in many cases, more interpretable and intuitive latent representations. Our methods offer: a simple approach in implementation to train these models; no addition in the number of parameters; low increase in computational overhead (especially when compared to more sophisticated methods of variance reduction [25]); and improved performance.
As we have already pointed out, oftentimes latent variable models have superior explanatory power and so can aid in understanding cases in which the model failed the downstream task. Interpretability of deep neural models can be essential to better discover any ethically harmful biases that exist in the data or in the model itself.
On the other hand, the generative models discussed in this work may also pave the way for malicious use cases, such as is the case with Deepfakes, fake human avatars used by malevolent Twitter users, and automatically generated fraudulent news. Generative models are remarkable at sampling new instances of fake data and, with the power of latent variables, the interpretability discussed before can be used maliciously to further push harmful biases instead of removing them. Furthermore, our work is promising in improving the performance of latent variable models with several discrete variables, that can be trained as attributes to control the sample generation. Attributes that can be activated or deactivated at will to generate fake data can both help beneficial and malignant users to finely control the generated sample. Our work may be currently agnostic to this, but we recognize the dangers and dedicate effort to combating any malicious applications.
Energywise, latent variable models often require less data and computation than other models that rely on a massive amount of data and infrastructure. This makes latent variable models ideal for situations where data is scarce, or where there are few computational resources to train large models. We believe that better latent variable modeling is a step forward in the direction of alleviating environmental concerns of deep learning research
[44]. However, the models proposed in this work tend to use more resources earlier on in training than standard methods, and even though in the applications shown they consume much less as training progresses, it’s not clear if that trend is still observed in all potential applications.In data science, latent variable models (LVMs), such as mixedmembership models
[3], can be used to uncover correlations in large amounts of data, for example, by clustering observations. Training these models requires various degrees of approximations which are not without consequence, they may impact the quality of our conclusions and their fealty to the data. For example, variational inference tends to underestimate uncertainty and give very little support to certain lessrepresented groups of variables. Where such a model informs decisionmakers on matters that affect lives, these decisions may be based on an incomplete view of the correlations in the data and/or these correlations may be exaggerated in harmful ways. On the one hand, our work contributes to more stable training of LVMs, and thus it is a step towards addressing some of the many approximations that can blur the picture. On the other hand, sparsity may exhibit a larger risk of disregarding certain correlations or groups of observations, and thus contribute to misinforming the data scientist. At this point it is unclear to which extent the latter happens and if it does whether it is consistent across LVMs and their various uses. We aim to study this issue further and work with practitioners to identify failure cases.Acknowledgments
This work was supported by the European Research Council (ERC StG DeepSPIN 758969), by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2019 and CMUPERI/TIC/0046/2014 (GoLocal), and by the MAIA project, funded by the P2020 program under contract number 045909. This project also received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 825299 (GoURMET). Top sparsemax is due in great part to initial work and ideas of Mathieu Blondel.
References
 [1] (2019) Interpretable neural predictions with differentiable binary variables. In Proc. of ACL, Cited by: §1.
 [2] (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. preprint arXiv:1308.3432. Cited by: §1, §2.
 [3] (2014) Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application 1, pp. 203–232. Cited by: Broader Impact.

[4]
(2020)
Learning with FenchelYoung losses.
Journal of Machine Learning Research
21 (35), pp. 1–69. Cited by: §4.1.  [5] (1996) RaoBlackwellisation of sampling schemes. Biometrika 83 (1), pp. 81–94. Cited by: §2.
 [6] (2019) Adaptively sparse transformers. In Proc. EMNLP, Cited by: §6.
 [7] (2016) Learning to communicate with deep multiagent reinforcement learning. In Proc. NeurIPS, Cited by: §5.2.
 [8] (2018) Backpropagation through the void: optimizing control variates for blackbox gradient estimation. In Proc. ICLR, Cited by: §2, §6.
 [9] (2016) MuProp: unbiased backpropagation for stochastic neural networks. In Proc. ICLR, Cited by: §2, §6.
 [10] (2017) Emergence of language with multiagent games: Learning to communicate with sequences of symbols. In Proc. NeurIPS, Cited by: §5.2, §5.2, footnote 3.
 [11] (1974) Validation of subgradient optimization. Mathematical Programming 6 (1), pp. 62–88. Cited by: §3.
 [12] (2017) Categorical reparameterization with GumbelSoftmax. In Proc. ICLR, Cited by: §1, §2, §5.3.
 [13] (1999) An introduction to variational methods for graphical models. Machine Learning 37 (2), pp. 183–233. Cited by: §5.1.

[14]
(2016)
Learning to play guess who? and inventing a grounded language as a consequence.
In
Proc. NeurIPS Workshop on Deep Reinforcement Learning
, Cited by: §5.2.  [15] (2019) EGG: a toolkit for research on Emergence of lanGuage in Games. In Proc. EMNLP, Cited by: Appendix B.
 [16] (2018) A tutorial on deep latent variable models of natural language. preprint arXiv:1812.06834. Cited by: §1.
 [17] (2014) Autoencoding variational Bayes. In Proc. ICLR, Cited by: §1, §2, §2.
 [18] (2014) Semisupervised learning with deep generative models. In Proc. NeurIPS, Cited by: §5.1, §5.1.
 [19] (2002) Natural language from artificial life. Artificial life 8 (2), pp. 185–215. Cited by: §5.2.
 [20] (2020) Estimating Gradients for Discrete Random Variables by Sampling without Replacement. In Proc. ICLR, Cited by: §6.
 [21] (2013) Sparse projections onto the simplex. In Proc. ICML, Cited by: §4.1.
 [22] (2017) Multiagent cooperation and the emergence of (natural) language. In Proc. ICLR, Cited by: Appendix B, §B.1, Appendix B, §5.2, §5.2, §5.2, footnote 3.
 [23] (1998) Gradientbased learning applied to document recognition. Proc. IEEE 86 (11), pp. 2278–2324. Cited by: §5.1.
 [24] (1969) Convention: a philosophical study. Cited by: §5.2.
 [25] (2019) RaoBlackwellized stochastic gradients for discrete distributions. In Proc. ICML, Cited by: Appendix B, §2, §5.1, §5.1, §6, Broader Impact.
 [26] (2017) The Concrete distribution: a continous relaxation of discrete random variables. In Proc. ICLR, Cited by: §1, §2.
 [27] (2018) Sparse and constrained attention for neural machine translation. In Proc. ACL, Cited by: §6.
 [28] (2016) From softmax to sparsemax: a sparse model of attention and multilabel classification. In Proc. ICML, Cited by: §1, §3, §6.
 [29] (2019) Selective attention for contextaware neural machine translation. preprint arXiv:1903.08788. Cited by: §6.
 [30] (2014) Neural variational inference and learning in belief networks. In Proc. ICML, Cited by: §5.3.
 [31] (2019) Monte Carlo gradient estimation in machine learning. preprint arXiv:1906.10652. Cited by: §1.
 [32] (2017) A regularized framework for sparse and structured neural attention. In Proc. NeurIPS, Cited by: §6.
 [33] (2018) SparseMAP: differentiable sparse structured inference. In Proc. ICML, Cited by: §1, §4.2, §5.3, §6.
 [34] (2018) Towards dynamic computation graphs via sparse latent structure. In Proc. EMNLP, Cited by: §4.2, §6.
 [35] (2012) Variational bayesian inference with stochastic search. In Proc. ICML, Cited by: §1, §2.
 [36] (2019) Sparse sequencetosequence models. In Proceedings of ACL, Cited by: §6.
 [37] (2014) Black box variational inference. In Proc. AISTATS, Cited by: §2.
 [38] (2017) Selfcritical sequence training for image captioning. In Proc. CVPR, Cited by: §5.1.
 [39] (2014) Stochastic backpropagation and approximate inference in deep generative models. In Proc. ICML, Cited by: §1, §2.
 [40] (1951) A stochastic approximation method. The Annals of Mathematical Statistics 22 (3), pp. 400–407. Cited by: §2.
 [41] (1976) A Monte Carlo method for estimating the gradient in a stochastic network. Unpublished manuscript, Technion, Haifa, Israel. Cited by: §1, §2.
 [42] (2019) SSN: Learning sparse switchable normalization via SparsestMax. In Proc. CVPR, Cited by: §6.
 [43] (2015) Very deep convolutional networks for largescale image recognition. In Proc. ICLR, Cited by: §B.1.
 [44] (2019) Energy and policy considerations for deep learning in NLP. preprint arXiv:1906.02243. Cited by: Broader Impact.
 [45] (2016) Learning multiagent communication with backpropagation. In Proc. NeurIPS, Cited by: §5.2.
 [46] (2008) A joint model of text and aspect ratings for sentiment summarization. In Proc. ACL, Cited by: §1.
 [47] (2017) REBAR: lowvariance, unbiased gradient estimates for discrete latent variable models. In Proc. NeurIPS, Cited by: §2, §6.
 [48] (2008) Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 (12), pp. 1–305. Cited by: §4.
 [49] (2013) Variance reduction for stochastic gradient optimization. In Proc. NeurIPS, Cited by: §2, §6.
 [50] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8 (34), pp. 229–256. Cited by: §2, §6.
 [51] (2017) FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. preprint arXiv:1708.07747. Cited by: §5.3.
Appendix A Budget Constraint
The maximization oracle for the budget constraint described in §5.3 can be computed in . This is done by sorting the Bernoulli scores and selecting the entries among the top which have a positive score.
Appendix B Training Details
In our applications, we follow the experimental procedures described in [25] and [22] for §5.1 and §5.2, respectively. We describe below the most relevant training details and key differences in architectures when applicable. For other implementation details that we do not mention here, we refer the reader to the works referenced above.
Semisupervised Variational Autoencoder.
In this experiment, the classification network consists of three fully connected hidden layers of size 256, using ReLU activations. The generative and inference network both consist of one hidden layer of size 128, also with ReLU activations. The multivariate Gaussian has 8 dimensions and its covariance is diagonal. For all models we have chosen the learning rate based on the best ELBO on the validation set, doing a grid search (5e5, 1e4, 5e4, 1e3, 5e3). Optimization was done with Adam.
Emergent communication game.
In this application, we closely followed the experimental procedure described by Lazaridou et al. [22] with a few key differences. The architecture of the sender and the receiver is identical with the exception that the sender does not take as input the distractor images along with the correct image — only the correct image. The collection of images shown to the receiver was increased from 2 to 16 and the vocabulary of the sender was increased to 256. The hidden size and embedding size was also increased to 512 and 256, respectively. We did a grid search on the learning rate (0.01, 0.005, 0.001) and entropy regularizer (0.1, 0.05, 0.01) and chose the best configuration for each model on the validation set based on the communication success. The Gumbel models were trained with a temperature of 1 throughout. All models were trained with the Adam optimizer, with a batch size of 64 and during 200 epochs.implemented in EGG [15] We choose the vocabulary of the sender to be 256, the hidden size to be 512 and the embedding size to be 256.
BitVector Variational Autoencoder.
In this experiment, we have set the generative and inference network to consist of one hidden layer with 128 nodes, using ReLU activations. We have searched a learning rate doing grid search (0.0005, 0.001, 0.002) and chosen the model based on the ELBO performance on the validation set. We used the Adam optimizer.
b.1 Datasets
Semisupervised Variational Autoencoder.
MNIST consists of grayscale images of handwritten digits. It contains 60,000 datapoints for training and 10,000 datapoints for testing. We perform model selection on the last 10,000 datapoints of the training split.
Emergent communication game.
The data used by Lazaridou et al. [22]
is a subset of ImageNet containing 463,000 images, chosen by sampling 100 images from 463 baselevel concepts. The images are then applied a forwardpass through the pretrained VGG ConvNet
[43] and the representations at the secondtolast fully connected layer are saved to use as input to the sender/receiver.BitVector Variational Autoencoder.
FashionMNIST consists of grayscale images of clothes. It contains 60,000 datapoints for training and 10,000 datapoints for testing. We perform model selection on the last 10,000 datapoints of the training split.
Appendix C Performance in Decoder Calls
Appendix D Computing infrastructure
Our infrastructure consists of 4 machines with the specifications shown in Table 2. The machines were used interchangeably, and all experiments were executed in a single GPU. Despite having machines with different specifications, we did not observe large differences in the execution time of our models across different machines.
#  GPU  CPU 

1.  4 Titan Xp  12GB  16 AMD Ryzen 1950X @ 3.40GHz  128GB 
2.  4 GTX 1080 Ti  12GB  8 Intel i79800X @ 3.80GHz  128GB 
3.  3 RTX 2080 Ti  12GB  12 AMD Ryzen 2920X @ 3.50GHz  128GB 
4.  3 RTX 2080 Ti  12GB  12 AMD Ryzen 2920X @ 3.50GHz  128GB 