Latent variable models
, and probabilistic graphical models more generally, provide a declarative language for specifying prior knowledge and structural relationships in complex datasets. They have a long and rich history in natural language processing, having contributed to fundamental advances such as statistical alignment for translation(Brown et al., 1993), topic modeling (Blei et al., 2003), unsupervised part-of-speech tagging (Brown et al., 1992), and grammar induction (Klein and Manning, 2004), among others. Deep learning, broadly construed, is a toolbox for learning rich representations (i.e., features) of data through numerical optimization. Deep learning is the current dominant paradigm in natural language processing, and some of the major successes include language modeling (Bengio et al., 2003; Mikolov et al., 2010; Zaremba et al., 2014), machine translation (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017), and natural language understanding tasks such as question answering and natural language inference.
There has been much recent, exciting work on combining the complementary strengths of latent variable models and deep learning. Latent variable modeling makes it easy to explicitly specify model constraints through conditional independence properties, while deep learning makes it possible to parameterize these conditional likelihoods with powerful function approximators. While these “deep latent variable” models provide a rich, flexible framework for modeling many real-world phenomena, difficulties exist: deep parameterizations of conditional likelihoods usually make posterior inference intractable, and latent variable objectives often complicate backpropagation by introducing points of non-differentiability. This tutorial explores these issues in depth through the lens of variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008), a key technique for performing approximate posterior inference.
The term “deep latent variable” models can also refer to the use of neural networks to perform latent variableinference (“deep inference”). In the context of variational inference, this means that we train an inference network to output the parameters of an approximate posterior distribution given the set of variables to be conditioned upon (Kingma and Welling, 2014; Rezende et al., 2014; Mnih and Gregor, 2014). We will devote a significant portion of the tutorial to this setting.
The tutorial will be organized as follows: section 2 introduces notation and briefly describes the basics of neural networks; section 3 presents several archetypical latent variable models of text and their “deep” variants; section 4 surveys applications of latent variable models introduced in the previous section; section 5 deals with learning and performing posterior inference in latent variable models, both in the case where exact inference over the latent variables is tractable and when it is not; section 6
focuses on amortized variational inference and variational autoencoders, a central framework for learning deep generative models; section7 briefly touches on other methods for learning latent variable models; and finally, section 8 concludes.
While our target audience is the community of natural language processing (NLP) researchers/practioners, we hope that this tutorial will also be of use to researchers in other areas. We have therefore organized the tutorial to be modular: sections 3 and 4, which are more specific to NLP, are largely independent from sections 5 and 6, which mostly deal with the general problem of learning deep generative models.
This tutorial will focus on learning latent variable models whose joint distribution can be expressed as adirected graphical model (DGM),111Directed graphical models are also sometimes referred to as generative models, since their directed nature makes it easy to generate data through ancestral sampling.
and we will mostly do this through variational inference. Specifically, we will not cover (or only briefly touch on) undirected graphical models such as restricted Boltzmann Machines (and more broadly, Markov random fields), posterior inference based on Markov chain Monte Carlo sampling, spectral learning of latent variable models, and non-likelihood-based approaches such as generative adversarial networks(Goodfellow et al., 2014). While each of these topics is a rich area of active research on its own, we have chosen to limit scope of this tutorial to directed graphical models and variational inference in order to provide a more thorough background on fundamental ideas and key techniques, as well as a detailed survey of recent advances.
Throughout, we will assume the following notation:
: an observation, usually a sequence of tokens, , sometimes denoted with
: a dataset of i.i.d observations,
: an unobserved latent variable, which may be a sequence or other structured object
: a latent vector (we will useto denote a general latent variable, and in the specific case where the latent variable is a continuous vector, e.g. )
: generative model parameters
: variational parameters
, or : generative model parameterized by
, or : likelihood model parameterized by
: variational distribution parameterized by
: the standard -simplex, i.e. the set
We will use, where the density is parameterized by . While the random variable over which the distribution is induced will be often clear from context, when it is not clear we will either use subscripts to identify the random variable, e.g. and ), or a different letter, e.g. . We will sometimes overload notation and use with the same to refer to the probability of the event that the random variable takes on the value
where the probability distribution is again parameterized by, i.e. . For brevity of notation, and since the distinction will encumber rather than elucidate at the level of abstraction this tutorial is aiming for, we will also overload variables (e.g. ) to both refer to a random variable (a measurable function from a sample space to ) and its realization (an element of ).
We now briefly introduce the neural network machinery to be used in this tutorial. Deep networks are parameterized non-linear functions, which transform an input into features using parameters
. We will in particular make use of the multilayer perceptron (MLP), which computes features as follows:
where is an element-wise nonlinearity, such as , , or the logistic sigmoid, and the set of neural network parameters is .
We will also make use of the recurrent neural network (RNN), which maps a sequence of inputsinto a sequence of features , as follows:
where , and where for concreteness we have parameterized the above RNN as an Elman RNN (Elman, 1990)
. While modern architectures typically use the slightly more complicated long short-term memory (LSTM)(Hochreiter and Schmidhuber, 1997)
or gated recurrent units (GRU)(Cho et al., 2014) RNNs instead, we will remain agnostic with regard to the specific RNN parameterization, and simply use to encompass the LSTM/GRU parameterizations as well.
RNN Language Models
We will shortly introduce several archetypal latent variable models of a sentence consisting of words. Before we do so, however, we briefly introduce the RNN language model, an incredibly effective model of a sentence that uses no latent variables. Indeed, because RNN language models are such good sentence models, we will need a very good reason to make use of latent variables, which as we will see complicate learning and inference. Sections 3 and 4 attempt to motivate the need for introducing latent variables into our sentence models, and the remaining sections discuss how to learn and do inference with them.
Using the notation as shorthand for the sequence , RNN language models define a distribution over a sentence as follows
and is the word embedding corresponding to the -th word in the sequence. The function over a vector applies an element-wise exponentiation to and renormalizes to obtain a distribution, i.e. means that each element of is given by
Because RNN language models are a central tool in deep NLP, for the remainder of the tutorial we will use the notation
to mean that is distributed according to an RNN language model (as above) with parameters , where includes the parameters of the RNN itself as well as the matrix used in defining the per-word distributions at each step .
3 Archetypal Latent Variable Models of Sentences
We now present several archetypal latent variable models of sentences , which will serve to ground the discussion. All of the models we introduce will provide us with a joint distribution over the observed words and unobserved latent variables . These models will differ in terms of whether the latent variables are discrete, continuous, or structured, and we will provide both a shallow and deep variant of each model. We will largely defer the motivating applications for these models to the next section, but we note that on an intuitive level, the latent variables in these three types of models have slightly different interpretations. In particular, discrete latent variables can be interpreted as inducing a clustering over data points, continuous latent variables can be interpreted as providing a dimensionality reduction of data points, and structured latent variables can be interpreted as unannotated structured objects (i.e., objects with interdependent pieces or parts) in the data;222Because discrete structured latent variables consist of interdependent discrete latent variables, we can also think of them as inducing interdependent clusterings of the parts of each data point. these interpretations will be expanded upon below.
In addition to describing models in each of the three categories listed above, we will briefly sketch out what inference – that is, calculating the posterior over latent variables – in these models looks like, as a way of motivating the later sections of the tutorial.
3.1 A Discrete Latent Variable Model
We begin with a simple discrete latent variable model, namely, a latent-variable Naive Bayes model (i.e., a mixture of categoricals model). This model assumes a sentenceis generated according to the following process:
Draw latent variable from a Categorical prior with parameter . That is, .
Given , draw each token in independently from a Categorical distribution with parameter . That is, , where is the probability of drawing word index given the latent variable . Thus, the probability of the sequence given is
Letting be all the parameters of our model, the full joint distribution is
This model assumes that each token in is generated independently, conditioned on
. This assumption is clearly naive (hence the name) but greatly reduces the number of parameters that need to be estimated.333In our formulation we model text as bag-of-words and thus ignore position information. It is also possible to model position-specific probabilities within Naive Bayes with additional parameters for . This would result in parameters. The total number of parameters in this generative model is , where we have parameters for and parameters in for each of the values of .444The model is overparameterized since we only need parameters for a Categorical distribution over a set of size . This is rarely an issue in practice.
Despite the Naive Bayes assumption, the above model becomes interesting when we have sentences that we assume to be generated according to the above process. Indeed, since each sentence comes with a corresponding latent variable governing its generation, we can see the values as inducing a clustering over the sentences ; sentences generated by the same value of belong to the same cluster. We show a graphical model depicting this scenario in Figure 2.
Making the Model “Deep”
One of the reasons we are interested in deep latent variable models is that neural networks make it simple to define flexible distributions without using too many parameters. As an example, we can formulate a sentence model similar to the Naive Bayes model, but which avoids the Naive Bayes assumption above (whereby each token is generated independently given ) using an RNN. An RNN will allow the probability of to depend on the entire history of tokens preceding . In this deep variant, we might then define the probability of given latent variable as
That is, the probability of given is given by an with parameters that are specific to the that is drawn. We then obtain the joint distribution by substituting (4) into the term for in (3). We show the corresponding graphical model in Figure 3.
Note that this deep model allows for avoiding the Naive Bayes assumption while still using only parameters, assuming . The Naive Bayes model, on the other hand, requires parameters, and parameters if we use position-specific distributions. Thus, as long as is not too large, we can save parameters under the deep model as and get large.
For discrete latent variable models, inference – that is, calculating the posterior – can typically be performed by enumeration. In particular, using Bayes’s rule we have
where the latent variable is assumed to take on one of values. Calculating the denominator is clearly the most computationally expensive part of inference, but is generally considered to be tractable as long as is not too big. Note, however, that the model’s parameterization can affect how quickly we can calculate this denominator: under the Naive Bayes model we can accumulate ’s word counts once, and evaluate their probability under the categorical distributions, whereas for the RNN based model we need to run different RNNs over .
3.2 A Continuous Latent Variable Model
We now consider models in which the latent variables are vectors in , rather than integer-valued. We begin with a continuous analog of the Naive Bayes model in the previous subsection. In particular, we will assume a sequence is generated according to the following process:
Draw latent variable from
—that is, from a Normal distribution with meanand identity covariance matrix .
Given , draw each token in independently from a Categorical distribution with as its parameters.
Note that the above model closely resembles the shallow model of Section 3.1, except that instead of using the latent variable to index into a set of Categorical distribution parameters, we let the parameters of a Categorical distribution be a function of the latent . We can thus view the latent variable as a lower dimensional representation of our original sentence . Letting and , we have the joint density
We show a corresponding graphical model in Figure 4. Note that the dependence structure in Figure 4 is identical to that depicted in Figure 2; we have, however, changed the type of latent variable (from discrete to continuous) and the parameterizations of the corresponding distributions.
Making the Model “Deep”
As in Section 3.1, we may replace the Naive Bayes distribution over tokens with one parameterized by an RNN. We thus have the generative process:
Draw latent variable from .
Given , draw each token in from a conditional RNN, .
We use the notation to refer to the distribution over sentences induced by conditioning an ordinary on some vector , by concatenating onto the RNN’s input at each time-step. In particular, we define
Again using Bayes’s rule, the posterior under continuous latent variable models is given by
Unlike with discrete latent variables, however, calculating the denominator above will in general be intractable. Indeed, using deep models will generally prevent us from exactly calculating the denominator above even in relatively simple cases. This concern motivates many of the methods we discuss in later sections.
3.3 A Structured, Discrete Latent Variable Model
Finally, we will consider a model with multiple interrelated discrete latent variables per data-point. In particular, we will consider the Hidden Markov Model (HMM)(Rabiner, 1989), which models a sentence by assuming there is a discrete latent variable responsible for generating each word in the sentence, and that each of these latent variables depends only on the latent variable responsible for generating the previous word. Concretely, an HMM assumes that a sentence is generated according to the following process:
First, begin with variable , which is always equal to the special “start” state (i.e. ). Then, for
Draw latent variable from a Categorical distribution with parameter .
Draw observed token from a Categorical distribution with parameter .
The above generative process gives rise to the following joint distribution over the sentence and latent variables :
We show the corresponding graphical model in Figure 6.
It is important to note that the second equality above makes some significant independence assumptions: it assumes that the probability of depends only on (and not on or ), and it assumes that the probability of depends only on (and not on or ). These ”Markov” (i.e., independence) assumptions are what give the Hidden Markov Model its name. We also note that we have referred to an HMM as a “structured” latent variable model because the latent sequence is structured in the sense that it contains multiple components that are interdependent, as governed by .
Making the Model “Deep”
We can create a “deep” HMM (c.f., Tran et al. (2016b); Johnson et al. (2016)) by viewing the and categorical parameters as being parameterizable in their own right, and parameterizing them with neural network components. For example, we might parameterize an HMM’s categorical distributions as follows:
where and now refer to the parameters of the corresponding MLPs. Note that the graphical model in Figure 6 remains correct for the deep HMM; indeed, graphical models only show the dependency structure of the model (which has not been changed), and not the particular parameterization chosen. We also note that a deep parameterization may allow us to use fewer parameters as and get large. In particular, a standard HMM requires parameters, whereas if the s above have hidden units we require only parameters.
For structured, discrete latent variable models we have
where we index all possible latent structures (e.g., all sequence of discrete latent variables of length ) with . It is possible to compute this sum over structures with a dynamic program (e.g., the forward or backward algorithms (Rabiner, 1989)), and for certain models, like HMMs, it will be tractable to do so. For other, more complicated structured latent variable models the dynamic program will be intractable to compute, necessitating the use of approximate inference methods. For instance, for the Factorial HMM (FHMM) (Ghahramani and Jordan, 1996) depicted in Figure 7, which generates each word by conditioning on the current state of several independent first-order Markov chains, calculation of the denominator above will be exponential in depth and therefore intractable even with a dynamic program.
4 Motivation and Examples
We now motivate the archetypal models we have introduced in the previous section by providing some examples of where similar models have been used in the literature. In general, we tend to be interested in latent variable models for any of the following interrelated reasons:
we have no or only partial supervision;
we want to make our model more interpretable through ;
we wish to model a multimodal distribution over ;
we would like to use latent variables to control our predictions and generations;
we are interested in learning some underlying structure or representation;
we want to better model the observed data;
and we will see that many of these apply to the models we will be discussing.
4.1 Examples of Discrete Latent Variable Models
As noted in Section 3, it is common to interpret discrete latent variable models as inducing a clustering over data. We now discuss several prominent NLP examples where discrete latent variable models can be interpreted in this way.
The paradigmatic example application of categorical latent variable models to text is document clustering (i.e., unsupervised document classification). In this setup we are given a set of documents , which we would like to partition into clusters such that intra-cluster documents are maximally similar. This clustering can be useful for retrieving or recommending unlabeled documents similar to some query document. Seen from the generative modeling perspective described in Section 3.1, we view each document as being generated by its corresponding latent cluster index, , which gives us a model of . Since we are ultimately interested in obtaining the label (i.e., cluster index) of a document, however, we would then form the posterior over labels in order to determine the likely label for .555Note that if we are just interested in finding the most likely label for we can simply evaluate , since .
Work on document clustering goes back decades (see Willett (1988) and Aggarwal and Zhai (2012) for surveys), and many authors take the latent variable perspective described above. In terms of incorporating deep models into document clustering, it is possible to make use of neural components (such as RNNs) in modeling the generation of words in each document, as described in Section 3.1
. It has been more common recently, however, to attempt to cluster real-valued vector-representations (i.e., embeddings) of documents. For instance, it is common to pre-compute document or paragraph embeddings with an unsupervised objective and then cluster these embeddings with K-Means(MacQueen et al., 1967); see Le and Mikolov (2014) and Xu et al. (2015a) for different approaches to obtaining these embeddings before clustering. Xie et al. (2016) continue to update document embeddings as they cluster, using an auxiliary loss derived from confidently clustered examples. Whereas none of these document-embedding based approaches are presented explicitly within the generative modeling framework above, they can all be viewed as identifying parameterized distributions over real vectors (e.g., Gaussians with particular means and covariances) with each cluster index, which in turn generate the embeddings of each document (rather than the words of each document). This sort of generative model is then (at least in principle) amenable to forming posteriors over cluster assignments, as above.
Mixtures of Experts
Consider a supervised image captioning scenario, where we wish to produce a textual caption in response to an image , and we are given a training set of aligned image-caption pairs . When training a model to generate captions similar to those in the training set, it may be useful to posit that there are in fact several “expert” captioning models represented in the training data, each of which is capable of generating a slightly different caption for the same image
. For instance, experts might differ in terms of the kind of language used in the caption (e.g., active vs. passive) or even in terms of what part of the image they focus on (e.g., foreground vs. whatever the human in the image is doing). Of course, we generally don’t know beforehand which expert is responsible for generating which caption in the training data, but we can aim to capture the variance in captioning induced by these posited experts by identifying each value of a categorical latent variablewith a different posited expert, and assuming that a caption is generated by first choosing one of these experts, which in turn generates conditioned on . This type of model is known as a Mixture of Experts (MoE) model (Jacobs et al., 1991), and it gives rise to a joint distribution over and , only this time we will also condition on : . As in the previous example, the discrete latents induce a clustering, where examples are clustered by which expert generated them.
MoE models are widely used, and they are particularly suited to ensembling scenarios, where we wish to ensemble several different prediction models, which may have different areas of expertise. Garmash and Monz (2016)
use an MoE to ensemble several expert neural machine translation models, andLee et al. (2016) use a related approach, called “diverse ensembling,” in training a neural image captioning system, showing that their model is able to generate more diverse captions as a result; He et al. (2018b) find that MoE also lead to more diverse responses in machine translation, and Gehrmann et al. (2018) use a similar technique for generating descriptions of restaurant databases. There has also been work in text-generation that uses an MoE model per token, rather than per sentence (Yin et al., 2016; Le et al., 2016; Yang et al., 2018b).666Although these models have multiple discrete latents per data point, which suggests that we should perhaps consider them to be structured latent variable models, we will consider them unstructured since the interdependence between token-level latents is not made explicit in the probability model; correlations between these latents, however, are undoubtedly modeled to some extent by the associated RNNs. See also Eigen et al. (2013) and Shazeer et al. (2017), who use MoE layers in constructing neural network architectures, and in the latter case for the purpose of language modeling.
Note that in the case of MoE models, we are interested in using a latent variable model not just because the training examples are not labeled with which experts generated them (reason (a) above), but also for reasons (c) and (d). That is, we attempt to capture different modes in the caption distribution (perhaps corresponding to different styles of caption, for example) by identifying a latent variable with each of these experts or modes. Similarly, we might hope to control the style of the caption by restricting our captioning system to one expert. When it comes to document clustering, on the other hand, we are interested in latent variables primarily because we don’t have labeled clusters for our documents (reason (a)). Indeed, since we are interested primarily in the posterior over , rather than the joint distribution, reasons (c) and (d) are not applicable.
4.2 Examples of Continuous Latent Variable Models
Continuous latent variable models can often be interpreted as performing a dimensionality reduction of data, by associating a low-dimensional vector in with a data point. One of the advantages of this sort of dimensionality reduction is that it becomes easier to have a finer-grained notion of similarity between data points, by calculating their distance, for example, in the dimensionally reduced space. We now discuss two prominent examples.
Topic Models (Blei et al., 2003) are an enormously influential family of latent variable models, which posit a set of latent “topics,” realized as distributions over words, which in turn generate each document in a collection. Each document in the collection moreover has a latent distribution over these topics, governing how often it chooses to discuss a particular topic. Note that the per-document distribution over topics in a topic model plays a similar role to the per-sentence vector in the shallow model of Section 3.2. Thus, in both cases we might, for instance, determine how similar two data points are by measuring the distance between their corresponding ’s.
Whereas, with the exception of Topic Models, NLP has historically preferred discrete latent variables, the tendency of deep models to deal with continuous embeddings of objects has spurred interest in continuous latent variable models for NLP. For example, it has recently become quite popular to view a sentence as being generated by a latent vector in , rather than by a latent label in as in the previous subsection. Thus, Bowman et al. (2016) develop a latent vector model of sentence generation, where sentences are generated with a , in a very similar way to that presented in Section 3.2. This model and its extensions (Yang et al., 2017; Hu et al., 2017; Kim et al., 2018), represent the dominant approach to neural latent variable modeling of sentence generation at this time.
As suggested above, viewing sentences as being generated by real vectors gives us a fine-grained notion of similarity between sentences, namely, the distance between the corresponding latent representations. The desire to compute similarities between sentences has also motivated extensive work on obtaining sentence embeddings (Le and Mikolov, 2014; Kiros et al., 2015; Joulin et al., 2016; Conneau et al., 2017; Peters et al., 2018; Pagliardini et al., 2018; Rücklé et al., 2018), which can often be interpreted as continuous latent variables, though they need not be.
4.3 Example of Structured, Discrete Latent Variable Models
A major motivation for using structured, discrete latent variables in NLP is the desire to infer structured discrete objects, such as parse trees or sequences or segmentations, which we believe to be represented in the data, even when they are unannotated.
Unsupervised Tagging and Parsing
The simplest example application of structured, discrete latent variable modeling in NLP involves inducing the part-of-speech (POS) tags for a sentence. More formally, we are given a sentence , and we wish to arrive at a sequence of POS tags, one for each word in the sentence. HMMs, as described in Section 3.3, and their variants have historically been the dominant approach to arriving at a joint distribution over words and tags (Brown et al., 1992; Merialdo, 1994; Smith and Eisner, 2005; Haghighi and Klein, 2006; Johnson, 2007; Toutanova and Johnson, 2008; Berg-Kirkpatrick et al., 2010; Christodoulopoulos et al., 2010; Blunsom and Cohn, 2011; Stratos et al., 2016). We may then predict the POS tags for a new sentence by calculating
There is now a growing body of work involving deep parameterizations of structured discrete latent variable models for unsupervised parsing and tagging. For instance, Tran et al. (2016b) obtain good results on unsupervised POS tagging by parameterizing the transition and emission probabilities of an HMM with neural components, as described in Section 3.3. In addition, just as recent approaches to neural document clustering have defined models that generate document embeddings rather than the documents themselves, there has been recent work in neural, unsupervised POS tagging based on defining neural HMM-style models that emit word embeddings, rather than words themselves (Lin et al., 2015; He et al., 2018a).
Unsupervised dependency parsing represents an additional, fairly simple example application of neural models with structured latent variables. In unsupervised dependency parsing we attempt to induce a sentence’s dependency tree without any labeled training data. Much recent unsupervised dependency parsing is based on the DMV model of Klein and Manning (2004) and its variants (Headden III et al., 2009; Spitkovsky et al., 2010, 2011), where there are multiple discrete latent variables per word, rather than one, as in POS tagging. In particular, the DMV model can be viewed as providing a joint distribution over the words in a sentence and discrete latent variables representing each left and right dependent of each word, as well as a final empty left and right dependent for each word. As in the case of unsupervised tagging, we are primarily interested in predicting the most likely dependency tree given a sentence:
Again, neural approaches divide between those that parameterize DMV-like models, which jointly generate the dependency tree and its words, with neural components (Jiang et al., 2016; Cai et al., 2017; Han et al., 2017), and those which define DMV-models that generate embeddings (He et al., 2018a).
Structured latent variables are also widely used in text generation models and applications. For instance,Miao and Blunsom (2016) learn a summarization model by positing the following generative process for a document: we first sample a condensed, summary version of a document from an . Then, conditioned on , we generate the full version of the document: by sampling from a . This model gives us a joint distribution over documents and their corresponding summaries . Moreover, a summarization system is then merely a system that infers the posterior over summaries given a document .
Whereas in the model of Miao and Blunsom (2016) a document is generated by conditioning solely on latent variables , it is also common to posit generative models in which text is generated by conditioning both on some latent variables as well as on some additional contextual information . In particular, might represent an image for an image captioning model, a sentence in a different language for a machine translation model, or a database for data-to-document generation model. In this setting, the latent variables would then represent some additional unobserved structure that, together with , accounts for the observed text . Some notable recent examples in this vein include viewing text as being generated by both as well as a shorter sequence of latent variables (Kaiser et al., 2018; Roy et al., 2018), a sequence of fertility latent variables (Gu et al., 2018a), a sequence of iteratively refined sequences of latent variables (Lee et al., 2018), or by a latent template or plan (Wiseman et al., 2018).
5 Learning and Inference
After defining a latent-variable model, we are typically interested in being able to do two related tasks: (1) we would like to be able to learn the parameters of the model, and (2) once trained, we would like to be able to perform inference over the model. That is, we’d like to be able to compute the posterior distribution (or approximations thereof) over the latent variables, given some data . As we will see, these two tasks are intimately connected because learning often uses inference as a subroutine. On an intuitive level, this is because if we knew, for instance, the value of given , learning would be simple: we would simply maximize . Thus, as we will see, learning often involves alternately inferring likely values, and optimizing the model assuming these inferred ’s.
The dominant approach to learning latent variable models in a probabilistic setting is to maximize the log marginal likelihood. This is equivalent to minimizing , the KL-divergence between the true data distribution and the model distribution , where the latent variable has been marginalized out. It is also possible to approximately minimize other divergences between and , e.g. the Jensen-Shannon divergence or the Wasserstein distance. In the context of deep latent variable models, such methods often utilize a separate model (discriminator/critic) which learns to distinguish between samples from from samples from . The generative model is trained “adversarially” to fool the discriminator. This gives rise to a family of models known as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). While not the main focus of the this tutorial, we review GANs and their applications to text modeling in section 7.3.
5.1 Directly Maximizing the Log Marginal Likelihood
We begin with cases where the log marginal likelihood, i.e.
is tractable to evaluate. (The sum should be replaced with an integral if is continuous). This is equivalent to assuming posterior inference is tractable, since
Calculating the log marginal likelihood is indeed tractable in some of the models that we have seen so far, such as categorical latent variable models where is not too big, or certain structured latent variable models (like HMMs) where dynamic programs allow us to efficiently sum over all the assignments. In such cases, maximum likelihood training of our parameters then corresponds to solving the following maximization problem:
where we have assumed examples in our training set.
In cases where is parameterized by a deep model, the above maximization problem is not tractable to solve exactly. We will assume, however, that is differentiable with respect to . The main tool for optimizing such models, then, is gradient-based optimization. In particular, define the log marginal likelihood over the training set as
The gradient is given by
Note that the above gradient expression involves an expectation over the posterior , and is therefore an example of how inference is used as a subroutine in learning. With this expression for the gradient in hand, we may then learn by updating the parameters as
where is the learning rate and is initialized randomly. In practice the gradient is calculated over mini-batches (i.e. random subsamples of the training set), and adaptive algorithms (Duchi et al., 2011; Zeiler, 2012; Kingma and Ba, 2015) are often used.
5.2 Expectation Maximization (EM) Algorithm
The Expectation Maximization (EM) algorithm(Dempster et al., 1977) is an iterative method for learning latent variable models with tractable posterior inference. It maximizes a lower bound on the log marginal likelihood at each iteration. Given randomly-initialized starting parameters , the algorithm updates the parameters via the following alternating procedure:
E-step: Derive the posterior under current parameters , i.e., for all .
M-step: Define the expected complete data likelihood as
Maximize this with respect to , holding fixed
It can be shown that EM improves the log marginal likelihood at each iteration, i.e.
As a simple example, let us apply the above recipe to the Naive Bayes model in section 3.1, with :
E-step: for each , calculate
Note that above we have written the prior over in terms of a single parameter , since we must have .
M-step: The expected complete data likelihood is given by
To maximize the above with respect to , we can differentiate and set the resulting expression to zero. Using the indicator notation to refer to a function that returns 1 if the condition in the bracket holds and 0 otherwise, we have
where . (We can verify that the above is indeed the maximum since ). A similar derivation for yields
Note that these updates are analogous to the maximum likelihood parameters of a Naive Bayes model in the supervised case, except that the empirical counts have been replaced with the expected counts under the posterior distribution.
Let us now consider using EM to learn the parameters of the RNN model introduced at the end of Section 3.1. Here, the E-step is similar to the Naive Bayes model and follows straightforwardly from Bayes’ rule: