Variational Autoencoder (VAE) Kingma and Welling (2013); Rezende et al. (2014) is a generative model widely applied to language-generation tasks, which propagates latent codes drawn from a simple prior to manifest data samples through a decoder. The generative model is augmented by an inference network, which feeds observable data samples through an encoder to yield a distribution on the corresponding latent codes. Since natural language often conceives a latent hierarchical structure, it is desirable for the latent code in VAE to reflect such inherent language structure, so that the generated text can be more natural and expressive. An example of language structure is illustrated in Figure 1, where sentences are organized into a tree structure. The root node corresponds to simple sentences (e.g., “Yes”), while nodes on outer leaves represent sentences with more complex syntactic structure and richer, more specific semantic meaning (e.g., “The food in the restaurant is awesome”)111Another possible way to organize sentences is a hierarchy of topics, e.g., a parent node can be a sentence on “sports”, while its children are sentences on “basketball” and “skiing”..
In existing VAE-based generative models, such structures are not explicitly considered. The latent code often employs a simple Gaussian prior, and the posterior is approximated as a Gaussian with diagonal covariance matrix. Such embeddings assume an Euclidean structure, which is inadequate in capturing geometric structures illustrated in Figure 1. While some variants have been proposed to enrich the prior distributions Xu and Durrett (2018); Wang et al. (2019); Shi et al. (2019), there is no affirmative evidence that structural information in language can be recovered effectively by the model.
Hyperbolic geometry has recently emerged as an effective method for representation learning from data with hierarchical structure Mathieu et al. (2019); Nickel and Kiela (2017). Informally, hyperbolic space can be considered as a continuous map of trees. For example, a Poincaré disk (a hyperbolic space with two dimensions) can represent any tree with arbitrary low distortion De Sa et al. (2018); Sarkar (2011). In Euclidean space, however, it is difficult to learn such structural representation even with infinite dimensions Linial et al. (1995).
Motivated by these observations, we propose Adversarial Poincaré Variational Autoencoder (APo-VAE), a text embedding and generation model based on hyperbolic representations, where the latent code is encouraged to capture the underlying tree-like structure in language. Such a latent structure gives us more control of the sentences we want to generate, , an increase of sentence complexity and diversity can be achieved along some trajectory from a root to its children. In practice, we define both the prior and the variational posterior of the latent code over a Poincaré ball, via the use of a wrapped normal distribution Nagano et al. (2019). To obtain more stable model training and learn more flexible representation of the latent code, we exploit the primal-dual formulation of KL divergence Dai et al. (2018) based on the Fenchel duality Rockafellar and others (1966), to adversarially optimize the variational bound. Unlike the primal form that relies on Monte Carlo approximation Mathieu et al. (2019), our dual formulation bypasses the need for tractable posterior likelihoods via the introduction of an auxiliary dual function.
We apply the proposed approach to language modeling and dialog-response generation tasks. For language modeling, in order to enhance the distribution complexity of the prior, we use an additional “variational mixture of posteriors” prior (VampPrior) design Tomczak and Welling (2018) for the wrapped normal distribution. Specifically, VampPrior uses a mixture distribution with components from variational posteriors, coupling the parameters of the prior and variational posterior together. For dialog response generation, a conditional model variant of APo-VAE is designed to take into account the dialog context.
Experiments also show that the proposed model addresses posterior collapse Bowman et al. (2016), a major obstacle preventing efficient learning of VAE on text data. In posterior collapse, the encoder learns an approximate posterior similar to the prior, and the decoder tends to ignore the latent code for generation. Experiments show that our proposed model can effectively avoid posterior collapse. We hypothesize that this is due to the use of a more informative prior in hyperbolic space that enhances the complexity of the latent representation, which aligns well with previous work Tomczak and Welling (2018); Wang et al. (2019) that advocates a better prior design.
Our main contributions are summarized as follows. () We present Adversarial Poincaré Variational Autoencoder (APo-VAE), a novel approach to text embedding and generation based on hyperbolic latent representations. () In addition to the use of wrapped normal distribution, an adversarial learning procedure and a VampPrior design are incorporated for robust model training. () Experiments on language modeling and dialog-response generation benchmarks demonstrate the superiority of the proposed approach compared to Euclidean VAEs, benefiting from capturing informative latent hierarchies in natural language.
2.1 Variational Autoencoder
Let be a dataset of sentences, where each is a sequence of tokens of length . Our goal is to learn that best models the observed sentences so that the expected -likelihood is maximized, i.e., .
The variational autoencoder (VAE) considers a latent-variable model to represent sentences, with an auxilary encoder that draws samples of latent code from the conditional density , known as the approximate posterior. Given a latent code , the decoder samples a sentence from the conditional density , where the “decoding” pass takes an auto-regressive form. Together with prior , the model is given by the joint . The VAE leverages the approximate posterior to derive an evidence lower bound (ELBO) to the (intractable) marginal -likelihood :
where are jointly optimized during training, and the gap is given by the decomposition
denotes Kullback-Leibler divergence. Alternatively, the ELBO can also be written as:
where the first conditional likelihood and second KL terms respectively characterize reconstruction and generalization capabilities. Intuitively, a good model is expected to strike a balance between good reconstruction and generalization. In most cases, both the prior and the variational posterior are assumed to be Gaussian for computational convenience. However, such over-simplified assumptions may not be ideal for capturing the intrinsic characteristics of data that have unique geometrical structure, such as natural language.
2.2 Hyperbolic Space
Riemannian manifolds can provide a more powerful and meaningful embedding space for complex data with highly non-Euclidean structure, that cannot be effectively captured in a vectorial form (e.g., social networks, biology and computer graphics). Of particular interest is the hyperbolic space (Ganea et al., 2018), where () the relatively simple geometry allows tractable computations, and () the exponential growth of distance in finite dimensions naturally embeds rich hierarchical structures in a compact form.
An -dimensional Riemannian manifold is a set of points locally similar to a linear space . At each point of the manifold , we can define a real vector space that is tangent to , along with an associated metric tensor
metric tensorwhich is an inner product on . Intuitively, a Riemannian manifold behaves like a vector space only in its infinitesimal neighborhood, allowing the generalization of common notation like angle, straight line and distance to a smooth manifold. For each tangent space , there exists a specific one-to-one map from an -ball at the origin of to a neighborhood of on , called the exponential map. We refer to the inverse of an exponential map as the logarithm map, denoted as . In addition, a parallel transport intuitively transports tangent vectors along a “straight” line between and , so that they remain “parallel”. This is the basic machinery that allows us to generalize distributions and computations in the hyperbolic space, as detailed in later sections.
Poincaré Ball Model.
Hyperbolic geometry is one type of non-Euclidean geometry with a constant negative curvature. As a classical example of hyperbolic space, an -dimensional Poincaré ball, with curvature parameter (i.e., radius ), can be denoted as with its metric tensor given by , where and denotes the regular Euclidean metric tensor. Intuitively, as moves closer to the boundary , the hyperbolic distance between and a nearby diverges at a rate of . This implies significant representation capacity as very dissimilar objects can be encoded on a compact domain.
We review the closed-form mathematical operations that enable differentiable training for hyperbolic space models, namely the hyperbolic algebra (vector addition) and tangent space computations (exponential/logarithm map and parallel transport). The hyperbolic algebra is formulated under the framework of gyrovector spaces (Ungar, 2008), with the addition of two points given by the Möbius addition:
For any point , the exponential map and the logarithmic map are given for and by
where . Note that the Poincaré ball model is geodesically complete in the sense that is well-defined on the full tangent space . The parallel transport map from a vector to another tangent space is given by
3 Adversarial Poincaré VAE
In this section, we first introduce our hyperbolic encoder and decoder, and how to apply reparametrization. We then provide detailed descriptions on model implementation, explaining how the primal-dual form of KL divergence can help stabilize training. Finally, we describe how to adopt VampPrior Tomczak and Welling (2018) to enhance performance. A summary of our model scheme is provided in Figure 2.
3.1 Flexible Wrapped Distribution Encoder
We begin by generalizing the standard normal distribution to a Poincaré ball (Ganea et al., 2018). While there are a few competing definitions of the hyperbolic normal, we choose the wrapped normal as our prior and variational posterior, largely due to its flexibility for more expressive generalization. A wrapped normal distribution is defined as follows: () sample vector from ; () parallel transport to ; () using exponential map to project back to . Putting these together, a latent sample has the following reparametrizable form:
For approximate posteriors, depends on . We further generalize the (restrictive) hyperbolic wrapped normal by acknowledging that under the implicit VAE Fang et al. (2019) framework, one does not need the approximate posterior to be analytically tractable. This allows us to replace the tangent space sampling step in (7) with a more flexible implicit distribution from which we draw samples as for . Note that now can be regarded as a deterministic displacement vector that anchors embeddings to the correct semantic neighborhood, allowing the stochastic
to only focus on modeling the local uncertainty of semantic embedding. The synergy between the deterministic and stochastic parts enables efficient representation learning relative to existing alternatives. For simplicity, we denote the encoder neural network as, which contains and , with parameters .
3.2 Poincaré Decoder
To build a geometry-aware decoder for a hyperbolic latent code, we follow Ganea et al. (2018), and use a generalized linear function analogously defined in the hyperbolic space. A Euclidean linear function takes the form , where is the coefficient, is the intercept,
is a hyperplane passing throughwith as the normal direction, and is the distance between and hyperplane . The counterpart in Poincaré ball analogously writes
where , and are the the gyroplane and the distance between and the gyroplane, respectively. Specifically, we use the hyperbolic linear function in (8) to extract features from the Poincaré embedding . The feature would be the input to the RNN decoder. We denote the combined network of and the RNN decoder as , where parameters contain and .
3.3 Implementing APo-VAE
While it is straightforward to compute the ELBO (2.1
) via Monte Carlo estimates using the explicit wrapped normal densityMathieu et al. (2019), we empirically observe that: () the normal assumption restricts the expressiveness of the model; () the wrapped normal likelihood makes the training severely unstable. Therefore, we appeal to a primal-dual view of VAE training to overcome such difficulties Rockafellar and others (1966); Dai et al. (2018); Fang et al. (2019). Specifically, the KL term in (2.1) can be reformulated as:
where is the (auxiliary) dual function (i.e., a neural network) with parameters . The primal-dual view of the KL term enhances the approximation ability, while also being tractable computationally. Meanwhile, since the density function in the original KL term in (2.1) is replaced by the dual function
, we can avoid direct computation with respect to the probability density function of the wrapped normal distribution.
To train our proposed APo-VAE with the primal-dual form of the VAE objective, we follow the training schemes of coupled variational Bayes (CVB) Dai et al. (2018) and implicit VAE Fang et al. (2019), which optimize the objective adversarially. Specifically, we update in the dual function to maximize:
where denotes the expectation over empirical distribution on observations. Accordingly, parameters and are updated to maximize:
Note that the term is maximized in (3.3) while it is minimized in (3.3), i.e., adversarial learning. In other words, one can consider the dual function as a discriminative network that distinguishes between the prior and the variational posterior , both of which are paired with the input data .
3.4 Data-driven Prior
While the use of a standard normal prior is a simple choice in Euclidean space, we argue that it induces bias in the hyperbolic setup. This is because natural sentences have specific meaning, and it is unrealistic to have the bulk of mass concentrated in the center (this is for low dimension; for high dimensions, it will concentrate near the surface of a sphere, which may partly explain why cosine similarity works favorably compared with Euclidean distance for NLP applications).
To reduce the induced bias from a pre-fixed prior, we consider a data-driven alternative, that also serves to close the variational gap. In this work, we adopt the VampPrior framework for this purpose Tomczak and Welling (2018), which is a mixture of variational posteriors conditioned on learnable pseudo-data points. Specifically, we consider the prior as a learnable distribution given by
where is the learned approximate posterior, and we call the parameter pseudo inputs. Intuitively, seeks to match the aggregated posterior Makhzani et al. (2015): in a cost-efficient manner via parameterizing the pseudo inputs. By replacing the prior distribution in (3.3) with , we complete the final objective of the proposed APo-VAE. The detailed training procedure is summarized in Algorithm 1.
4 Related Work
VAE for Text Generation.
Many VAE models have been proposed for text generation, most of which focus on solving the posterior collapse issue. The most popular strategy is to alter the training dynamics, keeping the encoder away from bad local optima. For example, variants of KL annealing Bowman et al. (2016); Zhao et al. (2018); Fu (2019) dynamically adjust the weight on the KL penalty term as training progresses; lagging VAE He et al. (2019)
aggressively optimizes the encoder before each decoder update, to overcome the imbalanced training issue between the encoder and decoder. Alternative strategies have also been proposed based on competing theories or heuristics.-VAE Razavi et al. (2019) tackles this issue by enforcing a minimum KL divergence between the posterior and the prior. Yang et al. (2017) blames mode-collapse on the auto-regressive design of the decoder and advocates alternative architectures. A semi-amortized inference network is considered by Kim et al. (2018) to bridge the amortization gap between log-likelihood and the ELBO.
Recent work has also shown that posterior collapse can be ameliorated by using more expressive priors and variational posteriors other than Gaussian. Flow-based VAE is considered in Ziegler and Rush (2019) to enhance the flexibility of prior distributions. A topic-guided prior is proposed in Wang et al. (2019) to achieve more controllable text generation. Fang et al. (2019) explores implicit sample-based representations, without requiring an explicit density form for the approximate posterior. Xu and Durrett (2018) considers replacing the Gaussian with the spherical von Mises-Fisher (vMF) distribution. Compared to these prior arts, our model features structured representation in hyperbolic space, which not only captures latent hierarchies but also combats posterior collapse.
Hyperbolic Space Representation Learning.
There has been a recent surge of interest in representation learning in hyperbolic space, largely due to its exceptional effectiveness modeling data with underlying graph structure (Chamberlain et al., 2017), such as relation nets (Nickel and Kiela, 2017). In the context of NLP, hyperbolic geometry has been considered for word embeddings (Tifrea et al., 2018). A popular vehicle for hyperbolic representation learning is the auto-encoder (AE) framework (Grattarola et al., 2019; Ovinnikov, 2019), where the decoders are built to efficiently exploit the hyperbolic geometry (Ganea et al., 2018). Closest to our APo-VAE are the works of hyperbolic VAEs (Mathieu et al., 2019; Nagano et al., 2019), where wrapped normal distributions have been used. Drawing power from the dual form of the KL, the proposed Apo-VAE highlights an implicit posterior and data-driven prior, showing improved training stability.
We evaluate the proposed model on two tasks: () language modeling, and () dialog-response generation, with quantitative results, human evaluation and qualitative analysis.
5.1 Experimental Setup
We use three datasets for language modeling: Penn Treebank (PTB) Marcus et al. (1993), Yahoo and Yelp corpora Yang et al. (2017). PTB contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. Yahoo and Yelp are much larger datasets, each containing 100k sentences with greater average length.
Following Gu et al. (2019), we consider the following two datasets for dialogue-response generation: Switchboard Godfrey and Holliman (1997) and DailyDialog Li et al. (2017). The former contains 2.4k two-sided telephone conversations, manually transcribed and aligned. The latter has 13k daily conversations for an English learner in daily life. We split the data into training, validation and test sets following the protocol described in Zhao et al. (2017); Shen et al. (2018).
To benchmark language modeling performance, we report the ELBO and Perplexity (PPL) of APo-VAE and baselines. In order to verify our proposed Apo-VAE is more resistant to posterior collapse, we also report KL-divergence and mutual information (MI) between and He et al. (2019). The number of active units (AU) of the latent code is also reported, where the activity of a latent dimension is measured as , and defined as active if .
For dialogue response generation, we adopt the evaluation metrics used in previous studiesZhao et al. (2017); Gu et al. (2019), including BLEU Papineni et al. (2002), BOW Liu et al. (2016), and intra/inter-dist values Gu et al. (2019). The first two metrics are used to assess the relevance of the generated response, and the third one is for diversity evaluation. Detailed definitions of these metrics are provided in the Appendix.
For language modeling, we adopt the LSTM Hochreiter and Schmidhuber (1997) for both the encoder and decoder, with dimension of the latent code set to . Following Mathieu et al. (2019), the hyper-parameter is set to . For dialogue response generation, we modify our APo-VAE following the Conditional VAE framework Zhao et al. (2017). Specifically, an extra input of context embedding is supplied to the model (i.e., ). The prior is a wrapped normal conditioned on context embedding, learned together with the variational posterior. Additional details are provided in the Appendix.
5.2 Experimental Results
Table 1 shows results on language modeling. We mainly compare APo-VAE with other VAE-based solutions, including -VAE Higgins et al. (2017), SA-VAE Kim et al. (2018), lagging VAE (LAG-VAE) He et al. (2019), vMF-VAE Xu and Durrett (2018) and iVAE Fang et al. (2019). On all three datasets, our model achieves lower negative ELBO and PPL than other models, demonstrating its strong ability to better model sequential text data. Meanwhile, the larger KL term and higher mutual information (between and ) of APo-VAE model indicate its robustness in handling posterior collapse. In addition, the introduction of data-driven prior (denoted as APo-VAE+VP) further boosts the performance, especially on negative ELBO and PPL.
To verify our hypothesis that the proposed model is capable of learning latent tree structure in text data, we visualize the two-dimensional projection of the learned latent code in Figure 3. For visualization, we randomly draw 5k samples from PTB-test, and encode them to the latent space using APo-VAE encoder. We color-code each sentence based on its length (i.e., blue for long sentences and red for short sentences). Note that only a small portion of data have a length longer than (), and human inspection verified that most of them are composed of multiple sub-sentences. As such, we exclude these samples from our analysis.
As shown in Figure 3, longer sentences (dark blue) tend to occupy the outer rim of the Poincaré ball, while the shorter ones (dark red) are concentrated in the inner area. In addition, we draw a long sentence (dark blue) from the test set, and manually shorten it to create several variants of different lengths (ranging from 6 to 27), which are related in a hierarchical manner based on human judgement. We visualize their latent codes projected by the trained APo-VAE. The resulting plot is consistent with a hierarchical structure: as the sentence becomes more specific, the embedding moves outward. We also decode from the neighbours of these latent codes, the outputs (see the Appendix) of which demonstrate similar hierarchical structure as the input sentences.
|Dataset||APo-VAE vs.||APo-VAE vs.|
Dialogue Response Generation.
Results on SwitchBoard and DailyDialog are summarized in Table 2. Our proposed model generates comparable or better responses than the baseline models in terms of both relevance (BLEU and BOW) and diversity (intra/inter-dist). For SwitchBoard, APo-VAE improves the average recall from (by iVAE) to , while significantly enhancing generation diversity (e.g., from to for intra-dist-2). For DailyDialog, similar trends can be observed. Three examples are provided in Figure 4. More examples can be found in the Appendix.
We further perform human evaluation via Amazon Mechanical Turk. We asked the turkers to compare generated responses from two models, and assess each model’s informativeness, relevance to the dialog context (coherence), and diversity. We use randomly sampled contexts from the test set, each assessed by three judges. In order to evaluate diversity, 5 responses are generated for each dialog context. For quality control, only workers with a lifetime task approval rating greater than 98% were allowed to participate in our study. Table 3 summarizes the human evaluation results. The responses generated by our model are clearly preferred by the judges compared with other competing methods.
We present APo-VAE, a novel model for text generation in hyperbolic space. Our model can learn latent hierarchies in natural language via the use of wrapped normals for the prior. A primal-dual view of KL divergence is adopted for robust model training. Experiments on language modeling and dialog response generation demonstrate the superiority of the model. For future work, we plan to apply APo-VAE to other text generation tasks such as image captioning.
- Generating sentences from a continuous space. In CoNLL, Cited by: §1, §4.
- Neural embeddings of graphs in hyperbolic space. arXiv preprint arXiv:1705.10359. Cited by: §4.
- Coupled variational bayes via optimization embedding. In NeurIPS, Cited by: §1, §3.3, §3.3.
Representation tradeoffs for hyperbolic embeddings.
Proceedings of machine learning research. Cited by: §1.
- Implicit deep latent variable models for text generation. In EMNLP, Cited by: §3.1, §3.3, §3.3, §4, §5.2.
- Cyclical annealing schedule: a simple approach to mitigating kl vanishing. In NAACL, Cited by: §4.
- Hyperbolic neural networks. In NeurIPS, Cited by: §2.2, §3.1, §3.2, §4.
- Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia 926, pp. 927. Cited by: §5.1.
- Adversarial autoencoders with constant-curvature latent manifolds. Applied Soft Computing 81, pp. 105511. Cited by: §4.
- DialogWAE: multimodal response generation with conditional wasserstein auto-encoder. In ICLR, Cited by: §5.1, §5.1.
- Lagging inference networks and posterior collapse in variational autoencoders. In ICLR, Cited by: §4, §5.1, §5.2.
- Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §5.2.
- Long short-term memory. Neural computation. Cited by: §5.1.
- Semi-amortized variational autoencoders. In ICML, Cited by: §4, §5.2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
- Dailydialog: a manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957. Cited by: §5.1.
- The geometry of graphs and some of its algorithmic applications. Combinatorica 15 (2), pp. 215–245. Cited by: §1.
- How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023. Cited by: §5.1.
- Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §3.4.
- Building a large annotated corpus of english: the penn treebank. Computational Linguistics. Cited by: §5.1.
- Continuous hierarchical representations with poincar’e variational auto-encoders. In NeurIPS, Cited by: §1, §1, §3.3, §4, §5.1.
- A wrapped normal distribution on hyperbolic space for gradient-based learning. In ICML, Cited by: §1, §4.
- Poincaré embeddings for learning hierarchical representations. In NeurIPS, Cited by: §1, §4.
- Poincar’e wasserstein autoencoder. arXiv preprint arXiv:1901.01427. Cited by: §4.
- BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §5.1.
- Preventing posterior collapse with delta-vaes. In ICLR, Cited by: §4.
Stochastic backpropagation and approximate inference in deep generative models. In ICML, Cited by: §1.
- Extension of fenchel’duality theorem for convex functions. Duke mathematical journal. Cited by: §1, §3.3.
- Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pp. 355–366. Cited by: §1.
- Improving variational encoder-decoders in dialogue generation. In AAAI, Cited by: §5.1.
- Fixing gaussian mixture vaes for interpretable text generation. arXiv preprint arXiv:1906.06719. Cited by: §1.
- Poincar’e glove: hyperbolic word embeddings. arXiv preprint arXiv:1810.06546. Cited by: §4.
- VAE with a vampprior. In AISTATS, Cited by: §1, §1, §3.4, §3.
- A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics 1 (1), pp. 1–194. Cited by: §2.2.
- Topic-guided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137. Cited by: §1, §1, §4.
- Spherical latent spaces for stable variational autoencoders. In EMNLP, Cited by: §1, §4, §5.2.
- Improved variational autoencoders for text modeling using dilated convolutions. In ICML, Cited by: §4, §5.1.
- Infovae: information maximizing variational autoencoders. In AAAI, Cited by: §4.
- Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In ACL, Cited by: §5.1, §5.1, §5.1.
- Latent normalizing flows for discrete sequences. In ICML, Cited by: §4.