1 Introduction
Variational Autoencoder (VAE) Kingma and Welling (2013); Rezende et al. (2014) is a generative model widely applied to languagegeneration tasks, which propagates latent codes drawn from a simple prior to manifest data samples through a decoder. The generative model is augmented by an inference network, which feeds observable data samples through an encoder to yield a distribution on the corresponding latent codes. Since natural language often conceives a latent hierarchical structure, it is desirable for the latent code in VAE to reflect such inherent language structure, so that the generated text can be more natural and expressive. An example of language structure is illustrated in Figure 1, where sentences are organized into a tree structure. The root node corresponds to simple sentences (e.g., “Yes”), while nodes on outer leaves represent sentences with more complex syntactic structure and richer, more specific semantic meaning (e.g., “The food in the restaurant is awesome”)^{1}^{1}1Another possible way to organize sentences is a hierarchy of topics, e.g., a parent node can be a sentence on “sports”, while its children are sentences on “basketball” and “skiing”..
In existing VAEbased generative models, such structures are not explicitly considered. The latent code often employs a simple Gaussian prior, and the posterior is approximated as a Gaussian with diagonal covariance matrix. Such embeddings assume an Euclidean structure, which is inadequate in capturing geometric structures illustrated in Figure 1. While some variants have been proposed to enrich the prior distributions Xu and Durrett (2018); Wang et al. (2019); Shi et al. (2019), there is no affirmative evidence that structural information in language can be recovered effectively by the model.
Hyperbolic geometry has recently emerged as an effective method for representation learning from data with hierarchical structure Mathieu et al. (2019); Nickel and Kiela (2017). Informally, hyperbolic space can be considered as a continuous map of trees. For example, a Poincaré disk (a hyperbolic space with two dimensions) can represent any tree with arbitrary low distortion De Sa et al. (2018); Sarkar (2011). In Euclidean space, however, it is difficult to learn such structural representation even with infinite dimensions Linial et al. (1995).
Motivated by these observations, we propose Adversarial Poincaré Variational Autoencoder (APoVAE), a text embedding and generation model based on hyperbolic representations, where the latent code is encouraged to capture the underlying treelike structure in language. Such a latent structure gives us more control of the sentences we want to generate, , an increase of sentence complexity and diversity can be achieved along some trajectory from a root to its children. In practice, we define both the prior and the variational posterior of the latent code over a Poincaré ball, via the use of a wrapped normal distribution Nagano et al. (2019). To obtain more stable model training and learn more flexible representation of the latent code, we exploit the primaldual formulation of KL divergence Dai et al. (2018) based on the Fenchel duality Rockafellar and others (1966), to adversarially optimize the variational bound. Unlike the primal form that relies on Monte Carlo approximation Mathieu et al. (2019), our dual formulation bypasses the need for tractable posterior likelihoods via the introduction of an auxiliary dual function.
We apply the proposed approach to language modeling and dialogresponse generation tasks. For language modeling, in order to enhance the distribution complexity of the prior, we use an additional “variational mixture of posteriors” prior (VampPrior) design Tomczak and Welling (2018) for the wrapped normal distribution. Specifically, VampPrior uses a mixture distribution with components from variational posteriors, coupling the parameters of the prior and variational posterior together. For dialog response generation, a conditional model variant of APoVAE is designed to take into account the dialog context.
Experiments also show that the proposed model addresses posterior collapse Bowman et al. (2016), a major obstacle preventing efficient learning of VAE on text data. In posterior collapse, the encoder learns an approximate posterior similar to the prior, and the decoder tends to ignore the latent code for generation. Experiments show that our proposed model can effectively avoid posterior collapse. We hypothesize that this is due to the use of a more informative prior in hyperbolic space that enhances the complexity of the latent representation, which aligns well with previous work Tomczak and Welling (2018); Wang et al. (2019) that advocates a better prior design.
Our main contributions are summarized as follows. () We present Adversarial Poincaré Variational Autoencoder (APoVAE), a novel approach to text embedding and generation based on hyperbolic latent representations. () In addition to the use of wrapped normal distribution, an adversarial learning procedure and a VampPrior design are incorporated for robust model training. () Experiments on language modeling and dialogresponse generation benchmarks demonstrate the superiority of the proposed approach compared to Euclidean VAEs, benefiting from capturing informative latent hierarchies in natural language.
2 Preliminaries
2.1 Variational Autoencoder
Let be a dataset of sentences, where each is a sequence of tokens of length . Our goal is to learn that best models the observed sentences so that the expected likelihood is maximized, i.e., .
The variational autoencoder (VAE) considers a latentvariable model to represent sentences, with an auxilary encoder that draws samples of latent code from the conditional density , known as the approximate posterior. Given a latent code , the decoder samples a sentence from the conditional density , where the “decoding” pass takes an autoregressive form. Together with prior , the model is given by the joint . The VAE leverages the approximate posterior to derive an evidence lower bound (ELBO) to the (intractable) marginal likelihood :
(1) 
where are jointly optimized during training, and the gap is given by the decomposition
(2) 
where
denotes KullbackLeibler divergence. Alternatively, the ELBO can also be written as:
(3) 
where the first conditional likelihood and second KL terms respectively characterize reconstruction and generalization capabilities. Intuitively, a good model is expected to strike a balance between good reconstruction and generalization. In most cases, both the prior and the variational posterior are assumed to be Gaussian for computational convenience. However, such oversimplified assumptions may not be ideal for capturing the intrinsic characteristics of data that have unique geometrical structure, such as natural language.
2.2 Hyperbolic Space
Riemannian manifolds can provide a more powerful and meaningful embedding space for complex data with highly nonEuclidean structure, that cannot be effectively captured in a vectorial form (e.g., social networks, biology and computer graphics). Of particular interest is the hyperbolic space (Ganea et al., 2018), where () the relatively simple geometry allows tractable computations, and () the exponential growth of distance in finite dimensions naturally embeds rich hierarchical structures in a compact form.
Riemannian Geometry.
An dimensional Riemannian manifold is a set of points locally similar to a linear space . At each point of the manifold , we can define a real vector space that is tangent to , along with an associated
metric tensor
which is an inner product on . Intuitively, a Riemannian manifold behaves like a vector space only in its infinitesimal neighborhood, allowing the generalization of common notation like angle, straight line and distance to a smooth manifold. For each tangent space , there exists a specific onetoone map from an ball at the origin of to a neighborhood of on , called the exponential map. We refer to the inverse of an exponential map as the logarithm map, denoted as . In addition, a parallel transport intuitively transports tangent vectors along a “straight” line between and , so that they remain “parallel”. This is the basic machinery that allows us to generalize distributions and computations in the hyperbolic space, as detailed in later sections.Poincaré Ball Model.
Hyperbolic geometry is one type of nonEuclidean geometry with a constant negative curvature. As a classical example of hyperbolic space, an dimensional Poincaré ball, with curvature parameter (i.e., radius ), can be denoted as with its metric tensor given by , where and denotes the regular Euclidean metric tensor. Intuitively, as moves closer to the boundary , the hyperbolic distance between and a nearby diverges at a rate of . This implies significant representation capacity as very dissimilar objects can be encoded on a compact domain.
Mathematical Operations.
We review the closedform mathematical operations that enable differentiable training for hyperbolic space models, namely the hyperbolic algebra (vector addition) and tangent space computations (exponential/logarithm map and parallel transport). The hyperbolic algebra is formulated under the framework of gyrovector spaces (Ungar, 2008), with the addition of two points given by the Möbius addition:
(4)  
For any point , the exponential map and the logarithmic map are given for and by
(5) 
where . Note that the Poincaré ball model is geodesically complete in the sense that is welldefined on the full tangent space . The parallel transport map from a vector to another tangent space is given by
(6) 
3 Adversarial Poincaré VAE
In this section, we first introduce our hyperbolic encoder and decoder, and how to apply reparametrization. We then provide detailed descriptions on model implementation, explaining how the primaldual form of KL divergence can help stabilize training. Finally, we describe how to adopt VampPrior Tomczak and Welling (2018) to enhance performance. A summary of our model scheme is provided in Figure 2.
3.1 Flexible Wrapped Distribution Encoder
We begin by generalizing the standard normal distribution to a Poincaré ball (Ganea et al., 2018). While there are a few competing definitions of the hyperbolic normal, we choose the wrapped normal as our prior and variational posterior, largely due to its flexibility for more expressive generalization. A wrapped normal distribution is defined as follows: () sample vector from ; () parallel transport to ; () using exponential map to project back to . Putting these together, a latent sample has the following reparametrizable form:
(7) 
For approximate posteriors, depends on . We further generalize the (restrictive) hyperbolic wrapped normal by acknowledging that under the implicit VAE Fang et al. (2019) framework, one does not need the approximate posterior to be analytically tractable. This allows us to replace the tangent space sampling step in (7) with a more flexible implicit distribution from which we draw samples as for . Note that now can be regarded as a deterministic displacement vector that anchors embeddings to the correct semantic neighborhood, allowing the stochastic
to only focus on modeling the local uncertainty of semantic embedding. The synergy between the deterministic and stochastic parts enables efficient representation learning relative to existing alternatives. For simplicity, we denote the encoder neural network as
, which contains and , with parameters .3.2 Poincaré Decoder
To build a geometryaware decoder for a hyperbolic latent code, we follow Ganea et al. (2018), and use a generalized linear function analogously defined in the hyperbolic space. A Euclidean linear function takes the form , where is the coefficient, is the intercept,
is a hyperplane passing through
with as the normal direction, and is the distance between and hyperplane . The counterpart in Poincaré ball analogously writes(8) 
where , and are the the gyroplane and the distance between and the gyroplane, respectively. Specifically, we use the hyperbolic linear function in (8) to extract features from the Poincaré embedding . The feature would be the input to the RNN decoder. We denote the combined network of and the RNN decoder as , where parameters contain and .
3.3 Implementing APoVAE
While it is straightforward to compute the ELBO (2.1
) via Monte Carlo estimates using the explicit wrapped normal density
Mathieu et al. (2019), we empirically observe that: () the normal assumption restricts the expressiveness of the model; () the wrapped normal likelihood makes the training severely unstable. Therefore, we appeal to a primaldual view of VAE training to overcome such difficulties Rockafellar and others (1966); Dai et al. (2018); Fang et al. (2019). Specifically, the KL term in (2.1) can be reformulated as:(9)  
where is the (auxiliary) dual function (i.e., a neural network) with parameters . The primaldual view of the KL term enhances the approximation ability, while also being tractable computationally. Meanwhile, since the density function in the original KL term in (2.1) is replaced by the dual function
, we can avoid direct computation with respect to the probability density function of the wrapped normal distribution.
To train our proposed APoVAE with the primaldual form of the VAE objective, we follow the training schemes of coupled variational Bayes (CVB) Dai et al. (2018) and implicit VAE Fang et al. (2019), which optimize the objective adversarially. Specifically, we update in the dual function to maximize:
(10) 
where denotes the expectation over empirical distribution on observations. Accordingly, parameters and are updated to maximize:
(11) 
Note that the term is maximized in (3.3) while it is minimized in (3.3), i.e., adversarial learning. In other words, one can consider the dual function as a discriminative network that distinguishes between the prior and the variational posterior , both of which are paired with the input data .
3.4 Datadriven Prior
While the use of a standard normal prior is a simple choice in Euclidean space, we argue that it induces bias in the hyperbolic setup. This is because natural sentences have specific meaning, and it is unrealistic to have the bulk of mass concentrated in the center (this is for low dimension; for high dimensions, it will concentrate near the surface of a sphere, which may partly explain why cosine similarity works favorably compared with Euclidean distance for NLP applications).
To reduce the induced bias from a prefixed prior, we consider a datadriven alternative, that also serves to close the variational gap. In this work, we adopt the VampPrior framework for this purpose Tomczak and Welling (2018), which is a mixture of variational posteriors conditioned on learnable pseudodata points. Specifically, we consider the prior as a learnable distribution given by
(12) 
where is the learned approximate posterior, and we call the parameter pseudo inputs. Intuitively, seeks to match the aggregated posterior Makhzani et al. (2015): in a costefficient manner via parameterizing the pseudo inputs. By replacing the prior distribution in (3.3) with , we complete the final objective of the proposed APoVAE. The detailed training procedure is summarized in Algorithm 1.
4 Related Work
VAE for Text Generation.
Many VAE models have been proposed for text generation, most of which focus on solving the posterior collapse issue. The most popular strategy is to alter the training dynamics, keeping the encoder away from bad local optima. For example, variants of KL annealing Bowman et al. (2016); Zhao et al. (2018); Fu (2019) dynamically adjust the weight on the KL penalty term as training progresses; lagging VAE He et al. (2019)
aggressively optimizes the encoder before each decoder update, to overcome the imbalanced training issue between the encoder and decoder. Alternative strategies have also been proposed based on competing theories or heuristics.
VAE Razavi et al. (2019) tackles this issue by enforcing a minimum KL divergence between the posterior and the prior. Yang et al. (2017) blames modecollapse on the autoregressive design of the decoder and advocates alternative architectures. A semiamortized inference network is considered by Kim et al. (2018) to bridge the amortization gap between loglikelihood and the ELBO.Recent work has also shown that posterior collapse can be ameliorated by using more expressive priors and variational posteriors other than Gaussian. Flowbased VAE is considered in Ziegler and Rush (2019) to enhance the flexibility of prior distributions. A topicguided prior is proposed in Wang et al. (2019) to achieve more controllable text generation. Fang et al. (2019) explores implicit samplebased representations, without requiring an explicit density form for the approximate posterior. Xu and Durrett (2018) considers replacing the Gaussian with the spherical von MisesFisher (vMF) distribution. Compared to these prior arts, our model features structured representation in hyperbolic space, which not only captures latent hierarchies but also combats posterior collapse.
Hyperbolic Space Representation Learning.
There has been a recent surge of interest in representation learning in hyperbolic space, largely due to its exceptional effectiveness modeling data with underlying graph structure (Chamberlain et al., 2017), such as relation nets (Nickel and Kiela, 2017). In the context of NLP, hyperbolic geometry has been considered for word embeddings (Tifrea et al., 2018). A popular vehicle for hyperbolic representation learning is the autoencoder (AE) framework (Grattarola et al., 2019; Ovinnikov, 2019), where the decoders are built to efficiently exploit the hyperbolic geometry (Ganea et al., 2018). Closest to our APoVAE are the works of hyperbolic VAEs (Mathieu et al., 2019; Nagano et al., 2019), where wrapped normal distributions have been used. Drawing power from the dual form of the KL, the proposed ApoVAE highlights an implicit posterior and datadriven prior, showing improved training stability.
5 Experiments
We evaluate the proposed model on two tasks: () language modeling, and () dialogresponse generation, with quantitative results, human evaluation and qualitative analysis.
5.1 Experimental Setup
Datasets.
We use three datasets for language modeling: Penn Treebank (PTB) Marcus et al. (1993), Yahoo and Yelp corpora Yang et al. (2017). PTB contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. Yahoo and Yelp are much larger datasets, each containing 100k sentences with greater average length.
Following Gu et al. (2019), we consider the following two datasets for dialogueresponse generation: Switchboard Godfrey and Holliman (1997) and DailyDialog Li et al. (2017). The former contains 2.4k twosided telephone conversations, manually transcribed and aligned. The latter has 13k daily conversations for an English learner in daily life. We split the data into training, validation and test sets following the protocol described in Zhao et al. (2017); Shen et al. (2018).
Evaluation Metrics.
To benchmark language modeling performance, we report the ELBO and Perplexity (PPL) of APoVAE and baselines. In order to verify our proposed ApoVAE is more resistant to posterior collapse, we also report KLdivergence and mutual information (MI) between and He et al. (2019). The number of active units (AU) of the latent code is also reported, where the activity of a latent dimension is measured as , and defined as active if .
For dialogue response generation, we adopt the evaluation metrics used in previous studies
Zhao et al. (2017); Gu et al. (2019), including BLEU Papineni et al. (2002), BOW Liu et al. (2016), and intra/interdist values Gu et al. (2019). The first two metrics are used to assess the relevance of the generated response, and the third one is for diversity evaluation. Detailed definitions of these metrics are provided in the Appendix.Model Implementation.
For language modeling, we adopt the LSTM Hochreiter and Schmidhuber (1997) for both the encoder and decoder, with dimension of the latent code set to . Following Mathieu et al. (2019), the hyperparameter is set to . For dialogue response generation, we modify our APoVAE following the Conditional VAE framework Zhao et al. (2017). Specifically, an extra input of context embedding is supplied to the model (i.e., ). The prior is a wrapped normal conditioned on context embedding, learned together with the variational posterior. Additional details are provided in the Appendix.
Model  ELBO  PPL  KL  MI  AU 
PTB  
VAE  102.6  108.26  1.1  0.8  2 
VAE  104.5  117.92  7.5  3.1  5 
SAVAE  102.6  107.71  1.2  0.7  2 
vMFVAE  95.8  93.7  2.9  3.2  21 
iVAE  87.6  54.46  6.3  3.5  32 
APoVAE  87.2  53.32  8.4  4.8  32 
APoVAE+VP  87.0  53.02  8.9  4.5  32 
Yahoo  
VAE  328.6  61.21  0.0  0.0  0 
VAE  328.7  61.29  6.3  2.8  8 
SAVAE  327.2  60.15  5.2  2.9  10 
LAGVAE  326.7  59.77  5.7  2.9  15 
vMFVAE  318.5  53.9  6.3  3.7  23 
iVAE  309.5  48.22  8.0  4.4  32 
APoVAE  286.2  47.00  6.9  4.1  32 
APoVAE+VP  285.6  46.61  8.1  4.9  32 
Yelp  
VAE  357.9  40.56  0.0  0.0  0 
VAE  358.2  40.69  4.2  2.0  4 
SAVAE  357.8  40.51  2.8  1.7  8 
LAGVAE  355.9  39.73  3.8  2.4  11 
vMFVAE  356.2  51.0  4.1  3.9  13 
iVAE  348.2  36.70  7.6  4.6  32 
APoVAE  319.7  34.10  12.1  7.5  32 
APoVAE+VP  316.4  32.91  12.7  6.2  32 
Model  BLEU  BOW  Intradist  Interdist  

R  P  F1  A  E  G  dist1  dist2  dist1  dist2  
SwitchBoard  
CVAE  0.295  0.258  0.275  0.836  0.572  0.846  0.803  0.415  0.112  0.102 
CVAEBOW  0.298  0.272  0.284  0.828  0.555  0.840  0.819  0.493  0.107  0.099 
CVAECO  0.299  0.269  0.283  0.839  0.557  0.855  0.863  0.581  0.111  0.110 
DialogWAE  0.394  0.254  0.309  0.897  0.627  0.887  0.713  0.651  0.245  0.413 
iVAE  0.427  0.254  0.319  0.930  0.670  0.900  0.828  0.692  0.391  0.668 
APoVAE  0.438  0.261  0.328  0.937  0.683  0.904  0.861  0.792  0.445  0.717 
DailyDialog  
CVAE  0.265  0.222  0.242  0.923  0.543  0.811  0.938  0.973  0.177  0.222 
CVAEBOW  0.256  0.224  0.239  0.923  0.540  0.812  0.947  0.976  0.165  0.206 
CVAECO  0.259  0.244  0.251  0.914  0.530  0.818  0.821  0.911  0.106  0.126 
DialogWAE  0.341  0.278  0.306  0.948  0.578  0.846  0.830  0.940  0.327  0.583 
iVAE  0.355  0.239  0.285  0.951  0.609  0.872  0.897  0.975  0.501  0.868 
APoVAE  0.359  0.265  0.305  0.954  0.616  0.873  0.919  0.989  0.511  0.869 
5.2 Experimental Results
Language Modeling.
Table 1 shows results on language modeling. We mainly compare APoVAE with other VAEbased solutions, including VAE Higgins et al. (2017), SAVAE Kim et al. (2018), lagging VAE (LAGVAE) He et al. (2019), vMFVAE Xu and Durrett (2018) and iVAE Fang et al. (2019). On all three datasets, our model achieves lower negative ELBO and PPL than other models, demonstrating its strong ability to better model sequential text data. Meanwhile, the larger KL term and higher mutual information (between and ) of APoVAE model indicate its robustness in handling posterior collapse. In addition, the introduction of datadriven prior (denoted as APoVAE+VP) further boosts the performance, especially on negative ELBO and PPL.
Visualization.
To verify our hypothesis that the proposed model is capable of learning latent tree structure in text data, we visualize the twodimensional projection of the learned latent code in Figure 3. For visualization, we randomly draw 5k samples from PTBtest, and encode them to the latent space using APoVAE encoder. We colorcode each sentence based on its length (i.e., blue for long sentences and red for short sentences). Note that only a small portion of data have a length longer than (), and human inspection verified that most of them are composed of multiple subsentences. As such, we exclude these samples from our analysis.
As shown in Figure 3, longer sentences (dark blue) tend to occupy the outer rim of the Poincaré ball, while the shorter ones (dark red) are concentrated in the inner area. In addition, we draw a long sentence (dark blue) from the test set, and manually shorten it to create several variants of different lengths (ranging from 6 to 27), which are related in a hierarchical manner based on human judgement. We visualize their latent codes projected by the trained APoVAE. The resulting plot is consistent with a hierarchical structure: as the sentence becomes more specific, the embedding moves outward. We also decode from the neighbours of these latent codes, the outputs (see the Appendix) of which demonstrate similar hierarchical structure as the input sentences.
Dataset  APoVAE vs.  APoVAE vs.  

iVAE  DialogWAE  
win  loss  tie  win  loss  tie  
Switch  Inform  52.8  27.9  19.3  63.7  27.1  19.2 
Board  Coherence  41.7  35.5  22.8  41.2  34.4  24.4 
Diversity  51.2  26.4  22.4  62.1  25.1  12.8  
Daily  Inform  45.4  26.9  17.7  46.1  26.5  27.4 
Dialog  Coherence  40.1  25.9  24.0  40.7  24.2  25.1 
Diversity  43.9  30.8  25.3  47.5  31.4  21.1 
Dialogue Response Generation.
Results on SwitchBoard and DailyDialog are summarized in Table 2. Our proposed model generates comparable or better responses than the baseline models in terms of both relevance (BLEU and BOW) and diversity (intra/interdist). For SwitchBoard, APoVAE improves the average recall from (by iVAE) to , while significantly enhancing generation diversity (e.g., from to for intradist2). For DailyDialog, similar trends can be observed. Three examples are provided in Figure 4. More examples can be found in the Appendix.
Human Evaluation.
We further perform human evaluation via Amazon Mechanical Turk. We asked the turkers to compare generated responses from two models, and assess each model’s informativeness, relevance to the dialog context (coherence), and diversity. We use randomly sampled contexts from the test set, each assessed by three judges. In order to evaluate diversity, 5 responses are generated for each dialog context. For quality control, only workers with a lifetime task approval rating greater than 98% were allowed to participate in our study. Table 3 summarizes the human evaluation results. The responses generated by our model are clearly preferred by the judges compared with other competing methods.
6 Conclusion
We present APoVAE, a novel model for text generation in hyperbolic space. Our model can learn latent hierarchies in natural language via the use of wrapped normals for the prior. A primaldual view of KL divergence is adopted for robust model training. Experiments on language modeling and dialog response generation demonstrate the superiority of the model. For future work, we plan to apply APoVAE to other text generation tasks such as image captioning.
References
 Generating sentences from a continuous space. In CoNLL, Cited by: §1, §4.
 Neural embeddings of graphs in hyperbolic space. arXiv preprint arXiv:1705.10359. Cited by: §4.
 Coupled variational bayes via optimization embedding. In NeurIPS, Cited by: §1, §3.3, §3.3.

Representation tradeoffs for hyperbolic embeddings.
Proceedings of machine learning research
. Cited by: §1.  Implicit deep latent variable models for text generation. In EMNLP, Cited by: §3.1, §3.3, §3.3, §4, §5.2.
 Cyclical annealing schedule: a simple approach to mitigating kl vanishing. In NAACL, Cited by: §4.
 Hyperbolic neural networks. In NeurIPS, Cited by: §2.2, §3.1, §3.2, §4.
 Switchboard1 release 2. Linguistic Data Consortium, Philadelphia 926, pp. 927. Cited by: §5.1.
 Adversarial autoencoders with constantcurvature latent manifolds. Applied Soft Computing 81, pp. 105511. Cited by: §4.
 DialogWAE: multimodal response generation with conditional wasserstein autoencoder. In ICLR, Cited by: §5.1, §5.1.
 Lagging inference networks and posterior collapse in variational autoencoders. In ICLR, Cited by: §4, §5.1, §5.2.
 Betavae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §5.2.
 Long shortterm memory. Neural computation. Cited by: §5.1.
 Semiamortized variational autoencoders. In ICML, Cited by: §4, §5.2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
 Dailydialog: a manually labelled multiturn dialogue dataset. arXiv preprint arXiv:1710.03957. Cited by: §5.1.
 The geometry of graphs and some of its algorithmic applications. Combinatorica 15 (2), pp. 215–245. Cited by: §1.
 How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023. Cited by: §5.1.
 Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §3.4.
 Building a large annotated corpus of english: the penn treebank. Computational Linguistics. Cited by: §5.1.
 Continuous hierarchical representations with poincar’e variational autoencoders. In NeurIPS, Cited by: §1, §1, §3.3, §4, §5.1.
 A wrapped normal distribution on hyperbolic space for gradientbased learning. In ICML, Cited by: §1, §4.
 Poincaré embeddings for learning hierarchical representations. In NeurIPS, Cited by: §1, §4.
 Poincar’e wasserstein autoencoder. arXiv preprint arXiv:1901.01427. Cited by: §4.
 BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §5.1.
 Preventing posterior collapse with deltavaes. In ICLR, Cited by: §4.

Stochastic backpropagation and approximate inference in deep generative models
. In ICML, Cited by: §1.  Extension of fenchel’duality theorem for convex functions. Duke mathematical journal. Cited by: §1, §3.3.
 Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pp. 355–366. Cited by: §1.
 Improving variational encoderdecoders in dialogue generation. In AAAI, Cited by: §5.1.
 Fixing gaussian mixture vaes for interpretable text generation. arXiv preprint arXiv:1906.06719. Cited by: §1.
 Poincar’e glove: hyperbolic word embeddings. arXiv preprint arXiv:1810.06546. Cited by: §4.
 VAE with a vampprior. In AISTATS, Cited by: §1, §1, §3.4, §3.
 A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics 1 (1), pp. 1–194. Cited by: §2.2.
 Topicguided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137. Cited by: §1, §1, §4.
 Spherical latent spaces for stable variational autoencoders. In EMNLP, Cited by: §1, §4, §5.2.
 Improved variational autoencoders for text modeling using dilated convolutions. In ICML, Cited by: §4, §5.1.
 Infovae: information maximizing variational autoencoders. In AAAI, Cited by: §4.
 Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. In ACL, Cited by: §5.1, §5.1, §5.1.
 Latent normalizing flows for discrete sequences. In ICML, Cited by: §4.
Comments
There are no comments yet.