1 Introduction
Topic models have been extensively explored in the Natural Language Processing (NLP) community for unsupervised knowledge discovery. Latent Dirichlet Allocation (LDA)
Blei et al. (2003), the most popular topic model, has been extended Lin and He (2009); Zhou et al. (2014); Cheng et al. (2014) for various extraction tasks. Due to the difficulty of exact inference, most LDA variants require approximate inference methods, such as meanfield methods and collapsed Gibbs sampling. However, these approximate approaches have the drawback that small changes to the modeling assumptions result in a rederivation of the inference algorithm, which can be mathematically arduous.One possible way in addressing this limitation is through neural topic models which employ blackbox inference mechanism with neural networks. Inspired by variational autoencoder (VAE)
Kingma and Welling (2013), Srivastava and Sutton Srivastava and Sutton (2017) used the LogisticNormal prior to mimic the simplex in latent topic space and proposed the Neural Variational LDA (NVLDA). Moreover, they replaced the wordlevel mixture in NVLDA with a weighted product of experts and proposed the ProdLDA Srivastava and Sutton (2017) to further enhance the topic quality.Although Srivastava and Sutton Srivastava and Sutton (2017) used the LogisticNormal distribution to approximate the Dirichlet distribution, they are not exactly the same. An illustration of these two distributions is shown in Figure 1 in which the LogisticNormal distribution does not exhibit multiple peaks at the vertices of the simplex as that in the Dirichlet distribution and as such, it is less capable to capture the multimodality which is crucial in topic modeling Wallach et al. (2009). To deal with the limitation, Wang et al. Wang et al. (2019) proposed the Adversarialneural Topic Model (ATM) based on adversarial training, it uses a generator network to capture the semantic patterns lying behind the documents. However, given a document, ATM is not able to infer the documenttopic distribution which is useful for downstream applications, such as text clustering. Moreover, ATM take the bagofwords assumption and do not utilize any word relatedness information captured in word embeddings which have been proved to be crucial for better performance in many NLP tasks Liu et al. (2018); Lei et al. (2018).
To address these limitations, we model topics with Dirichlet prior and propose a novel Bidirectional Adversarial Topic model (BAT) based on bidirectional adversarial training. The proposed BAT employs a generator network to learn the projection function from randomlysampled documenttopic distribution to documentword distribution. Moreover, an encoder network is used to learn the inverse projection, transforming a documentword distribution into a documenttopic distribution. Different from traditional models that often resort to analytic approximations, BAT employs a discriminator which aims to discriminate between real distribution pair and fake distribution pair, thereby helps the networks (generator and encoder) to learn the twoway projections better. During the adversarial training phase, the supervision signal provided by the discriminator will guide the generator to construct a more realistic document and thus better capture the semantic patterns in text. Meanwhile, the encoder network is also guided to generate a more reasonable topic distribution conditioned on specific documentword distributions. Finally, to incorporate the word relatedness information captured by word embeddings, we extend the BAT by modeling each topic with a multivariate Gaussian in the generator and propose the Bidirectional Adversarial Topic model with Gaussian (GaussianBAT).
The main contributions of the paper are:

We propose a novel Bidirectional Adversarial Topic (BAT) model, which is, to our best knowledge, the first attempt of using bidirectional adversarial training in neural topic modeling;

We extend BAT to incorporate the word relatedness information into the modeling process and propose the Bidirectional Adversarial Topic model with Gaussian (GaussianBAT);

Experimental results on three public datasets show that BAT and GaussianBAT outperform the stateoftheart approaches in terms of topic coherence measures. The effectiveness of BAT and GaussianBAT is further verified in text clustering.
2 Related work
Our work is related to two lines of research, which are adversarial training and neural topic modeling.
2.1 Adversarial Training
Adversarial training, first employed in Generative Adversarial Network (GAN) Goodfellow et al. (2014), has been extensively studied from both theoretical and practical perspectives.
Theoretically, Arjovsky Arjovsky et al. (2017) and Gulrajani Gulrajani et al. (2017) proposed the Wasserstein GAN which employed the Wasserstein distance between data distribution and generated distribution as the training objective. To address the limitation that most GANs Goodfellow et al. (2014); Radford et al. (2015) could not project data into a latent space, Bidirectional Generative Adversarial Nets (BiGAN) Donahue et al. (2016) and Adversarially Learned Inference (ALI) Dumoulin et al. (2016) were proposed.
Adversarial training has also been extensively used for text generation. For example, SeqGAN
Yu et al. (2017) incorporated a policy gradient strategy for text generation. RankGAN Lin et al. (2017) ranked a collection of humanwritten sentences to capture the language structure for improving the quality of text generation. To avoid mode collapse when dealing with discrete data, MaskGAN Fedus et al. (2018) used an actorcritic conditional GAN to fill in missing text conditioned on the context.2.2 Neural Topic Modeling
To overcome the challenging exact inference of topic models based on directed graph, a replicated softmax model (RSM), based on the Restricted Boltzmann Machines was proposed in
Hinton and Salakhutdinov (2009). Inspired by VAE, Miao et al. Miao et al. (2016) used the multivariate Gaussian as the prior distribution of latent space and proposed the Neural Variational Document Model (NVDM) for text modeling. To model topic properly, the Gaussian Softmax Model (GSM) Miao et al. (2017)which constructs the topic distribution using a Gaussian distribution followed by a softmax transformation was proposed based on the NVDM.
Likewise, to deal with the inappropriate Gaussian prior of NVDM, Srivastava and Sutton Srivastava and Sutton (2017) proposed the NVLDA which approximates the Dirichlet prior using a LogisticNormal distribution. Recently, the Adversarialneural Topic Model (ATM) Wang et al. (2019) is proposed based on adversarial training, it models topics with Dirichlet prior which is able to capture the multimodality compared with logisticnormal prior and obtains better topics. Besides, the Adversarialneural Event (AEM) Wang et al. (2019) model is also proposed for open event extraction by representing each event as an entity distribution, a location distribution, a keyword distribution and a date distribution.Despite the extensive exploration of this research field, scarce work has been done to incorporate Dirichlet prior, word embeddings and bidirectional adversarial training into neural topic modeling. In this paper, we propose two novel topic modeling approaches, called BAT and GaussianBAT, which are different from existing approaches in the following aspects: (1) Unlike NVDM, GSM, NVLDA and ProdLDA which model latent topic with Gaussian or logisticnormal prior, BAT and GaussianBAT explicitly employ Dirichlet prior to model topics; (2) Unlike ATM which could not infer topic distribution of a given document, BAT and GaussianBAT uses a encoder to generate the topic distribution corresponding to the document; (3) Unlike neural topic models that only utilize word cooccurrence information, GaussianBAT models topic with multivariate Gaussian and incorporates the word relatedness into modeling process.
3 Methodology
Our proposed neural topic models are based on bidirectional adversarial training Donahue et al. (2016) and aim to learn the twoway nonlinear projection between two highdimensional distributions. In this section, we first introduce the Bidirectional Adversarial Topic (BAT) model that only employs the word cooccurrence information. Then, built on BAT, we model topics with multivariate Gaussian in the generator of BAT and propose the Bidirectional Adversarial Topic model with Gaussian (GaussianBAT), which naturally incorporates word relatedness information captured in word embeddings into modeling process.
3.1 Bidirectional Adversarial Topic model
As depicted in Figure 2, the proposed BAT consists of three components: (1) The Encoder takes the dimensional document representation sampled from text corpus as input and transforms it into the corresponding dimensional topic distribution ; (2) The Generator takes a random topic distribution drawn from a Dirichlet prior as input and generates a dimensional fake word distribution ; (3) The Discriminator takes the real distribution pair and fake distribution pair as input and discriminates the real distribution pairs from the fake ones. The outputs of the discriminator are used as supervision signals to learn , and during adversarial training. In what follows, we describe each component in more details.
3.1.1 Encoder Network
The encoder learns a mapping function to transform documentword distribution to documenttopic distribution. As shown in the topleft panel of Figure 2, it contains a dimensional documentword distribution layer, an dimensional representation layer and a dimensional documenttopic distribution layer, where and denote vocabulary size and topic number respectively.
More concretely, for each document in text corpus, takes the document representation as input, where is the representation weighted by TFIDF, and it is calculated by:
where denotes the number of th word appeared in document , represents the number of documents in the corpus, and means the number of documents that contain th word in the corpus. Thus, each document could be represented as a dimensional multinomial distribution and the th dimension denotes the semantic consistency between th word and the document.
With as input, firstly projects it into an dimensional semantic space through the representation layer as follows:
(1)  
(2) 
where and are weight matrix and bias term of the representation layer,
is the state vector normalized by batch normalization
, denotes the parameter of LeakyReLU activation and represents the output of representation layer.Then, the encoder transforms into a dimensional topic space based on the equation below:
(3) 
where is the weight matrix of topic distribution layer, represents the bias term, denotes the corresponding topic distribution of the input and the th () dimension represents the proportion of th topic in document .
3.1.2 Generator network
The generator is shown in the bottomleft panel of Figure 2. Contrary to encoder, it provides an inverse projection from documenttopic distribution to documentword distribution and contains a dimensional documenttopic layer, an dimensional representation layer and a dimensional documentword distribution layer.
As pointed out in Wallach et al. (2009), the choice of Dirichlet prior over topic distribution is important to obtain interpretable topics. Thus, BAT employs the Dirichlet prior parameterized with to mimic the multivariate simplex over topic distribution . It can be drawn randomly based on the equation below:
(4) 
where is the dimensional hyperparameter of Dirichlet prior, is the topic number that should be set in BAT, , follows the constrain that , represents the proportion of the th topic in the document, and normalization term is defined as .
To learn the transformation from documenttopic distribution to documentword distribution, firstly projects into an dimensional representation space based on equations:
(5)  
(6) 
where is weight matrix of the representation layer, represents bias term, is the state vector normalized by batch normalization, Eq. 6 represents the LeakyReLU activation parameterized with , and is the output of the representation layer.
Then, to project into word distribution
, a subnet contains a linear layer and a softmax layer is used and the transformation follows:
(7) 
where and are weight matrix and bias of word distribution layer, is the word distribution correspond to . For each , the th dimension is the probability of the th word in fake document .
3.1.3 Discriminator network
The discriminator is constituted by three layers (a
dimensional joint distribution layer, an
dimensional representation layer and an output layer) as shown in the right panel of Figure 2. It employs real distribution pair and fake distribution pair as input and then outputs to identify the input sources (fake or real). Concretely, a higher value of represents that is more prone to predict the input as real and vice versa.3.2 BAT with Gaussian (GaussianBAT)
In BAT, the generator models topics based on the bagofwords assumption as in most other neural topic models. To incorporate the word relatedness information captured in word embeddings Mikolov et al. (2013a, b); Pennington et al. (2014); Joulin et al. (2017); Athiwaratkun et al. (2018) into the inference process, we modify the generator of BAT and propose GaussianBAT, in which models each topic with a multivariate Gaussian as shown in Figure 3.
Concretely, GaussianBAT employs the multivariate Gaussian to model the th topic. Here, and are trainable parameters, they represent mean and covariance matrix respectively. Following its probability density, for each word , the probability in the th topic is calculated by:
(8)  
(9) 
where means the word embedding of th word, is the vocabulary size, is the determinant of covariance matrix , is the dimension of word embeddings, is the probability calculated by density, and is the normalized word distribution of th topic. With randomly sampled topic distribution and the calculated topicword distributions , the fake word distribution corresponding to can be obtained by:
(10) 
where is the topic proportion of the th topic. Then, and are concatenated to form the fake distribution pair as shown in Figure 3. And encoder and discriminator of GaussianBAT are same as BAT, shown as Figure 2. In our experiments, the pretrained 300dimensional Glove Pennington et al. (2014) embedding is used.
3.3 Objective and Training Procedure
In Figure 2, the real distribution pair and the fake distribution pair can be viewed as random samples drawn from two dimensional joint distributions and , each of them comprising of a dimensional Dirichlet distribution and a dimensional Dirichlet distribution. The training objective of BAT and GaussianBAT is to make the generated joint distribution close to the real joint distribution as much as possible. In this way, a twoway projection between documenttopic distribution and documentword distribution could be built by the learned encoder and generator.
To measure the distance between and , we use the Wassersteindistance as the optimization objective, since it was shown to be more effective compared to JensenShannon divergence Arjovsky et al. (2017):
(11) 
where represents the output signal of the discriminator. A higher value denotes that the discriminator is more prone to consider the input as a real distribution pair and vice versa. In addition, we use weight clipping which was proposed to ensure the Lipschitz continuity Arjovsky et al. (2017) of .
The training procedure of BAT and GaussianBAT is given in Algorithm. 1. Here, is the clipping parameter, represents the number of discriminator iterations per generator iteration, is the batch size, is the learning rate, and are hyperparameters of Adam Kingma and Ba (2014), and represents . In our experiments, we set the , , , , and .
3.4 Topic Generation and Cluster Inference
After model training, learned and will build a twoway projection between documenttopic distribution and documentword distribution. Thus, and could be used for topic generation and cluster inference.
To generate the word distribution of each topic, we use , a
dimensional vector, as the onehot encoding of the
th topic. For example, in a six topic setting. And the word distribution of the th topic is obtained by:(12) 
Likewise, given the document representation , topic distribution obtained by BAT/GaussianBAT could be used for cluster inference based on:
(13) 
where denotes the inferred cluster of .
4 Experiments
In this section, we first present the experimental setup which includes the datasets used and the baselines, followed by the experimental results.
4.1 Experimental Setup
We evaluate BAT and GaussianBAT on three datasets for topic extraction and text clustering, 20Newsgroups^{1}^{1}1http://qwone.com/ jason/20Newsgroups/ , Grolier^{2}^{2}2https://cs.nyu.edu/roweis/data/ and NYTimes^{3}^{3}3http://archive.ics.uci.edu/ml/datasets/Bag+of+Words . Details are summarized below:
20Newsgroups Lang (1995) is a collection of approximately 20,000 newsgroup articles, partitioned evenly across 20 different newsgroups.
Grolier is built from Grolier Multimedia Encycopedia, which covers almost all the fields in the world.
NYTimes is a collection of news articles published between 1987 and 2007, and contains a wide range of topics, such as sports, politics, education, etc.
We use the full datasets of 20Newsgroups and Grolier. For the NYTimes dataset, we randomly select 100,000 articles and remove the low frequency words. The final statistics are shown in Table 1:
Dataset  #Doc (Train)  #Doc (Test)  #Words 

20Newsgroups  11,259  7,488  1,995 
Grolier  29,762    15,276 
NYtimes  99,992    12,604 
We choose the following models as baselines:
LDA Blei et al. (2003) extracts topics based on word cooccurrence patterns from documents. We implement LDA following the parameter setting suggested in Griffiths and Steyvers (2004).
NVDM Miao et al. (2016) is an unsupervised text modeling approach based on VAE. We use the original implementation of the paper^{4}^{4}4https://github.com/ysmiao/nvdm.
GSMMiao et al. (2017) is an enhanced topic model based on NVDM, we use the original implementation in our experiments^{5}^{5}5https://github.com/linkstrife/NVDMGSM.
NVLDA Srivastava and Sutton (2017), also built on VAE but with the logisticnormal prior. We use the implementation provided by the author^{6}^{6}6https://github.com/akashgit/autoencoding vi for topic models.
ProdLDA Srivastava and Sutton (2017), is a variant of NVLDA, in which the distribution over individual words is a product of experts. The original implementation is used.
ATM Wang et al. (2019), is a neural topic modeling approach based on adversarial training, we implement the ATM following the parameter setting suggested in the original paper.
4.2 Topic Coherence Evaluation
Topic models are typically evaluated with the likelihood of heldout documents and topic coherence. However, Chang et al. Chang et al. (2009) showed that a higher likelihood of heldout documents does not correspond to human judgment of topic coherence. Thus, we follow Röder et al. (2015) and employ four topic coherence metrics (C_P, C_A, NPMI and UCI) to evaluate the topics generated by various models. In all experiments, each topic is represented by the top 10 words according to the topicword probabilities, and all the topic coherence values are calculated using the Palmetto library^{7}^{7}7https://github.com/dicegroup/Palmetto.
Dataset  Model  C_P  C_A  NPMI  UCI 

20Newsgroups  NVDM  0.2558  0.1286  0.0984  2.9496 
GSM  0.2318  0.1067  0.0400  1.6083  
NVLDA  0.1205  0.1763  0.0207  1.3466  
ProdLDA  0.1858  0.2155  0.0083  1.5044  
LDA  0.2361  0.1769  0.0523  0.3399  
ATM  0.1914  0.1720  0.0207  0.3871  
BAT  0.2597  0.1976  0.0472  0.0969  
GaussianBAT  0.3758  0.2251  0.0819  0.5925  
Grolier  NVDM  0.1877  0.1456  0.0619  2.1149 
GSM  0.1974  0.1966  0.0491  0.0410  
NVLDA  0.2205  0.1504  0.0653  2.4797  
ProdLDA  0.0374  0.1733  0.0193  1.6398  
LDA  0.1908  0.2009  0.0497  0.0503  
ATM  0.2105  0.2188  0.0582  0.1051  
BAT  0.2312  0.2108  0.0608  0.1709  
GaussianBAT  0.2606  0.2142  0.0724  0.2836  
NYtimes  NVDM  0.4130  0.1341  0.1437  4.3072 
GSM  0.3426  0.2232  0.0848  0.6224  
NVLDA  0.1575  0.1482  0.0614  2.4208  
ProdLDA  0.0034  0.1963  0.0282  1.9173  
LDA  0.3083  0.2127  0.0772  0.5165  
ATM  0.3568  0.2375  0.0899  0.6582  
BAT  0.3749  0.2355  0.0951  0.7073  
GaussianBAT  0.4163  0.2479  0.1079  0.9215 
Model  Topics 

GaussianBAT  voter campaign poll candidates democratic election republican vote presidential democrat 
song album music band rock pop sound singer jazz guitar  
film movie actor character movies director series actress young scenes  
flight airline passenger airlines aircraft shuttle airport pilot carrier planes  
BAT  vote president voter campaign election democratic governor republican black candidates 
album band music rock song jazz guitar pop musician record  
film actor play acting role playing character father movie actress  
flight airline delay airlines plane pilot airport passenger carrier attendant  
LDA  voter vote poll election campaign primary candidates republican race party 
music song band sound record artist album show musical rock  
film movie character play actor director movies minutes theater cast  
flight plane ship crew air pilot hour boat passenger airport  
ATM  voter vote poll republican race primary percent election campaign democratic 
music song musical album jazz band record recording mp3 composer  
film movie actor director award movies character theater production play  
jet flight airline hour plane passenger trip plan travel pilot 
We firstly make a comparison of topic coherence vs. different topic proportions. Experiments are conducted on the datasets with five topic number settings [20, 30, 50, 75, 100]. We calculate the average topic coherence values among topics whose coherence values are ranked at the top 50, 70, 90, 100 positions. For example, to calculate the average C_P value of BAT , we first compute the average C_P coherence with the selected topics whose C_P values are ranked at the top 90% for each topic number setting, and then average the five coherence values with each corresponding to a particular topic number setting.
The detailed comparison is shown in Figure 4. It can be observed that BAT outperforms the baselines on all the coherence metrics for NYTimes datasets. For Grolier dataset, BAT outperforms all the baselines on C_P, NPMI and UCI metrics, but gives slightly worse results compared to ATM on C_A. For 20Newsgroups dataset, BAT performs the best on C_P and NPMI, but gives slightly worse results compared to ProdLDA on C_A, and LDA on UCI. By incorporating word embeddings through trainable Gaussian distribution, GaussianBAT outperforms all the baselines and BAT on four coherence metrics, often by a large margin, across all the three datasets except for Grolier dataset on C_A when considering 100% topics. This may be attribute to the following factors: (1) The Dirichlet prior employed in BAT and GaussianBAT could exhibit a multimodal distribution in latent space and is more suitable for discovering semantic patterns from text; (2) ATM does not consider the relationship between topic distribution and word distribution since it only carry out adversarial training in word distribution space; (3) The incorporation of word embeddings in GaussianBAT helps generating more coherent topics.
We also compare the average topic coherence values (all topics taken into account) numerically to show the effectiveness of proposed BAT and GaussianBAT. The results of numerical topic coherence comparison are listed in Table 2 and each value is calculated by averaging the average topic coherences over five topic number settings. The best coherence value on each metric is highlighted in bold. It can be observed that GaussianBAT gives the best overall results across all metrics and on all the datasets except for Grolier dataset on C_A. To make the comparison of topics more intuitive, we provide four topic examples extracted by models in Table 3. It can be observed that the proposed BAT and GaussianBAT can generate more coherent topics.
Moreover, to explore how topic coherence varies with different topic numbers, we also provide the comparison of average topic coherence vs. different topic number on 20newsgroups, Grolier and NYTimes (all topics taken into account). The detailed comparison is shown in Figure 5. It could be observed that GaussianBAT outperforms the baselines with 20, 30, 50 and 75 topics except for Grolier dataset on C_A metric. However, when the topic number is set to 100, GaussianBAT performs slightly worse than LDA (e.g., UCI for 20Newsgroups and C_A for NYTimes). This may be caused by the increased model complexity due to the larger topic number settings. Likewise, BAT can achieve at least the secondbest results among all the approaches in most cases for NYTimes dataset. For Grolier, BAT also performs the secondbest except on C_A metric. However, for 20newsgroups, the results obtained by BAT are worse than ProdLDA (C_A) and LDA (UCI) due to the limited training documents in the dataset, though it still largely outperforms other baselines.
4.3 Text Clustering
We further compare our proposed models with baselines on text clustering. Due to the lack of document label information in Grolier and NYTimes, we only use 20Newsgroups dataset in our experiments. The topic number is set to 20 (groundtruth categories) and the performance is evaluated by accuracy :
(14) 
where is the number of documents in the test set, is the indicator function, is the groundtruth label of th document, is the category assignment, and ranges over all possible onetoone mappings between labels and clusters. The optimal map function can be obtained by the KuhnMunkres algorithm Kuhn (1955). A larger accuracy value indicates a better text clustering results.
Dataset  NVLDA  ProdLDA  LDA  BAT  GBAT 

20NG  33.31%  33.82%  35.36%  35.66%  41.25% 
The comparison of text clustering results on 20Newsgroups is shown in Table 4. Due to the poor performance of NVDM in topic coherence evaluation, its result is excluded here. Not surprisingly, NVLDA and ProdLDA perform worse than BAT and GaussianBAT that model topics with the Dirichlet prior. This might be caused by the fact that LogisticNormal prior does not exhibit multiple peaks at the vertices of the simplex, as depicted in Figure 1. Compared with LDA, BAT achieves a comparable result in accuracy since both models have the same Dirichlet prior assumption over topics and only employ the word cooccurrence information. GaussianBAT outperforms the second best model, BAT, by nearly 6% in accuracy. This shows that the incorporation of word embeddings is important to improve the semantic coherence of topics and thus results in better consistency between cluster assignments and groundtruth labels.
5 Conclusion
In this paper, we have explored the use of bidirectional adversarial training in neural topic models and proposed two novel approaches: the Bidirectional Adversarial Topic (BAT) model and the Bidirectional Adversarial Topic model with Gaussian (GaussianBAT). BAT models topics with the Dirichlet prior and builds a twoway transformation between documenttopic distribution and documentword distribution via bidirectional adversarial training. GaussianBAT extends from BAT by incorporating word embeddings into the modeling process, thereby naturally considers the word relatedness information captured in word embeddings. The experimental comparison on three widely used benchmark text corpus with the existing neural topic models shows that BAT and GaussianBAT achieve improved topic coherence results. In the future, we would like to devise a nonparametric neural topic model based on adversarial training. Besides, developing correlated topic modelsis another promising direction.
Acknowledgements
We would like to thank anonymous reviewers for their valuable comments and helpful suggestions. This work was funded by the National Key Research and Development Program of China(2017YFB1002801) and the National Natural Science Foundation of China (61772132). And YH is partially supported by EPSRC (grant no. EP/T017112/1).
References

Wasserstein generative adversarial networks.
In
International conference on machine learning
, pp. 214–223. Cited by: §2.1, §3.3.  Probabilistic fasttext for multisense word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1–11. Cited by: §3.2.
 Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §1, §4.1.
 Reading tea leaves: how humans interpret topic models. In Advances in neural information processing systems, pp. 288–296. Cited by: §4.2.
 Btm: topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering 26 (12), pp. 2928–2941. Cited by: §1.
 Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §2.1, §3.
 Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §2.1.
 MaskGAN: better text generation via filling in the_. arXiv preprint arXiv:1801.07736. Cited by: §2.1.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1, §2.1.
 Finding scientific topics. Proceedings of the National academy of Sciences 101 (suppl 1), pp. 5228–5235. Cited by: §4.1.
 Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.1.
 Replicated softmax: an undirected topic model. In Advances in neural information processing systems, pp. 1607–1614. Cited by: §2.2.
 Bag of tricks for efficient text classification. EACL 2017, pp. 427. Cited by: §3.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
 The hungarian method for the assignment problem. Naval research logistics quarterly 2, pp. 83–97. Cited by: §4.3.
 Newsweeder: learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339. Cited by: §4.1.

SAAN: a sentimentaware attention network for sentiment analysis
. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1197–1200. Cited by: §1.  Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 375–384. Cited by: §1.
 Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pp. 3155–3165. Cited by: §2.1.

Content attention model for aspect based sentiment analysis
. In Proceedings of the 2018 World Wide Web Conference, pp. 1023–1032. Cited by: §1.  Discovering discrete latent topics with neural variational inference. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2410–2419. Cited by: §2.2, §4.1.
 Neural variational inference for text processing. In International conference on machine learning, pp. 1727–1736. Cited by: §2.2, §4.1.

Efficient estimation of word representations in vector space
. arXiv preprint arXiv:1301.3781. Cited by: §3.2.  Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §3.2.
 Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2, §3.2.
 Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.1.
 Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pp. 399–408. Cited by: §4.2.
 Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488. Cited by: §1, §1, §2.2, §4.1.
 Rethinking lda: why priors matter. In Advances in neural information processing systems, pp. 1973–1981. Cited by: §1, §3.1.2.
 Open event extraction from online text using a generative adversarial network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 282–291. Cited by: §2.2.
 Atm: adversarialneural topic model. Information Processing & Management 56 (6), pp. 102098. Cited by: §1, §2.2, §4.1.

Seqgan: sequence generative adversarial nets with policy gradient.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, Cited by: §2.1.  A simple bayesian modelling approach to event extraction from twitter. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 700–705. Cited by: §1.