1 Introduction
Topic models have been successfully applied in Natural Language Processing with various applications such as information extraction, text clustering, summarization, and sentiment analysis
Lu et al. (2011); Subramani et al. (2018); Tuan et al. (2020); Wang et al. (2019b); Wang and Mengoni (2021); Nguyen et al. (2021). The most popular conventional topic model, Latent Dirichlet Allocation Blei et al. (2003), learns documenttopic and topicword distribution via Gibbs sampling and mean field approximation. To apply deep neural network for topic model, Miao et al.
Miao et al. (2017) proposed to use neural variational inference as the training method while Srivastava and Sutton Srivastava and Sutton (2017) employed the logistic normal prior distribution. However, recent studies Wang et al. (2019a, 2020) showed that both Gaussian and logistic normal prior fail to capture multimodality aspects and semantic patterns of a document, which are crucial to maintain the quality of a topic model.To cope with this issue, Adversarial Topic Model (ATM) Wang et al. (2019a, 2020); Hu et al. (2020); Nan et al. (2019)
was proposed with adversarial mechanisms using a combination of generator and discriminator. By seeking the equilibrium between the generator and discriminator, the generator is capable of learning meaningful semantic patterns of the document. Nonetheless, this framework has two main limitations. First, ATM relies on the key ingredient: leveraging the discrimination of the real distribution from the fake (negative) distribution to guide the training. Since the sampling of the fake distribution is not conditioned on the real distribution, it barely generates positive samples which largely preserves the semantic content of the real sample. This limits the behavior concerning the mutual information in the positive sample and the real one, which has been demonstrated as key driver to learn useful representations in unsupervised learning
Blum and Mitchell (1998); Xu et al. (2013); Bachman et al. (2019); Chen et al. (2020a); Tian et al. (2020). Second, ATM takes random samples from a prior distribution to feed to the generator. Previous work Card et al. (2017)has shown that incorporating additional variables, such as metadata or the sentiment, to estimate the topic distribution aids the learning of coherent topics. Relying on a predefined prior distribution, ATM hinders the integration of those variables.
To address the above drawbacks, in this paper we propose a novel method to model the relations among samples without relying on the generativediscriminative architecture. In particular, we formulate the objective as an optimization problem that aims to move the representation of the input (or prototype) closer to the one that shares the semantic content, i.e., positive sample. We also take into account the relation of the prototype and the negative sample by forming an auxiliary constraint to enforce the model to push the representation of the negative farther apart from the prototype. Our mathematical framework ends with a contrastive objective, which will be jointly optimized with the evidence lower bound of neural topic model.
Nonetheless, another challenge arises: how to effectively generate positive and negative samples under neural topic model setting? Recent efforts have addressed positive sampling strategies and methods to generate hard negative samples for images Chuang et al. (2020); Robinson et al. (2020); Chen et al. (2020b); Tian et al. (2019). However, relevant research to adapt the techniques to neural topic model setting has been neglected in the literature. In this work, we introduce a novel sampling method that mimics the way human being seizes the similarity of a pair of documents, which is based on the following hypothesis:
Hypothesis 1.
The common theme of the prototype and the positive sample can be realized due to their relative frequency of salient words.
We use the example in Fig. 1 to explain the idea of our method. Humans are able to tell the similarity of the input with positive sample due to the reason that the frequency of salient words such as “league” and “teams" is proportional to their counterpart in the positive sample. On the other hand, the separation between the input and the negative sample can be induced since those words in the input do not occur in negative sample, though they both contain words “billions" and “dollars", which are not salient in the context of the input. Based on this intuition, our method generates the positive and negative samples for topic model by maintaining the weights of salient entries and altering those of unimportant ones in the prototype to construct the positive samples while performing the opposite procedure for the negative ones. Inherently, since our method is not depended on a fixed prior distribution to draw our samples, we are not restrained in incorporating external variables to provide additional knowledge for better learning topics.
In a nutshell, the contributions of our paper are as follows:

[leftmargin=*]

We target the problem of capturing meaningful representations through modeling the relations among samples from a new mathematical perspective and propose a novel contrastive objective which is jointly optimized with evidence lower bound of neural topic model. We find that capturing the mutual information between the prototype and its positive samples provides a strong foundation for constructing coherent topics, while differentiating the prototype from the negative samples plays a less important role.

We propose a novel sampling strategy that is motivated by human behavior when comparing different documents. By relying on the reconstructed output, we adapt the sampling to the learning process of the model, and produce the most informative samples compared with other sampling strategies.

We conduct extensive experiments in three common topic modeling datasets and demonstrate the effectiveness of our approach by outperforming other stateoftheart approaches in terms of topic coherence , on both global and topicbytopic basis.
2 Related Work
Neural Topic Model
(NTM) has been studied to encode a large set of documents using latent vectors. Inspired by Variational Autoencoder, NTM inherit most techniques from VAEspecific early works, such as reparameterization trick
Kingma and Welling (2013) and neural variational inference Rezende et al. (2014). Subsequent works attempting to apply for topic model Srivastava and Sutton (2017); Miao et al. (2016, 2017) focus on studying various prior distributions, e.g. Gaussian or logistic normal. Recently, researches directly target to improve topic coherence through formulating it as an optimizing objective Ding et al. (2018), incorporating contextual language knowledge Hoyle et al. (2020), or passing external information, e.g. sentiment, group of documents, as input Card et al. (2017). Generating topics that are humaninterpretable has become the goal of a wide variety of latest efforts.Adversarial Topic Model Wang et al. (2019b) is a topic modeling approach that models the topics with GANbased architecture. The key components in that architecture consist of a generator projecting randomly sampled documenttopic distribution to gain the most realistic documentword distribution as possible and a discriminator trying to distinguish between the generated and the true sample Wang et al. (2019a, 2020). To better learn informative representations of a document, Hu et al. Hu et al. (2020) proposed adding two cycleconsistent constraints to encourage the coordination between the encoder and generator.
Contrastive Framework and Sampling Techniques There are various efforts studying contrastive method to learn meaningful representations. For visual information, contrastive framework is applied for tasks such as image classification Khosla et al. (2020); Hjelm et al. (2018), object detection Xie et al. (2021); Sun et al. (2021); Amrani et al. (2019), image segmentaion Zhao et al. (2020); Chaitanya et al. (2020); Ke et al. (2021), etc. Other applications different from image include adversarial training Ho and Vasconcelos (2020); Miyato et al. (2018); Kim et al. (2020), graph You et al. (2020); Sun et al. (2019); Li et al. (2019); Hassani and Khasahmadi (2020), and sequence modeling Logeswaran and Lee (2018); Oord et al. (2018); Henaff (2020). Specific positive sampling strategies have been proposed to improve the performance of contrastive learning, e.g. applying viewbased transformations that preserve semantic content in the image Chen et al. (2020b, a); Tian et al. (2020). On the other hand, there is a recent surge of interest in studying negative sampling methods. Chuang et al. Chuang et al. (2020) propose a debiasing method which is to correct the fact in false negative samples. For object detection, Jin et al. Jin et al. (2018) employ temporal structure of video to generate negative examples. Although widely studied, little effort has been made to adapt contrastive techniques to neural topic model.
In this paper, we reformulate our goal of learning document representations in neural topic model as a contrastive objective. The form of our objective is mostly related to Robinson et al. Robinson et al. (2020). However, there are two key differences: (1) As they use the weighting factor associated with the impact of negative sample as a tool to search for the distribution of hard negative samples, we consider it as an adaptive parameter to control the impact of the positive and negative sample on the learning. (2) We regard the effect of positive sample as the main driver to achieve meaningful representations, while they exploit the impact of negative ones. Our approach is more applicable to topic modeling, as proven in the investigation into human behavior of distinguishing among documents.
3 Methodology
3.1 Notations and Problem Setting
In this paper, we focus on improving the performance of neural topic model (NTM), measured via topic coherence. NTM inherits the architecture of Variational Autoencoder, where the latent vector is taken as topic distribution. Suppose the vocabulary has unique words, each document is represented as a word count vector and a latent distribution over topics: . NTM assumes that is generated from a prior distribution and x is generated from the conditional distribution over the topic by a decoder . The aim of model is to infer the documenttopic distribution given the word count. In other words, it must estimate the posterior distribution , which is approximated by the variational distribution modelled by an encoder . NTM is trained by minimizing the following objective
(1) 
3.2 Contrastive objective derivation
Let denote the set of document bagofwords. Each vector is associated with a negative sample and a positive sample . We assume a discrete set of latent classes , so that have the same latent class while does not. In this work, we choose to use the semantic dot product to measure the similarity between prototype and the drawn samples.
Our goal is to learn a mapping function of the encoder which transforms to the latent distribution ( and are transformed to and , respectively). A reasonable mapping function must fulfill two qualities: (1) and are mapped onto nearby positions; (2) and are projected distantly. Regarding goal (1) as the main objective and goal (2) as the constraint enforcing the model to learn the relations among dissimilar samples, we specify the constrained optimization problem, in which denotes the strength of the constraint
(2) 
Rewriting Eq. 2 as a Lagragian under KKT conditions Kuhn and Tucker (2014); Karush (1939), we attain:
(3) 
where the positive KKT multiplier is the regularisation coefficient that controls the effect of the negative sample on training. Eq. 3 can be derived to arrive at the weightedcontrastive loss.
(4) 
where . The full proof of (4) can be found in the Appendix. Previous works Kim et al. (2020); Chaitanya et al. (2020); You et al. (2020); Khosla et al. (2020); Chuang et al. (2020); Han et al. (2021) consider the positive and negative sample equally likely as setting . In this paper, we leverage different values of to guide the model concentration on the sample which is distinct from the input. In consequence, a reasonable value of will provide a clear separation among topics in the dataset. We demonstrate our procedure to estimate in the following section.
3.3 Controlling the effect of negative sample
When choosing value of , we need to answer the following questions: (1) What impact does have on the process of training? and (2) Is it possible to design a procedure which is dataoriented to approximate ?
Understanding the impact of To exemplify point (1), we study the impact of on the contrastive loss presented in Section 3.2. The gradient of the contrastive loss (4) with respect to the latent distribution would be:
(5) 
This derivation confirms the proportionality of the gradient norm with respect to . As the training progresses, the update step must be carefully controlled to avoid bouncing around the minimum or getting stuck in local optima.
Adaptive scheduling We leverage the adaptive approach to construct a dataoriented procedure to estimate . Initially, the neural topic model will consider the representation of each document equally likely. The relation of the similarity of the positive and the prototype to the one of the negative and the prototype can provide us with a starting viewpoint of the model. Concretely, we store that information in the initialized value of which is estimated with the formula .
After intialisation, to accommodate to the model learning, we continue to adopt an adaptive strategy which keeps updating value of according to the triangle scheduling procedure: . We summarize the detail of choosing in Algo. 1.
3.4 Wordbased Sampling Strategy
Here we provide a technical motivation and details of our sampling method. To choose a sample which has the same underlying topic with the input, it is reasonable to filter out topics which hold large values in the documenttopic distribution, as they are considered to be important by the neural topic model. Subsequently, the procedure will draw salient words in each of the topic that will contribute the weights to the drawn samples. We call this strategy as the topicbased sampling strategy.
However, as shown in Miao et al. (2017), the process of topic choosing is sensitive to the training performance and it is challenging to determine the optimal topic number represented for every single input. Miao et al Miao et al. (2017) implemented a stick breaking procedure to specifically predict number of topics for each document. Their strategy demands approximating the likelihood increase for each decision of breaking the stick, in other word adding the number of topic that the document denotes. Since their process takes up a considerable amount of computation, we propose a simpler approach which is wordbased to draw both positive and negative samples.
For each document with its associated word count vector , we form the tfidf representation . Then, we feed x to the neural topic model to obtain the latent vector and the reconstructed document . Our wordbased sampling strategy is illustrated in Fig. 2.
Negative sampling We select tokens that have the highest tfidf scores. We hypothesize that these words mainly contribute to the topic of the document. By substituting weights of chosen tokens in the original input x with the weights of the reconstructed representation : , we enforce the negative samples to have the main content deviated from the original input .
Note that since the model improves its reconstruction ability as training progresses, the weights of salient words from the reconstructed output approach those from the original input (but not equal). The model should take a more careful learning step to adapt to this situation. As the negative sample controlling factor decays its value when converging to the final training step, due to our adaptive scheduling approach aforementioned in section 3.3, it is able to adapt to this phenomenon.
Positive sampling Contrary to the negative case, we select tokens possessing the lowest tfidf scores . We obtain the positive sample which bears a resembling theme to the original input by assigning weights of the chosen tokens in to their counterpart in through . This forms a valid positive sampling procedure since modifying weights of insignificant tokens retains the salient topics in the source document.
3.5 Training objective
4 Experimental Setting
In this section, we provide the experimental setups of our conducted experiments to evaluate the performance of our proposed method. We provide the statistics summary of the datasets in Appendix.
4.1 Datasets
We conduct our experiments on three readily available datasets that belong to various domains, vocabulary sizes, and document lengths:

[leftmargin=*]

20Newsgroups (20NG) dataset Lang (1995) consists of about 18000 documents, each document is a newsgroup post and associated with a newsgroup label (for example, talk.politics.misc). Following Huynh et al. Huynh et al. (2020), we preprocess the dataset to remove stopwords, words possessing length equal to , and get rid of words whose frequency is less than . We conduct the dataset split with , , for training, validation, and testing, respectively.

Wikitext103 (Wiki) Merity et al. (2016)
is a version of WikiText dataset, which includes about
articles from Good and Featured section on Wikipedia. We follow the preprocess, keep the top words as in Merity et al. (2016), and use the train/dev/test split of , , and .
For evaluation measure, we use the Normalized Mutual Pointwise Information (NPMI) since this strongly correlates with human judgement and is popularly applied to verify the topic quality Hoyle et al. (2020)
. For text classification, we use the F1score as the evaluation metric.
4.2 Baselines
We compare our method with the following stateoftheart neural topic models of diverse styles:

[leftmargin=*]

NTM Ding et al. (2018) a Gaussianbased neural topic model proposed by (Miao et al., 2017) inheriting the VAE architecture and utilizing neural variational inference for training.

SCHOLAR Card et al. (2017) a VAEbased neural topic model learning with logistic normal prior and is provided with a method to incorporate external variables.

SCHOLAR + BAT Hoyle et al. (2020) a version of SCHOLAR model trained using knowledge distillation where BERT model as a teacher provides contextual knowledge for its student, the neural topic model.

WLDA Nan et al. (2019) a topic model which takes form of a Wasserstein autoencoder with Dirichlet prior approximated by minimizing Maximum Mean Discrepancy.

BATM Wang et al. (2020)
a neural topic model whose architecture is inspired by Generative Adversarial Network. We use the version trained with bidirectional adversarial training method and the architecture consisting of 3 components: encoder, generator, and discriminator.
5 Results
5.1 Topic coherence
20NG  IMDb  Wiki  
NTM Ding et al. (2018)  0.283 0.004  0.277 0.003  0.170 0.008  0.169 0.003  0.250 0.010  0.291 0.009 
WLDA Nan et al. (2019)  0.279 0.003  0.188 0.001  0.136 0.007  0.095 0.003  0.451 0.012  0.308 0.007 
BATM Wang et al. (2020)  0.314 0.003  0.245 0.001  0.065 0.008  0.090 0.004  0.336 0.010  0.319 0.005 
SCHOLAR Card et al. (2017)  0.319 0.007  0.263 0.002  0.168 0.002  0.140 0.001  0.429 0.011  0.446 0.009 
SCHOLAR + BAT Hoyle et al. (2020)  0.324 0.006  0.272 0.002  0.182 0.002  0.175 0.003  0.446 0.010  0.455 0.007 
Our model   0.327 0.006  0.274 0.003  0.191 0.007  0.185 0.003  0.455 0.012  0.450 0.008 
Our model   0.328 0.004  0.277 0.003  0.195 0.008  0.187 0.001  0.465 0.012  0.456 0.004 
Our model   0.334 0.004  0.280 0.003  0.197 0.006  0.188 0.002  0.497 0.009  0.478 0.006 
Overall basis We evaluate our methods both at and . For each topic, we follow previous works Hoyle et al. (2020); Wang et al. (2019a); Card et al. (2017) to pick the top words, measure its NPMI measure and calculate in the average value. As shown in Tab. 1, our method achieves the best topic coherence on three benchmark datasets. We surpass the baseline SCHOLAR Card et al. (2017), its version trained with distilled knowledge SCHOLAR + BAT Hoyle et al. (2020), and other stateoftheart neural topic models in both cases of and . We also establish the robustness of our improvement by conducting experiments on
runs with different random seeds and recording the mean and standard deviation. This confirms that the contrastive framework promotes the overall quality of generated topics.
Topicbytopic basis To further evaluate the performance of our method, we proceed to individually compare each of our topics with the aligned topic produced by the baseline neural topic model. Following Hoyle et al. Hoyle et al. (2020), we use a variant of competitive linking to greedily approximate the optimal weight of the bipartite graph matching. Particularly, a bipartite graph is constructed by linking the topics of our model and the baseline one. The weight of each link is represented as the JensenShannon (JS) divergence Wong and You (1985); Lin (1991) between two topics. We iteratively choose the pair according to its lowest JS score, dispense those two topics from the topic list, and repeat until the JS score surpasses a certain threshold. Fig. 3 (left) shows the aligned scores for three benchmark corpora. Using visual inspection, we decide to choose the most aligned 44 topic pairs to conduct the comparison. As shown in Fig. 3 (right), our model has more topics with higher NPMI score than the baseline model. This means that our model not only generates better topics on average but also on the topicbytopic basis.
5.2 Text classification
Model  20NG  IMDb 
BATM Wang et al. (2020)  30.8  66.0 
SCHOLAR Card et al. (2017)  52.9  83.4 
SCHOLAR + BAT Hoyle et al. (2020)  32.2  73.1 
Our model  54.4  84.2 
In order to compare the extrinsic predictive performance, we use document classification as the downstream task. We collect the latent vectors inferred by neural topic models in
and train a Random Forest with the number of decision trees as
and the maximum depth as to predict the class of each document. We pick IMDb and 20NG for our experiment. Our method surpasses other neural topic models on the downstream text classification with significant gaps, as shown in Tab. 2.5.3 Ablation Study
20NG  IMDb  Wiki  
Our method  0.334 0.004  0.280 0.003  0.197 0.006  0.190 0.002  0.497 0.009  0.478 0.006 
 w/o positive sampling  0.320 0.004  0.272 0.002  0.187 0.006  0.182 0.007  0.452 0.012  0.448 0.009 
 w/o negative sampling  0.331 0.002  0.277 0.002  0.195 0.008  0.188 0.003  0.474 0.010  0.468 0.007 
To verify the efficiency mimicking the human behavior in learning topic by grasping the commonalities, we train our methods under the besting setting with (, with wordbased sampling), but with two different objectives: (1) Without positive sampling: model captures semantic pattern by only distinguishing the input and the negative sample; (2) Without negative sampling: model learns the semantic pattern by solely minimizing the similarity the input with the positive sample. Tab. 3 demonstrates losing one of the two views in contrastive framework degrades the quality of the topics. We include the optimizing objective for the two approaches in the Appendix. Remarkably, it is interesting that removing the negative objective influences less than for the positive one. This reconfirms the soundness of our approach to focus on the effect of positive sample, which takes inspiration from human perspective.
6 Analysis
6.1 Effect of adaptive controlling parameter
20NG  IMDb  Wiki  
0sampling  0.269 0.003  0.231 0.001  0.171 0.005  0.172 0.002  0.448 0.008  0.429 0.007 
Random sampling  0.321 0.005  0.273 0.001  0.183 0.002  0.177 0.001  0.460 0.012  0.462 0.003 
Topicbased sampling   0.313 0.004  0.270 0.005  0.189 0.002  0.172 0.002  0.467 0.012  0.464 0.002 
Topicbased sampling   0.322 0.005  0.268 0.002  0.181 0.006  0.170 0.007  0.450 0.013  0.461 0.008 
Topicbased sampling   0.319 0.001  0.273 0.002  0.176 0.007  0.170 0.003  0.472 0.007  0.444 0.006 
Our method  0.334 0.004  0.280 0.003  0.197 0.006  0.188 0.002  0.497 0.009  0.478 0.006 
We then show the relation between , which controls the impact of our constraint, and the topic coherence measure in Fig. 4. As shown in the figure, adaptive weight exhibits consistent superiority over manually tuned constant parameter. We elaborate our high performance on the triangle scheduling that brings the selfadjustment in different training stages.
6.2 Random Sampling Strategy
Number of Topics  20NG  IMDb  Wiki 
0.0140  0.0291  0.0344  
0.0494  0.0012  0.0156 
In this section, we demonstrate the effectiveness of our random sampling strategy. We compare our performance with two other methods: (1) sampling: we replace the weights of chosen tokens in the BoW with ; (2): we create the negative samples by drawing other documents from the dataset, then extracting the topic vector of each document; we do not perform positive sampling in this variant. (3) Topicbased sampling: the sampling strategy we discussed in section 3.4, we experiment with varying choices of . As shown in Tab. 4, our sampling method consistently outperforms other strategies by a large margin. This confirms our hypothesis that topicbased sampling is vulnerable to drawing insufficient or redundant topics and might harm the performance.
In addition, to further evalute the statistical significance of our outperforming over traditional random sampling method, we conduct significance testing and report pvalue in Tab. 5. As it can be seen, all of the pvalues are smaller than 0.05, which proves the statistical significance in the improvement of our method against traditional contrastive learning.
6.3 Importance Measure
Metrics  IMDb  20NG  Wiki 
PCA  0.184 0.004  0.325 0.003  0.481 0.005 
SVD  0.181 0.004  0.313 0.003  0.476 0.014 
tf  0.196 0.003  0.332 0.006  0.495 0.008 
idf  0.193 0.001  0.334 0.004  0.490 0.009 
tfidf  0.197 0.006  0.334 0.004  0.497 0.009 
Our wordbased sampling strategy employs tfidf measure to determine important and unimportant words that have values to be superseded to form positive and negative samples.
To have a fair judgement, we also conduct experiments with two other complex sampling methods using Principal Component Analysis (PCA) or Singular Value Decomposition (SVD). Specifically, we decompose the reconstructed and original input vectors into singular values and then replace the largest/smallest singular values of the input with the largest/smallest ones of the reconstructed to obtain negative/positive samples, respectively. For SVD, we choose
largest/smallest values for substitution whereas for PCA, we decompose the input vector onto d space in order to make it similar to the latent space of neural topic model (number of topics ) and proceed to substitute largest/smallest values as in SVD. We conducted our experiments on 3 datasets IMDb, 20NG, and Wiki with , and reported the results (NPMI) in Tab. 6.As it can be obviously seen, despite its simplicity, tfidfbased sampling method outperforms other complicated sampling methods in our tasks.
Dataset  Method  NPMI  Topic 
20NG  SCHOLAR  0.259  max bush clinton crypto pgp clipper nsa announcement air escrow 
Our model  0.543  crypto clipper encryption nsa escrow wiretap chip proposal warrant secure  
Wiki  SCHOLAR  0.196  airlines boeing vehicle manufactured flight skiing airline ski engine alpine 
Our model  0.564  skiing ski alpine athletes para paralympic nordic olympic paralympics ipc  
IMDb  SCHOLAR  0.145  hong chinese kong imagery japanese rape lynch torture violence disturbing 
Our model  0.216  hong chinese kong japan fairy japanese sword martial fantasy magical 
6.4 Case Studies
We randomly extract sample topic in each of three datasets to study the quality of the generated topics and show the result in Tab. 7. Generally, the topic words generated by our model tends to concentrate on the main topic of the document. For example, in 20NG dataset, it can be seen that our words tend to concentrate on the topic related to cryptography (encryption, crypto, etc.) and computer hardware (chip, wiretap, clipper, etc.), rather than political words, e.g. bush and clinton generated by SCHOLAR model. Our generated topics in Wiki is more focused on skiing, while SCHOLAR’s topic comprises of traffic terms such as vehicle, boeing, and engine. Similarly, the topic words in IMDb generated by our model mainly reflects the theme of Fantasy movie in japan, chinese, and hong kong, while not including offtopic words such as torture and disturbing which were generated by SCHOLAR model.
7 Conclusion
In this paper, we propose a novel method to help neural topic model learn more meaningful representations. Approaching the problem with a mathematical perspective, we enforce our model to consider both effects of positive and negative pairs. To better capture semantic patterns, we introduce a novel sampling strategy which takes inspiration from human behavior in differentiating documents. Experimental results on three common benchmark datasets show that our method outperforms other stateoftheart neural topic models in terms of topic coherence.
References

Learning to detect and retrieve objects from unlabeled videos.
In
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
, pp. 3713–3717. Cited by: §2.  Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §1.

Latent dirichlet allocation.
the Journal of machine Learning research
3, pp. 993–1022. Cited by: §1. 
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pp. 92–100. Cited by: §1.  Neural models for documents with metadata. arXiv preprint arXiv:1705.09296. Cited by: §1, §2, 2nd item, §5.1, Table 1, Table 2.
 Contrastive learning of global and local features for medical image segmentation with limited annotations. arXiv preprint arXiv:2006.10511. Cited by: §2, §3.2.
 A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §1, §2.
 Big selfsupervised models are strong semisupervised learners. arXiv preprint arXiv:2006.10029. Cited by: §1, §2.
 Debiased contrastive learning. arXiv preprint arXiv:2007.00224. Cited by: §1, §2, §3.2.
 Coherenceaware neural topic modeling. arXiv preprint arXiv:1809.02687. Cited by: §2, 1st item, Table 1.

Dual contrastive learning for unsupervised imagetoimage translation
. arXiv preprint arXiv:2104.07689. Cited by: §3.2.  Contrastive multiview representation learning on graphs. In International Conference on Machine Learning, pp. 4116–4126. Cited by: §2.
 Dataefficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §2.
 Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.
 Contrastive learning with adversarial examples. arXiv preprint arXiv:2010.12050. Cited by: §2.
 Improving neural topic models using knowledge distillation. arXiv preprint arXiv:2010.02377. Cited by: §2, 3rd item, §4.1, §5.1, §5.1, Table 1, Table 2.
 Neural topic modeling with cycleconsistent adversarial training. arXiv preprint arXiv:2009.13971. Cited by: §1, §2.
 OTLDA: a geometryaware optimal transport approach for topic modeling. Advances in Neural Information Processing Systems 33. Cited by: 1st item.
 Unsupervised hard example mining from videos for improved object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 307–324. Cited by: §2.
 Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago. Cited by: §3.2.
 Universal weakly supervised segmentation by pixeltosegment contrastive learning. arXiv preprint arXiv:2105.00957. Cited by: §2.
 Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Cited by: §2, §3.2.
 Adversarial selfsupervised contrastive learning. arXiv preprint arXiv:2006.07589. Cited by: §2, §3.2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
 Nonlinear programming. In Traces and emergence of nonlinear programming, pp. 247–258. Cited by: §3.2.
 Newsweeder: learning to filter netnews. In Machine Learning Proceedings 1995, pp. 331–339. Cited by: 1st item.
 Graph matching networks for learning the similarity of graph structured objects. In International Conference on Machine Learning, pp. 3835–3845. Cited by: §2.
 Divergence measures based on the shannon entropy. IEEE Transactions on Information theory 37 (1), pp. 145–151. Cited by: §5.1.
 An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893. Cited by: §2.
 Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Information Retrieval 14 (2), pp. 178–203. Cited by: §1.
 Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150. Cited by: 3rd item.
 Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: 2nd item.
 Discovering discrete latent topics with neural variational inference. In International Conference on Machine Learning, pp. 2410–2419. Cited by: §1, §2, §3.4.
 Neural variational inference for text processing. In International conference on machine learning, pp. 1727–1736. Cited by: §2.

Virtual adversarial training: a regularization method for supervised and semisupervised learning
. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.  Topic modeling with wasserstein autoencoders. arXiv preprint arXiv:1907.12374. Cited by: §1, 4th item, Table 1.

Enriching and controlling global semantics for text summarization
. arXiv preprint arXiv:2109.10616. Cited by: §1.  Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.

Stochastic backpropagation and approximate inference in deep generative models
. In International conference on machine learning, pp. 1278–1286. Cited by: §2.  Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592. Cited by: §1, §2.
 Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488. Cited by: §1, §2.
 A novel approach of neural topic modelling for document clustering. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2169–2173. Cited by: §1.
 FSCE: fewshot object detection via contrastive proposal encoding. arXiv preprint arXiv:2103.05950. Cited by: §2.
 Infograph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §2.
 Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.
 What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §1, §2.

Capturing greater context for question generation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 9065–9072. Cited by: §1.  How pandemic spread in news: text analysis using topic model. arXiv preprint arXiv:2102.04205. Cited by: §1.
 Neural topic modeling with bidirectional adversarial training. arXiv preprint arXiv:2004.12331. Cited by: §1, §1, §2, 5th item, Table 1, Table 2.
 Atm: adversarialneural topic model. Information Processing & Management 56 (6), pp. 102098. Cited by: §1, §1, §2, §5.1.
 Open event extraction from online text using a generative adversarial network. arXiv preprint arXiv:1908.09246. Cited by: §1, §2.

Entropy and distance of random graphs with application to structural pattern recognition
. IEEE Transactions on Pattern Analysis and Machine Intelligence (5), pp. 599–609. Cited by: §5.1.  DetCo: unsupervised contrastive learning for object detection. arXiv preprint arXiv:2102.04803. Cited by: §2.
 A survey on multiview learning. arXiv preprint arXiv:1304.5634. Cited by: §1.
 Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33. Cited by: §2, §3.2.
 Contrastive learning for labelefficient semantic segmentation. arXiv preprint arXiv:2012.06985. Cited by: §2.
Appendix A Implementation details
In this section, we include the hyperprameter details we use in this work, e.g. learning rate, batch size, etc. We apply different sets of hyperparameters, with respect to the dataset the neural topic model is trained on.
20NG  IMDb  Wiki  
Learning rate  0.002  0.001  0.002 
Batch size  200  200  500 
Appendix B Contrastive loss derivation
We provide the proof of the inequality (4) in this section.
Theorem 1.
Let denote the word count representation of a document, denote the positive sample and negative sample with respect to , denote the mapping function of the encoder, denote the positive KKT multiplier, and denote the strength of constraint. Suppose , then we have the following inequality
(7) 
Proof.
Appendix C Versions of loss function
We provide the description of versions of loss functions we use in this work.
Contrastive approach  Using both positive and negative samples
(8) 
Contrastive approach  Using only positive sample
(9) 
Contrastive approach  Using only negative sample
(10) 
Appendix D Understanding number of chosen tokens
We demonstrate the effect of changing the number of tokens chosen for sampling. We perform training with different choices of and record the topic coherence. For visibility, we normalize them to one common scale before plotting them in Fig 5. It can be seen that the performance initially increases as we select more tokens from the reconstructed output to substitute for the drawn sample. However, when the number of selected tokens grows too large, the topic coherence measure starts decreasing as increases. We hypothesize that the overwhelming number of substituted values can alter the semantic of the positive samples, while producing random negative sample.