Log In Sign Up

Contrastive Learning for Neural Topic Model

Recent empirical studies show that adversarial topic models (ATM) can successfully capture semantic patterns of the document by differentiating a document with another dissimilar sample. However, utilizing that discriminative-generative architecture has two important drawbacks: (1) the architecture does not relate similar documents, which has the same document-word distribution of salient words; (2) it restricts the ability to integrate external information, such as sentiments of the document, which has been shown to benefit the training of neural topic model. To address those issues, we revisit the adversarial topic architecture in the viewpoint of mathematical analysis, propose a novel approach to re-formulate discriminative goal as an optimization problem, and design a novel sampling method which facilitates the integration of external variables. The reformulation encourages the model to incorporate the relations among similar samples and enforces the constraint on the similarity among dissimilar ones; while the sampling method, which is based on the internal input and reconstructed output, helps inform the model of salient words contributing to the main topic. Experimental results show that our framework outperforms other state-of-the-art neural topic models in three common benchmark datasets that belong to various domains, vocabulary sizes, and document lengths in terms of topic coherence.


page 1

page 2

page 3

page 4


SimDoc: Topic Sequence Alignment based Document Similarity Framework

Document similarity is the problem of estimating the degree to which a g...

Neural Sinkhorn Topic Model

In this paper, we present a new topic modelling approach via the theory ...

Neural Topic Modeling with Bidirectional Adversarial Training

Recent years have witnessed a surge of interests of using neural topic m...

Document Informed Neural Autoregressive Topic Models with Distributional Prior

We address two challenges in topic models: (1) Context information aroun...

TAN-NTM: Topic Attention Networks for Neural Topic Modeling

Topic models have been widely used to learn representations from text an...

ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

The most common ways to explore latent document dimensions are topic mod...

Efficient Learning for Undirected Topic Models

Replicated Softmax model, a well-known undirected topic model, is powerf...

1 Introduction

Topic models have been successfully applied in Natural Language Processing with various applications such as information extraction, text clustering, summarization, and sentiment analysis

Lu et al. (2011); Subramani et al. (2018); Tuan et al. (2020); Wang et al. (2019b); Wang and Mengoni (2021); Nguyen et al. (2021). The most popular conventional topic model, Latent Dirichlet Allocation Blei et al. (2003)

, learns document-topic and topic-word distribution via Gibbs sampling and mean field approximation. To apply deep neural network for topic model, Miao et al.

Miao et al. (2017) proposed to use neural variational inference as the training method while Srivastava and Sutton Srivastava and Sutton (2017) employed the logistic normal prior distribution. However, recent studies Wang et al. (2019a, 2020) showed that both Gaussian and logistic normal prior fail to capture multi-modality aspects and semantic patterns of a document, which are crucial to maintain the quality of a topic model.

To cope with this issue, Adversarial Topic Model (ATM) Wang et al. (2019a, 2020); Hu et al. (2020); Nan et al. (2019)

was proposed with adversarial mechanisms using a combination of generator and discriminator. By seeking the equilibrium between the generator and discriminator, the generator is capable of learning meaningful semantic patterns of the document. Nonetheless, this framework has two main limitations. First, ATM relies on the key ingredient: leveraging the discrimination of the real distribution from the fake (negative) distribution to guide the training. Since the sampling of the fake distribution is not conditioned on the real distribution, it barely generates positive samples which largely preserves the semantic content of the real sample. This limits the behavior concerning the mutual information in the positive sample and the real one, which has been demonstrated as key driver to learn useful representations in unsupervised learning

Blum and Mitchell (1998); Xu et al. (2013); Bachman et al. (2019); Chen et al. (2020a); Tian et al. (2020). Second, ATM takes random samples from a prior distribution to feed to the generator. Previous work Card et al. (2017)

has shown that incorporating additional variables, such as metadata or the sentiment, to estimate the topic distribution aids the learning of coherent topics. Relying on a pre-defined prior distribution, ATM hinders the integration of those variables.

To address the above drawbacks, in this paper we propose a novel method to model the relations among samples without relying on the generative-discriminative architecture. In particular, we formulate the objective as an optimization problem that aims to move the representation of the input (or prototype) closer to the one that shares the semantic content, i.e., positive sample. We also take into account the relation of the prototype and the negative sample by forming an auxiliary constraint to enforce the model to push the representation of the negative farther apart from the prototype. Our mathematical framework ends with a contrastive objective, which will be jointly optimized with the evidence lower bound of neural topic model.

Nonetheless, another challenge arises: how to effectively generate positive and negative samples under neural topic model setting? Recent efforts have addressed positive sampling strategies and methods to generate hard negative samples for images Chuang et al. (2020); Robinson et al. (2020); Chen et al. (2020b); Tian et al. (2019). However, relevant research to adapt the techniques to neural topic model setting has been neglected in the literature. In this work, we introduce a novel sampling method that mimics the way human being seizes the similarity of a pair of documents, which is based on the following hypothesis:

Hypothesis 1.

The common theme of the prototype and the positive sample can be realized due to their relative frequency of salient words.

We use the example in Fig. 1 to explain the idea of our method. Humans are able to tell the similarity of the input with positive sample due to the reason that the frequency of salient words such as “league” and “teams" is proportional to their counterpart in the positive sample. On the other hand, the separation between the input and the negative sample can be induced since those words in the input do not occur in negative sample, though they both contain words “billions" and “dollars", which are not salient in the context of the input. Based on this intuition, our method generates the positive and negative samples for topic model by maintaining the weights of salient entries and altering those of unimportant ones in the prototype to construct the positive samples while performing the opposite procedure for the negative ones. Inherently, since our method is not depended on a fixed prior distribution to draw our samples, we are not restrained in incorporating external variables to provide additional knowledge for better learning topics.

Figure 1: Illustration of a document with one positive and negative pair.

In a nutshell, the contributions of our paper are as follows:

  • [leftmargin=*]

  • We target the problem of capturing meaningful representations through modeling the relations among samples from a new mathematical perspective and propose a novel contrastive objective which is jointly optimized with evidence lower bound of neural topic model. We find that capturing the mutual information between the prototype and its positive samples provides a strong foundation for constructing coherent topics, while differentiating the prototype from the negative samples plays a less important role.

  • We propose a novel sampling strategy that is motivated by human behavior when comparing different documents. By relying on the reconstructed output, we adapt the sampling to the learning process of the model, and produce the most informative samples compared with other sampling strategies.

  • We conduct extensive experiments in three common topic modeling datasets and demonstrate the effectiveness of our approach by outperforming other state-of-the-art approaches in terms of topic coherence , on both global and topic-by-topic basis.

2 Related Work

Neural Topic Model

(NTM) has been studied to encode a large set of documents using latent vectors. Inspired by Variational Autoencoder, NTM inherit most techniques from VAE-specific early works, such as reparameterization trick

Kingma and Welling (2013) and neural variational inference Rezende et al. (2014). Subsequent works attempting to apply for topic model Srivastava and Sutton (2017); Miao et al. (2016, 2017) focus on studying various prior distributions, e.g. Gaussian or logistic normal. Recently, researches directly target to improve topic coherence through formulating it as an optimizing objective Ding et al. (2018), incorporating contextual language knowledge Hoyle et al. (2020), or passing external information, e.g. sentiment, group of documents, as input Card et al. (2017). Generating topics that are human-interpretable has become the goal of a wide variety of latest efforts.

Adversarial Topic Model Wang et al. (2019b) is a topic modeling approach that models the topics with GAN-based architecture. The key components in that architecture consist of a generator projecting randomly sampled document-topic distribution to gain the most realistic document-word distribution as possible and a discriminator trying to distinguish between the generated and the true sample Wang et al. (2019a, 2020). To better learn informative representations of a document, Hu et al. Hu et al. (2020) proposed adding two cycle-consistent constraints to encourage the coordination between the encoder and generator.

Contrastive Framework and Sampling Techniques There are various efforts studying contrastive method to learn meaningful representations. For visual information, contrastive framework is applied for tasks such as image classification Khosla et al. (2020); Hjelm et al. (2018), object detection Xie et al. (2021); Sun et al. (2021); Amrani et al. (2019), image segmentaion Zhao et al. (2020); Chaitanya et al. (2020); Ke et al. (2021), etc. Other applications different from image include adversarial training Ho and Vasconcelos (2020); Miyato et al. (2018); Kim et al. (2020), graph You et al. (2020); Sun et al. (2019); Li et al. (2019); Hassani and Khasahmadi (2020), and sequence modeling Logeswaran and Lee (2018); Oord et al. (2018); Henaff (2020). Specific positive sampling strategies have been proposed to improve the performance of contrastive learning, e.g. applying view-based transformations that preserve semantic content in the image Chen et al. (2020b, a); Tian et al. (2020). On the other hand, there is a recent surge of interest in studying negative sampling methods. Chuang et al. Chuang et al. (2020) propose a debiasing method which is to correct the fact in false negative samples. For object detection, Jin et al. Jin et al. (2018) employ temporal structure of video to generate negative examples. Although widely studied, little effort has been made to adapt contrastive techniques to neural topic model.

In this paper, we re-formulate our goal of learning document representations in neural topic model as a contrastive objective. The form of our objective is mostly related to Robinson et al. Robinson et al. (2020). However, there are two key differences: (1) As they use the weighting factor associated with the impact of negative sample as a tool to search for the distribution of hard negative samples, we consider it as an adaptive parameter to control the impact of the positive and negative sample on the learning. (2) We regard the effect of positive sample as the main driver to achieve meaningful representations, while they exploit the impact of negative ones. Our approach is more applicable to topic modeling, as proven in the investigation into human behavior of distinguishing among documents.

3 Methodology

3.1 Notations and Problem Setting

In this paper, we focus on improving the performance of neural topic model (NTM), measured via topic coherence. NTM inherits the architecture of Variational Autoencoder, where the latent vector is taken as topic distribution. Suppose the vocabulary has unique words, each document is represented as a word count vector and a latent distribution over topics: . NTM assumes that is generated from a prior distribution and x is generated from the conditional distribution over the topic by a decoder . The aim of model is to infer the document-topic distribution given the word count. In other words, it must estimate the posterior distribution , which is approximated by the variational distribution modelled by an encoder . NTM is trained by minimizing the following objective


3.2 Contrastive objective derivation

Let denote the set of document bag-of-words. Each vector is associated with a negative sample and a positive sample . We assume a discrete set of latent classes , so that have the same latent class while does not. In this work, we choose to use the semantic dot product to measure the similarity between prototype and the drawn samples.

Our goal is to learn a mapping function of the encoder which transforms to the latent distribution ( and are transformed to and , respectively). A reasonable mapping function must fulfill two qualities: (1) and are mapped onto nearby positions; (2) and are projected distantly. Regarding goal (1) as the main objective and goal (2) as the constraint enforcing the model to learn the relations among dissimilar samples, we specify the constrained optimization problem, in which denotes the strength of the constraint


Rewriting Eq. 2 as a Lagragian under KKT conditions Kuhn and Tucker (2014); Karush (1939), we attain:


where the positive KKT multiplier is the regularisation coefficient that controls the effect of the negative sample on training. Eq. 3 can be derived to arrive at the weighted-contrastive loss.


where . The full proof of (4) can be found in the Appendix. Previous works Kim et al. (2020); Chaitanya et al. (2020); You et al. (2020); Khosla et al. (2020); Chuang et al. (2020); Han et al. (2021) consider the positive and negative sample equally likely as setting . In this paper, we leverage different values of to guide the model concentration on the sample which is distinct from the input. In consequence, a reasonable value of will provide a clear separation among topics in the dataset. We demonstrate our procedure to estimate in the following section.

1:Dataset , model parameter , model , total training steps
2:Randomly pick a batch of samples from the training set
3:for each sample in the chosen batch do
4:     Draw the negative sample and a positive sample
5:     Obtain the latent distribution associated with the drawn samples: ,
6:     Obtain the candidate value with .
7:end for
8:Initialize as the mean of the candidate list
9:for  to  do
10:     Train the model with
11:end for
Algorithm 1 Approximate

3.3 Controlling the effect of negative sample

When choosing value of , we need to answer the following questions: (1) What impact does have on the process of training? and (2) Is it possible to design a procedure which is data-oriented to approximate ?

Understanding the impact of   To exemplify point (1), we study the impact of on the contrastive loss presented in Section 3.2. The gradient of the contrastive loss (4) with respect to the latent distribution would be:


This derivation confirms the proportionality of the gradient norm with respect to . As the training progresses, the update step must be carefully controlled to avoid bouncing around the minimum or getting stuck in local optima.

Adaptive scheduling We leverage the adaptive approach to construct a data-oriented procedure to estimate . Initially, the neural topic model will consider the representation of each document equally likely. The relation of the similarity of the positive and the prototype to the one of the negative and the prototype can provide us with a starting viewpoint of the model. Concretely, we store that information in the initialized value of which is estimated with the formula .

After intialisation, to accommodate to the model learning, we continue to adopt an adaptive strategy which keeps updating value of according to the triangle scheduling procedure: . We summarize the detail of choosing in Algo. 1.

3.4 Word-based Sampling Strategy

Here we provide a technical motivation and details of our sampling method. To choose a sample which has the same underlying topic with the input, it is reasonable to filter out topics which hold large values in the document-topic distribution, as they are considered to be important by the neural topic model. Subsequently, the procedure will draw salient words in each of the topic that will contribute the weights to the drawn samples. We call this strategy as the topic-based sampling strategy.

However, as shown in Miao et al. (2017), the process of topic choosing is sensitive to the training performance and it is challenging to determine the optimal topic number represented for every single input. Miao et al Miao et al. (2017) implemented a stick breaking procedure to specifically predict number of topics for each document. Their strategy demands approximating the likelihood increase for each decision of breaking the stick, in other word adding the number of topic that the document denotes. Since their process takes up a considerable amount of computation, we propose a simpler approach which is word-based to draw both positive and negative samples.

For each document with its associated word count vector , we form the tf-idf representation . Then, we feed x to the neural topic model to obtain the latent vector and the reconstructed document . Our word-based sampling strategy is illustrated in Fig. 2.

Negative sampling We select tokens that have the highest tf-idf scores. We hypothesize that these words mainly contribute to the topic of the document. By substituting weights of chosen tokens in the original input x with the weights of the reconstructed representation : , we enforce the negative samples to have the main content deviated from the original input .

Note that since the model improves its reconstruction ability as training progresses, the weights of salient words from the reconstructed output approach those from the original input (but not equal). The model should take a more careful learning step to adapt to this situation. As the negative sample controlling factor decays its value when converging to the final training step, due to our adaptive scheduling approach aforementioned in section 3.3, it is able to adapt to this phenomenon.

Positive sampling Contrary to the negative case, we select tokens possessing the lowest tf-idf scores . We obtain the positive sample which bears a resembling theme to the original input by assigning weights of the chosen tokens in to their counterpart in through . This forms a valid positive sampling procedure since modifying weights of insignificant tokens retains the salient topics in the source document.

Figure 2: Our sampling strategy.

3.5 Training objective

Joint objective We jointly combine the goal of reconstructing the original input, matching the approximate with the true posterior distribution, with the contrastive objective specified in section 3.2.


We summarize our learning procedure in Algorithm 2.

1:Dataset , model parameter , model , push-pull balancing factor , contrastive controlling weight
3:     for  to  do
4:         Compute , from ;
5:         Obtain top-k indices of words with smallest tf-idf weights ;
6:         Sample from and ;
7:         Obtain top-k indices of words with largest tf-idf weights ;
8:         Sample from and ;
9:     end for

     Compute the loss function

defined in Eq. 6;
11:     Update by gradients to minimize the loss;
12:until the training converges
Algorithm 2 Contrastive Neural Topic Model

4 Experimental Setting

In this section, we provide the experimental setups of our conducted experiments to evaluate the performance of our proposed method. We provide the statistics summary of the datasets in Appendix.

4.1 Datasets

We conduct our experiments on three readily available datasets that belong to various domains, vocabulary sizes, and document lengths:

  • [leftmargin=*]

  • 20Newsgroups (20NG) dataset Lang (1995) consists of about 18000 documents, each document is a newsgroup post and associated with a newsgroup label (for example, talk.politics.misc). Following Huynh et al. Huynh et al. (2020), we preprocess the dataset to remove stopwords, words possessing length equal to , and get rid of words whose frequency is less than . We conduct the dataset split with , , for training, validation, and testing, respectively.

  • Wikitext-103 (Wiki) Merity et al. (2016)

    is a version of WikiText dataset, which includes about

    articles from Good and Featured section on Wikipedia. We follow the preprocess, keep the top words as in Merity et al. (2016), and use the train/dev/test split of , , and .

  • IMDb movie reviews (IMDb) Maas et al. (2011) has movie reviews for analytics. Each review in the corpus is connected with a sentiment label, which we use as the external variable for our topic model. Respectively, we apply the dataset split of , , for training, validation, and testing.

For evaluation measure, we use the Normalized Mutual Pointwise Information (NPMI) since this strongly correlates with human judgement and is popularly applied to verify the topic quality Hoyle et al. (2020)

. For text classification, we use the F1-score as the evaluation metric.

4.2 Baselines

We compare our method with the following state-of-the-art neural topic models of diverse styles:

  • [leftmargin=*]

  • NTM Ding et al. (2018) a Gaussian-based neural topic model proposed by (Miao et al., 2017) inheriting the VAE architecture and utilizing neural variational inference for training.

  • SCHOLAR Card et al. (2017) a VAE-based neural topic model learning with logistic normal prior and is provided with a method to incorporate external variables.

  • SCHOLAR + BAT Hoyle et al. (2020) a version of SCHOLAR model trained using knowledge distillation where BERT model as a teacher provides contextual knowledge for its student, the neural topic model.

  • W-LDA Nan et al. (2019) a topic model which takes form of a Wasserstein auto-encoder with Dirichlet prior approximated by minimizing Maximum Mean Discrepancy.

  • BATM Wang et al. (2020)

    a neural topic model whose architecture is inspired by Generative Adversarial Network. We use the version trained with bidirectional adversarial training method and the architecture consisting of 3 components: encoder, generator, and discriminator.

5 Results

5.1 Topic coherence

20NG IMDb Wiki
NTM Ding et al. (2018) 0.283 0.004 0.277 0.003 0.170 0.008 0.169 0.003 0.250 0.010 0.291 0.009
W-LDA Nan et al. (2019) 0.279 0.003 0.188 0.001 0.136 0.007 0.095 0.003 0.451 0.012 0.308 0.007
BATM Wang et al. (2020) 0.314 0.003 0.245 0.001 0.065 0.008 0.090 0.004 0.336 0.010 0.319 0.005
SCHOLAR Card et al. (2017) 0.319 0.007 0.263 0.002 0.168 0.002 0.140 0.001 0.429 0.011 0.446 0.009
SCHOLAR + BAT Hoyle et al. (2020) 0.324 0.006 0.272 0.002 0.182 0.002 0.175 0.003 0.446 0.010 0.455 0.007
Our model - 0.327 0.006 0.274 0.003 0.191 0.007 0.185 0.003 0.455 0.012 0.450 0.008
Our model - 0.328 0.004 0.277 0.003 0.195 0.008 0.187 0.001 0.465 0.012 0.456 0.004
Our model - 0.334 0.004 0.280 0.003 0.197 0.006 0.188 0.002 0.497 0.009 0.478 0.006
Table 1: Results measured in NPMI of neural topic models

Overall basis We evaluate our methods both at and . For each topic, we follow previous works Hoyle et al. (2020); Wang et al. (2019a); Card et al. (2017) to pick the top words, measure its NPMI measure and calculate in the average value. As shown in Tab. 1, our method achieves the best topic coherence on three benchmark datasets. We surpass the baseline SCHOLAR Card et al. (2017), its version trained with distilled knowledge SCHOLAR + BAT Hoyle et al. (2020), and other state-of-the-art neural topic models in both cases of and . We also establish the robustness of our improvement by conducting experiments on

runs with different random seeds and recording the mean and standard deviation. This confirms that the contrastive framework promotes the overall quality of generated topics.

Figure 3: (left) Jensen-Shannon for aligned topic pairs of SCHOLAR and our model. (right) The number of aligned topic pairs which our model improves upon SCHOLAR model

Topic-by-topic basis To further evaluate the performance of our method, we proceed to individually compare each of our topics with the aligned topic produced by the baseline neural topic model. Following Hoyle et al. Hoyle et al. (2020), we use a variant of competitive linking to greedily approximate the optimal weight of the bipartite graph matching. Particularly, a bipartite graph is constructed by linking the topics of our model and the baseline one. The weight of each link is represented as the Jensen-Shannon (JS) divergence Wong and You (1985); Lin (1991) between two topics. We iteratively choose the pair according to its lowest JS score, dispense those two topics from the topic list, and repeat until the JS score surpasses a certain threshold. Fig. 3 (left) shows the aligned scores for three benchmark corpora. Using visual inspection, we decide to choose the most aligned 44 topic pairs to conduct the comparison. As shown in Fig. 3 (right), our model has more topics with higher NPMI score than the baseline model. This means that our model not only generates better topics on average but also on the topic-by-topic basis.

5.2 Text classification

Model 20NG IMDb
BATM Wang et al. (2020) 30.8 66.0
SCHOLAR Card et al. (2017) 52.9 83.4
SCHOLAR + BAT Hoyle et al. (2020) 32.2 73.1
Our model 54.4 84.2
Table 2: Text classification employing the latent distribution predicted by neural topic models.

In order to compare the extrinsic predictive performance, we use document classification as the downstream task. We collect the latent vectors inferred by neural topic models in

and train a Random Forest with the number of decision trees as

and the maximum depth as to predict the class of each document. We pick IMDb and 20NG for our experiment. Our method surpasses other neural topic models on the downstream text classification with significant gaps, as shown in Tab. 2.

5.3 Ablation Study

20NG IMDb Wiki
Our method 0.334 0.004 0.280 0.003 0.197 0.006 0.190 0.002 0.497 0.009 0.478 0.006
- w/o positive sampling 0.320 0.004 0.272 0.002 0.187 0.006 0.182 0.007 0.452 0.012 0.448 0.009
- w/o negative sampling 0.331 0.002 0.277 0.002 0.195 0.008 0.188 0.003 0.474 0.010 0.468 0.007
Table 3: Ablation studies

To verify the efficiency mimicking the human behavior in learning topic by grasping the commonalities, we train our methods under the besting setting with (, with word-based sampling), but with two different objectives: (1) Without positive sampling: model captures semantic pattern by only distinguishing the input and the negative sample; (2) Without negative sampling: model learns the semantic pattern by solely minimizing the similarity the input with the positive sample. Tab. 3 demonstrates losing one of the two views in contrastive framework degrades the quality of the topics. We include the optimizing objective for the two approaches in the Appendix. Remarkably, it is interesting that removing the negative objective influences less than for the positive one. This reconfirms the soundness of our approach to focus on the effect of positive sample, which takes inspiration from human perspective.

6 Analysis

6.1 Effect of adaptive controlling parameter

20NG IMDb Wiki
0-sampling 0.269 0.003 0.231 0.001 0.171 0.005 0.172 0.002 0.448 0.008 0.429 0.007
Random sampling 0.321 0.005 0.273 0.001 0.183 0.002 0.177 0.001 0.460 0.012 0.462 0.003
Topic-based sampling - 0.313 0.004 0.270 0.005 0.189 0.002 0.172 0.002 0.467 0.012 0.464 0.002
Topic-based sampling - 0.322 0.005 0.268 0.002 0.181 0.006 0.170 0.007 0.450 0.013 0.461 0.008
Topic-based sampling - 0.319 0.001 0.273 0.002 0.176 0.007 0.170 0.003 0.472 0.007 0.444 0.006
Our method 0.334 0.004 0.280 0.003 0.197 0.006 0.188 0.002 0.497 0.009 0.478 0.006
Table 4: Results of different sampling method
Figure 4: The influence of adaptive controlling parameter on topic coherence measure

We then show the relation between , which controls the impact of our constraint, and the topic coherence measure in Fig. 4. As shown in the figure, adaptive weight exhibits consistent superiority over manually tuned constant parameter. We elaborate our high performance on the triangle scheduling that brings the self-adjustment in different training stages.

6.2 Random Sampling Strategy

Number of Topics 20NG IMDb Wiki
0.0140 0.0291 0.0344
0.0494 0.0012 0.0156
Table 5: Significance Testing results, reporting p-value

In this section, we demonstrate the effectiveness of our random sampling strategy. We compare our performance with two other methods: (1) -sampling: we replace the weights of chosen tokens in the BoW with ; (2): we create the negative samples by drawing other documents from the dataset, then extracting the topic vector of each document; we do not perform positive sampling in this variant. (3) Topic-based sampling: the sampling strategy we discussed in section 3.4, we experiment with varying choices of . As shown in Tab. 4, our sampling method consistently outperforms other strategies by a large margin. This confirms our hypothesis that topic-based sampling is vulnerable to drawing insufficient or redundant topics and might harm the performance.

In addition, to further evalute the statistical significance of our outperforming over traditional random sampling method, we conduct significance testing and report p-value in Tab. 5. As it can be seen, all of the p-values are smaller than 0.05, which proves the statistical significance in the improvement of our method against traditional contrastive learning.

6.3 Importance Measure

Metrics IMDb 20NG Wiki
PCA 0.184 0.004 0.325 0.003 0.481 0.005
SVD 0.181 0.004 0.313 0.003 0.476 0.014
tf 0.196 0.003 0.332 0.006 0.495 0.008
idf 0.193 0.001 0.334 0.004 0.490 0.009
tf-idf 0.197 0.006 0.334 0.004 0.497 0.009
Table 6: Results when employing various importance measures

Our word-based sampling strategy employs tf-idf measure to determine important and unimportant words that have values to be superseded to form positive and negative samples.

To have a fair judgement, we also conduct experiments with two other complex sampling methods using Principal Component Analysis (PCA) or Singular Value Decomposition (SVD). Specifically, we decompose the reconstructed and original input vectors into singular values and then replace the largest/smallest singular values of the input with the largest/smallest ones of the reconstructed to obtain negative/positive samples, respectively. For SVD, we choose

largest/smallest values for substitution whereas for PCA, we decompose the input vector onto -d space in order to make it similar to the latent space of neural topic model (number of topics ) and proceed to substitute largest/smallest values as in SVD. We conducted our experiments on 3 datasets IMDb, 20NG, and Wiki with , and reported the results (NPMI) in Tab. 6.

As it can be obviously seen, despite its simplicity, tf-idf-based sampling method outperforms other complicated sampling methods in our tasks.

Dataset Method NPMI Topic
20NG SCHOLAR 0.259 max bush clinton crypto pgp clipper nsa announcement air escrow
Our model 0.543 crypto clipper encryption nsa escrow wiretap chip proposal warrant secure
Wiki SCHOLAR 0.196 airlines boeing vehicle manufactured flight skiing airline ski engine alpine
Our model 0.564 skiing ski alpine athletes para paralympic nordic olympic paralympics ipc
IMDb SCHOLAR 0.145 hong chinese kong imagery japanese rape lynch torture violence disturbing
Our model 0.216 hong chinese kong japan fairy japanese sword martial fantasy magical
Table 7: Some example topics on three datasets 20NG, Wiki, and IMDb

6.4 Case Studies

We randomly extract sample topic in each of three datasets to study the quality of the generated topics and show the result in Tab. 7. Generally, the topic words generated by our model tends to concentrate on the main topic of the document. For example, in 20NG dataset, it can be seen that our words tend to concentrate on the topic related to cryptography (encryption, crypto, etc.) and computer hardware (chip, wiretap, clipper, etc.), rather than political words, e.g. bush and clinton generated by SCHOLAR model. Our generated topics in Wiki is more focused on skiing, while SCHOLAR’s topic comprises of traffic terms such as vehicle, boeing, and engine. Similarly, the topic words in IMDb generated by our model mainly reflects the theme of Fantasy movie in japan, chinese, and hong kong, while not including off-topic words such as torture and disturbing which were generated by SCHOLAR model.

7 Conclusion

In this paper, we propose a novel method to help neural topic model learn more meaningful representations. Approaching the problem with a mathematical perspective, we enforce our model to consider both effects of positive and negative pairs. To better capture semantic patterns, we introduce a novel sampling strategy which takes inspiration from human behavior in differentiating documents. Experimental results on three common benchmark datasets show that our method outperforms other state-of-the-art neural topic models in terms of topic coherence.


  • E. Amrani, R. Ben-Ari, T. Hakim, and A. Bronstein (2019) Learning to detect and retrieve objects from unlabeled videos. In

    2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

    pp. 3713–3717. Cited by: §2.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §1.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation.

    the Journal of machine Learning research

    3, pp. 993–1022.
    Cited by: §1.
  • A. Blum and T. Mitchell (1998) Combining labeled and unlabeled data with co-training. In

    Proceedings of the eleventh annual conference on Computational learning theory

    pp. 92–100. Cited by: §1.
  • D. Card, C. Tan, and N. A. Smith (2017) Neural models for documents with metadata. arXiv preprint arXiv:1705.09296. Cited by: §1, §2, 2nd item, §5.1, Table 1, Table 2.
  • K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu (2020) Contrastive learning of global and local features for medical image segmentation with limited annotations. arXiv preprint arXiv:2006.10511. Cited by: §2, §3.2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §1, §2.
  • T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020b) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §1, §2.
  • C. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, and S. Jegelka (2020) Debiased contrastive learning. arXiv preprint arXiv:2007.00224. Cited by: §1, §2, §3.2.
  • R. Ding, R. Nallapati, and B. Xiang (2018) Coherence-aware neural topic modeling. arXiv preprint arXiv:1809.02687. Cited by: §2, 1st item, Table 1.
  • J. Han, M. Shoeiby, L. Petersson, and M. A. Armin (2021)

    Dual contrastive learning for unsupervised image-to-image translation

    arXiv preprint arXiv:2104.07689. Cited by: §3.2.
  • K. Hassani and A. H. Khasahmadi (2020) Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, pp. 4116–4126. Cited by: §2.
  • O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §2.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.
  • C. Ho and N. Vasconcelos (2020) Contrastive learning with adversarial examples. arXiv preprint arXiv:2010.12050. Cited by: §2.
  • A. Hoyle, P. Goel, and P. Resnik (2020) Improving neural topic models using knowledge distillation. arXiv preprint arXiv:2010.02377. Cited by: §2, 3rd item, §4.1, §5.1, §5.1, Table 1, Table 2.
  • X. Hu, R. Wang, D. Zhou, and Y. Xiong (2020) Neural topic modeling with cycle-consistent adversarial training. arXiv preprint arXiv:2009.13971. Cited by: §1, §2.
  • V. Huynh, H. Zhao, and D. Phung (2020) OTLDA: a geometry-aware optimal transport approach for topic modeling. Advances in Neural Information Processing Systems 33. Cited by: 1st item.
  • S. Jin, A. RoyChowdhury, H. Jiang, A. Singh, A. Prasad, D. Chakraborty, and E. Learned-Miller (2018) Unsupervised hard example mining from videos for improved object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 307–324. Cited by: §2.
  • W. Karush (1939) Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago. Cited by: §3.2.
  • T. Ke, J. Hwang, and S. X. Yu (2021) Universal weakly supervised segmentation by pixel-to-segment contrastive learning. arXiv preprint arXiv:2105.00957. Cited by: §2.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Cited by: §2, §3.2.
  • M. Kim, J. Tack, and S. J. Hwang (2020) Adversarial self-supervised contrastive learning. arXiv preprint arXiv:2006.07589. Cited by: §2, §3.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • H. W. Kuhn and A. W. Tucker (2014) Nonlinear programming. In Traces and emergence of nonlinear programming, pp. 247–258. Cited by: §3.2.
  • K. Lang (1995) Newsweeder: learning to filter netnews. In Machine Learning Proceedings 1995, pp. 331–339. Cited by: 1st item.
  • Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli (2019) Graph matching networks for learning the similarity of graph structured objects. In International Conference on Machine Learning, pp. 3835–3845. Cited by: §2.
  • J. Lin (1991) Divergence measures based on the shannon entropy. IEEE Transactions on Information theory 37 (1), pp. 145–151. Cited by: §5.1.
  • L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893. Cited by: §2.
  • Y. Lu, Q. Mei, and C. Zhai (2011) Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Information Retrieval 14 (2), pp. 178–203. Cited by: §1.
  • A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150. Cited by: 3rd item.
  • S. Merity, C. Xiong, J. Bradbury, and R. Socher (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: 2nd item.
  • Y. Miao, E. Grefenstette, and P. Blunsom (2017) Discovering discrete latent topics with neural variational inference. In International Conference on Machine Learning, pp. 2410–2419. Cited by: §1, §2, §3.4.
  • Y. Miao, L. Yu, and P. Blunsom (2016) Neural variational inference for text processing. In International conference on machine learning, pp. 1727–1736. Cited by: §2.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018)

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning

    IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.
  • F. Nan, R. Ding, R. Nallapati, and B. Xiang (2019) Topic modeling with wasserstein autoencoders. arXiv preprint arXiv:1907.12374. Cited by: §1, 4th item, Table 1.
  • T. Nguyen, A. T. Luu, T. Lu, and T. Quan (2021)

    Enriching and controlling global semantics for text summarization

    arXiv preprint arXiv:2109.10616. Cited by: §1.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In International conference on machine learning, pp. 1278–1286. Cited by: §2.
  • J. Robinson, C. Chuang, S. Sra, and S. Jegelka (2020) Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592. Cited by: §1, §2.
  • A. Srivastava and C. Sutton (2017) Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488. Cited by: §1, §2.
  • S. Subramani, V. Sridhar, and K. Shetty (2018) A novel approach of neural topic modelling for document clustering. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2169–2173. Cited by: §1.
  • B. Sun, B. Li, S. Cai, Y. Yuan, and C. Zhang (2021) FSCE: few-shot object detection via contrastive proposal encoding. arXiv preprint arXiv:2103.05950. Cited by: §2.
  • F. Sun, J. Hoffmann, V. Verma, and J. Tang (2019) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §2.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §1, §2.
  • L. A. Tuan, D. Shah, and R. Barzilay (2020) Capturing greater context for question generation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 9065–9072. Cited by: §1.
  • M. Wang and P. Mengoni (2021) How pandemic spread in news: text analysis using topic model. arXiv preprint arXiv:2102.04205. Cited by: §1.
  • R. Wang, X. Hu, D. Zhou, Y. He, Y. Xiong, C. Ye, and H. Xu (2020) Neural topic modeling with bidirectional adversarial training. arXiv preprint arXiv:2004.12331. Cited by: §1, §1, §2, 5th item, Table 1, Table 2.
  • R. Wang, D. Zhou, and Y. He (2019a) Atm: adversarial-neural topic model. Information Processing & Management 56 (6), pp. 102098. Cited by: §1, §1, §2, §5.1.
  • R. Wang, D. Zhou, and Y. He (2019b) Open event extraction from online text using a generative adversarial network. arXiv preprint arXiv:1908.09246. Cited by: §1, §2.
  • A. K. Wong and M. You (1985)

    Entropy and distance of random graphs with application to structural pattern recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence (5), pp. 599–609. Cited by: §5.1.
  • E. Xie, J. Ding, W. Wang, X. Zhan, H. Xu, Z. Li, and P. Luo (2021) DetCo: unsupervised contrastive learning for object detection. arXiv preprint arXiv:2102.04803. Cited by: §2.
  • C. Xu, D. Tao, and C. Xu (2013) A survey on multi-view learning. arXiv preprint arXiv:1304.5634. Cited by: §1.
  • Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen (2020) Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33. Cited by: §2, §3.2.
  • X. Zhao, R. Vemulapalli, P. Mansfield, B. Gong, B. Green, L. Shapira, and Y. Wu (2020) Contrastive learning for label-efficient semantic segmentation. arXiv preprint arXiv:2012.06985. Cited by: §2.

Appendix A Implementation details

In this section, we include the hyperprameter details we use in this work, e.g. learning rate, batch size, etc. We apply different sets of hyperparameters, with respect to the dataset the neural topic model is trained on.

20NG IMDb Wiki
Learning rate 0.002 0.001 0.002
Batch size 200 200 500
Table 8: Hyperparameter details

Appendix B Contrastive loss derivation

We provide the proof of the inequality (4) in this section.

Theorem 1.

Let denote the word count representation of a document, denote the positive sample and negative sample with respect to , denote the mapping function of the encoder, denote the positive KKT multiplier, and denote the strength of constraint. Suppose , then we have the following inequality


We rewrite the LHS in (7)

At this point, we conclude our proof. ∎

Appendix C Versions of loss function

We provide the description of versions of loss functions we use in this work.

Contrastive approach - Using both positive and negative samples


Contrastive approach - Using only positive sample


Contrastive approach - Using only negative sample


Appendix D Understanding number of chosen tokens

We demonstrate the effect of changing the number of tokens chosen for sampling. We perform training with different choices of and record the topic coherence. For visibility, we normalize them to one common scale before plotting them in Fig 5. It can be seen that the performance initially increases as we select more tokens from the reconstructed output to substitute for the drawn sample. However, when the number of selected tokens grows too large, the topic coherence measure starts decreasing as increases. We hypothesize that the overwhelming number of substituted values can alter the semantic of the positive samples, while producing random negative sample.

Figure 5: The influence of number of tokens chosen to construct random samples