A Mutual Information Maximization Perspective of Language Representation Learning

10/18/2019 ∙ by Lingpeng Kong, et al. ∙ 0

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in representation learning have driven progress in natural language processing. Performance on many downstream tasks have improved considerably, achieving parity with human baselines in benchmark leaderboards such as SQuAD (Rajpurkar et al., 2016, 2018) and GLUE (Wang et al., 2019). The main ingredient is the “pretrain and fine-tune” approach, where a large text encoder is trained on an unlabeled corpus with self-supervised training objectives and used to initialize a task-specific model. Such an approach has also been shown to reduce the number of training examples that is needed to achieve good performance on the task of interest (Yogatama et al., 2019).

In contrast to first-generation models that learn word type embeddings (Mikolov et al., 2013; Pennington et al., 2014), recent methods have focused on contextual token representations—i.e., learning an encoder to represent words in context. Many of these encoders are trained with a language modeling objective, where the representation of a context is trained to be predictive of a target token by maximizing the log likelihood of predicting this token (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018, 2019). In a vanilla language modeling objective, the target token is always the next token that follows the context. Peters et al. (2018) propose an improvement by adding a reverse objective that also predicts the word token that precedes the context. Following this trend, current state-of-the-art encoders such as BERT (Devlin et al., 2018) and XLNet (Yang et al., 2019) are also trained with variants of the language modeling objective: masked language modeling and permutation language modeling.

In this paper, we provide an alternative view and show that these methods also maximize a lower bound on the mutual information between different parts of a word sequence. Such a framework is inspired by the InfoMax principle (Linsker, 1988)

and has been the main driver of progress in self-supervised representation learning in other domains such as computer vision, audio processing, and reinforcement learning

(Belghazi et al., 2018; van den Oord et al., 2019; Hjelm et al., 2019; Bachman et al., 2019; O’Connor and Veeling, 2019). Many of these methods are trained to maximize a particular lower bound called InfoNCE (van den Oord et al., 2019)—also known as contrastive learning (Arora et al., 2019). The main idea behind contrastive learning is to divide an input data into multiple (possibly overlapping) views and maximize the mutual information between encoded representations of these views, using views derived from other inputs as negative samples. In §2, we provide an overview of representation learning with mutual information maximization. We then show how the skip-gram objective (§3.1; Mikolov et al. 2013), masked language modeling (§3.2; Devlin et al. 2018), and permutation language modeling (§3.3; Yang et al. 2019), fit in this framework.

In addition to providing a principled theoretical understanding that bridges progress in multiple areas, our proposed framework also gives rise to a general class of word representation learning models which serves as a basis for designing and combining self-supervised training objectives to create better language representations. As an example, we show how to use this framework to construct a simple self-supervised objective that maximizes the mutual information between a sentence and -grams in the sentence (§4). We combine it with a variant of the masked language modeling objective and show that the resulting representation performs better, particularly on tasks such as question answering and linguistics acceptability (§5).

2 Mutual Information Maximization

Mutual information measures dependencies between random variables. Given two random variables

and , it can be understood as how much knowing reduces the uncertainty in or vice versa. Formally, the mutual information between and is:

Consider and to be different views of an input data (e.g., a word and its context, two different partitions of a sentence). The goal of training is to learn a function that maximizes .

Maximizing mutual information directly is generally intractable when the function

consists of modern encoders such as neural networks

(Paninski, 2003), so we need to resort to a lower bound on . One particular lower bound that has been shown to work well in practice is InfoNCE (Logeswaran and Lee, 2018; van den Oord et al., 2019),111Alternative bounds include Donsker-Vardhan representation (Donsker and Varadhan, 1983) and Jensen-Shannon estimator (Nowozin et al., 2016), but we focus on InfoNCE here.

which is based on Noise Contrastive Estimation (NCE;

Gutmann and Hyvarinen, 2012).222 See van den Oord et al. (2019); Poole et al. (2019) for detailed derivations of InfoNCE as a bound on mutual information. InfoNCE is defined as:


where and are different views of an input sequence, is a function parameterized by (e.g., a dot product between encoded representations of a word and its context, a dot product between encoded representations of two partitions of a sentence), and is a set of samples drawn from a proposal distribution . The set contains the positive sample and negative samples.

Learning representations based on this objective is also known as contrastive learning. Arora et al. (2019) show representations learned by such a method have provable performance guarantees and reduce sample complexity on downstream tasks.

We note that InfoNCE is related to cross-entropy. When always includes all possible values of the random variable (i.e.,

and they are uniformly distributed, maximizing InfoNCE is analogous to maximizing the standard cross-entropy loss:


Eq. 2 above shows that InfoNCE is related to maximizing , and it approximates the summation over elements in (i.e., the partition function) by negative sampling. As a function of the negative samples, the InfoNCE bound is tighter when contains more samples (as can be seen in Eq. 1 above by inspecting the term). Approximating a softmax over a large vocabulary with negative samples is a popular technique that has been widely used in natural language processing in the past. We discuss it here to make the connection under this framework clear.

3 Models

We describe how Skip-gram, BERT, and XLNet fit into the mutual information maximization framework as instances of InfoNCE. In the following, we assume that , where . Denote the vocabulary set by and the size of the vocabulary by . For word representation learning, we seek to learn an encoder parameterized by to represent each word in a sequence in dimensions. For each of the models we consider in this paper, and are formed by taking different parts of (e.g., and ).

3.1 Skip-gram

We first start with a simple word representation learning model Skip-gram (Mikolov et al., 2013). Skip-gram is a method for learning word representations that relies on the assumption that a good representation of a word should be predictive of its context. The objective function that is maximized in Skip-gram is: , where is a word token and is a context word of .

Let be the context word to be predicted and be the input word . Recall that is . The skip-gram objective function can be written as an instance of InfoNCE (Eq. 1) where and are embedding lookup functions that map each word type to . (i.e., ).

can either be computed using a standard softmax over the entire vocabulary or with negative sampling (when the vocabulary is very large). These two approaches correspond to different choices of . In the softmax approach, is the full vocabulary set and each word in is uniformly distributed. In negative sampling, is a set of negative samples drawn from e.g., a unigram distribution.

While Skip-gram has been widely accepted as an instance contrastive learning (Mikolov et al., 2013; Mnih and Kavukcuoglu, 2013), we include it here to illustrate its connection with modern approaches such as BERT and XLNet described subsequently. We can see that the two views of an input sentence that are considered in Skip-gram are two words that appear in the same sentence, and they are encoded using simple lookup functions.

3.2 Bert

Devlin et al. (2018) introduce two self-supervised tasks for learning contextual word representations: masked language modeling and next sentence prediction. Previous work suggests that the next sentence prediction objective is not necessary to train a high quality BERT encoder and the masked language modeling appears to be the key to learn good representations (Liu et al., 2019; Joshi et al., 2019; Lample and Conneau, 2019), so we focus on masked language modeling here. However, we also show how next sentence prediction fits into our framework in Appendix A.

In masked language modeling, given a sequence of word tokens of length , , BERT replaces 15% of the tokens in the sequence with (i) a mask symbol 80% of the time, (ii) a random word 10% of the time, or (iii) its original word. For each replaced token, it introduces a term in the masked language modeling training objective to predict the original word given the perturbed sequence (i.e., the sequence masked at ). This training objective can be written as: .

Following our notation in §2, we have . Let be a masked word and be the masked sequence . Consider a Transformer encoder parameterized by and denote as a function that returns the final hidden state corresponding to the -th token (i.e., the masked token) after running through the Transformer. Let

be a lookup function that maps each word type into a vector and

be the full vocabulary set . Under this formulation, the masked language modeling objective maximizes Eq. 1

and different choices of masking probabilities can be understood as manipulating the joint distributions

. In BERT, the two views of a sentence correspond to a masked word in the sentence and its masked context.

Contextual vs. non-contextual.

It is generally understood that the main difference between Skip-gram and BERT is that Skip-gram learns representations of word types (i.e., the representation for a word is always the same regardless of the context it appears in) and BERT learns representations of word tokens. We note that under our formulation for either Skip-gram or BERT, the encoder that we want to learn appears in , and is not used after training. We show that Skip-gram and BERT maximizes a similar objective, and the main difference is in the choice of the encoder that forms —a context dependent Transformer encoder that takes a sequence as its input for BERT and a simple word embedding lookup for Skip-gram.

3.3 XLNet

Yang et al. (2019) propose a permutation language modeling objective to learn contextual word representations. This objective considers all possible factorization permutations of a joint distribution of a sentence. Given a sentence , there are ways to factorize its joint distribution.333For example, we can factorize , and many others. Given a sentence , denote a permutation by . XLNet optimizes the objective function:

As a running example, consider a permutation order 3,1,5,2,4 for a sentence . Given the order, XLNet is only trained to predict the last tokens in practice. For , the context sequence used for training is , with being the target word.

In addition to replacing the Transformer encoder with Transformer XL (Dai et al., 2019)

, a key architectural innovation of XLNet is the two-stream self-attention. In two-stream self attention, a shared encoder is used to compute two sets of hidden representations from one original sequence. They are called the query stream and the content stream. In the query stream, the input sequence is masked at the target position, whereas the content stream sees the word at the target position. Words at future positions for the permutation order under consideration are also masked in both streams. These masks are implemented as two attention mask matrices. During training, the final hidden representation for a target position from the query stream is used to predict the target word.

Since there is only one set of encoder parameters for both streams, we show that we can arrive at the permutation language modeling objective from the masked language modeling objective with an architectural change in the encoder. Denote a hidden representation by , where indexes the position and indexes the layer, and consider the training sequence and the permutation order 3,1,5,2,4. In BERT, we compute attention scores to obtain from for every (i.e., ), where is the embedding for the mask symbol. In XLNet, the attention scores for future words in the permutation order are masked to 0. For example, when we compute , only the attention score from is considered (since the permutation order is 3,1,5,2,4). For , we use and . XLNet does not require a mask symbol embedding since the attention score from a masked token is always zeroed out with an attention mask (implemented as a matrix). As a result, we can consider XLNet training as masked language modeling with stochastic attention masks in the encoder.

It is now straightforward to see that the permutation language modeling objective is an instance of Eq.1, where is a target token and is a masked sequence . Similar to BERT, we have a Transformer encoder parameterized by and denote as a function that returns the final hidden state corresponding to the -th token (i.e., the masked token) after running through the Transformer. Let be a lookup function that maps each word type into a vector and be the full vocabulary set . The main difference between BERT and XLNet is that the encoder that forms used in XLNet implements attention masking based on a sampled permutation order when building its representations. In addition, XLNet and BERT also differ in the choice of since each of them has its own masking procedure. However, we can see that both XLNet and BERT maximize the same objective.

4 InfoWord.

Our analysis on Skip-Gram, BERT, and XLNet shows that their objective functions are different instances of InfoNCE in Eq.1, although they are typically trained using the entire vocabulary set for instead of negative sampling. These methods differ in how they choose which views of a sentence they use as and , the data distribution , and the architecture of the encoder for computing . Seen under this unifying framework, we can observe that progress in the field has largely been driven by using a more powerful encoder to represent . While we only provide derivations for Skip-gram, BERT, and XLNet, it is straightforward to show that other language-modeling-based pretraining-objectives such as those used in ELMo (Peters et al., 2018)

and GPT-2

(Radford et al., 2019) can be formulated under this framework.

Our framework also allows us to draw connections to other mutual information maximization representation learning methods that have been successful in other domains (e.g., computer vision, audio processing, reinforcement learning). In this section, we discuss an example derive insights to design a simple self-supervised objective for learning better language representations.

Deep InfoMax (DIM; Hjelm et al., 2019) is a mutual information maximization based representation learning method for images. The complete objective function that DIM maximizes consists of multiple terms. Here, we focus on a term in the objective that maximizes the mutual information between local features and global features. We describe the main idea of this objective for learning representations from a one-dimensional sequence, although it is originally proposed to learn from a two-dimensional object.

Given a sequence , we consider the “global” representation of the sequence to be the hidden state of the first token (assumed to be a special start of sentence symbol) after contextually encoding the sequence ,444Alternatively, the global representation can be the averaged representations of words in the sequence although we do not explore this in our experiments. and the local representations to be the encoded representations of each word in the sequence . We can use the contrastive learning framework to design a task that maximizes the mutual information between this global representation vector and its corresponding “local” representations using local representations from other sequences as negative samples. This is analogous to training the global representation vector of a sentence to choose which words appear in the sentence and which words are from other sentences.555 We can see that this self-supervised task is related to the next sentence prediction objective in BERT. However, instead of learning a global representation (assumed to be the representation of the first token in BERT) to be predictive of whether two sentences are consecutive sentences, it learns its global representation to be predictive of words in the original sentence. However, if we feed the original sequence to the encoder and take the hidden state of the first token as the global representation, the task becomes trivial since the global representation is built using all the words in the sequence. We instead use a masked sequence and .

State-of-the-art methods based on language modeling objectives consider all negative samples since the second view of the input data (i.e., the part denoted by in Eq. 1) that are used is simple and it consists of only a target word—hence the size of the negative set is still manageable. A major benefit of the contrastive learning framework is that we only need to be able to take negative samples for training. Instead of individual words, we can use -grams as the local representations.666Local image patches used in DIM are analogous to -grams in a sentence. Denote an -gram by and a masked sequence masked at position to by We define as:

where is a sentence masked at position to , is an -gram spanning from to , and is an -gram from a set that consists of the positive sample and negative -grams from other sentences in the corpus. We use one Transformer to encode both views, so we do not need here.

Since the main goal of representation learning is to train an encoder parameterized by , it is possible to combine multiple self-supervised tasks into an objective function in the contrastive learning framework. Our model, which we denote InfoWord, combines the above objective with a masked language modeling objective . The only difference between our masked language modeling objective and the standard masked language modeling objective is that we use negative sampling to construct by sampling from the unigram distribution. We have:

Our overall objective function is a weighted combination of the two terms above:

where and

are hyperparameters that balance the contribution of each term.

5 Experiments

In this section, we evaluate the effects of training masked language modeling with negative sampling and adding to the quality of learned representations.

5.1 Setup

We largely follow the same experimental setup as the original BERT model (Devlin et al., 2018). We have two Transformer architectures similar to BERT and BERT. BERT has 12 hidden layers, 768 hidden dimensions, and 12 attention heads (110 million parameters); whereas BERT has 24 hidden layers, 1024 hidden dimensions, and 16 attention heads (340 million parameters).

For each of the Transformer variant above, we compare three models in our experiments:

  • BERT: The original BERT model publicly available in https://github.com/google-research/bert.

  • BERT-NCE: Our reimplementation of BERT. It differs from the original implementation in several ways: (1) we only use the masked language modeling objective and remove next sentence prediction, (2) we use negative sampling instead of softmax, and (3) we only use one sentence for each training example in a batch.

  • InfoWord: Our model described in §4. The main difference between InfoWord and BERT-NCE is the addition of to the objective function. We discuss how we mask the data for in §5.2.

5.2 Pretraining

We use the same training corpora and apply the same preprocessing and tokenization as BERT. We create masked sequences for training with as follows. We iteratively sample -grams from a sequence until the masking budget (15% of the sequence length) has been spent. At each sampling iteration, we first sample the length of the -gram (i.e., in

-grams) from a Gaussian distribution

clipped at 1 (minimum length) and 10 (maximum length). Since BERT tokenizes words into subwords, we measure the -gram length at the word level and compute the masking budget at the subword level. This procedure is inspired by the masking approach in Joshi et al. (2019).

For negative sampling, we use words and -grams from other sequences in the same batch as negative samples (for MLM and DIM respectively). There are approximately 70,000 subwords and 10,000 -grams (words and phrases) in a batch. We discuss hyperparameter details in Appendix B.

5.3 Fine-tuning

We evaluate on two benchmarks: GLUE (Wang et al., 2019) and SQuAD(Rajpurkar et al., 2016). We train a task-specific decoder and fine-tune pretrained models for each dataset that we consider. We describe hyperparameter details in Appendix B.


is a set of natural language understanding tasks that includes sentiment analysis, linguistic acceptability, paraphrasing, and natural language inference. Each task is formulated as a classification task. The tasks in GLUE are either a single-sentence classification task or a sentence pair classification task. We follow the same setup as the original BERT model and add a start of sentence symbol (i.e., the

CLS symbol) to every example and use a separator symbol (i.e., the SEP

symbol) to separate two concatenated sentences (for sentence pair classification tasks). We add a linear transformation and a softmax layer to predict the correct label (class) from the representation of the first token of the sequence.

SQuAD is a reading comprehension dataset constructed from Wikipedia articles. We report results on SQuAD 1.1. Here, we also follow the same setup as the original BERT model and predict an answer span—the start and end indices of the correct answer in the context. We use a standard span predictor as the decoder, which we describe in details in Appendix C.

(m/mm) Avg
[origin=c]90Base BERT 52.1 93.5 88.9 71.2 84.6/83.4 90.5 66.4 78.8
BERT-NCE 50.8 93.0 88.6 70.5 83.2/83.0 90.9 65.9 78.2
InfoWord 53.3 92.5 88.7 71.0 83.7/82.4 91.4 68.3 78.9
[origin=c]90Large BERT 60.5 94.9 89.3 72.1 86.7/85.9 92.7 70.1 81.5
BERT-NCE 54.7 93.1 89.5 71.2 85.8/85.0 92.7 72.5 80.6
InfoWord 57.5 94.2 90.2 71.3 85.8/84.8 92.6 72.0 81.1
Table 1: Summary of results on GLUE.
Model Dev Test
[origin=c]90Base BERT 88.5 80.8 - -
BERT-NCE 90.2 83.3 90.9 84.4
InfoWord 90.7 84.0 91.4 84.7
[origin=c]90Large BERT 90.9 84.1 91.3 84.3
BERT-NCE 92.0 85.9 92.7 86.6
InfoWord 92.6 86.6 93.1 87.3
Table 2: Summary of results on SQuAD 1.1.

5.4 Results

We show our main results in Table 1 and Table 2. Our BERT reimplementation with negative sampling underperforms the original BERT model on GLUE but is significantly better on SQuAD. However, we think that the main reasons for this performance discrepancy are the different masking procedures (we use span-based masking instead of whole-word masking) and the different ways training examples are presented to the model (we use one consecutive sequence instead of two sequences separated by the separator symbol). Comparing BERT-NCE and InfoWord, we observe the benefit of the new self-supervised objective (better overall GLUE and SQuAD results), particularly on tasks such as question answering and linguistics acceptability that seem to require understanding of longer phrases. In order to better understand our model, we investigate its performance with varying numbers of training examples and different values of on the SQuAD development set and show the results in Figure 1 (for models with the Base configuration). We can see that InfoWord consistently outperforms BERT-NCE and the performance gap is biggest when the dataset is smallest, suggesting the benefit of having better pretrained representations when there are fewer training examples.

[scale=0.25]squad_plot.pdf [scale=0.25]squad_f1.pdf

Figure 1: The left plot shows scores of BERT-NCE and InfoWord as we increase the percentage of training examples on SQuAD (dev). The right plot shows scores of InfoWord on SQuAD (dev) as a function of .

5.5 Discussion

Span-based models.

We show how to design a simple self-supervised task in the InfoNCE framework that improves downstream performance on several datasets. Learning language representations to predict contiguous masked tokens has been explored in other context, and the objective introduced in is related to these span-based models such as SpanBERT (Joshi et al., 2019) and MASS (Song et al., 2019). While our experimental goal is to demonstrate the benefit of contrastive learning for constructing self-supervised tasks, we note that InfoWord is simpler to train and exhibits similar trends to SpanBERT that outperforms baseline models. We leave exhaustive comparisons to these methods to future work.

Mutual information maximization.

A recent study has questioned whether the success of InfoNCE as an objective function is due to its property as a lower bound on mutual information and provides an alternative hypothesis based on metric learning (Tschannen et al., 2019). Regardless of the prevailing perspective, InfoNCE is widely accepted as a good representation learning objective, and formulating state-of-the-art language representation learning methods under this framework offers valuable insights that unifies many popular representation learning methods.


Image representation learning methods often incorporate a regularization term in its objective function to encourage learned representations to look like a prior distribution (Hjelm et al., 2019; Bachman et al., 2019). This is useful for incorporating prior knowledge into a representation learning model. For example, the DeepInfoMax model has a term in its objective that encourages the learned representation from the encoder to match a uniform prior. Regularization is not commonly used when learning language representations. Our analysis and the connection we draw to representation learning methods used in other domains provide an insight into possible ways to incorporate prior knowledge into language representation learning models.

Future directions.

The InfoNCE framework provides a holistic way to view progress in language representation learning. The framework is very flexible and suggests several directions that can be explored to improve existing methods. We show that progress in the field has been largely driven by innovations in the encoder which forms . InfoNCE is based on maximizing the mutual information between different views of an input data, and it facilitates training on structured views as long as we can perform negative sampling (van den Oord et al., 2019; Bachman et al., 2019). Our analysis demonstrates that existing methods based on language modeling objectives only consider a single target word as one of the views. We think that incorporating more complex views (e.g., higher-order or skip -grams, syntactic and semantic parses, etc.) and designing appropriate self-supervised tasks is a promising future direction. A related area that is also underexplored is designing methods to obtain better negative samples.

6 Conclusion

We analyzed state-of-the-art language representation learning methods from the perspective of mutual information maximization. We provided a unifying view of classical and modern word embedding models and showed how they relate to popular representation learning methods used in other domains. We used this framework to construct a new self-supervised task based on maximizing the mutual information between the global representation and local representations of a sentence. We demonstrated the benefit of this new task via experiments on GLUE and SQuAD.


  • S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi (2019) A theoretical analysis of contrastive unsupervised representation learning. In Proc. of ICML, Cited by: §1, §2.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint. Cited by: §1, §5.5, §5.5.
  • M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018) MINE: mutual information neural estimation. In Proc. of ICML, Cited by: §1.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Proc. of NIPS, Cited by: §1.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In Proc. of ACL, Cited by: §3.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, Cited by: §1, §1, §3.2, §5.1.
  • M. D. Donsker and S. R. S. Varadhan (1983) Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics 36 (2), pp. 183––212. Cited by: footnote 1.
  • M. U. Gutmann and A. Hyvarinen (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.

    Journal of Machine Learning Research

    13, pp. 307––361.
    Cited by: Appendix A, §2.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In Proc. of ICLR, Cited by: §1, §4, §5.5.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proc. of ACL, Cited by: §1.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) SpanBERT: improving pre-training by representing and predicting spans. arXiv preprint. Cited by: Appendix B, §3.2, §5.2, §5.5.
  • D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In Proc. of ICLR, Cited by: Appendix B.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint. Cited by: §3.2.
  • R. Linsker (1988) Self-organization in a perceptual network. Computer 21 (3), pp. 105–117. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint. Cited by: §3.2.
  • L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. In Proc. of ICLR, Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proc. of NIPS, Cited by: §1, §1, §3.1, §3.1.
  • A. Mnih and K. Kavukcuoglu (2013) Learning word embeddings efficiently with noise-contrastive estimation. In Proc. of NIPS, Cited by: §3.1.
  • S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In Proc. of NIPS, Cited by: footnote 1.
  • S. L. P. O’Connor and B. S. Veeling (2019) Greedy infomax for biologically plausible self-supervised representation learning. In Proc. of NeurIPS, Cited by: §1.
  • L. Paninski (2003) Estimation of entropy and mutual information. Neural computation 15 (6), pp. 1191––1253. Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Proc. of EMNLP, Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1, §4.
  • B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker (2019) On variational lower bounds of mutual information. In Proc. of ICML, Cited by: footnote 2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report, OpenAI. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Technical report, OpenAI. Cited by: §1, §4.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proc. of ACL, Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP, Cited by: §1, §5.3.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In Proc. of ICML, Cited by: §5.5.
  • M. Tschannen, J. Djolonga, P. K. Rubenstein, and S. Gelly (2019) On mutual information maximization for representation learning. arXiv preprint. Cited by: §5.5.
  • A. van den Oord, Y. Li, and O. Vinyals (2019) Representation learning with contrastive predictive coding. arXiv preprint. Cited by: §1, §2, §5.5, footnote 2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understainding. In Proc. of ICLR, Cited by: §1, §5.3.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint. Cited by: §1, §1, §3.3.
  • D. Yogatama, C. de Masson d’Autume, J. Connor, T. Kocisky, M. Chrzanowski, L. Kong, A. Lazaridou, W. Ling, L. Yu, C. Dyer, and P. Blunsom (2019) Learning and evaluating general linguistic intelligence. arXiv preprint. Cited by: §1.

Appendix A Next Sentence Prediction

We show that the next sentence prediction objective used in BERT is an instance of contrastive learning in this section. In next sentence prediction, given two sentences and , the task is to predict whether these are two consecutive sentences or not. Training data for this task is created by sampling a random second sentence from the corpus to be used as a negative example 50% of the time.

Consider a discriminator (i.e., a classifier with parameters

) that takes encoded representations of concatenated and and returns a score. We denote this discriminator by . The next sentence prediction objective function is:

This objective function—which is used for training BERT—is known in the literature as “local” Noise Contrastive Estimation (Gutmann and Hyvarinen, 2012). Since summing over all possible negative sentences is intractable, BERT approximates this by using a binary classifier to distinguish real samples and noisy samples.

An alternative approximation to using a binary classifier is to use “global NCE”, which is what InfoNCE is based on. Here, we have:

where we sample negative sentences from the corpus and combine it with the positive sentence to construct . To make the connection of this objective function with InfoNCE in Eq. 1 explicit, let and be two consecutive sentences and . Let be , where is a trainable parameter, denotes a concatenation of and . Consider a Transformer encoder parameterized by , and let be a function that returns the final hidden state of the first token after running the concatenated sequence to the Transformer. Note that the encoder that we want to learn only depends on , so both of these approximations can be used for training next sentence prediction.

Appendix B Hyperparameters


We use Adam (Kingma and Ba, 2015) with , and . The batch size for training is 1024 with a maximum sequence length of . We train for 400,000 steps (including 18,000 warmup steps) with a weight decay rate of 0.01. We set the learning rate to for all variants of the Base models and for the Large models. We set to and tune .


We set the maximum sequence length to 128. For each GLUE task, we use the respective development set to choose the learning rate from , and the batch size from

. The number of training epochs is set to 4 for CoLA and 10 for other tasks, following

Joshi et al. (2019). We run each hyperparameter configuration 5 times and evaluate the best model on the test set (once).


We set the maximum sequence length to 512 and train for 4 epochs. We use the development set to choose the learning rate from and the batch size from .

Appendix C Question Answering Decoder

We use a standard span predictor as follows. Denote the length of the context paragraph by , and . Denote the encoded representation of the -th token in the context by . The question answering decoder introduces two sets of parameters: and . The probability of each context token being the start of the answer is computed as: . The probability of the end index of the answer is computed analogously using . The predicted answer is the span with the highest probability after multiplying the start and end probabilities.