On the Sentence Embeddings from Pre-trained Language Models

by   Bohan Li, et al.
ByteDance Inc.
Carnegie Mellon University

Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective. Experimental results show that our proposed BERT-flow method obtains significant performance gains over the state-of-the-art sentence embeddings on a variety of semantic textual similarity tasks. The code is available at https://github.com/bohanli/BERT-flow.




Improving Contextual Representation with Gloss Regularized Pre-training

Though achieving impressive results on many NLP tasks, the BERT-like mas...

GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method

Large pre-trained language models such as BERT have been the driving for...

Whitening Sentence Representations for Better Semantics and Faster Retrieval

Pre-training models such as BERT have achieved great success in many nat...

Latte-Mix: Measuring Sentence Semantic Similarity with Latent Categorical Mixtures

Measuring sentence semantic similarity using pre-trained language models...

HIT at SemEval-2022 Task 2: Pre-trained Language Model for Idioms Detection

The same multi-word expressions may have different meanings in different...

Transferring Semantic Knowledge Into Language Encoders

We introduce semantic form mid-tuning, an approach for transferring sema...

SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable AMR Meaning Features

Metrics for graph-based meaning representations (e.g., Abstract Meaning ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, pre-trained language models and its variants (radford2019language; devlin2018bert; yang2019xlnet; liu2019roberta) like BERT (devlin2018bert) have been widely used as representations of natural language. Despite their great success on many NLP tasks through fine-tuning, the sentence embeddings from BERT without fine-tuning are significantly inferior in terms of semantic textual similarity (reimers2019sentence) – for example, they even underperform the GloVe (pennington2014glove) embeddings which are not contextualized and trained with a much simpler model. Such issues hinder applying BERT sentence embeddings directly to many real-world scenarios where collecting labeled data is highly-costing or even intractable.

In this paper, we aim to answer two major questions: (1) why do the BERT-induced sentence embeddings perform poorly to retrieve semantically similar sentences? Do they carry too little semantic information, or just because the semantic meanings in these embeddings are not exploited properly? (2) If the BERT embeddings capture enough semantic information that is hard to be directly utilized, how can we make it easier without external supervision?

Towards this end, we first study the connection between the BERT pretraining objective and the semantic similarity task. Our analysis reveals that the sentence embeddings of BERT should be able to intuitively reflect the semantic similarity between sentences, which contradicts with experimental observations. Inspired by gao2019representation who find that the language modeling performance can be limited by the learned anisotropic word embedding space where the word embeddings occupy a narrow cone, and ethayarajh2019contextual who find that BERT word embeddings also suffer from anisotropy, we hypothesize that the sentence embeddings from BERT – as average of context embeddings from last layers111

In this paper, we compute average of context embeddings from last one or two layers as our sentence embeddings since they are consistently better than the [CLS] vector as shown in 


– may suffer from similar issues. Through empirical probing over the embeddings, we further observe that the BERT sentence embedding space is semantically non-smoothing and poorly defined in some areas, which makes it hard to be used directly through simple similarity metrics such as dot product or cosine similarity.

To address these issues, we propose to transform the BERT sentence embedding distribution into a smooth and isotropic Gaussian distribution through normalizing flows (dinh2014nice)

, which is an invertible function parameterized by neural networks. Concretely, we learn a flow-based generative model to maximize the likelihood of generating BERT sentence embeddings from a standard Gaussian latent variable in a

unsupervised fashion. During training, only the flow network is optimized while the BERT parameters remain unchanged. The learned flow, an invertible mapping function between the BERT sentence embedding and Gaussian latent variable, is then used to transform the BERT sentence embedding to the Gaussian space. We name the proposed method as BERT-flow.

We perform extensive experiments on 7 standard semantic textual similarity benchmarks without using any downstream supervision. Our empirical results demonstrate that the flow transformation is able to consistently improve BERT by up to 12.70 points with an average of 8.16 points in terms of Spearman correlation between cosine embedding similarity and human annotated similarity. When combined with external supervision from natural language inference tasks (snli; mnli), our method outperforms the sentence-BERT embeddings (reimers2019sentence)

, leading to new state-of-the-art performance. In addition to semantic similarity tasks, we apply sentence embeddings to a question-answer entailment task, QNLI 

(wang2018glue), directly without task-specific supervision, and demonstrate the superiority of our approach. Moreover, our further analysis implies that BERT-induced similarity can excessively correlate with lexical similarity compared to semantic similarity, and our proposed flow-based method can effectively remedy this problem.

2 Understanding the Sentence Embedding Space of BERT

To encode a sentence into a fixed-length vector with BERT, it is a convention to either compute an average of context embeddings in the last few layers of BERT, or extract the BERT context embedding at the position of the [CLS] token. Note that there is no token masked when producing sentence embeddings, which is different from pretraining.

reimers2019sentence demonstrate that such BERT sentence embeddings lag behind the state-of-the-art sentence embeddings in terms of semantic similarity. On the STS-B dataset, BERT sentence embeddings are even less competitive to averaged GloVe (pennington2014glove) embeddings, which is a simple and non-contextualized baseline proposed several years ago. Nevertheless, this incompetence has not been well understood yet in existing literature.

Note that as demonstrated by reimers2019sentence, averaging context embeddings consistently outperforms the [CLS] embedding. Therefore, unless mentioned otherwise, we use average of context embeddings as BERT sentence embeddings and do not distinguish them in the rest of the paper.

2.1 The Connection between Semantic Similarity and BERT Pre-training

We consider a sequence of tokens

. Language modeling (LM) factorizes the joint probability

in an autoregressive way, namely where the context . To capture bidirectional context during pretraining, BERT proposes a masked language modeling (MLM) objective, which instead factorizes the probability of noisy reconstruction , where is a corrupted sequence, is the masked tokens, is equal to 1 when is masked and 0 otherwise. The context .

Note that both LM and MLM can be reduced to modeling the conditional distribution of a token given the context , which is typically formulated with a softmax function as,


Here the context embedding is a function of , which is usually heavily parameterized by a deep neural network (e.g., a Transformer (vaswani2017attention)); The word embedding is a function of , which is parameterized by an embedding lookup table.

The similarity between BERT sentence embeddings can be reduced to the similarity between BERT context embeddings 222This is because we approximate BERT sentence embeddings with context embeddings, and compute their dot product (or cosine similarity) as model-predicted sentence similarity. Dot product is equivalent to cosine similarity when the embeddings are normalized to unit hyper-sphere.. However, as shown in Equation 1, the pretraining of BERT does not explicitly involve the computation of . Therefore, we can hardly derive a mathematical formulation of what exactly represents.

Co-Occurrence Statistics as the Proxy for Semantic Similarity

Instead of directly analyzing , we consider , the dot product between a context embedding and a word embedding . According to yang2017breaking, in a well-trained language model, can be approximately decomposed as follows,


where denotes the pointwise mutual information between and , is a word-specific term, and is a context-specific term.

PMI captures how frequently two events co-occur more than if they independently occur. Note that co-occurrence statistics is a typical tool to deal with “semantics” in a computational way — specifically, PMI is a common mathematical surrogate to approximate word-level semantic similarity levy2014neural; ethayarajh2019towards. Therefore, roughly speaking, it is semantically meaningful to compute the dot product between a context embedding and a word embedding.

Higher-Order Co-Occurrence Statistics as Context-Context Semantic Similarity.

During pretraining, the semantic relationship between two contexts and could be inferred and reinforced with their connections to words. To be specific, if both the contexts and co-occur with the same word , the two contexts are likely to share similar semantic meaning. During the training dynamics, when and occur at the same time, the embeddings and are encouraged to be closer to each other, meanwhile the embedding and where are encouraged to be away from each other due to normalization. A similar scenario applies to the context . In this way, the similarity between and is also promoted. With all the words in the vocabulary acting as hubs, the context embeddings should be aware of its semantic relatedness to each other.

Higher-order context-context co-occurrence could also be inferred and propagated during pretraining. The update of a context embedding could affect another context embedding in the above way, and similarly can further affect another . Therefore, the context embeddings can form an implicit interaction among themselves via higher-order co-occurrence relations.

Rank of word frequency
Mean -norm 0.95 1.04 1.22 1.45
Mean -NN -dist. () 0.77 0.93 1.16 1.30
Mean -NN -dist. () 0.83 0.99 1.22 1.34
Mean -NN -dist. () 0.87 1.04 1.26 1.37
Mean -NN dot-product. () 0.73 0.92 1.20 1.63
Mean -NN dot-product. () 0.73 0.91 1.19 1.61
Mean -NN dot-product. () 0.72 0.90 1.17 1.60
Table 1: The mean -norm, as well as their distance to their -nearest neighbors (among all the word embeddings) of the word embeddings of BERT, segmented by ranges of word frequency rank (counted based on Wikipedia dump; the smaller the more frequent).

2.2 Anisotropic Embedding Space Induces Poor Semantic Similarity

As discussed in Section 2.1, the pretraining of BERT should have encouraged semantically meaningful context embeddings implicitly. Why BERT sentence embeddings without finetuning yield unsatisfactory performance?

To investigate the underlying problem of the failure, we use word embeddings as a surrogate because words and contexts share the same embedding space. If the word embeddings exhibits some misleading properties, the context embeddings will also be problematic, and vice versa.

gao2019representation and wang2020spectrum have pointed out that, for language modeling, the maximum likelihood training with Equation 1

usually produces an anisotropic word embedding space. “Anisotropic” means word embeddings occupy a narrow cone in the vector space. This phenomenon is also observed in the pretrained Transformers like BERT, GPT-2, etc 


In addition, we have two empirical observations over the learned anisotropic embedding space.

Observation 1: Word Frequency Biases the Embedding Space

We expect the embedding-induced similarity to be consistent to semantic similarity. If embeddings are distributed in different regions according to frequency statistics, the induced similarity is not useful any more.

However, as discussed by gao2019representation, anisotropy is highly relevant to the imbalance of word frequency. They prove that under some assumptions, the optimal embeddings of non-appeared tokens in Transformer language models can be extremely far away from the origin. They also try to roughly generalize this conclusion to rarely-appeared words.

To verify this hypothesis in the context of BERT, we compute the mean distance between the BERT word embeddings and the origin (i.e., the mean -norm). In the upper half of Table 1, we observe that high-frequency words are all close to the origin, while low-frequency words are far away from the origin.

This observation indicates that the word embeddings can be biased to word frequency. This coincides with the second term in Equation 3, the log density of words. Because word embeddings play a role of connecting the context embeddings during training, context embeddings might be misled by the word frequency information accordingly and its preserved semantic information can be corrupted.

Observation 2: Low-Frequency Words Disperse Sparsely

We observe that, in the learned anisotropic embedding space, high-frequency words concentrates densely and low-frequency words disperse sparsely.

This observation is achieved by computing the mean distance of word embeddings to their -nearest neighbors. In the lower half of Table 1, we observe that the embeddings of low-frequency words tends to be farther to their -NN neighbors compared to the embeddings of high-frequency words. This demonstrates that low-frequency words tends to disperse sparsely.

Due to the sparsity, many “holes” could be formed around the low-frequency word embeddings in the embedding space, where the semantic meaning can be poorly defined. Note that BERT sentence embeddings are produced by averaging the context embeddings, which is a convexity-preserving operation. However, the holes violate the convexity of the embedding space. This is a common problem in the context of representation learining (rezende2018taming; li2019surprisingly; ghosh2019variational). Therefore, the resulted sentence embeddings can locate in the poorly-defined areas, and the induced similarity can be problematic.

3 Proposed Method: BERT-flow

To verify the hypotheses proposed in Section 2.2, and to circumvent the incompetence of the BERT sentence embeddings, we proposed a calibration method called BERT-flow in which we take advantage of an invertible mapping from the BERT embedding space to a standard Gaussian latent space. The invertibility condition assures that the mutual information between the embedding space and the data examples does not change.

Figure 1: An illustration of our proposed flow-based calibration over the original sentence embedding space of BERT.

3.1 Motivation

A standard Gaussian latent space may have favorable properties which can help with our problem.

Connection to Observation 1

First, standard Gaussian satisfies isotropy. The probabilistic density in standard Gaussian distribution does not vary in terms of angle. If the

norm of samples from standard Gaussian are normalized to 1, these samples can be regarded as uniformly distributed over a unit sphere.

We can also understand the isotropy from a singular spectrum perspective. As discussed above, the anisotropy of the embedding space stems from the imbalance of word frequency. In the literature of traditional word embeddings, mu2017all discovers that the dominating singular vectors can be highly correlated to word frequency, which misleads the embedding space. By fitting a mapping to an isotropic distribution, the singular spectrum of the embedding space can be flattened. In this way, the word frequency-related singular directions, which are the dominating ones, can be suppressed.

Connection to Observation 2

Second, the probabilistic density of Gaussian is well defined over the entire real space. This means there are no “hole” areas, which are poorly defined in terms of probability. The helpfulness of Gaussian prior for mitigating the “hole” problem has been widely observed in existing literature of deep latent variable models (rezende2018taming; li2019surprisingly; ghosh2019variational).

3.2 Flow-based Generative Model

We instantiate the invertible mapping with flows. A flow-based generative model kobyzev2019normalizing establishes an invertible transformation from the latent space to the observed space . The generative story of the model is defined as

where the prior distribution, and is an invertible transformation. With the change-of-variables theorem, the probabilistic density function (PDF) of the observable is given as,

In our method, we learn a flow-based generative model by maximizing the likelihood of generating BERT sentence embeddings from a standard Gaussian latent latent variable. In other words, the base distribution is a standard Gaussian and we consider the extracted BERT sentence embeddings as the observed space . We maximize the likelihood of ’s marginal via Equation 3.2 in a fully unsupervised way.


Here denotes the dataset, in other words, the collection of sentences. Note that during training, only the flow parameters are optimized while the BERT parameters remain unchanged. Eventually, we learn an invertible mapping function which can transform each BERT sentence embedding into a latent Gaussian representation without loss of information.

The invertible mapping is parameterized as a neural network, and the architectures are usually carefully designed to guarantee the invertibility dinh2014nice. Moreover, its determinant should also be easy to compute so as to make the maximum likelihood training tractable. In our experiments, we follows the design of Glow kingma2018glow. The Glow model is composed of a stack of multiple invertible transformations, namely actnorm, invertible convolution, and affine coupling layer333For concrete mathamatical formulations, please refer to Table 1 of kingma2018glow. We simplify the model by replacing affine coupling with additive coupling dinh2014nice to reduce model complexity, and replacing the invertible convolution with random permutation to avoid numerical errors. For the mathematical formula of the flow model with additive coupling, please refer to Appendix A.

4 Experiments

Published in reimers2019sentence
Avg. GloVe embeddings 58.02 53.76 55.14 70.66 59.73 68.25 63.66
Avg. BERT embeddings 46.35 58.40 38.78 57.98 57.98 63.15 61.06
BERT CLS-vector 16.50 42.63 20.16 30.01 20.09 36.88 38.03
Our Implementation
BERT 47.29 58.21 49.07 55.92 54.75 62.75 65.19
BERT-last2avg 59.04 63.75 57.84 61.95 62.48 70.95 69.81
BERT-flow (NLI) 58.56 () 65.44 () 59.54 () 64.69 () 64.66 () 72.92 () 71.84 ()
BERT-flow (target) 70.72 () 63.11() 63.48 () 72.14 () 68.42 () 73.77 () 75.37 ()
BERT 46.99 53.74 46.89 53.32 49.27 56.54 61.63
BERT-last2avg 59.56 60.22 57.68 61.37 61.02 68.04 70.32
BERT-flow (NLI) 68.09 () 64.62 () 61.72 () 66.05 () 66.34 () 74.87 () 74.47 ()
BERT-flow (target) 72.26 () 62.50 () 65.20 () 73.39 () 69.42 () 74.92 () 77.63 ()
Table 2: Experimental results on semantic textual similarity without using NLI supervision. We report the Spearman’s rank correlation between the cosine similarity of sentence embeddings and the gold labels on multiple datasets. Numbers are reported as . denotes outperformance over its BERT baseline and denotes underperformance. Our proposed BERT-flow method achieves the best scores. Note that our BERT-flow use -last2avg as default setting. : Use NLI corpus for the unsupervised training of flow; supervision labels of NLI are NOT visible.

To verify our hypotheses and demonstrate the effectiveness of our proposed method, in this section we present our experimental results for various tasks related to semantic textual similarity under multiple configurations. For the implementation details of our siamese BERT models and flow-based models, please refer to Appendix B.

4.1 Semantic Textual Similarity


We evaluate our approach extensively on the semantic textual similarity (STS) tasks. We report results on 7 datasets, namely the STS benchmark (STS-B) 

(cer2017semeval) the SICK-Relatedness (SICK-R) dataset (marelli2014sick) and the STS tasks 2012 - 2016 (agirre2012semeval; agirre2013sem; agirre2014semeval; agirre2015semeval; agirre2016semeval). We obtain all these datasets via the SentEval toolkit (conneau2018senteval). These datasets provide a fine-grained gold standard semantic similarity between 0 and 5 for each sentence pair.

Evaluation Procedure.

Following the procedure in previous work like Sentence-BERT reimers2019sentence for the STS task, the prediction of similarity consists of two steps: (1) first, we obtain sentence embeddings for each sentence with a sentence encoder, and (2) then, we compute the cosine similarity between the two embeddings of the input sentence pair as our model-predicted similarity. The reported numbers are the Spearman’s correlation coefficients between the predicted similarity and gold standard similarity scores, which is the same way as in (reimers2019sentence).

Experimental Details.

We consider both BERT and BERT in our experiments. Specifically, we use an average pooling over BERT context embeddings in the last one or two layers as the sentence embedding which is found to outperform the [CLS] vector. Interestingly, our preliminary exploration shows that averaging the last two layers of BERT (denoted by -last2avg) consistently produce better results compared to only averaging the last one layer. Therefore, we choose -last2avg as our default configuration when assessing our own approach.

For the proposed method, the flow-based objective (Equation 3.2) is maximized only to update the invertible mapping while the BERT parameters remains unchanged. Our flow models are by default learned over the full target dataset (train + validation + test). We denote this configuration as flow (target). Note that although we use the sentences of the entire target dataset, learning flow does not use any provided labels for training, thus it is a purely unsupervised calibration over the BERT sentence embedding space.

We also test our flow-based model learned on a concatenation of SNLI (snli) and MNLI (mnli) for comparison (flow (NLI)). The concatenated NLI datasets comprise of tremendously more sentence pairs (SNLI 570K + MNLI 433K). Note that “flow (NLI)” does not require any supervision label. When fitting flow on NLI corpora, we only use the raw sentences instead of the entailment labels. An intuition behind the flow (NLI) setting is that, compared to Wikipedia sentences (on which BERT is pretrained), the raw sentences of both NLI and STS are simpler and shorter. This means the NLI-STS discrepancy could be relatively smaller than the Wikipedia-STS discrepancy.

We run the experiments on two settings: (1) when external labeled data is unavailable. This is the natural setting where we learn flow parameters with the unsupervised objective (Equation 3.2), meanwhile BERT parameters are unchanged. (2) we first fine-tune BERT on the SNLI+MNLI textual entailment classification task in a siamese fashion (reimers2019sentence). For BERT-flow, we further learn the flow parameters. This setting is to compare with the state-of-the-art results which utilize NLI supervision (reimers2019sentence). We denote the two different models as BERT-NLI and BERT-NLI-flow respectively.

Published in reimers2019sentence

InferSent - Glove
68.03 65.65 52.86 66.75 62.15 72.77 66.86
USE 74.92 76.69 64.49 67.80 64.61 76.83 73.18
SBERT-NLI 77.03 72.91 70.97 76.53 73.19 79.09 74.30
SBERT-NLI 79.23 73.75 72.27 78.46 74.90 80.99 76.25
SRoBERTa-NLI 77.77 74.46 71.54 72.49 70.80 78.74 73.69
SRoBERTa-NLI 79.10 74.29 74.53 77.00 73.18 81.85 76.82

Our Implementation
BERT-NLI 77.08 72.62 66.23 70.22 72.15 77.35 73.91
BERT-NLI-last2avg 78.03 74.07 68.37 72.44 73.98 79.15 75.39
BERT-NLI-flow (NLI) 79.10 () 78.03 () 67.75 () 76.73 () 75.53 () 80.63 () 77.58 ()
BERT-NLI-flow (target) 81.03 () 74.97 () 68.95 () 78.48 () 77.62 () 81.95 () 78.94 ()
BERT-NLI 77.80 73.44 66.87 73.91 74.04 79.14 75.35
BERT-NLI-last2avg 78.45 74.93 68.69 75.63 75.55 80.35 76.81
BERT-NLI-flow (NLI) 79.89 () 77.73 () 69.61 () 79.45 () 77.56 () 82.48 () 79.36 ()
BERT-NLI-flow (target) 81.18 () 74.52 () 70.19 () 80.27 () 78.85 () 82.97 () 80.57 ()

Table 3: Experimental results on semantic textual similarity with NLI supervision. Note that our flows are still learned in a unsupervised way. InferSent (conneau-EtAl:2017:EMNLP2017) is a siamese LSTM train on NLI, Universal Sentence Encoder (USE) (cer2018universal) replace the LSTM with a Transformer and SBERT reimers2019sentence further use BERT. We report the Spearman’s rank correlation between the cosine similarity of sentence embeddings and the gold labels on multiple datasets. Numbers are reported as . denotes outperformance over its BERT baseline and denotes underperformance. Our proposed BERT-flow (i.e., the “BERT-NLI-flow” in this table) method achieves the best scores. Note that our BERT-flow use -last2avg as default setting. : Use NLI corpus for the unsupervised training of flow; supervision labels of NLI are NOT visible.

Results w/o NLI Supervision.

As shown in Table 2, the original BERT sentence embeddings (with both BERT and BERT) fail to outperform the averaged GloVe embeddings. And averaging the last-two layers of the BERT model can consistently improve the results. For BERT and BERT, our proposed flow-based method (BERT-flow (target)) can further boost the performance by 5.88 and 8.16 points on average respectively. For most of the datasets, learning flows on the target datasets leads to larger performance improvement than on NLI. The only exception is SICK-R where training flows on NLI is better. We think this is because SICK-R is collected for both entailment and relatedness. Since SNLI and MNLI are also collected for textual entailment evaluation, the distribution discrepancy between SICK-R and NLI may be relatively small. Also due to the much larger size of the NLI datasets, it is not surprising that learning flows on NLI results in stronger performance.

Results w/ NLI Supervision.

Table 3 shows the results with NLI supervisions. Similar to the fully unsupervised results before, our isotropic embedding space from invertible transformation is able to consistently improve the SBERT baselines in most cases, and outperforms the state-of-the-art SBERT/SRoBERTa results by a large margin. Robustness analysis with respect to random seeds are provided in Appendix C.

4.2 Unsupervised Question-Answer Entailment

In addition to the semantic textual similarity tasks, we examine the effectiveness of our method on unsupervised question-answer entailment. We use Question Natural Language Inference (QNLI, wang2018glue), a dataset comprising 110K question-answer pairs (with 5K+ for testing). QNLI extracts the questions as well as their corresponding context sentences from SQUAD (rajpurkar2016squad), and annotates each pair as either entailment or no entailment. In this paper, we further adapt QNLI as an unsupervised task. The similarity between a question and an answer can be predicted by computing the cosine similarity of their sentence embeddings. Then we regard entailment as 1 and no entailment as 0, and evaluate the performance of the methods with AUC.

As shown in Table 4, our method consistently improves the AUC on the validation set of QNLI. Also, learning flow on the target dataset can produce superior results compared to learning flows on NLI.

BERT-NLI-last2avg 70.30
BERT-NLI-flow (NLI) 72.52 ()
BERT-NLI-flow (target) 76.17 ()
BERT-NLI-last2avg 70.41
BERT-NLI-flow (NLI) 74.19 ()
BERT-NLI-flow (target) 77.09 ()
Table 4: AUC on QNLI evaluated on the validation set. : Use NLI corpus for the unsupervised training of flow; supervision labels of NLI are NOT visible.

4.3 Comparison with Other Embedding Calibration Baselines

Method Correlation
BERT 47.29
   + SN 55.46
   + NATSV () 51.79
   + NATSV () 60.40
   + SN + NATSV () 56.02
   + SN + NATSV () 63.51
BERT-flow (target) 65.62
Table 5: Comparing flow-based method with baselines on STS-B. is selected among on the validation set. We report the Spearman’s rank correlation ().

In the literature of traditional word embeddings, arora2016simple and mu2017all also discover the anisotropy phenomenon of the embedding space, and they provide several methods to encourage isotropy:

Standard Normalization (SN).

In this idea, we conduct a simple post-processing over the embeddings by computing the mean

and standard deviation

of the sentence embeddings ’s, and normalizing the embeddings by .

Nulling Away Top- Singular Vectors (NATSV).

mu2017all find out that sentence embeddings computed by averaging traditional word embeddings tend to have a fast-decaying singular spectrum. They claim that, by nulling away the top- singular vectors, the anisotropy of the embeddings can be circumvented and better semantic similarity performance can be achieved.

We compare with these embedding calibration methods on STS-B dataset and the results are shown in Table 5. Standard normalization (SN) helps improve the performance but it falls behind nulling away top- singular vectors (NATSV). This means standard normalization cannot fundamentally eliminate the anisotropy. By combining the two methods, and carefully tuning over the validation set, further improvements can be achieved. Nevertheless, our method still produces much better results. We argue that NATSV can help eliminate anisotropy but it may also discard some useful information contained in the nulled vectors. On the contrary, our method directly learns an invertible mapping to isotropic latent space without discarding any information.

Figure 2: A scatterplot of sentence pairs, where the horizontal axis represents similarity (either gold standard semantic similarity or embedding-induced similarity), the vertical axis represents edit distance. The sentence pairs with edit distance are highlighted with green, meanwhile the rest of the pairs are colored with blue. We can observed that lexically similar sentence pairs tends to be predicted to be similar by BERT embeddings, especially for the green pairs. Such correlation is less evident for gold standard labels or flow-induced embeddings.

4.4 Dicussion: Semantic Similarity Versus Lexical Similarity

In addition to semantic similarity, we further study lexical similarity induced by different sentence embeddings. Specifically, we use edit distance as the metric for lexical similarity between a pair of sentences, and focus on the correlations between the sentence similarity and edit distance. Concretely, we compute the cosine similarity in terms of BERT sentence embeddings as well as edit distance for each sentence pair. Within a dataset consisting of many sentence pairs, we compute the Spearman’s correlation coefficient between the similarities and the edit distances, as well as between similarities from different models. We perform experiment on the STS-B dataset and include the human annotated gold similarity into this analysis.

BERT-Induced Similarity Excessively Correlates with Lexical Similarity.

Similarity Edit distance Gold similarity
Gold similarity -24.61 100.00
BERT-induce similarity -50.49 59.30
Flow-induce similarity -28.01 74.09
Table 6: Spearman’s correlation between various sentence similarities on the validation set of STS-B. We can observe that BERT-induced similarity is highly correlated to edit distance, while the correlation with edit distance is less evident for gold standard or flow-induced similarity.

Table 6 shows that the correlation between BERT-induced similarity and edit distance is very strong (), considering that gold standard labels maintain a much smaller correlation with edit distance (). This phenomenon can also be observed in Figure 2. Especially, for sentence pairs with edit distance (highlighted with green), BERT-induced similarity is extremely correlated to edit distance. However, it is not evident that gold standard semantic similarity correlates with edit distance. In other words, it is often the case where the semantics of a sentence can be dramatically changed by modifying a single word. For example, the sentences “I like this restaurant” and “I dislike this restaurant” only differ by one word, but convey opposite semantic meaning. BERT embeddings may fail in such cases. Therefore, we argue that the lexical proximity of BERT sentence embeddings is excessive, and can spoil their induced semantic similarity.

Flow-Induced Similarity Exhibits Lower Correlation with Lexical Similarity.

By transforming the original BERT sentence embeddings into the learned isotropic latent space with flow, the embedding-induced similarity not only aligned better with the gold semantic semantic similarity, but also shows a lower correlation with lexical similarity, as presented in the last row of Table 6. The phenomenon is especially evident for the examples with edit distance (highlighted with green in Figure 2). This demonstrates that our proposed flow-based method can effectively suppress the excessive influence of lexical similarity over the embedding space.

5 Conclusion and Future Work

In this paper, we investigate the deficiency of the BERT sentence embeddings on semantic textual similarity, and propose a flow-based calibration which can effectively improve the performance. In the future, we are looking forward to diving in representation learning with flow-based generative models from a broader perspective.


The authors would like to thank Jiangtao Feng, Wenxian Shi, Yuxuan Song, and anonymous reviewers for their helpful comments and suggestion on this paper.


Appendix A Mathematical Formula of the Invertible Mapping

Generally, flow-based model is a stacked sequence of many invertible transformation layers: . Specifically, in our approach, each transformation is an additive coupling layer, which can be mathematically formulated as follows.


Here can be parameterized with a deep neural network for the sake of expressiveness.

Its inverse function can be explicitly written as:


Appendix B Implementation Details

Throughout our experiment, we adopt the official Tensorflow code of BERT 

444https://github.com/google-research/bert as our codebase. Note that we clip the maximum sequence length to 64 to reduce the costing of GPU memory. For the NLI finetuning of siamese BERT, we folllow the settings in (reimers2019sentence)

(epochs = 1, learning rate =

, and batch size = 16). Our results may vary from their published one. The authors mentioned in https://github.com/UKPLab/sentence-transformers/issues/50 that this is a common phenonmenon and might be related the random seed. Note that their implementation relies on the Transformers repository of Huggingface555https://github.com/huggingface/transformers. This may also lead to discrepancy between the specific numbers.

Our implementation of flows is adapted from both the official repository of GLOW666https://github.com/openai/glow as well as the implementation fo the Tensor2tensor library777https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/glow.py

. The hyperparameters of our flow models are given in Table 

7. On the target datasets, we learn the flow parameters for 1 epoch with learning rate . On NLI datasets, we learn the flow parameters for 0.15 epoch with learning rate . The optimizer that we use is Adam.

In our preliminary experiments on STS-B, we tune the hyperparameters on the dev set of STS-B. Empirically, the performance does not vary much with regard to the architectural hyperparameters compared to the learning schedule. Afterwards, we do not tune the hyperparameters any more when working on the other datasets. Empirically, we find the hyperparameters of flow are not sensitive across the datasets.

Coupling architecture in

3-layer CNN with residual connection

Coupling width 32
#levels 2
Depth 3
Table 7: Flow hyperparameters.

Appendix C Results with Different Random Seeds

We perform 5 runs with different random seeds in the NLI-supervised setting on STS-B. Results with standard deviation and median are demonstrated in Table 8

. Although the variance of NLI finetuning is not negligible, our proposed flow-based method consistently leads to improvement.

Method Spearman’s
BERT-NLI-large 77.26 1.76 (median: 78.19)
BERT-NLI-large-last2avg 78.07 1.50 (median: 78.68)
BERT-NLI-large-last2avg + flow-target 81.10 0.55 (median: 81.35)
Table 8: Results with different random seeds.