The application of deep neural language models(devlin2018bert; peters2018deep; radford2019language; brown2020language) gained great success in recent years, since they create contextualized word representations that are sensitive to the surrounding context. This trend also stimulates the advance of generating semantic representations of longer piece of text, such as sentences and paragraphs (arora2016simple). However, sentence embeddings have been proven to poorly capture the underlying semantics of sentences (li2020sentence) as the previous work (gao2019representation; ethayarajh2019contextual; li2020sentence)
suggested that the word representations of all words are not isotropic: they are not uniformly distributed with respect to direction. Instead, they occupy a narrow cone in the vector space, and are thereforeanisotropic. (ethayarajh2019contextual)
has proved that the contextual word embeddings from the pre-trained model is so anisotropic that any two word embeddings have, on average, a cosine similarity of 0.99. Further investigation from(li2020sentence) found that the BERT sentence embedding space suffers from two problems, that is, word frequency biases the embedding space and low-frequency words disperse sparsely, which lead to cause the difficulty of using BERT sentence embedding directly through simple similarity metrics such as dot product or cosine similarity.
To address the problem aforementioned, (ethayarajh2019contextual) elaborates on the theoretical reason that leads to the anisotropy problem, as observed in pre-trained models. (gao2019representation) designs a novel way to mitigate the degeneration problem by regularizing the word embedding matrix. A recent attempt named BERT-flow (li2020sentence)
, proposed to transform the BERT sentence embedding distribution into a smooth and isotropic Gaussian distribution through normalizing flow(dinh2014nice)
, which is an invertible function parameterized by neural networks.
Instead of designing a sophisticated method as the previous attempts did, in this paper, we find that a simple and effective post-processing technique – whitening – is capable enough of tackling the anisotropic problem of sentence embeddings (reimers2019sentence)
. Specifically, we transform the mean value of the sentence vectors to 0 and the covariance matrix to the identity matrix. In addition, we also introduce a dimensionality reduction strategy to facilitate the whitening operation for further improvement the effect of our approach.
The experimental results on 7 standard semantic textual similarity benchmark datasets show that our method can generally improve the model performance and achieve the state-of-the-art results on most of datasets. Meanwhile, by adding the dimensionality reduction operation, our approach can further boost the model performance, as well as naturally optimize the memory storage and accelerate the retrieval speed.
The main contributions of this paper are summarized as follows:
We explore the reason for the poor performance of BERT-based sentence embedding in similarity matching tasks, i.e., it is not in a standard orthogonal basis.
A whitening post-processing method is proposed to transform the BERT-based sentence to a standard orthogonal basis while reducing its size.
Experimental results on seven semantic textual similarity tasks demonstrate that our method can not only improve model performance significantly, but also reduce vector size.
2 Related Work
Early attempts on tackling the anisotropic problem have appeared in specific NLP contexts. (arora2016simple) first computed the sentence representation for the entire semantic textual similarity dataset, then extracted the top direction from those sentence representations and ﬁnally projected the sentence representation away from it. By doing so, the top direction will inherently encode the common information across the entire dataset. (mu2017all) proposed a postprocessing operation is on dense low-dimensional representations with both positive and negative entries, they eliminate the common mean vector and a few top dominating directions from the word vectors, so that renders off-the-shelf representations even stronger. (gao2019representation) proposed a novel regularization method to address the anisotropic problem in training natural language generation models. They design a novel way to mitigate the degeneration problem by regularizing the word embedding matrix. As observe that the word embeddings are restricted into a narrow cone, the proposed approach directly increase the size of the aperture of the cone, which can be simply achieved by decreasing the similarity between individual word embeddings.(ethayarajh2019contextual) investigated the inner mechanism of contextual contextualized word representations. They found that upper layers of ELMo, BERT, and GPT-2 produce more context-speciﬁc representations than lower layers. This increased context-speciﬁcity is always accompanied by increased anisotropy. Following up (ethayarajh2019contextual)’s work, (li2020sentence) proposed BERT-flow, in which it transforms the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing ﬂows that are learned with an unsupervised objective.
When it comes to state-of-the-art sentence embedding methods, previous work (conneau2017supervised; cer2017semeval)
found that the SNLI datasets are suitable for training sentence embeddings and(yang2018learning)cer2018universal)
proposed a so-called Universal Sentence Encoder which trains a transformer network and augments unsupervised learning with training on SNLI dataset. In the era of pre-trained methods,(humeau2019real) addressed the run-time overhead of the cross-encoder from BERT and presented a method (poly-encoders) to compute a score between context vectors and pre-computed candidate embeddings using attention. (reimers2019sentence) is a modiﬁcation of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.
3 Our Approach
Sentence embedding should be able to intuitively reﬂect the semantic similarity between sentences. When we retrieve semantically similar sentences, we generally encoder the raw sentences into sentence representations, and then calculate the cosine value of their angles for comparison or ranking (rahutomo2012semantic). Therefore, a thought-provoking question comes up: what assumptions does cosine similarity make about the input vector? In other words, what preconditions would fit in vectors comparison by cosine similarity?
We answer this question by studying the geometry of cosine similarity. Geometrically, given two vectors and , we are aware that inner product of and is the product of the Euclidean magnitudes and the cosine of the angle between them. Accordingly, the cosine similarity is the inner product of and divided by their norms:
However, the above equation 1
is only satisfied when the coordinate basis is Standard Orthogonal Basis. The cosine of the angle has a distinct geometric meaning, but the equation1 is operation-based, which depends on the selected coordinate basis. Therefore, the coordinate formula of the inner product varies with the change of the coordinate basis, and the coordinate formula of the cosine value will also change accordingly.
(li2020sentence) verified that sentence embedding from BERT (devlin2018bert) has included sufficient semantics although it is not exploited properly. In this case, if the sentence embeddings perform poorly when equation 1 is operated to calculate the cosine value of semantic similarity, the reason may be that the coordinate basis to which the sentence vector belongs is not the Standard Orthogonal Basis. From a statistical point of view, we can infer that it is supposed to ensure each basis vector is independent and uniform when we choose the basis for a set of vectors. If this set of basis is Standard Orthogonal Basis, then the corresponding set of vectors should show isotropy.
To summarize, the above heuristic hypothesis elaborately suggests: if a set of vectors satisfies isotropy, we can assume it is derived from the Standard Orthogonal Basis in which it also indicates that we can calculate the cosine similarity via equation1. Otherwise, if it is asotropic, we need to transform the original sentence embedding in a way to enforce it being isotropic, and then use the equation 1 to calculate the cosine similarity.
3.2 Whitening Transformation
Previous work (li2020sentence) address the hypothesis in section 3.1 by adopting a flow-based approach. We find that utilizing the whitening operation which is commonly-adopted in machine learning can also achieve comparable gains.
As far as we are aware that the mean value is 0 and the covariance matrix is a identity matrix with respect to the standard normal distribution. Thus, our goal is to transform the mean value of the sentence vector into 0 and the covariance matrix into the identity matrix. Presumably we have a set of sentence embeddings, which can also be written as a set of row vectors
, then we carry out a linear transformation in equation2 such that the mean value of is 0 and the covariance matrix is a identity matrix:
The above equation 2 actually corresponds to the whitening operation in machine learning (christiansen2010data). In order to let the mean value equals to 0, we only need to enable:
The most difficult part is solving the matrix W. To achieve so, we denote the original covariance matrix of as:
Then we can get the transformed covariance matrix :
As we specify that the new covariance matrix is an identity matrix, we actually need to solve the equation 6 below:
We are aware that the covariance matrix is a positive definite symmetric matrix. The positive definite symmetric matrix satisfies the following form of SVD decomposition (golub1971singular):
is an orthogonal matrix,is a diagonal matrix and the diagonal elements are all positive. Therefore, let , we can obtain the solution:
3.3 Dimensionality Reduction
By far, we already knew that the original covariance matrix of sentence embeddings can be converted into an identity matrix by utilizing the transformation matrix . Among them, the orthogonal matrix is a distance-preserving transformation, which means it does not change the relative distribution of the whole data, but transforms the original covariance matrix into the diagonal matrix .
As far as we know, each diagonal element of the diagonal matrix measures the variation of the one-dimensional data in which it is located. If its value is small, it represents that the variation of this dimensional feature is also small and non-significant, even near to a constant. Accordingly, the original sentence vector may only be embedded into a lower dimensional space, and we can remove this dimensional feature while operate dimensionality reduction, where it enables the result of cosine similarity more reasonable and naturally accelerate the speed of vector retrieval as it is directly proportional to the dimensionality.
In fact, the elements in diagonal matrix
deriving from Singular Value Decomposition(golub1971singular) has been sorted in the descending order. Therefore, we only need to retain the first columns of
to achieve this dimensionality reduction effect, which is equivalent to Principal Component Analysis(abdi2010principal) theoretically. Here,
is an empirical hyperparameter. We refer the entire transformation workflow asWhitening-, of which detailed algorithm implementation is shown in Algorithm 1.
3.4 Complexity Analysis
In terms of the computational efficiency on the massive scale of corpora, the mean values and the covariance matrix can be calculated recursively. To be more specific, all the above algorithm 3.2 needs are the mean value vector and the covariance matrix (where is the dimension of word embedding) of the entire sentence vectors . Therefore, given the new sentence vector , the mean value can be calculated as:
Similarly, convariance matrix is the expectation of , thus it can be calculated as:
Therefore, we can conclude that the space complexities of and are all and the time complexities are , which indicates the effectiveness of our algorithm has reached theoretically optimal. It is reasonable to infer that the algorithm in section 3.2 can obtain the covariance matrix and with limited memory storage even in the large-scale corpora.
To evaluation the effectiveness of the proposed approach, we present our experimental results for various tasks related to semantic textual similarity(STS) tasks under multiple configurations. In the following sections, we first introduce the benchmark datasets in section 4.1 and our detailed experiment settings in section 4.2. Then, we list our experimental result and in-depth analysis in section 4.3. Furthermore, we evaluate the effect of dimensionality reduction with different settings of dimensionality in section 4.4.
|Published in (reimers2019sentence)|
|Avg. GloVe embeddings||58.02||55.14||70.66||59.73||68.25||63.66||53.76|
|Avg. BERT embeddings||46.35||38.78||57.98||57.98||63.15||61.06||58.40|
|Published in (li2020sentence)|
|Published in (li2020sentence)|
|Published in (reimers2019sentence)|
|InferSent - Glove||68.03||52.86||66.75||62.15||72.77||66.86||65.65|
|Published in (li2020sentence)|
|Published in (li2020sentence)|
We compare the model performance with baselines for STS tasks without any specific training data as (reimers2019sentence) does. 7 datasets including STS 2012-2016 tasks (agirre2012semeval; agirre2013sem; agirre2014semeval; agirre2015semeval; agirre2016semeval), the STS benchmark (cer2017semeval) and the SICK-Relatedness dataset (marelli2014sick) are adopted as our benchmarks for evalutation. For each sentence pair, these datasets provide a standard semantic similarity measurement ranging from 0 to 5. We adopt the Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and the gold labels, since (reimers2019sentence) suggested it is the most reasonable metrics in STS tasks. The evaluation procedure is kept as same as (li2020sentence), of which we first encode each raw sentence text into sentence embedding, then calculate the cosine similarities between input sentence embedding pairs as our predicted similarity scores.
4.2 Experimental Settings and Baselines
We compare the performanc with the following baselines. In the unsupervised STS, Avg. GloVe embeddings denotes that we adopt GloVe (pennington2014glove) as the sentence embedding. Similarly, Avg. BERT embeddings and BERT CLS-vector denotes that we use raw BERT (devlin2018bert) with and without using the CLS-token output. In the surpervised STS, USE denotes Universal Sentence Encoder (cer2018universal) which replaces the LSTM with a Transformer. While SBERT-NLI and SRoBERTa-NLI correspond to the BERT and RoBERTa (liu2019roberta) model trained on a combined NLI dataset (consitutuing SNLI (bowman2015large) and MNLI (williams2017broad)) with the Sentence-BERT training approach (reimers2019sentence).
Since the BERT-flow(NLI/target) is the primary baseline we are compared to, we basically align to their experimental settings and symbols. Concretely, we also use both and in our experiments. We choose -first-last-avg222In (li2020sentence), it is marked as -last2avg, but it is actually -first-last-avg in its source code. as our default configuration as averaging the first and the last layers of BERT can stably achieve better performance compared to only averaging the last one layer. Similar to (li2020sentence), we leverage the full target dataset (including all sentences in train, development, and test sets, and excluding all labels) to calculate the whitening parameters and through the unsupervised approach as described in Section 3.2. These model are symbolized as -whitening(target). Furthermore, -whitening(NLI) denotes the whitening parameters are obtained on the NLI corpus. -whitening-256(target/NLI) and -whitening-384(target/NLI) indicates that through our whitening method, the output embedding size is reduced to 256 and 384, respectively.
Without supervision of NLI.
As shown in Table 1, the raw BERT and GloVe sentence embedding unsuprisingly obtain the worst performance on these datasets. Under the settings, our approach consistently outperforms the BERT-flow and achieves state-of-the-art results with 256 sentence embedding dimensionality on STS-B, STS-12, STS-13, STS-14, STS-15 datasets respectively. When we switch to , the better results achieved if the dimensionality of sentence embedding set to 384. Our approach still gains the competitive results on most of the datasets compared to BERT-flow, and achieves the state-of-the-art results by roughly 1 point on STS-B, STS-13, STS-14 datasets.
With supervision of NLI.
In Table 2, the and are trained on the NLI dataset with supervised labels through the approach in (reimers2019sentence). It could be observed that our outperforms on the STS-13, STS-14, STS-15, STS-16 tasks, and obtains better result on STS-B, STS-14, STS-15, STS-16 tasks. These experimental results show that our whitening method can further improve the performance of SBERT, even though it has been trained under the supervision of the NLI dataset.
4.4 Effect of Dimensionality
Dimensionality reduction is a crucial feature, because reduction of vector size brings about smaller memory occupation and a faster retrieval for downstream vector search engines. The dimensionality is a hyperparameter of reserved dimension of sentence embeddings, which can affect the model performance by large margin. Therefore, we carry out experiment to test the variation of Spearman’s correlation coefficient of the model with the change of dimensionality . Figure 1 presents the variation curve of model performance under and embeddings. For most tasks, reducing the dimension of the sentence vector to its one of third is an relatively optimal solution, in which its performance is at the edge of increasing point.
In the SICK-R results in Table 1, although our is not as effective as , our model has a competitive advantage, i.e., the smaller embedding size (256 vs. 768). Furthermore, as presented in Figure 1(a), the correlation score of our raises to 66.52 when the embedding size is set to 109, which outperforms the by 1.08 point. Besides, other tasks can also achieve better performances by choosing carefully.
In this work, we explore an alternative approach to alleviate the anisotropy problem of sentence embedding. Our approach is based on the whitening operation in machine learning, where experimental results indicate our method is simple but effective on 7 semantic similarity benchmark datasets. Besides, we also find that introduce dimensionality reduction operation can further boost the model performance, and naturally optimize the memory storage and accelerate the retrieval speed.