PromptBERT: Improving BERT Sentence Embeddings with Prompts

by   Ting Jiang, et al.
Beihang University

The poor performance of the original BERT for sentence semantic similarity has been widely discussed in previous works. We find that unsatisfactory performance is mainly due to the static token embeddings biases and the ineffective BERT layers, rather than the high cosine similarity of the sentence embeddings. To this end, we propose a prompt based sentence embeddings method which can reduce token embeddings biases and make the original BERT layers more effective. By reformulating the sentence embeddings task as the fillin-the-blanks problem, our method significantly improves the performance of original BERT. We discuss two prompt representing methods and three prompt searching methods for prompt based sentence embeddings. Moreover, we propose a novel unsupervised training objective by the technology of template denoising, which substantially shortens the performance gap between the supervised and unsupervised setting. For experiments, we evaluate our method on both non fine-tuned and fine-tuned settings. Even a non fine-tuned method can outperform the fine-tuned methods like unsupervised ConSERT on STS tasks. Our fine-tuned method outperforms the state-of-the-art method SimCSE in both unsupervised and supervised settings. Compared to SimCSE, we achieve 2.29 and 2.58 points improvements on BERT and RoBERTa respectively under the unsupervised setting.


page 4

page 11


Universal Text Representation from BERT: An Empirical Study

We present a systematic investigation of layer-wise BERT activations for...

conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers

In this paper we focus on constructing useful embeddings of textual info...

Improving BERT with Self-Supervised Attention

One of the most popular paradigms of applying large, pre-trained NLP mod...

Extracting Software Requirements from Unstructured Documents

Requirements identification in textual documents or extraction is a tedi...

Patent Search Using Triplet Networks Based Fine-Tuned SciBERT

In this paper, we propose a novel method for the prior-art search task. ...

An Unsupervised Sentence Embedding Method byMutual Information Maximization

BERT is inefficient for sentence-pair tasks such as clustering or semant...

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) ...

1 Introduction

In recent years, we have witnessed the success of pre-trained language models like BERT 

devlin2018bert and RoBERTa liu2019roberta in sentence embeddings gao2021simcse; yan2021consert. However, the original BERT still shows poor performance in sentence embeddings reimers2019sentence; li2020sentence. The most commonly used example is that it underperforms the traditional word embedding methods like GloVe pennington2014glove.

Previous research has linked anisotropy to explain the poor performance of the original BERT li2020sentence; yan2021consert; gao2021simcse. Anisotropy makes the token embeddings occupy a narrow cone, resulting in a high similarity between any sentence pair li2020sentence. li2020sentence (li2020sentence

) proposed a normalizing flows method to transform the sentence embeddings distribution to a smooth and isotropic Gaussian distribution and

yan2021consert (yan2021consert) presented a contrastive framework to transfer sentence representation. The goal of these methods is to eliminate anisotropy in sentence embeddings. However, we find that anisotropy is not the primary cause of poor semantic similarity. For example, averaging the last layer of the original BERT is even worse than averaging its static token embeddings in semantic textual similarity task, but the sentence embeddings from last layer are less anisotropic than static token embeddings.

Following this result, we find the original BERT layers222We denote the transformer blocks in BERT as BERT layers. actually damage the quality of sentence embeddings. However, if we treat static token embeddings333We denote the BERT token embeddings as static token embeddings. as word embedding, it still yields unsatisfactory results compared to GloVe. Inspired by li2020sentence, who found token frequency biases its distribution, we find the distribution is not only biased by frequency, but also case sensitive and subword in WordPiece wu2016google. We design a simple experiment to test our conjecture by simply removing these biased tokens (e.g., high frequency subwords and punctuation) and using the average of the remaining token embeddings as sentence representation. It can outperform the Glove and even achieve results comparable to post-processing methods BERT-flow li2020sentence and BERT-whitening su2021whitening.

Motivated by these findings, avoiding embedding bias can improve the performance of sentence representations. However, it is labor-intensive to manually remove embedding biases and it may result in the omission of some meaningful words if the sentence is too short. Inspired by brown2020language, which has reformulated the different NLP tasks as fill-in-the-blanks problems by different prompt, we propose a prompt based method by using the template to obtain the sentence representations in BERT. Prompt based method can avoid embedding bias and utilize the original BERT layers. We find original BERT can achieve reasonable performance with the help of the template in sentence embeddings, and it even outperforms some BERT based methods, which fine-tune BERT in down-stream tasks.

Our approach is equally applicable to fine-tuned setting. Current methods utilize the contrastive learning to help the BERT learn better sentence embeddings gao2021simcse; yan2021consert. However, the unsupervised methods still suffer from leaking proper positive pairs. yan2021consert (yan2021consert) discuss four data augmentation methods, but the performance seems worse than directly using the dropout in BERT as noise gao2021simcse. We find the prompts can provide a better way to generate positive pairs by different viewpoints from different templates. To this end, we propose a prompt based contrastive learning method with template denoising to leverage the power of BERT in an unsupervised setting, which significantly shortens the gap between the supervised and unsupervised performance. Our method achieves the state-of-the-art results in both unsupervised and supervised settings.

2 Related Work

Learning sentence embeddings as a fundamental NLP problem has been largely studied. Currently, how to leverage the power of BERT in sentence embeddings has become a new trend. Many works li2020sentence; gao2021simcse achieved strong performance with BERT in both supervised and unsupervised settings. Among these works, contrastive learning based methods achieve the state-of-the-art results. These works gao2021simcse; yan2021consert pay attention to constructing positive sentence pairs. gao2021simcse (gao2021simcse) proposed a novel contrastive training objective to directly use inner dropout as noise to construct positive pairs. yan2021consert (yan2021consert) discuss four methods to construct positive pairs.

Although BERT achieved great success in sentence embeddings, the original BERT shows unsatisfactory performance. Contextual token embeddings from original BERT even underperform the word embeddings like GloVe. One explanation is the anisotropy in the original BERT, which causes sentence pairs to have high similarity. Following this explanation, BERT-flow li2020sentence and BERT-whitening su2021whitening have been proposed to reduce the anisotropy by post-processing the sentence embeddings from original BERT.

3 Rethinking the Sentence Embeddings of Original BERT

Previous works yan2021consert; gao2021simcse explained the poor performance of original BERT is limited by the learned anisotropic token embeddings space, where the token embeddings occupy a narrow cone. However, we find that anisotropy is not a key factor to inducing poor semantic similarity by examining the relationship between the aniostropy and performance. We think the main reasons are the ineffective BERT layers and static token embedding biases.

Observation 1: Original BERT layers fail to improve the performance. In this section, we analyze the influence of BERT layers by comparing the two sentence embeddings methods: averaging static token embeddings (input of the BERT layers) and averaging last layer (output of the BERT layers). We report the sentence embeddings performance and its sentence level anisotropy.

To measure the anisotropy, we follow the work of ethayarajh2019contextual to measure the sentence level anisotropy in sentence embeddings. Let be a sentence that appears in corpus . The anisotropy can be measured as follows:


where denotes the sentence embeddings method, which maps the raw sentence to its embedding and is the cosine similarity. In other words, the anisotropy of is measured by the average cosine similarity of a set of sentences. If sentence embeddings was isotropic (i.e., directionally uniform), then the average cosine similarity between uniformly randomly sampled sentences would be 0 arora2016simple. The closer it is to 1, the more anisotropic the embedding of sentences. We randomly sample 100,000 sentences from the Wikipedia corpus to compute the anisotropy.

We compare different pre-trained models (bert-base-uncased, bert-base-cased and roberta-base) and different sentence embeddings methods ( last layer avgerage, averaging of last hidden layer tokens as sentence embeddings and static token embeddings, directly averaging of static token embeddings). We have shown the spearman correlation, sentence level anisotropy of these methods in Table 1.

Pre-trained Correlation Sentence
models anisotropy
Static token embeddings avg.
bert-base-uncased 56.02 0.8250
bert-base-cased 56.65 0.5755
roberta-base 55.88 0.5693
Last layer avg.
bert-base-uncased 52.57 0.4874
bert-base-cased 56.93 0.7514
roberta-base 53.49 0.9554
Table 1: The spearman correlation, sentence anisotropy of Last layer average. and Static token embeddings average. The spearman correlation is the average of correlation on STS12-16, STS-B and SICK.

As Table 1 shows, we find the BERT layers in bert-base-uncased and roberta-base significantly harm the sentence embeddings performance. Even in bert-base-cased, the gain of BERT layers is trivial with only 0.28 improvement. We also show the sentence level anisotropy of each method. The performance degradation of the BERT layers seems not to be related to the sentence level anisotropy. For example, the last layer average is more isotropic than the static token embeddings average in bert-base-uncased. However, the static token embeddings average achieves better sentence embeddings performance.

(a) Frequency bias in bert-base-uncased.
(b) Frequency bias in bert-base-cased.
(c) Frequency bias in roberta-base.
(d) Subword and Case biases in bert-base-uncased.
(e) Subword and Case biases in bert-base-cased.
(f) Subword and Case biases in roberta-base.
Figure 1: 2D visualization of token embeddings with different biases. For frequency bias, the darker the color, the higher the token frequency. For subword and case bias, yellow represents subword and red represents the token contains capital letters.

Observation 2: Embedding biases harms the sentence embeddings performance. li2020sentence (li2020sentence) found that token embeddings can be biased to token frequency. Similar problems have been studied in yan2021consert. The anisotropy in BERT static token embeddings is sensitive to token frequency. Therefore, we investigate whether embedding bias yields unsatisfactory performance of sentence embeddings. We observe that the token embeddings is not only biased by token frequency, but also subwords in WordPiece wu2016google and case sensitive.

As shown in Figure 1, we visualize these biases in the token embeddings of bert-base-uncased, bert-base-cased and roberta-base. The token embeddings of three pre-trained models are highly biased by the token frequency, subword and case. The token embeddings can roughly divided into three regions according to the subword and case biases : 1) the lowercase begin-of-word tokens, 2) the uppercase begin-of-word tokens and 3) the subword tokens. For uncased pre-trained model bert-base-uncased, the token embeddings also can roughly divided into two regions: 1) the begin-of-word tokens, 2) the subword tokens.

For frequency bias, we can observe that high frequency tokens are clustered, while low frequency tokens are dispersed sparsely in all models yan2021consert. The begin-of-word tokens are more vulnerable to frequency than subword tokens in BERT. However, the subword tokens are more vulnerable in RoBERTa.

Previous works yan2021consert; li2020sentence

often connect the concept of "token embeddings bias" with the token embeddings anisotropy as the reason for bias. However, we think the anisotropy is unrelated to the bias. The bias means the distribution of embedding is disturbed by some irrelevant information like token frequency, which can be directly visualized according to the PCA. For the anisotropy, it means the whole embedding occupies a narrow cone in the high dimensional vector space, which cannot be directly visualized.

average cosine similarity
bert-base-uncased 0.4445
bert-base-cased 0.1465
roberta-base 0.0235
Table 2: The average cosine similarity in static token embeddings

Table 2 shows the static token embeddings anisotropy of three pre-trained models in Figure 1 according to the average the cosine similarity between any two token embeddings. Contrary to the previous conclusion yan2021consert; li2020sentence, we find only bert-base-uncased’s static token embeddings is highly anisotropic. The static token embeddings like roberta-base are isotropic with 0.0235 average cosine similarity. For biases, these models are suffered from the biases in static token embeddings, which is irrelevant to the anisotropy.

cased uncased roberta
Static Token Embeddings 56.93 56.02 55.88
     Freq. 60.27 59.65 65.41
     Freq. & Sub. 64.83 62.20 64.89
     Freq. & Sub. & Case 65.07 - 65.06
     Freq. & Sub. & Case & Pun. 66.05 63.10 67.64
Table 3: The influence of static embedding biases in spearman correlation. The spearman correlation is the average of STS12-16, STS-B and SICK. Cased, uncased and roberta represent bert-base-cased, bert-base-uncased and roberta-base. For Freq., Sub., Case. and Pun., we remove the top frequency tokens, subword tokens, uppercase tokens and punctuation respectively. More details can be found in Appendix LABEL:sec:appendix_sta_token.

To prove the negative impact of biases, we show the influence of biases to the sentence embeddings with averaging static token embeddings as sentence embeddings (without BERT layers). The results of eliminating embedding biases are quite impressive on three pre-trained models in Table 3. Simply removing a set of tokens, the result can be improved by 9.22, 7.08 and 11.76 respectively. The final result of roberta-base can outperform post-processing methods such as BERT-flow li2020sentence and BERT-whitening su2021whitening with only using static token embeddings.

Manually removing embedding biases is a simple method to improve the performance of sentence embeddings. However, if the sentence is too short, this is not an adequate solution, which may result in the omission of some meaningful words.