Zero-training Sentence Embedding via Orthogonal Basis

09/30/2018 ∙ by ZiYi Yang, et al. ∙ Microsoft Stanford University 0

We propose a simple and robust training-free approach for building sentence representations. Inspired by the Gram-Schmidt Process in geometric theory, we build an orthogonal basis of the subspace spanned by a word and its surrounding context in a sentence. We model the semantic meaning of a word in a sentence based on two aspects. One is its relatedness to the word vector subspace already spanned by its contextual words. The other is the word's novel semantic meaning which shall be introduced as a new basis vector perpendicular to this existing subspace. Following this motivation, we develop an innovative method based on orthogonal basis to combine pre-trained word embeddings into sentence representations. This approach requires zero training and zero parameters, along with efficient inference performance. We evaluate our approach on 11 downstream NLP tasks. Experimental results show that our model outperforms all existing zero-training alternatives in all the tasks and it is competitive to other approaches relying on either large amounts of labelled data or prolonged training time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The concept of word embeddings has been prevalent in NLP community in recent years, as they can characterize semantic similarity between any pair of words, achieving promising results in a large number of NLP tasks (Mikolov et al., 2013; Pennington et al., 2014; Salle et al., 2016). However, due to the hierarchical nature of human language, it is not sufficient to comprehend text solely based on isolated understanding of each word. This has prompted a recent rise in search for semantically robust embeddings for longer pieces of text, such as sentences and paragraphs.

Based on learning paradigms, the existing approaches to sentence embeddings can be categorized into three groups: i) unsupervised, ii) supervised and semi-supervised, and iii) training-free methods.

Unsupervised sentence embedding. These models are trained on large unlabelled corpora to generate sentence embeddings. SkipThought (Kiros et al., 2015) is an encoder-decoder model that predicts adjacent sentences. Pagliardini et al. (2018)

proposes an unsupervised model, Sent2Vec, to learn an n-gram feature in a sentence to predict the center word from the surrounding context. Quick thoughts (QT)

(Logeswaran & Lee, 2018)

replaces the encoder with a classifier to predict context sentences from candidate sequences.

Khodak et al. (2018) proposes to learn a linear mapping to reconstruct the center word from its context.

Supervised and semi-supervised sentence embedding. In this category, the representation network structure is trained on labelled datasets, sometimes after pre-training on large unsupervised corpora. For instance, Conneau et al. (2017) generates the sentence encoder InferSent using Natural Language Inference (NLI) dataset. Universal Sentence Encoder (Yang et al., 2018; Cer et al., 2018) utilizes the transformer (Vaswani et al., 2017) for sentence embeddings. The model is first trained on large scale of unsupervised data from Wikipedia and forums, and then trained on the Stanford Natural Language Inference (SNLI) dataset. Wieting & Gimpel (2017) propose the gated recurrent averaging network (GRAN), which is trained on Paraphrase Database (PPDB) and English Wikipedia. Subramanian et al. (2018) leverages a multi-task learning framework to generate sentence embeddings. Wieting et al. (2015a) learns the paraphrastic sentence representations as the simple average of updated word embeddings.

Training-free sentence embedding. Recent work (Arora et al., 2017)

shows that, surprisingly, a weighted sum or transformation of word representations can outperform many sophisticated neural network structures in sentence embedding tasks. These methods require no training and are parameter-free.

Arora et al. (2017) constructs a sentence embedding called SIF as a sum of pre-trained word embeddings, weighted by reverse document frequency. Rücklé et al. (2018) concatenates different power mean word embeddings as a sentence vector in

-mean. As these methods do not have a parametric model, they can be easily adapted to novel text domains with both fast inference speed and high-quality sentence embeddings. In view of this trend, our work aims to further advance the frontier of this group and make its new state-of-the-art.

In this paper, we propose a novel sentence embedding algorithm, Geometric Embedding (GEM), based entirely on the geometric structure of word embedding space. Given a -dim word embedding matrix for a sentence with words, any linear combination of the sentence’s word embeddings lies in the subspace spanned by the word vectors. We analyze the geometric structure of this subspace in . When we consider the words in a sentence one-by-one in order, each word may bring in a novel orthogonal basis to the existing subspace. This new basis can be considered as the new semantic meaning brought in by this word, while the length of projection in this direction can indicate the intensity of this new meaning. It follows that a word with a strong intensity should have a larger influence in the sentence’s meaning. Thus, these intensities can be converted into weights to linearly combine all word embeddings to obtain the sentence embedding. In this paper, we theoretically frame the above approach in a QR factorization of the word embedding matrix . Furthermore, since the meaning and importance of a word largely depends on its close neighborhood, we propose the sliding-window QR factorization method to capture the context of a word and characterize its significance within the context.

In the last step, we adapt a similar approach as Arora et al. (2017) to remove top principal vectors before generating the final sentence embedding. This step is to ensure commonly shared background components, e.g. stop words, do not bias sentence similarity comparison. As we build a new orthogonal basis for each sentence, we propose to have disparate background components for each sentence. This motivates us to put forward a sentence-specific principal vector removal method, leading to better empirical results.

We evaluate our algorithm on 11 NLP tasks, including both unsupervised and supervised tasks with downstream network structures. In all of these tasks, our algorithm outperforms all training-free methods and many parametric approaches. For example, compared to SIF (Arora et al., 2017), the performance is boosted by 5.5% on STS benchmark dataset, and by 2.5% on SST dataset. Plus, the running time of our model compares favorably with existing models.

The rest of this paper is organized as following. In Section 2, we describe our sentence embedding algorithm GEM. We evaluate our model on various tasks in Section 3 and Section 4. Finally, we summarize our work in Section 5.

2 Approach

2.1 Quantify New Semantic Meaning

Let us consider the idea of word embeddings (Mikolov et al., 2013), where a word is projected as a vector . Any sequence of words can be viewed as a subspace in spanned by its word vectors. Before the appearance of the th word, is a subspace in spanned by {}. Its orthonormal basis is . The embedding of the th word can be decomposed into


where is the part in that resides in subspace , and is orthogonal to and is to be added to . The above algorithm is also known as Gram-Schmidt Process. In the case of rank deficiency, i.e., is already a linear combination of {}, is a zero vector and . In matrix form, this process is also known as QR factorization, defined as follows.

QR factorization. Define an embedding matrix of words as , where is the embedding of the th word in a word sequence . can be factorized into , where the non-zero columns in are the orthonormal basis, and is an upper triangular matrix.

The process above computes the novel semantic meaning of a word w.r.t all preceding words. As the meaning of a word influences and is influenced by its close neighbors, we now calculate the novel orthogonal basis vector of each word in its neighborhood, rather than only w.r.t the preceding words.

Definition 1 (Contextual Window Matrix) Given a word , and its -neighborhood window inside the sentence , define the contextual window matrix of word as:


Here we shuffle to the end of to compute its novel semantic information compared with its context. Now the QR factorization of is


Note that is the last column of , which is also the new orthogonal basis vector to this contextual window matrix.

Next, in order to generate the embedding for a sentence, we will assign a weight to each of its words. This weight should characterize how much new and important information a word brings to the sentence. The previous process yields the orthogonal basis vector . We propose that represents the novel semantic meaning brought by word . We will now discuss how to quantify i) the novelty of to other meanings in , ii) the significance of to its context, and iii) the corpus-wise uniqueness of w.r.t the whole corpus.

2.2 Novelty

We propose that a word is more important to a sentence if its novel orthogonal basis vector is a large component in . This can be quantified as a novelty score:


where is the last column of , and is the last element of .

Connection to least square. From QR factorization theory, the novel orthogonal basis is also the normalized residual in the least square problem , i.e. , where . And is the minimum distance from word vector to the hyper-plane spanned by ’s context words .

It follows that is the exponential of the normalized distance between and the subspace spanned by its context.

2.3 Significance

The significance of a word is related to how semantically aligned it is to the meaning of its context. To identify principal directions, i.e. meanings, in the contextual window matrix , we employ Singular Value Decomposition.

Singular Value Decomposition. Given a matrix , there exists with orthogonal columns, diagonal matrix ,

, and orthogonal matrix

, such that .

The columns of , , are an orthonormal basis of ’s columns subspace and we propose that they represent a set of semantic meanings from the context. Their corresponding singular values , denoted by , represent the importance associated with . The SVD of ’s contextual window matrix is . It follows that is the coordinate of in the basis of .

Intuitively, a word is more important if its novel semantic meaning has a better alignment with more principal meanings in its contextual window. This can be quantified as , where denotes element-wise product. Therefore, we define the significance of in its context to be:


It turns out can be rewritten as


We use the fact that is an orthogonal matrix and is orthogonal to all but the last column of , . Therefore, is essentially the distance between and the context hyper-plane, normalized by the context size.

2.4 Corpus-wise Uniqueness

Similar to the idea of inverse document frequency (IDF) (Sparck Jones, 1972), a word that is commonly present in the corpus is likely to be a stop word, thus its corpus-wise uniqueness is small. In our solution, we compute the principal directions of the corpus and then measure their alignment with the novel orthogonal basis vector . If there is a high alignment, will be assigned a relatively low corpus-wise uniqueness score, and vice versa.

2.4.1 compute principal directions of corpus

As proposed in Arora et al. (2017), given a corpus containing a set of sentences, each sentence embedding is first computed as a linear combination of its word embeddings, thus generating a sentence embedding matrix for a corpus with sentences. Then principal vectors of are computed.

In comparison, we do not form the sentence embedding matrix after we finalize the sentence embedding. Instead, we obtain an intermediate coarse-grained sentence embedding matrix as follows. Suppose the SVD of the sentence matrix of the th sentence is . Then the coarse-grained embedding for the th sentence is defined as:


where is a monotonically increasing function. We then compute the top principal vectors of , with singular values .

2.4.2 uniqueness score

In contrast to Arora et al. (2017), we select different principal vectors of for each sentence, as different sentences may have disparate alignments with the corpus. For each sentence, are re-ranked in descending order of their correlation with sentence matrix . The correlation is defined as . Next, the top principal vectors after re-ranking based on are selected: , with and their singular values in are .

Finally, a word with new semantic meaning vector in this sentence will be assigned a corpus-wise uniqueness score:


This ensures that common stop words will have their effect diminished since their embeddings are closely aligned with the corpus’ principal directions.

2.5 Sentence Vector

A sentence vector is computed as a weighted sum of its word embeddings, where the weights come from three scores: a novelty score (), a significance score () and a corpus-wise uniqueness score ().


We provide a theoretical explanation of Equation 9 in Appendix.

Sentence-Dependent Removal of Principal Components. Arora et al. (2017) shows that given a set of sentence vectors, removing projections onto the principal components of the spanned subspace can significantly enhance the performance on semantic similarity task. However, as each sentence may have a different semantic meaning, it could be sub-optimal to remove the same set of principal components from all sentences.

Therefore, we propose the sentence-dependent principal component removal (SDR), where we re-rank top principal vectors based on correlation with each sentence. Using the method from Section 2.4.2, we obtain for a sentence . The final embedding of this sentence is then computed as:


Ablation experiments show that sentence-dependent principal component removal can achieve better result. The complete algorithm is summarized in Algorithm 1 with an illustration in Figure 1.

Figure 1: An illustration of GEM algorithm. Top middle: The sentence to encode. Bottom middle: Form for , compute and novelty score (Section 2.1 and Section 2.2). Bottom left: Compute the SVD of and significance score (Section 2.3). Bottom right: Re-rank and select from principal components and compute uniqueness score (Section 2.4).
2:      A set of sentences , vocabulary , word embeddings
4:      Sentence embeddings {}
5:for th sentence in  do
6:     Form matrix , and is the th word in
7:     The SVD is
8:     Form the th column of the coarse-grained sentence embedding matrix ,
9:end for
10:Take first singular vectors and singular values of
11:for sentence in  do
12:     Re-rank in descending order by .
13:     Select top principal vectors as , with singular values .
14:     for word in  do
15:          is the contextual window matrix of .

         Do QR decomposition

, let and denote the last column of and
19:     end for
21:     Principal vectors removal:
22:end for
Algorithm 1 Geometric Embedding (GEM)

3 Experiments

3.1 Semantic Similarity Tasks: STS Benchmark

We evaluate our model on the STS Benchmark (Cer et al., 2017), a sentence-level semantic similarity dataset from SemEval and SEM STS. The goal for a model is to predict a similarity score of two sentences given a sentence pair. The evaluation is by the Pearson’s coefficient between human-labeled similarity (0 - 5 points) and predictions.

Experimental settings. We report two versions of our model, one only using GloVe word vectors (GEM + GloVe), and the other using word vectors concatenated from LexVec, fastText and PSL (Wieting et al., 2015b) (GEM + L.F.P). The final similarity score is computed as an inner product of normalized sentence vectors. Since our model is training-free, it does not utilize any information from the dev set when evaluating on the test set and vice versa.

Results on the dev and test set are reported in Table 2. As shown, on the test set, our model has a

higher score compared with another non-parametric model SIF, and

higher than the baseline of averaging L.F.P word vectors. It also outperforms most parametric models including GRAN, InferSent, and Sent2Vec. Of all evaluated models, our model only ranks second to Reddit + SNLI, which is trained on the Reddit conversations dataset (600 million sentence pairs) and SNLI (570k sentence pairs). In comparison, our proposed method requires no external data and no training.

Training-free models dev test
GEM + L.F.P 82.1 77.5
GEM + LexVec 81.9 76.5
SIF (Arora et al., 2017) 80.1 72.0
LexVec 58.78 50.43
L.F.P 62.4 52.0
Word2vec skipgram 70.0 56.5
Glove 52.4 40.6
Training-required models
Reddit + SNLI (Yang et al., 2018) 81.4 78.2
GRAN (Wieting & Gimpel, 2017) 81.8 76.4
InferSent (Conneau et al., 2017) 80.1 75.8
Sent2Vec (Pagliardini et al., 2018) 78.7 75.5
Paragram-Phrase (Wieting et al., 2015a) 73.9 73.2
Table 1: Pearson’s 100 on STSB
GEM + L.F.P 48.97
Reddit + SNLI tuned 47.44
KeLP-contrastive1 49.00
SimBow-contrastive2 47.87
SimBow-primary 47.22
Table 2: MAP on CQA subtask B

3.2 Semantic Similarity Tasks: CQA

We evaluate our model on subtask B of the SemEval Community Question Answering (CQA) task, another semantic similarity dataset. Given an original question and a set of the first ten related questions retrieved by a search engine, the model is expected to re-rank the related questions according to their similarity with respect to the original question. Each retrieved question is labelled “PerfectMatch”, “Relevant” or “Irrelevant”, with respect to . Mean average precision (MAP) is used as the evaluation measure.

We encode each question text into a unit vector . Retrieved questions

are ranked according to their cosine similarity with

. Results are shown in Table 2. For comparison, we include results from the best models in 2017 competition: SimBow (Charlet & Damnati, 2017), KeLP (Filice et al., 2017), and Reddit + SNLI tuned. Note that all three benchmark models require training, and SimBow and KeLP leverage optional features including usage of comments and user profiles. In comparison, our model only uses the question text without any training. Our model clearly outperforms both Reddit + SNLI tuned and SimBow-primary, and on par with KeLP model.

3.3 Supervised tasks

We further test our model on nine supervised tasks. The sentence embeddings generated are fixed and only the downstream task-specific neural structure is learned. Results are in Table 3.

As shown, our model GEM outperforms all non-parametric sentence embedding models, including SIF, p-mean (Rücklé et al., 2018), and BOW on GloVe. It also compares favorably with most of parametric models, including (Khodak et al., 2018), FastSent(Hill et al., 2016), InferSent, QT, Sent2Vec, SkipThought-LN (with layer normalization) (Kiros et al., 2015), SDAE(Hill et al., 2016), and STN (Subramanian et al., 2018). Note that sentence representations generated by GEM have much smaller dimension compared to most of benchmark models, and the subsequent neural structure has fewer learnable parameters.

Model Dim
time (h)
Training-free models
GEM + L.F.P 900 0 79.8 82.5 93.8 89.9 84.7 91.4 75.4/82.9 86.5 86.2
GEM + GloVe 300 0 78.8 81.1 93.1 89.4 83.6 88.6 73.4/82.3 86.3 85.3
SIF 300 0 77.3 78.6 90.5 87.0 82.2 78.0 - 86.0 84.6
p-mean 3600 0 78.4 80.4 93.1 88.9 83.0 90.6 - - -
GloVe BOW 300 0 78.7 78.5 91.6 87.6 79.8 83.6 72.1/80.9 80.0 78.6
Training-required models
InferSent 4096 24 81.1 86.3 92.4 90.2 84.6 88.2 76.2/83.1 88.4 86.3
Sent2Vec 700 6.5 75.8 80.3 91.1 85.9 - 86.4 72.5/80.8 - -
SkipThought-LN 4800 336 79.4 83.1 93.7 89.3 82.9 88.4 - 85.8 79.5
FastSent 300 2 70.8 78.4 88.7 80.6 - 76.8 72.2/80.3 - -
4800 N/A 81.8 84.3 93.8 87.6 86.7 89.0 - - -
SDAE 2400 192 74.6 78.0 90.8 86.9 - 78.4 73.7/80.7 - -
QT 4800 28 82.4 86.0 94.8 90.2 87.6 92.4 76.9/84.0 87.4 -
STN 4096 168 82.5 87.7 94.0 90.9 83.2 93.0 78.6/84.4 88.8 87.8
Table 3: Results on supervised tasks. Sentence embeddings are fixed for downstream supervised tasks. Best results for each task are underlined, best results from models in the same category are in bold. SIF results are extracted from Arora et al. (2017) and Rücklé et al. (2018), and some training time is collected from Logeswaran & Lee (2018).

4 Discussion

Ablation Study. As shown in in Table 4, every GEM weight () and proposed principal components removal methods contribute to the performance. As listed on the left, adding GEM weights improves the score by 8.6% on STS dataset compared with averaging three concatenated word vectors. The sentence-dependent principal component removal (SDR) proposed in GEM improves 0.3% compared to directly removing the top corpus principal components (SIR). Using GEM weights and SDR together yields an overall improvement of 19.7%. As shown on the right in Table 4, every weight contributes to the performance of our model. For example, three weights altogether improve the score in SUBJ task by 0.38% compared with only using .

Configurations STSB dev
Mean of L.F.P 62.4
GEM weights 71.0
GEM weights + SIR 81.8
GEM weights + SDR 82.1
Configurations STSB dev SUBJ
+ SDR 81.6 93.42
+ SDR 81.9 93.6
+ SDR 82.1 93.8
Table 4: Comparison of different configurations demonstrates the effectiveness of our model on STSB dev set and SUBJ. SDR stands for sentence-dependent principal component removal in Section 2.4.2. SIR stands for sentence-independent principal component removal, i.e. directly removing top corpus principal components from the sentence embedding.

Sensitivity Study. We evaluate the effect of all four hyper-parameters in our model: the window size in the contextual window matrix, the number of candidate principal components , the number of principal components to remove , and the power of the singular value in coarse sentence embedding, i.e. the power in in Equation 7. In the STS benchmark and all supervised experiments, these four hyper-parameters are fixed at , , and . We sweep the hyper-parameters and test on STSB dev set, SUBJ, and MPQA. As shown in Figure 2, our model is quite robust with respect to hyper-parameters.

Figure 2: Sensitivity tests on four hyper-parameters, the window size in contextual window matrix, the number of candidate principal components , the number of principal components to remove , and the exponential power of singular value in coarse sentence embedding.

Inference speed. We also compare the inference speed of our algorithm on the STSB test set with the benchmark models SkipThought and InferSent. SkipThought and InferSent are run on a NVIDIA Tesla P100, and our model is run on a CPU (Intel® Xeon® CPU E5-2690 v4 @2.60GHz). For fair comparison, batch size in InferSent and SkipThought is set to be 1. The results are shown in Table 5. It shows that without acceleration from GPU, our model is still faster than InferSent and is faster than SkipThought.

GEM (CPU) InferSent(GPU) SkipThought (GPU)
Average running time (seconds) 20.08 21.24 43.36
Variance 0.23 0.15 0.10
Table 5: Running time of GEM, InferSent and SkipThought on encoding sentences in STSB test set. GEM is run on CPU, InferSent and SkipThought is run on GPU. Data are collected from 5 trials.

5 Conclusions

We proposed a simple training-free method 111The code of GEM will be published soon. to generate sentence embeddings, based entirely on the geometric structure of the subspace spanned by word embeddings. Our sentence embedding evolves from the new orthogonal basis vector brought in by each word, which represents novel semantic meaning. The evaluation shows that our method not only sets up the new state-of-the-art of training-free models but also performs competitively well when compared with models requiring either large amount of training data or prolonged training time. In future work, we plan to consider multi-characters, i.e. subwords, into the model and explore other geometric structures in sentences.


We thank Jade Huang for suggestions on the writing of the paper.


Appendix A Proof

The novelty score (), significance score () and corpus-wise uniqueness score () are larger when a word has relatively rare appearance in the corpus and can bring in new and important semantic meaning to the sentence.

Following the section 3 in Arora et al. (2017)

, we can use the probability of a word

emitted from sentence in a dynamic process to explain eq. 9 and put this as following Theorem with its proof provided below.

Theorem 1. Suppose the probability that word is emitted from sentence is222The first term is adapted from Arora et al. (2017), where words near the sentence vector has higher probability to be generated. The second term is introduced so that words similar to the context in the sentence or close to common words in the corpus are also likely to occur.:


where is the sentence embedding, and denotes the vocabulary. Then when is sufficiently large, the MLE for is:


Proof: According to Equation 11,


Where and are two partition functions defined as


The joint probability of sentence is then


To simplify the notation, let . It follows that the log likelihood of word emitted from sentence is given by


By Taylor expansion, we have


Again by Taylor expansion on ,


The approximation is based on the assumption that is sufficiently large. It follows that,


Then the maximum log likelihood estimation of



Appendix B Experimental settings

For all experiments, sentences are tokenized using the NLTK tokenizer (Bird et al., 2009) wordpunct_tokenize, and all punctuation is skipped. in Equation 7. In the STS benchmark dataset and all supervised experiments, hyper-parameters are fixed at , , , and . In CQA dataset, and are changed to 6 and 15, the correlation term in section 2.4.2 is changed to . In supervised tasks, same as Arora et al. (2017), we do not perform principal components in supervised tasks.