The standard bag-of-words vector space model (vsm) (ml:SaltonBuckley1988) represents documents as real vectors. Documents are expressed in a basis where each basis vector corresponds to a single term, and each coordinate corresponds to the frequency of a term in a document. Consider the documents
represented in a basis of , where the basis vectors correspond to the terms in the order of first appearance. Then the corresponding document vectors , and would have the following coordinates in :
Assuming is orthonormal, we can take the inner product of the -normalized vectors , and to measure the cosine of the angle (i.e. the cosine similarity) between the documents , and :
Intuitively, this underestimates the true similarity between , and . Assuming is orthogonal but not orthonormal, and that the terms Julius, and Caesar are twice as important as the other terms, we can construct a diagonal change-of-basis matrix from to an orthonormal basis , where corresponds to the importance of a term . This brings us closer to the true similarity:
Since we assume that the bases and are orthogonal, the terms dead and killed contribute nothing to the cosine similarity despite the clear synonymy, because . In general, the vsm will underestimate the true similarity between documents that carry the same meaning but use different terminology.
In this paper, we further develop the soft vsm described by sidorov2014soft, which does not assume is orthogonal and which achieved state-of-the-art results on the question answering (qa) task at SemEval 2017 (charletdamnati17). In Section 2, we review the previous work incorporating term similarity into the vsm. In Section 3, we restate the definition of the soft vsm and present several computational complexity results. In Section 4, we describe the implementation in vector databases and inverted indices. We conclude in Section 5 by summarizing our results and suggesting future work.
2. Related work
Most works incorporating term similarity into the vsm published prior to sidorov2014soft remain in an orthogonal coordinate system and instead propose novel document similarity measures. To name a few, mikawa2011proposal proposes the extended cosine measure, which introduces a metric matrix as a multiplicative factor in the cosine similarity formula. is the solution of an optimization problem to maximize the sum of extended cosine measures between each vector and the centroid of the vector’s category. Conveniently, the metric matrix can be used directly with the soft vsm, where it defines the inner product between basis vectors. jimenez2012soft equip the multiset vsm with a soft cardinality operator that corresponds to cardinality, but takes term similarities into account.
The notion of generalizing the vsm to non-orthogonal coordinate systems was perhaps first explored by sidorov2014soft in the context of entrance exam question answering, where the basis vectors did not correspond directly to terms, but to -grams constructed by following paths in syntactic trees. The authors derive the inner product of two basis vectors from the edit distance between the corresponding -grams. Soft cosine measure (scm) is how they term the formula for computing the cosine similarity between two vectors expressed in a non-orthogonal basis. They also present an algorithm that computes a change-of-basis matrix to an orthonormal basis in time . We present an algorithm in this paper.
charletdamnati17 achieved state-of-the-art results at the qa task at SemEval 2017 (nakov2017semeval)
by training a document classifier on soft cosine measures between document passages. Unlike sidorov2014soft, charletdamnati17 already use basis vectors that correspond to terms rather than to-grams. They derive the inner product of two basis vectors both from the edit distance between the corresponding terms, and from the inner product of the corresponding word2vec term embeddings (mikolov2013efficient).
3. Computational complexity
In this section, we restate the definition of the soft vsm as it was described by sidorov2014soft. We then prove a tighter lower worst-case time complexity bound for computing a change-of-basis matrix to an orthonormal basis. We also prove that under certain assumptions, the inner product is a linear-time operation.
Definition 3.1 ().
Let be the real -space over equipped with the bilinear inner product . Let be the basis of in which we express our vectors. Let be a diagonal change-of-basis matrix from to a normalized basis of , i.e. . Let be the metric matrix of w.r.t. , i.e. . Then is a soft vsm.
Theorem 3.2 ().
Let be a soft vsm. Then a change-of-basis matrix from the basis to an orthonormal basis of can be computed in time .
By definition, for any change-of-basis matrix from the basis to an orthonormal basis. Since
contains inner products of linearly independent vectors, it is Gramian and positive definite (horn2013matrix, p. 441). The Gramianness of also implies its symmetry. Therefore, a lower triangular is uniquely determined by the Cholesky factorization of the symmetric positive-definite , which we can compute in time (stewart1998matrix, p. 191). ∎
See Table 1 for an experimental comparison.
Although the vocabulary in our introductory example contains only terms, is in the millions for real-world corpora such as the English Wikipedia. Therefore, we generally need to store the matrix in a sparse format, so that it fits into main memory. Later, we will discuss how the density of can be reduced, but the Cholesky factor can also be arbitrarily dense and therefore expensive to store. Given a permutation matrix , we can instead factorize into . Finding the permutation matrix that minimizes the density of the Cholesky factor is NP-hard (yannakakis1981computing)
, but heuristic stategies are known(cuthill1969reducing; heggernes2001computational). Using the fact that , and basic facts about transpose, we can derive as follows:
|terms||Algorithm||Real computation time|
|100||Cholesky factorization||0.||0006 sec (0.606 ms)|
|100||Gaussian elimination||0.||0529 sec (52.893 ms)|
|500||Cholesky factorization||0.||0086 sec (8.640 ms)|
|500||Gaussian elimination||22.||7361 sec (22.736 sec)|
|1000||Cholesky factorization||0.||0304 sec (30.378 ms)|
|1000||Gaussian elimination||354.||2746 sec (5.905 min)|
Lemma 3.3 ().
Let be a soft vsm. Let . Then .
Let be the change-of-basis matrix from the basis to an orthonormal basis of . Then:
From here, we can directly derive the cosine of the angle between and (i.e. what sidorov2014soft call the scm) as follows:
The scm is actually the starting point for charletdamnati17, who propose matrices that are not necessarily metric. If, like them, we are only interested in computing the scm, then we only require that the square roots remain real, i.e. that . For arbitrary , this holds iff is positive semi-definite. However, since the coordinates correspond to non-negative term frequencies, it is sufficient that and are non-negative as well. If we are only interested in computing the inner product, then can be arbitrary.
Theorem 3.4 ().
Let be a soft vsm such that no column of contains more than non-zero elements, where is a constant. Let and let be the number of non-zero elements in . Then can be computed in time .
Assume that and are represented by data structures with constant-time column access and non-zero element traversal, e.g. compressed sparse column (csc) matrices. Further assume that is represented by an array containing the main diagonal of . Then Algorithm 1 computes in time , which by Lemma 3.3, corresponds to .∎
Similarly, we can show that if a column of contains non-zero elements on average, has the average-case time complexity of . Note also that most information retrieval systems impose a limit on the length of a query document. Therefore, is usually bounded by a constant and .
Since we are usually interested in the inner products of all document pairs in two corpora (e.g. one containing queries and the other actual documents), we can achieve significant speed improvements with vector processors by computing , where , and are corpus matrices containing the coordinates of document vectors in the basis as columns. To compute the scm, we first need to normalize the document vectors by performing an entrywise division of every column in by where denotes entrywise product. is normalized analogously.
There are several strategies for making no column of contain more than non-zero elements. If we do not require that is metric (e.g. because we only wish to compute the inner product, or the scm), a simple strategy is to start with an empty matrix, and to insert the largest elements and the diagonal element from every column of . However, the resulting matrix will likely be asymmetric, which makes the inner product formula asymmetric as well. We can regain symmetry by always inserting an element together with the element and only if this does not make the column contain more than non-zero elements. This strategy is greedy, since later columns contain non-zero elements inserted by earlier columns. Our preliminary experiments suggest that processing colums that correspond to increasingly frequent terms performs best on the task of charletdamnati17. Finally, by limiting the sum of all non-diagonal elements in a column to be less than one, we can make strictly diagonally dominant and therefore positive definite, which enables us to compute through Cholesky factorization.
4. Implementation in vector databases and inverted indices
In this section, we present coordinate transformations for retrieving nearest document vectors according to the inner product, and the soft cosine measure from general-purpose vector databases such as Annoy, or Faiss (JDH17). We also discuss the implementation in the inverted indices of text search engines such as Apache Lucene (bialecki12).
With a vector database, we can transform document vectors to an orthonormal basis . In the transformed coordinates, the dot product corresponds to the inner product and the cosine similarity corresponds to the cosine of an angle (i.e. the soft cosine measure). A vector database that supports nearest neighbor search according to either the dot product, or the cosine similarity will therefore retrieve vectors expressed in according to either the inner product, or the soft cosine measure. We can compute a change-of-basis matrix of order in time by Theorem 3.2 and use it to transform every vector to by computing . However, this approach requires that is symmetric positive-definite and that we recompute , and reindex the vector database each time has changed. We will now discuss transformations that do not require and for which a non-negative is sufficient as discussed in the remark for Lemma 3.3.
Theorem 4.1 ().
Let be a soft vsm. Let such that Then
from Lemma 3.3.∎
By transforming a query vector into , we can retrieve documents according to the inner product in vector databases that only support nearest neighbor search according to the dot product. Note that we do not introduce into , which allows us to change without changing the documents in a vector database and that can be arbitrary as discussed in the remark for Lemma 3.3.
Theorem 4.2 ().
Let be a soft vsm. Let s.t. and Then iff .
. From Lemma 3.3, this equals except for the missing term in the divisor. The term is constant in both , and , so ordering is preserved. ∎
By transforming a query vector into and document vectors into , we can retrieve documents according to the scm in vector databases that only support nearest neighbor search according to the dot product.
Theorem 4.3 ().
Let be a soft vsm s.t. is non-negative.
. Since is non-negative, and , and therefore , and (neyshabur2015symmetric, sec. 4.2). Therefore:
From Lemma 3.3, this equals except for the missing term , and the extra term in the divisor. The terms are constant in both , and , so ordering is preserved. ∎
By transforming a query vector into and document vectors into , we can retrieve documents according to the scm in vector databases that only support nearest neighbor search according to the cosine similarity.
Whereas most vector databases are designed for storing low-dimensional and dense vector coordinates, document vectors have the dimension , which can be in the millions for real-world corpora such as the English Wikipedia. Apart from that, a document contains only a small fraction of the terms in the vocabulary, which makes the coordinates extremely sparse. Therefore, the coordinates need to be converted to a dense low-dimensional representation, using e.g. the latent semantic analysis (lsa), before they are stored in a vector database or used for queries.
Unlike vector databases, inverted-index-based search engines are built around a data structure called the inverted index, which maps each term in our vocabulary to a list of documents (a posting) containing the term. Documents in a posting are sorted by a common criterion. The search engine tokenizes a text query into terms, retrieves postings for the query terms, and then traverses the postings, computing similarity between the query and the documents.
We can directly replace the search engine’s document similarity formula with the formula for the inner product from Lemma 3.3, or the formula for the scm. After this straightforward change, the system will still only retrieve documents that have at least one term in common with the query. Therefore, we first need to expand the query vector by computing and retrieving postings for all terms corresponding to the nonzero coordinates in the expanded vector. The expected number of these terms is , where is the number of non-zero elements in , and is the maximum number of non-zero elements in any column of . Assuming and are bounded by a constant, .
5. Conclusion and future work
In this paper, we examined the soft vector space model (vsm) of sidorov2014soft. We restated the definition, we proved a tighter lower time complexity bound of for a related orthonormalization problem, and we showed how the inner product, and the soft cosine measure between document vectors can be efficiently computed in general-purpose vector databases, in the inverted indices of text search engines, and in other applications. To complement this paper, we also provided an implementation of the scm to Gensim111See https://github.com/RaRe-Technologies/gensim/, pull requests 1827, and 2016. (rehurek_lrec)
, a free open-source natural language processing library.
In our remarks for Theorem 3.4, we discuss strategies for making no column of matrix contain more than non-zero elements. Future research will evaluate their performance on the semantic text similarity task with public datasets. Various choices of the matrix based on word embeddings, Levenshtein distance, thesauri, and statistical regression as well as metric matrices from previous work (mikawa2011proposal) will also be evaluated both amongst themselves and against other document similarity measures such as the lda, lsa, and wmd.
We gratefully acknowledge the support by tačr under the Omega program, project td03000295. We also sincerely thank three anonymous reviewers for their insightful comments.