The decomposition of generalization errors into bias and variance (Geman et al, 1992) is one of the most profound insights of learning theory. Bias is caused by low capacity of models when the training samples are assumed to be infinite, whereas variance is caused by overfitting to finite samples. In this article, we apply the analysis to a new set of problems in Compositional Distributional Semantics, which studies the calculation of meanings of natural language phrases by vector representations of their constituent words. We prove an upper bound for the bias of a widely used compositional framework, the additive composition (Foltz et al, 1998; Landauer and Dumais, 1997; Mitchell and Lapata, 2010).
Calculations of meanings are fundamental problems in Natural Language Processing (NLP). In recent years, vector representations have seen great success at conveying meanings of individual words(Levy et al, 2015). These vectors are constructed from statistics of contexts surrounding the words, based on the Distributional Hypothesis that words occurring in similar contexts tend to have similar meanings (Harris, 1954). For example, given a target word , one can consider its context as close neighbors of
in a corpus, and assess the probabilityof the
-th word (in a fixed lexicon) occurring in the context of. Then, the word is represented by a vector (where is some function), and words with similar meanings to will have similar vectors (Miller and Charles, 1991).
Beyond the word level, a naturally following challenge is to represent meanings of phrases or even sentences. Based on the Distributional Hypothesis, it is generally believed that vectors should be constructed from surrounding contexts, at least for phrases observed in a corpus (Boleda et al, 2013). However, a main obstacle here is that phrases are far more sparse than individual words. For example, in the British National Corpus (BNC) (The BNC Consortium, 2007), which consists of 100M word tokens, a total of 16K lemmatized words are observed more than 200 times, but there are only 46K such bigrams, far less than the
possibilities for two-word combinations. Be it a larger corpus, one might only observe more rare words due to Zipf’s Law, so most of the two-word combinations will always be rare or unseen. Therefore, a direct estimation of the surrounding contexts of a phrase can have large sampling error. This partially fuels the motivation to construct phrase vectors from combining word vectors(Mitchell and Lapata, 2010), which also bases on the linguistic intuition that meanings of phrases are “composed” from meanings of their constituent words. In view of machine learning, word vectors have smaller sampling errors, or lower variance since words are more abundant than phrases. Then, a compositional framework which calculates meanings from word vectors will be favorable if its bias is also small.
Here, “bias” is the distance between two types of phrase vectors, one calculated from composing the vectors of constituent words (composed vector), and the other assessed from context statistics where the phrase is treated as a target (natural vector). The statistics is assessed from an infinitely large ideal corpus, so that the natural vector of the phrase can be reliably estimated without sampling error, hence conveying the meaning of the phrase by Distributional Hypothesis. If the distance between the two vectors is small, the composed vector can be viewed as a reasonable approximation of the natural vector, hence an approximation of meaning; moreover the composed vector can be more reliably estimated from finite real corpora because words are more abundant than phrases. Therefore, an upper bound for the bias will provide a learning-theoretic support for the composition operation.
A number of compositional frameworks have been proposed in the literature (Baroni and Zamparelli, 2010; Grefenstette and Sadrzadeh, 2011; Socher et al, 2012; Paperno et al, 2014; Hashimoto et al, 2014). Some are complicated methods based on linguistic intuitions (Coecke et al, 2010), and others are compared to human judgments for evaluation (Mitchell and Lapata, 2010). However, none of them has been previously analyzed regarding their bias111Unlike natural vectors which always lie in the same space as word vectors, some compositional frameworks construct meanings of phrases in different spaces. Nevertheless, we argue that even in such cases it is reasonable to require some mappings to a common space, because humans can usually compare meanings of a word and a phrase. Then, by considering distances between mapped images of composed vectors and natural vectors, we can define bias and call for theoretical analysis.. The most widely used framework is the additive composition (Foltz et al, 1998; Landauer and Dumais, 1997), in which the composed vector is calculated by averaging word vectors. Yet, it was unknown if this average is by any means related to statistics of contexts surrounding the corresponding phrases.
In this article, we prove an upper bound for the bias of additive composition of two-word phrases, and demonstrate several applications of the theory. An overview is given in Figure 1; we summarize as follows.
In Section 2.1, we introduce notations and define the vectors we consider in this work. Roughly speaking, we use to denote the probability of the -th word in a fixed lexicon occurring within a context window of a target (i.e. a word or phrase) , and define the -th entry of the natural vector as and
Here, is the lexicon size, , and are real numbers and is a function. We note that the formalization is general enough to be compatible with several previous research.
In Section 2.2, we describe our bias bound for additive composition, sketch its proof, and emphasize its practical consequences that can be tested on a natural language corpus. Briefly, we show that the more exclusively two successive words tend to occur together, the more accurate one can guarantee their additive composition as an approximation to the natural phrase vector; but this guarantee comes with one condition that should be a function that decreases steeply around and grows slowly at ; and when such condition is satisfied, one can derive an additional property that all natural vectors have approximately the same norm. These consequences are all experimentally verified in Section 5.3.
In Section 2.3, we give a formalized version of the bias bound (Theorem 1), with our assumptions on natural language data clarified. These assumptions include the well-know Zipf’s Law, a similar law applied to word co-occurrences which we call the Generalized Zipf’s Law, and some intuitively acceptable conditions. The assumptions are experimentally tested in Section 5.1 and Section 5.2. Moreover, we show that the Generalized Zipf’s Law can be drived from a widely used generative model for natural language (Section 2.6).
In Section 2.4, we prove some key lemmas regarding the aforementioned condition on function ; in Section 2.5 we formally prove the bias bound (with some supporting lemmas proven in Appendix A), and further give an intuitive explanation for the strength of additive composition: namely, with two words given, the vector of each can be decomposed into two parts, one encoding the contexts shared by both words, and the other encoding contexts not shared; when the two word vectors are added up, the non-shared part of each of them tend to cancel out, because non-shared parts have nearly independent distributions; as a result, the shared part gets reinforced, which is coincidentally encoded by the natural phrase vector.
Empirically, we demonstrate three applications of our theory:
The condition required to be satisfied by provides a unified explanation on why some recently proposed word vectors are good at additive composition (Section 3.1). Our experiments also verify that the condition drastically affects additive compositionality and other properties of vector representations (Section 5.3, Section 6).
Our intuitive explanation inspires a novel method for making vectors recognize word order, which was long thought as an issue for additive composition. Briefly speaking, since additive composition cancels out non-shared parts of word vectors and reinforces the shared one, we show that one can use labels on context words to control what is shared. In this case, we propose the Near-far Context in which the contexts of ordered bigrams are shared (Section 3.2). Our experiments show that the resulting vectors can indeed assess meaning similarities between ordered bigrams (Section 5.4), and demonstrate strong performance on phrase similarity tasks (Section 6.1). Unlike previous approaches, Near-far Context still composes vectors by taking average, retaining the merits of being parameter-free and having a bias bound.
In this section, we discuss vector representations constructed from an ideal natural language corpus, and establish a mathematical framework for analyzing additive composition. Our analysis makes several assumptions on the ideal corpus, which might be approximations or oversimplifications of real data. In Section 5, we will test these assumptions on a real corpus and verify that the theory still makes reasonable predictions.
2.1 Notation and Vector Representation
A natural language corpus is a sequence of words. Ideally, we assume that the sequence is infinitely long and contains an infinite number of distinct words.
We consider a finite sample of the infinite ideal corpus. In this sample, we denote the number of distinct words by , and use the words as a lexicon to construct vector representations. From the sample, we assess the count of the -th word in the lexicon, and assume that index is taken such that . Let be the total count, and denote .
With a sample corpus given, we can construct vector representations for targets, which are either words or phrases. To define the vectors one starts from specifying a context for each target, which is usually taken as words surrounding the target in corpus. As an example, Table 1 shows a word sequence, a phrase target and two word targets; contexts are taken as the closest four or five words to the targets.
We use , to denote word targets, and a phrase target consisting of two consecutive words and . When the word order is ignored (i.e., either or ), we denote the target by . A general target is denoted by . Later in this article, we will consider other types of targets as well, and a full list of target types is shown in Table 2.
Let be the count of target , and the count of -th word co-occurring in the context of . Denote .
In order to approximate the ideal corpus, we will take a sample larger and larger, then consider the limit. Under this limit, it is obvious that and . Further, we will assume some limit properties on and as specified in Section 2.3. These properties capture our idealization of an infinitely large natural language corpus. In Section 2.6, we will show that such properties can be derived from a Hierarchical Pitman-Yor Process, a widely used generative model for natural language data.
We construct a natural vector for from the statistics as follows:
Here, , and are real numbers and is a smooth function on . The subscript emphasizes that the vector will change if becomes larger (i.e. a larger sample corpus is taken). The scalar is for normalizing scales of vectors. In Section 2.2, we will further specify some conditions on , , and , but without much loss of generality.
To consider instead of can be viewed as a smoothing scheme that guarantees being applied to . We will consider that is not continuous at , such as ; yet, has to be well-defined even if . In practice, the estimated from a finite corpus can often be ; theoretically, the smoothing scheme plays a role in our proof as well.
The definition of is general enough to cover a wide range of previously proposed distributional word vectors. For example, if , and , then is the Point-wise Mutual Information (PMI) value that has been widely adopted in NLP (Church and Hanks, 1990; Dagan et al, 1994; Turney, 2001; Turney and Pantel, 2010). More recently, the Skip-Gram with Negative Sampling (SGNS) model (Mikolov et al, 2013a) is shown to be a matrix factorization of the PMI matrix (Levy and Goldberg, 2014b); and the more general form of and is explicitly introduced by the GloVe model (Pennington et al, 2014). Regarding other forms of , it has been reported in Lebret and Collobert (2014) and Stratos et al (2015) that empirically outperforms . We will discuss function further in Section 3.1, and review some other distributional vectors in Section 4.
We finish this section by pointing to Table 3 for a list of frequently used notations.
2.2 Practical Meaning of the Bias Bound
|Notation 2||a general target can denote either of the following:|
|Notation 2||word targets|
|Notation 2||two-word phrase target|
|Notation 2||two-word phrase target with word order ignored|
|Definition 6||a token of word not next to the word in corpus|
|Definition 15||words in the context of (resp. ) are assigned the left-hand-side (resp. right-hand-side) Near-far labels|
|Definition 16||a target (resp. ) not at the left (resp. right) of word (resp. )|
|Theorem 1||random word targets|
|General||random word targets can form different types such as and|
A compositional framework combines vectors and to represent the meaning of phrase “s t”. In this work, we study relations between this composed vector and the natural vector of the phrase target222Or it should be if one cares about word order, which we will discuss in Section 3.2.. More precisely, we study the Euclidean distance
where is the composition operation. If a sample corpus is taken larger and larger, we have limit , and will be well estimated to represent the meaning of “s t or t s”. Therefore, the above distance can be viewed as the bias of approximating by the composed vector . In practice, especially when is a complicated operation with parameters, it has been a widely adopted approach to learn the parameters by minimizing the same distances for phrases observed in corpus (Dinu et al, 2013; Baroni and Zamparelli, 2010; Guevara, 2010). These practices further motivate our study on the bias.
We consider additive composition, where is a parameter-free composition operator. We define
|Notation 1||,||index , where is the lexicon size|
|Notation 1||empirical probability of the -th word,|
|Notation 3||probability of the -th word co-occurring in context of ; defined as .|
|Definition 6||probability for an occurrence of being non-neighbor of|
|Definition 7||set of observed two-word phrases, word order ignored|
expected value and variance of a random variable
|General||indicator; if condition is true, 0 otherwise|
|General||probability of being true;|
|General||lowercase Greek letters denote real constants|
|Theorem 1||where , , or .|
Our analysis starts from the observation that, every word in the context of also occurs in the contexts of and : as illustrated in Table 1, if a word token (e.g. “rate”) comes from a phrase (e.g. “tax rate”), and if the context window size is not too small, the context for this token of is almost the same as the context of . This motivates us to decompose the context of into two parts, one coming from and the other not.
Define target as the tokens of word which do not occur next to word in corpus. We use to denote the probability of not occurring next to , conditioned on a token of word . Practically, can be estimated by the count ratio . Then, we have the following equation
because a word in the context of occurs in the context of either or .
We can view and as indicating how weak the “collocation” between and is. When and are small, and tend to occur next to each other exclusively, so and are likely to correlate with , making small. This is the fundamental idea of our bias bound, which estimates in terms of and . We give a detailed sketch below. First, by Triangle Inequality one immediately has
Then, we note that both and scale with . Without loss of generality, we can assume that is normalized such that the average norm of equals . Thus, if we can prove that for every target , we will have an upper bound
This is intuitively obvious because if all vectors lie on the unit sphere, distances between them will be less than the diameter 2. In this article, we will show that roughly speaking, it is indeed that for “every” , and the above “upper bound” can further be strengthened using Equation (1).
More precisely, we will prove that if a target phrase is randomly chosen, then converges to in probability. The argument is sketched as follows. First, when is random, and become random variables. We assume that for each , and are independent random variables. Note this assumption in contrast to the fact that for ; nonetheless, we assume that is random enough so that when changes, no obvious relation exists between and . Thus, ’s (,
fixed) are independent and we can apply the Law of Large Numbers:
In words, the fluctuations of ’s cancel out each other and their sum converges to expectation. However, Equation (3) requires a stronger statement than the ordinary Law of Large Numbers; namely, we do not assume and are identically distributed333This is reasonable, because is likely to be at the same scale as , whereas varies for different .. For this generalized Law of Large Numbers we need some technical conditions. One necessary condition is , which we prove by explicitly calculating ; another requirement is that the fluctuations of must be at comparable scales so they indeed cancel out. This is formalized as a uniform integrability condition, and we will show it imposes a non-trivial constraint on the function in definition of word vectors. Finally if Equation (3) holds, by setting we are done.
Since is smooth, this equation can be justified as long as and are small compared to . Then, we will rigorously prove that, when is sufficiently large, the total error in the above approximation becomes infinitesimal:
So we can replace and in definition of :
With arguments similar to the previous paragraph, we have and converge to in probability. Therefore, by Triangle Inequality we get
a better bound than (2). However, our bias bound is even stronger than this. Our further argument goes to the intuition that, should have “positive correlation” with because as targets, both and contain word ; on the other hand, and should be “independent” because targets and cover disjoint tokens of different words. With this intuition, we will derive the following bias bound:
A brief explanation can be found in Section 2.5, after the formal proof. Our experiments suggest that this bound is remarkably tight (Section 5.3). In addition, the intuitive explanation inspires a way to make additive composition aware of word order (Section 3.2).
In the rest of this section, we will formally normalize , , and for simplicity of discussion. These are mild conditions and do not affect the generality of our results. Then, we will summarize our claim of the bias bound, focusing on its practical verifiability.
Let be the set of two-word phrases observed in a finite corpus, word order ignored. We normalize such that the average norm of natural phrase vectors becomes :
Since is canceled out in definition of , it does not affect the bias. It does affect ; we set such that the centroid of natural phrase vectors becomes :
Note that, if the centroid of natural phrase vectors is far from , the normalization in Definition 7 would cause all phrase vectors cluster around one point on the unit sphere. Then, the phrase vectors would not be able to distinguish different meanings of phrases. The choice of in Definition 8 prevents such degenerated cases.
Next, if and are fixed, is taking minimum at
Hence, it is favorable to have the above equality. We can achieve it by adjusting such that the entries of each vector average to .
Practically, one can calculate and by first assuming in (6) to obtain , and then substitute in (5) to obtain the actual . The value of will not change because if all vectors have average entry , so dose their centroid. In Section 2.4, we will derive asymptotic values of , and theoretically.
We assume there is a such that . So can be either (if ) or (if ). This assumption is mainly for simplicity; intuitively, behavior of only matters at , because is applied to probability values which are close to . Indeed, our results can be generalized to such that
Our bias bound is summarized as follows.
Assume , the factors , and are normalized as above, and distributional vectors are constructed from an ideal natural language corpus. Then:
As we expected, for more “collocational” phrases, since and are smaller, the bias bound becomes stronger. Claim 1 states a prediction that can be empirically tested on a real large corpus; namely, one can estimate from the corpus and construct for a fixed , then check if the inequality holds approximately while omitting the limit. In Section 5.3, we conduct the experiment and verify the prediction. Our theoretical assumptions on the “ideal natural language corpus” will be specified in Section 2.3.
Besides it being empirically verified for phrases observed in a real corpus, the true value of Claim 1 is that the upper bound holds for an arbitrarily large ideal corpus. We can assume any plausible two-word phrase to occur sufficiently many times in the ideal corpus, even when it is unseen in the real one. In that case, a natural vector for the phrase can only be reliably estimated from the ideal corpus, but Claim 1 suggests that additive composition of word vectors provides a reasonable approximation for that unseen natural vector. Meanwhile, since word vectors can be reliably estimated from the real corpus, Claim 1 endorses additive composition as a reasonable meaning representation for unseen or rare phrases. On the other hand, it endorses additive composition for frequent phrases as well, because such phrases usually have strong collocations and Claim 1 says that the bias in this case is small.
The condition on function is crucial; we discuss its empirical implications in Section 3.1.
Under the same conditions in Claim 1, we have for all .
2.3 Formalization and Assumptions on Natural Language Data
For an ideal natural language corpus, we assume that:
Let , be randomly chosen word targets. If , or , then:
For fixed, ’s can be viewed as independent random variables.
Put . There exist such that for .
For each and , the random variables and are independent, whereas and have positive correlation.
Then, if , we have
We explain the assumptions of Theorem 1 in details below.
Assumption (A) is the Zipf’s Law (Zipf, 1935), which states that the frequency of the -th word is inversely proportional to . So is proportional to , and the factor comes from equations and . One immediate implication of Zipf’s Law is that one can make arbitrarily small by choosing sufficiently large and . More precisely, for any , we have
so as long as is large enough that , there is an in (7) such that . The limit will be extensively explored in our theory.
When a target is randomly chosen, (B1) assumes that the probability value is random enough that, when , there is no obvious relation between and (i.e. they are independent). We test this assumption in Section 5.1. Assumption (B2) suggests that is at the same scale as , and the random variable has a power law tail444The assumption can further be relaxed to . We only consider (B2) for simplicity. of index . We regard (B2) as the Generalized Zipf’s Law, analogous to Zipf’s Law because ’s (, fixed) can also be viewed as i.i.d. samples drawn from a power law of index . In Section 2.6, we show that Assumption (B) is closely related to a Hierarchical Pitman-Yor Process; and in Section 5.2 we empirically verify this assumption.
Assumption (C) is based on an intuition that, since and are different word targets and and are assessed from disjoint parts of corpus, the two random variables should be independent. On the other hand, the targets and both contain a word , so we expect and to have positive correlation. This assumption is also empirically tested in Section 5.1.
Since has a power law tail of index 1, the probability density is a multiple of for sufficiently large . Wherein, becomes an integral of , so is a necessary condition for .
Conversely, is usually a sufficient condition for , for instance, if follows the Pareto Distribution (i.e.
) or Inverse-Gamma Distribution. Another example will be given in Section2.6.
Define . Put , , and . Then,
There exists such that for all .
The set of random variables is uniformly integrable; i.e., for any , there exists such that for all .
, where .
above. The lemma calculates the first and second moments of; note that differs from only by some constant shift555Namely, the constant . As a further clue, in the upcoming Theorem 2 we will prove that can be taken as , and as which is in the same scale as .. As the lemma shows, when and vary, the squared first moment and the variance scale with the constant . At the limit , Lemma 1(b)(d) suggests that and converge, which is where the power law tail of mostly affects the behavior of .
so the random variable is dominated by and converges pointwisely to as . Then, by Lebesgue’s Dominated Convergence Theorem, we can generalize Lemma 1 to , and in turn generalize our bias bound.
Regarding the asymptotic behavior of , we have
For any , there exists such that for all .
For any , we have .
2.4 Why is important?
As we note in Remark 4, the condition is necessary for the existence of . This existence is important because, briefly speaking, the Law of Large Numbers only holds when expected values exist. More precisely, we use the following lemma to prove convergence in probability in Theorem 1, and in particular Equation (3) as discussed in Section 2.2. If , the required uniform integrability is not satisfied, which means the fluctuations of random variables may have too different scales to completely cancel out, so their weighted averages as we consider will not converge.
Assume ’s (, fixed) are independent random variables and is uniformly integrable. Assume . Then,
This lemma is a combination of the Law of Large Numbers and the Stolz-Cesàro Theorem. We prove it in two steps.
First step, we prove
This is a generalized version of the Law of Large Numbers, saying that the weighted average of converges in probability to the weighted average of . Since is uniformly integrable, for any there exists such that for all . Our strategy is to divide the average of into two parts, namely
and show that each part is close to its expectation. For the part, we have
by definition, so it has negligible expectation and can be bounded by Markov’s Inequality: