1 Introduction
Data sparseness is an inherent problem in statistical methods for natural language processing. Such methods use statistics on the relative frequencies of configurations of elements in a training corpus to learn how to evaluate alternative analyses or interpretations of new samples of text or speech. The most likely analysis will be taken to be the one that contains the most frequent configurations. The problem of data sparseness, also known as the zerofrequency problem [Witten BellWitten Bell1991], arises when analyses contain configurations that never occurred in the training corpus. Then it is not possible to estimate probabilities from observed frequencies, and some other estimation scheme that can generalize from the training data has to be used.
In language processing applications, the sparse data problem occurs even for very large data sets. For example, Essen:92a report that in a 75%25% split of the millionword LOB corpus, 12% of the bigrams in the test partition did not occur in the training portion. For trigrams, the sparse data problem is even more severe: for instance, researchers at IBM [Brown, DellaPietra, deSouza, Lai MercerBrown 1992] examined a training corpus consisting of almost 366 million English words, and discovered that one can expect 14.7% of the word triples in any new English text to be absent from the training sample. Thus, estimating the probability of unseen configurations is crucial to accurate language modeling, since the aggregate probability of these unseen events can be significant.
We focus here on a particular kind of configuration, word cooccurrence. Examples of such cooccurrences include relationships between head words in syntactic constructions (verbobject or adjectivenoun, for instance) and word sequences (grams). In commonly used models, the probability estimate for a previously unseen cooccurrence is a function of the probability estimates for the words in the cooccurrence. For example, in word bigram models, the probability of a conditioned word that has never occurred in training following the conditioning word is typically calculated from the probability of , as estimated by ’s frequency in the corpus [Jelinek, Mercer RoukosJelinek 1992, KatzKatz1987]. This method makes an independence assumption on the cooccurrence of and : the more frequent is, the higher the estimate of will be, regardless of .
Classbased and similaritybased models provide an alternative to the independence assumption. In these models, the relationship between given words is modeled by analogy with other words that are in some sense similar to the given ones.
For instance, Brown:92c suggest a classbased gram model in which words with similar cooccurrence distributions are clustered into word classes. The cooccurrence probability of a given pair of words is then estimated according to an averaged cooccurrence probability of the two corresponding classes. Pereira:93a propose a “soft” distributional clustering scheme for certain grammatical cooccurrences in which membership of a word in a class is probabilistic. Cooccurrence probabilities of words are then modeled by averaged cooccurrence probabilities of word clusters.
Dagan+Marcus+Markovitch:93a,Dagan+Marcus+Markovitch:95a present a similaritybased model, which avoids building clusters. Instead, each word is modeled by its own specific class, a set of words that are most similar to it. Using this scheme, they predict which unobserved cooccurrences are more likely than others. Their model, however, does not provide probability estimates and so cannot be used as a component of a larger probabilistic model, as would be required in, say, speech recognition.
Classbased and similaritybased methods for cooccurrence modeling may at first sight seem to be special cases of clustering and weighted nearestneighbor approaches used widely in machine learning and pattern recognition
[Aha, Kibler AlbertAha 1991, Cover HartCover Hart1967, Duda HartDuda Hart1973, Stanfill WaltzStanfill Waltz1986, Devroye, Györfi LugosiDevroye 1996, Atkeson, Moore SchaalAtkeson 1997]. There are important differences between those methods and ours. Clustering and nearestneighbor techniques often rely on representing objects as points in a multidimensional space with coordinates determined by the values of intrinsic object features. However, in most languagemodeling settings, all we know about a word are the frequencies of its cooccurrences with other words in certain configurations. Since the purpose of modeling is to estimate the probabilities of cooccurrences, the same cooccurrence statistics are the basis for both the similarity measure and the model predictions. That is, the only means we have for measuring word similarity are the predictions words make about what words they cooccur with, whereas in typical instance or (nondistributional) clustering learning methods, word similarity is defined from intrinsic features independently of the predictions (cooccurrence probabilities or classifications) associated with particular words (see for instance the work of Cardie:93a, Ng+Lee:96a, Ng:97b, and Zavrel+Daelemans:97a).1.1 Main Contributions
Our main contributions are a general scheme for using word similarity to improve the probability estimates of backoff models, and a comparative analysis of several similarity measures and parameter settings in two important language processing tasks, language modeling and disambiguation, showing that similaritybased estimates are indeed useful.
In our initial study, a languagemodel evaluation, we used a similaritybased model to estimate unseen bigram probabilities for Wall Street Journal text and compared it to a standard backoff model [KatzKatz1987]. Testing on a heldout sample, the similarity model achieved a 20% perplexity reduction over backoff for unseen bigrams. These constituted 10.6% of the test sample, leading to an overall reduction in testset perplexity of 2.4%. The similaritybased model was also tested in a speechrecognition task, where it yielded a statistically significant reduction (32 versus 64 mistakes in cases where there was disagreement with the backoff model) in recognition error.
In the disambiguation evaluation, we compared several variants of our initial method and the cooccurrence smoothing method of Essen:92a against the estimation method of Katz in a decision task involving unseen pairs of direct objects and verbs. We found that all the similaritybased models performed almost 40% better than backoff, which yielded about 49% accuracy in our experimental setting. Furthermore, a scheme based on the JensenShannon divergence [RaoRao1982, LinLin1991]^{1}^{1}1To the best of our knowledge, this is the first use of this particular distribution dissimilarity function in statistical language processing. The function itself is implicit in earlier work on distributional clustering [Pereira, Tishby LeePereira 1993] and has been used by Tishby (p.c.) in other distributional similarity work. Finch:thesis discusses its use in word clustering, but does not provide an experimental evaluation on actual data. yielded statistically significant improvement in error rate over cooccurrence smoothing.
We also investigated the effect of removing extremely lowfrequency events from the training set. We found that, in contrast to backoff smoothing, where such events are often discarded from training with little discernible effect, similaritybased smoothing methods suffer noticeable performance degradation when singletons (events that occur exactly once) are omitted.
The paper is organized as follows. Section 2 describes the general similaritybased framework; in particular, Section 2.3 presents the functions we use as measures of similarity. Section 3 details our initial language modeling experiments. Section 4 describes our comparison experiments on a pseudoword disambiguation task. Section 5 discusses related work. Finally, Section 6 summarizes our contributions and outlines future directions.
2 Distributional Similarity Models
We wish to model conditional probability distributions arising from the cooccurrence of linguistic objects, typically words, in certain configurations. We thus consider pairs
for appropriate sets and , not necessarily disjoint. In what follows, we use subscript for the element of a pair; thus is the conditional probability (or rather, some empirical estimate drawn from a base language model, the true probability being unknown) that a pair has second element given that its first element is ; and denotes the probability estimate, according to the base language model, that is the first word of a pair given that the second word is . denotes the base estimate for the unigram probability of word .A similaritybased language model consists of three parts: a scheme for deciding which word pairs require a similaritybased estimate, a method for combining information from similar words, and, of course, a function measuring the similarity between words. We give the details of each of these three parts in the following three sections. We will only be concerned with similarity between words in , which are the conditioning events for the probabilities that we want to estimate.
2.1 Discounting and Redistribution
Data sparseness makes the maximum likelihood estimate (MLE) for word pair probabilities unreliable. The MLE for the probability of a word pair , conditional on the appearance of word , is simply
(1) 
where is the frequency of in the training corpus and is the frequency of . However, is zero for any unseen word pair, that is, any such pair would be predicted as impossible. More generally, the MLE is unreliable for events with small nonzero counts as well as for those with zero counts. In the language modeling literature, the term smoothing is used to refer to methods for adjusting the probability estimates of smallcount events away from the MLE to try to alleviate its unreliability. Our proposals address the zerocount problem exclusively, and we rely on existing techniques to smooth other small counts.
Previous proposals for the zerocount problem [GoodGood1953, Jelinek, Mercer RoukosJelinek 1992, KatzKatz1987, Church GaleChurch Gale1991] adjust the MLE so that the total probability of seen word pairs is less than one, leaving some probability mass to be redistributed among the unseen pairs. In general, the adjustment involves either interpolation, in which the MLE is used in linear combination with an estimator guaranteed to be nonzero for unseen word pairs, or discounting, in which a reduced MLE is used for seen word pairs, with the probability mass left over from this reduction used to model unseen pairs.
The backoff method of Katz:87a is a prime example of discounting:
(2) 
where represents the GoodTuring discounted estimate [KatzKatz1987] for seen word pairs, and denotes the model for probability redistribution among the unseen word pairs. is a normalization factor. Since an extensive comparison study by Chen+Goodman:96 indicated that backoff is better than interpolation for estimating bigram probabilities, we will not consider interpolation methods here; however, one could easily incorporate similaritybased estimates into an interpolation framework as well.
In his original backoff model, Katz used as the model for predicting for unseen word pairs, that is, his model backed off to a unigram model for unseen bigrams. However, it is conceivable that backing off to a more detailed model than unigrams would be advantageous. Therefore, we generalize Katz’s formulation by writing instead of , enabling us to use similaritybased estimates for unseen word pairs instead of unigram frequency. Observe that similarity estimates are used for unseen word pairs only.
We next investigate estimates for derived by averaging information from words that are distributionally similar to .
2.2 Combining Evidence
Similaritybased models make the following assumption: if word is “similar” to word , then can yield information about the probability of unseen word pairs involving . We use a weighted average of the evidence provided by similar words, or neighbors, where the weight given to a particular word depends on its similarity to .
More precisely, let denote an increasing function of the similarity between and , and let denote the set of words most similar to . Then the general form of similarity model we consider is a weighted linear combination of predictions of similar words:
(3) 
where is a normalization factor. According to this formula, is more likely to occur with if it tends to occur with the words that are most similar to .
Considerable latitude is allowed in defining the set , as is evidenced by previous work that can be put in the above form. Essen:92a and Karov:96a (implicitly) set . However, it may be desirable to restrict in some fashion for efficiency reasons, especially if is large. For instance, in the language modeling application of Section 3, we use the closest or fewer words such that the dissimilarity between and is less than a threshold value ; and are tuned experimentally.
One can directly replace in the backoff equation (2) with . However, other variations are possible, such as interpolating with the unigram probability :
This represents, in effect, a linear combination of the similarity estimate and the backoff estimate: if , then we have exactly Katz’s backoff scheme. In the language modeling task (Section 3) we set experimentally; to simplify our comparison of different similarity models for sense disambiguation (Section 4), we set to .
It would be possible to make depend on , so that the contribution of the similarity estimate could vary among words. Such dependences are often used in interpolated models [Jelinek MercerJelinek Mercer1980, Jelinek, Mercer RoukosJelinek 1992, Saul PereiraSaul Pereira1997] and are indeed advantageous. However, since they introduce hidden variables, they require a more complex training algorithm, and we did not pursue that direction in the present work.
2.3 Measures of Similarity
We now consider several word similarity measures that can be derived automatically from the statistics of a training corpus, as opposed to being derived from manuallyconstructed word classes [YarowskyYarowsky1992, ResnikResnik1992, ResnikResnik1995, LukLuk1995, LinLin1997]. Sections 2.3.1 and 2.3.2 discuss two related informationtheoretic functions, the KL divergence and the JensenShannon divergence. Section 2.3.3 describes the norm, a geometric distance function. Section 2.3.4 examines the confusion probability, which has been previously employed in language modeling tasks. There are, of course, many other possible functions; we have opted to restrict our attention to this reasonably diverse set.
For each function, a corresponding weight function is given. The choice of weight function is to some extent arbitrary; the requirement that it be increasing in the similarity between and is not extremely constraining. While clearly performance depends on using a good weight function, it would be impossible to try all conceivable . Therefore, in section 4.5, we describe experiments evaluating similaritybased models both with and without weight functions.
All the similarity functions we describe depend on some base language model , which may or may not be the Katz discounted model from Section 2.1 above. While we discuss the complexity of computing each similarity function, it should be noted that in our current implementation, this is a onetime cost: we construct the matrix of wordtoword similarities before any parameter training takes place.
2.3.1 KL divergence
The KullbackLeibler (KL) divergence is a standard informationtheoretic measure of the dissimilarity between two probability mass functions [KullbackKullback1959, Cover ThomasCover Thomas1991]. We can apply it to the conditional distributions induced by words in on words in :
(4) 
is nonnegative, and is zero if and only if for all . However, the KL divergence is nonsymmetric and does not obey the triangle inequality.
For to be defined it must be the case that whenever . Unfortunately, this generally does not hold for MLEs based on samples; we must use smoothed estimates that redistribute some probability mass to zerofrequency events. But this forces the sum in (4) to be over all , which makes this calculation expensive for large vocabularies.
Once the divergence is computed, we set
The role of the free parameter is to control the relative influence of the neighbors closest to : if is high, then is nonnegligible only for those that are extremely close to , whereas if is low, distant neighbors also contribute to the estimate. We chose a negative exponential function of the KL divergence for the weight function by analogy with the form of the cluster membership function in related distributional clustering work [Pereira, Tishby LeePereira 1993] and also because that is the form for the probability that ’s distribution arose from a sample drawn from the distribution of [Cover ThomasCover Thomas1991, LeeLee1997]
. However, these reasons are heuristic rather than theoretical, since we do not have a rigorous probabilistic justification for similaritybased methods.
2.3.2 JensenShannon divergence
A related measure is the JensenShannon divergence [RaoRao1982, LinLin1991], which can be defined as the average of the KL divergence of each of two distributions to their average distribution:
(5) 
where is shorthand for the distribution
Since the KL divergence is nonnegative, is also nonnegative. Furthermore, letting and , it is easy to see that
(6) 
where is the entropy of the discrete density . This equation shows that gives the information gain achieved by distinguishing the two distributions and (conditioning on contexts and ) over pooling the two distributions (ignoring the distinction between and ).
It is also easy to see that can be computed efficiently, since it depends only on those conditioned words that occur in both contexts. Indeed, letting , and grouping the terms of (6) appropriately, we obtain
where . Therefore, is bounded, ranging between and , and smoothed estimates are not required because probability ratios are not involved.
As in the KL divergence case, we set ; plays the same role as before.
2.3.3 norm
The norm is defined as
(7) 
By grouping terms as before, we can express in a form depending only on the “common” :
It follows from the triangle inequality that , with equality to if and only if there are no words such that both and are strictly positive.
Since we require a weighting scheme that is decreasing in , we set
with again free.^{2}^{2}2We experimented with using as well, but it yielded poorer performance results. As before, the higher is, the more relative influence is accorded to the nearest neighbors.
It is interesting to note the following relations between the norm, the KLdivergence, and the JensenShannon divergence. Cover:91a give the following lower bound:
where is the base of the logarithm function. Lin:91 notes that is an upper bound for :
2.3.4 Confusion probability
Extending work by Sugawara+al:85a, Essen:92a used confusion probability to estimate word cooccurrence probabilities.^{3}^{3}3Actually, they present two alternative definitions. We use their model 2B, which they found yielded the best experimental results. They report 14% improvement in testset perplexity (defined below) on a small corpus. The confusion probability was also used by Grishman+Sterling:93a to estimate the likelihood of selectional patterns.
The confusion probability is an estimate of the probability that word can be substituted for word , in the sense of being found in the same contexts:
( serves as a normalization factor). In contrast to the distance functions described above, has the curious property that may not necessarily be the “closest” word to itself, that is, there may exist a word such that ; see Section 4.4 for an example.
The confusion probability can be computed from empirical estimates provided all unigram estimates are nonzero (as we assume throughout). In fact, the use of smoothed estimates such as those provided by Katz’s backoff scheme is problematic, because those estimates typically do not preserve consistency with respect to marginal estimates and Bayes’s rule (that is, it may be that ). However, using consistent estimates (such as the MLE), we can safely apply Bayes’s rule to rewrite as follows:
(8) 
As with the JensenShannon divergence and the norm, this sum requires computation only over the “common” ’s.
Examination of Equation (8) reveals an important difference between the confusion probability and the functions , , and described in the previous sections. Those functions rate as similar to if, roughly, is high when is. , however, is greater for those for which is large when is. When this ratio is large, we may think of as being exceptional, since if is infrequent, we do not expect to be large.
2.3.5 Summary
Several features of the measures of similarity listed above are summarized in Table 1. “Base LM constraints” are conditions that must be satisfied by the probability estimates of the base language model. The last column indicates whether the weight associated with each similarity function depends on a parameter that needs to be tuned experimentally.
name  range  base LM constraints  tune? 

if  yes  
none  yes  
none  yes  
Bayes consistency  no 
3 Language Modeling
The goal of our first set of experiments, described in this section, was to provide proof of concept by showing that similaritybased models can achieve better language modeling performance than backoff. We therefore only used one similarity measure. The success of these experiments convinced us that similaritybased methods are worth examining more closely; the results of our second set of experiments, comparing several similarity functions on a pseudoword disambiguation task, are described in the next section.
Our language modeling experiments used a similaritybased model, with the KL divergence as (dis)similarity measure, as an alternative to unigram frequency when backing off in a bigram model. That is, we used the bigram language model defined by:
(9)  
where , the entire vocabulary. As noted earlier, the estimates of must be smoothed to avoid division by zero when computing ; we employed the standard Katz bigram backoff model for that purpose. Since in this application, we considered only a small fraction of in computing , using the tunable thresholds and described in Section 2.2 for this purpose.
The standard evaluation metric for language models is the likelihood of the test data according to the model, or, more intuitively, the testset
perplexitywhich represents the average number of alternatives presented by the (bigram) model after each test word. Thus, a better model will have a lower perplexity. In our task, lower perplexity will indicate better prediction of unseen bigrams.
We evaluated the above model by comparing its testset perplexity and effect on speechrecognition accuracy with the baseline bigram backoff model developed by MIT Lincoln Laboratories for the Wall Street Journal (WSJ) text and dictation corpora provided by ARPA’s HLT program [PaulPaul1991].^{4}^{4}4The ARPA WSJ development corpora come in two versions, one with verbalized punctuation and the other without. We used the latter in all our experiments. The baseline backoff model follows the Katz design, except that, for the sake of compactness, all frequency one bigrams are ignored. The counts used in this model and in ours were obtained from 40.5 million words of WSJ text from the years 198789.
For perplexity evaluation, we tuned the similarity model parameters by minimizing perplexity on an additional sample of 57.5 thousand words of WSJ text, drawn from the ARPA HLT development test set. The best parameter values found were , , and . For these values, the improvement in perplexity for unseen bigrams in a heldout 18 thousand word sample (the ARPA HLT evaluation test set) is just over 20%. Since unseen bigrams comprise 10.6% of this sample, the improvement on unseen bigrams corresponds to an overall test set perplexity improvement of 2.4% (from 237.4 to 231.7).
training reduction (%)  test reduction (%)  

60  2.  5  4.  0  0.  15  18.  4  20.  51 
50  2.  5  4.  0  0.  15  18.  38  20.  45 
40  2.  5  4.  0  0.  2  18.  34  20.  03 
30  2.  5  4.  0  0.  25  18.  33  19.  76 
70  2.  5  4.  0  0.  1  18.  3  20.  53 
80  2.  5  4.  5  0.  1  18.  25  20.  55 
100  2.  5  4.  5  0.  1  18.  23  20.  54 
90  2.  5  4.  5  0.  1  18.  23  20.  59 
20  1.  5  4.  0  0.  3  18.  04  18.  7 
10  1.  5  3.  5  0.  3  16.  64  16.  94 
Table 2 shows reductions in training and test perplexity, sorted by training reduction, for different choices of the number of closest neighbors used. The values of , and are the best ones found for each .^{5}^{5}5Values of and refer to base 10 logarithms and exponentials in all calculations.
From equation (9), it is clear that the computational cost of applying the similarity model to an unseen bigram is . Therefore, lower values for (and ) are computationally preferable. From the table, we can see that reducing to 30 incurs a penalty of less than 1% in the perplexity improvement, so relatively low values of appear to be sufficient to achieve most of the benefit of the similarity model. As the table also shows, the best value of increases as decreases; that is, for lower , a greater weight is given to the conditioned word’s frequency. This suggests that the predictive power of neighbors beyond the closest 30 or so can be modeled fairly well by the overall frequency of the conditioned word.
The bigram similarity model was also tested as a language model in speech recognition. The test data for this experiment were pruned word lattices for 403 WSJ closedvocabulary test sentences. Arc scores in these lattices are sums of an acoustic score (negative log likelihood) and a languagemodel score, which in this case was the negative log probability provided by the baseline bigram model.
From the given lattices, we constructed new lattices in which the arc scores were modified to use the similarity model instead of the baseline model. We compared the best sentence hypothesis in each original lattice with the best hypothesis in the modified one, and counted the word disagreements in which one of the hypotheses was correct. There were a total of 96 such disagreements; the similarity model was correct in 64 cases, and the backoff model in 32. This advantage for the similarity model is statistically significant at the 0.01 level. The overall reduction in error rate is small, from 21.4% to 20.9%, because the number of disagreements is small compared with the overall number of errors in the recognition setup employed in these experiments.
Table 3 shows some examples of speech recognition disagreements between the two models. The hypotheses are labeled ‘B’ for backoff and ‘S’ for similarity, and the boldface words are errors. The similarity model seems to be better at modeling regularities such as semantic parallelism in lists and avoiding a past tense form after “to.” On the other hand, the similarity model makes several mistakes in which a function word is inserted in a place where punctuation would be found in written text.
B  commitments …from leaders felt the three point six billion dollars 

S  commitments …from leaders fell to three point six billion dollars 
B  followed by France the US agreed in Italy 
S  followed by France the US Greece …Italy 
B  he whispers to made a 
S  he whispers to an aide 
B  the necessity for change exist 
S  the necessity for change exists 
B  without …additional reserves Centrust would have reported 
S  without …additional reserves of Centrust would have reported 
B  in the darkness past the church 
S  in the darkness passed the church 
4 WordSense Disambiguation
Since the experiments described in the previous section demonstrated promising results for similaritybased estimation, we ran a second set of experiments designed to help us compare and analyze the somewhat diverse set of similarity measures given in Table 1. Unfortunately, the KL divergence and the confusion probability have different requirements on the base language model, and so we could not run a direct fourway comparison. As explained below, we elected to omit the KL divergence from consideration.
We chose to evaluate the three remaining measures on a word sense disambiguation task, in which each method was presented with a noun and two verbs, and was asked which verb was more likely to have the noun as a direct object. Thus, we did not measure the absolute quality of the assignment of probabilities, as would be the case in a perplexity evaluation, but rather the relative quality. We could therefore ignore constant factors, which is why we did not normalize the similarity measures.
4.1 Task Definition
In the usual word sense disambiguation problem, the method to be tested is presented with an ambiguous word in some context, and is asked to identify the correct sense of the word from that context. For example, a test instance might be the sentence fragment “robbed the bank”; the question is whether “bank” refers to a river bank, a savings bank, or perhaps some other alternative meaning.
While sense disambiguation is clearly an important problem for language processing applications, as an evaluation task it presents numerous experimental difficulties. First, the very notion of “sense” is not clearly defined; for instance, dictionaries may provide sense distinctions that are too fine or too coarse for the data at hand. Also, one needs to have training data for which the correct senses have been assigned; acquiring these correct senses generally requires considerable human effort. Furthermore, some words have many possible senses, whereas others are essentially monosemous; this means that test cases are not all uniformly hard.
To circumvent these and other difficulties, we set up a pseudoword disambiguation experiment [SchützeSchütze19921, Gale, Church YarowskyGale 1992], the format of which is as follows. First, a list of pseudowords is constructed, each of which is the combination of two different words in . Each word in contributes to exactly one pseudoword. Then, every in the test set is replaced with its corresponding pseudoword. For example, if a pseudoword is created out of the words “make” and “take”, then the data is altered as follows:
make  plans  {make, take} plans  
take  action  {make, take} action 
The method being tested must choose between the two words that make up the pseudoword.
The advantages of using pseudowords are twofold. First, the alternative “senses” are under the control of the experimenter. Each test instance presents exactly two alternatives to the disambiguation method, and the alternatives can be chosen to be of the same frequency, the same part of speech, and so on. Secondly, the pretransformation data yields the correct answer, so that no handtagging of the word senses is necessary. These advantages make pseudoword experiments an elegant and simple means to test the efficacy of different language models; of course they may not provide a completely accurate picture of how the models would perform in real disambiguation tasks, although one could create more realistic settings by making pseudowords out of more than two words, varying the frequencies of the alternative pseudosenses, and so on.
For ease of comparison, we did not consider interpolation with unigram probabilities. Thus, the model we used for these experiments differs slightly from that used in the language modeling tests; it can be summarized as follows:
4.2 Data
We used a statistical partofspeech tagger [ChurchChurch1988]
and pattern matching and concordancing tools (due to David Yarowsky) to identify transitive main verbs (
) and head nouns () of the corresponding direct objects in 44 million words of 1988 Associated Press newswire. We selected the nounverb pairs for the most frequent nouns in the corpus. These pairs are undoubtedly somewhat noisy given the errors inherent in the partofspeech tagging and pattern matching.We used , or , of the pairs so derived for building models, reserving for testing purposes. As some, but not all, of the similarity measures require smoothed models, we calculated both a Katz backoff model ( in equation (2), with ), and a maximumlikelihood model (). Furthermore, we wished to evaluate the hypothesis that a more compact language model can be built without affecting model quality by deleting singletons, word pairs that occur only once, from the training set. This claim had been made in particular for language modeling [KatzKatz1987]. We therefore built four base models, summarized in Table 4.
with singletons  no singletons  

(587,833 pairs)  (505,426 pairs)  
MLE  MLE1  MLEo1 
Katz  BO1  BOo1 
Since we wished to test the effectiveness of using similarity for unseen word cooccurrences, we removed from the test data any verbobject pairs that occurred in the training set; this resulted in unseen pairs (some occurred multiple times). The unseen pairs were further divided into five equalsized parts, through , which formed the basis for fivefold crossvalidation: in each of five runs, one of the was used as a performance test set, with the other four combined into one set used for tuning parameters (if necessary) via a simple grid search that evaluated the error on the tuning set at regularly spaced points in parameter space. Finally, test pseudowords were created from pairs of verbs with similar frequencies, so as to control for word frequency in the decision task. Our method was to simply rank the verbs by frequency and create pseudowords out of all adjacent pairs (thus, each verb participated in exactly one pseudoword). Table 5 lists some randomly chosen pseudowords and the frequencies of the corresponding verbs.
make (14782)/take (12871) 
fetch (35)/renegotiate (35) 
magnify (13)/exit (13) 
meeet (1)/stupefy (1) 
relabel (1)/entomb (1) 
We use error rate as our performance metric, defined as
where was the size of the test corpus. A tie occurs when the two words making up a pseudoword are deemed equally likely.
4.3 Baseline Experiments
The performances of the four base language models are shown in Table 6. MLE1 and MLEo1 both have error rates of exactly because the test sets consist of unseen bigrams, which are all assigned a probability of by maximumlikelihood estimates, and thus are all ties for this method. The backoff models BO1 and BOo1 also perform similarly.
MLE1  .5  .5  .5  .5  .5 

MLEo1  ˝  ˝  ˝  ˝  ˝ 
BO1  0.517  0.520  0.512  0.513  0.516 
BOo1  0.517  0.520  0.512  0.513  0.516 
Since the backoff models consistently performed worse than the MLE models, we chose to use only the MLE models in our subsequent experiments. Therefore, we only ran comparisons between the measures that could utilize unsmoothed data, namely, the norm, ; the JensenShannon divergence, ; and the confusion probability, . ^{6}^{6}6It should be noted, however, that on BO1 data, the KLdivergence performed slightly better than the norm.
4.4 Sample Closest Words
In this section, we examine the closest words to a randomly selected noun, “guy”, according to the three measures , , and .
Table 7 shows the ten closest words, in order, when the base language model is MLE1. There is some overlap between the closest words for and the closest words for , but very little overlap between the closest words for these measures and the closest words with respect to : only the words “man” and “lot” are common to all three. Also observe that the word “guy” itself is only fourth on the list of words with the highest confusion probability with respect to “guy”.
GUY  0.  0  GUY  0.  0  role  0.  033 
kid  1.  23  kid  0.  15  people  0.  024 
lot  1.  35  thing  0.  1645  fire  0.  013 
thing  1.  39  lot  0.  165  GUY  0.  0127 
man  1.  46  man  0.  175  man  0.  012 
doctor  1.  46  mother  0.  184  year  0.  01 
girl  1.  48  doctor  0.  185  lot  0.  0095 
rest  1.  485  friend  0.  186  today  0.  009 
son  1.  497  boy  0.  187  way  0.  008778 
bit  1.  498  son  0.  188  part  0.  008772 
(role: rank 173)  (role: rank 43)  (kid: rank 80) 
Let us examine the case of the nouns “kid” and “role” more closely. According to the similarity functions and , “kid” is the second closest word to “guy”, and “role” is considered relatively distant. In the case, however, “role” has the highest confusion probability with respect to “guy,” whereas “kid” has only the 80th highest confusion probability. What accounts for these differences?
Table 8, which gives the ten verbs most likely to occur with “guy”, “kid”, and “role”, indicates that both and rate words as similar if they tend to cooccur with the same verbs. Observe that four of the ten most likely verbs to occur with “kid” are also very likely to occur with “guy”, whereas only the verb “play” commonly occurs with both “role” and “guy”.
Noun  Most Likely Verbs 

guy  see get play let give catch tell do pick need 
kid  get see take help want tell teach send give love 
role  play take lead support assume star expand accept sing limit 
(1) electrocute (2) shortchange (3) bedevil (4) admire (5) bore (6) fool 
(7) bless (26) play (49) get 
If we sort the verbs by decreasing , a different order emerges (Table 9): “play”, the most likely verb to cooccur with “role”, is ranked higher than “get”, the most likely verb to cooccur with “kid”, thus indicating why “role” has a higher confusion probability with respect to “guy” than “kid” does.
GUY  0.  0  GUY  0.  0  role  0.  05 
kid  1.  17  kid  0.  15  people  0.  025 
lot  1.  40  thing  0.  16  fire  0.  021 
thing  1.  41  lot  0.  17  GUY  0.  018 
reason  1.  417  mother  0.  182  work  0.  016 
break  1.  42  answer  0.  1832  man  0.  012 
ball  1.  439  reason  0.  1836  lot  0.  0113 
answer  1.  44  doctor  0.  187  job  0.  01099 
tape  1.  449  boost  0.  189  thing  0.  01092 
rest  1.  453  ball  0.  19  reporter  0.  0106 
Finally, we examine the effect of deleting singletons from the base language model. Table 10 shows the ten closest words, in order, when the base language model is MLEo1. The relative order of the four closest words remains the same; however, the next six words are quite different from those for MLE1. This data suggests that the effect of singletons on calculations of similarity is quite strong, as is borne out by the experimental evaluations described in Section 4.5.
We conjecture that this effect is due to the fact that there are many very lowfrequency verbs in the data (65% of the verbs appeared with 10 or fewer nouns; the most common verb occurred with 710 nouns). Omitting singletons involving such verbs may well drastically alter the number of verbs that cooccur with both of two given nouns and . Since the similarity functions we consider in this set of experiments depend on such words, it is not surprising that the effect of deleting singletons is rather dramatic. In contrast, a backoff language model is not as sensitive to missing singletons because of the GoodTuring discounting of small counts and inflation of zero counts.
4.5 Performance of SimilarityBased Methods
Figure 1 shows the results of our experiments on the five test sets, using MLE1 as the base language model. The parameter was always set to the optimal value for the corresponding training set. RAND, which is shown for comparison purposes, simply chooses the weights randomly. was set equal to in all cases.
The similaritybased methods consistently outperformed Katz’s backoff method and the MLE (recall that both yielded error rates of about .5) by a large margin, indicating that information from other word pairs is very useful for unseen pairs when unigram frequency is not informative. The similaritybased methods also do much better than RAND, which indicates that it is not enough to simply combine information from other words arbitrarily: word similarity should be taken into account. In all cases, edged out the other methods. The average improvement in using instead of is .0082; this difference is significant to the .1 level (
), according to the paired ttest.
The results for the MLEo1 case are depicted in Figure 2. Again, we see the similaritybased methods achieving far lower error rates than the MLE, backoff, and RAND methods, and again, always performed the best. However, omitting singletons amplified the disparity between and : the average difference was , which is significant to the .01 level (paired ttest).
An important observation is that all methods, including RAND, suffered a performance hit if singletons were deleted from the base language model. This seems to indicate that seen bigrams should be treated differently from unseen bigrams, even if the seen bigrams are extremely rare. We thus conclude that one cannot create a compressed similaritybased language model by omitting singletons without hurting performance, at least for this task.
We now analyze the role of the parameter . Recall that appears in the weight functions for the JensenShannon divergence and the norm:
It controls the relative influence of the most similar words: their influence increases with higher values of .
Figure 3 shows how the value of affects disambiguation performance. Four curves are shown, each corresponding to a choice of similarity function and base language model. The error bars depict the average and range of error rates over the five disjoint test sets.
It is immediately clear that to get good performance results, must be set much higher for the JensenShannon divergence than for the norm. This phenomenon results from the fact that the range of possible values for is much smaller than that for . This “compression” of values requires a large to scale differences of distances correctly.
We also observe that setting too low causes substantially worse error rates; however, the curves level off rather than moving upwards again. That is, as long as a sufficiently large value is chosen, setting suboptimally does not greatly impact performance. Furthermore, the shape of the curves is the same for both base language models, suggesting that the relation between and testset performance is relatively insensitive to variations in training data.
The fact that higher values of seem to lead to better error rates suggests that ’s role is to filter out distant neighbors. To test this hypothesis, we experimented with using only the most similar neighbors. Figure 4 shows how the error rate depends on for different fixed values of . The two lowest curves depict the performance of the JensenShannon divergence and the norm when is set to the optimal value with respect to average test set performance; it appears that the more distant neighbors have essentially no effect on error rate because their contribution to the sum (9) is negligible. In contrast, when too low a value of is chosen (the upper two curves), distant neighbors are weighted too heavily. In this case, including more distant neighbors causes serious degradation of performance.
Interestingly, the behavior of the confusion probability is different from these two cases: adding more neighbors actually improves the error rate. This seems to indicate that the confusion probability is not correctly ranking similar words in order of informativeness. However, an alternative explanation is that is at a disadvantage only because it is not being employed in the context of a tunable weighting scheme.
To distinguish between these two possibilities, we ran an experiment that dispensed with weights altogether. Instead, we took a vote of the most similar neighbors: the alternative chosen as more likely was the one preferred by a majority of the most similar neighbors (note that we ignored the degree to which alternatives were preferred). The results are shown in Figure 5.
We see that the most similar neighbors according to and were always more informative than those chosen according to the confusion probability, with the largest performance gaps occurring for low (of course, all methods performed the same for , since in that case they were using the same set of neighbors). This graph provides clear evidence that the confusion probability is not as good a measure of the informativeness of other words.
5 Related Work
There is a large body of work on notions of work similarity, word clustering, and their applications. It is impossible to compare all those methods directly, since the assumptions, experimental settings and applications of methods vary widely. Therefore, the discussion below is mainly descriptive, highlighting some of the main similarities and differences between the methods.
5.1 Statistical similarity and clustering for disambiguation and language modeling
Our work is an instance of a growing body of research on using word similarity to improve performance in languageprocessing problems. Similaritybased algorithms either use the similarity scores between a word and other words directly in making their predictions, or rely on similarity scores between a word and representatives of precomputed similarity classes.
An early attempt to automatically classify words into semantic classes was carried out in the Linguistic String Project
[Grishman, Hirschman NhanGrishman 1986]. Semantic classes were derived from similar cooccurrence patterns of words within syntactic relations. Cooccurrence statistics were then considered at the class level and used to alleviate data sparseness in syntactic disambiguation.Schutze92,Schutze93 captures contextual word similarity by first reducing the dimensionality of a context representation using singular value decomposition and then using the reduceddimensionality representation to characterize the possible contexts of a word. This information is used for word sense disambiguation. All occurrences of an ambiguous word are clustered and each cluster is mapped manually to one of the senses of the word. The context vector of a new occurrence of the ambiguous word is mapped to the nearest cluster which determines the sense for that occurrence. Schütze emphasizes that his method avoids clustering words into a predefined set of classes, claiming that such clustering is likely to introduce artificial boundaries that cut off words from part of their semantic neighborhood.
Karov:96a have also addressed the data sparseness problem in word sense disambiguation by using word similarity. They use a circular definition for both a word similarity measure and a context similarity measure. The circularity is resolved by an iterative process in which the system learns a set of typical usages for each of the senses of an ambiguous word. Given a new occurrence of the ambiguous word the system selects the sense whose typical context is most similar to the current context, applying a procedure which resembles the sense selection process of Shütze.
Our scheme for employing word similarity in disambiguation was influenced by the work of Dagan+Marcus+Markovitch:93a,Dagan+Marcus+Markovitch:95a. Their method computes a word similarity measure directly from word cooccurrence data. A word is then modeled by a set of most similar words, and the plausibility of an unseen cooccurrence is judged by the cooccurrence statistics of the words in this set. The similarity measure is a weighted Tanimoto measure, a version of which was also used by Grefenstette92,Grefenstette94. Word association is measured by mutual information, following earlier work on word similarity by Hindle90.
The method of Dagan+Marcus+Markovitch:93a,Dagan+Marcus+Markovitch:95a does not provide probabilistic models. Disambiguation decisions are based on comparing scores for different alternatives, but they do not produce explicit probability estimates and therefore cannot be integrated directly within a larger probabilistic framework. The cooccurrence smoothing model of Essen:92a, like our model, produces explicit estimates of word cooccurrence probabilities based on the cooccurrence statistics of similar words. The similaritybased estimates are interpolated with direct estimates of gram probabilities to form a smoothed gram language model. Word similarity in this model is computed by the confusion probability measure, which we described and evaluated earlier.
Several language modeling methods produce similaritybased probability estimates through classbased models. These methods do not use a direct measure of the similarity between a word and other words, but instead cluster the words into classes using a global optimization criterion. Brown:92c present a classbased ngram model which records probabilities of sequences of word classes instead of sequences of individual words. The probability estimate for a bigram which contains a particular word is affected by bigram statistics for other words in the same class, where all words in the same class are considered similar in their cooccurrence behavior. Word classes are formed by a bottomup hardclustering algorithm whose objective function is the average mutual information of class cooccurrence. Ushioda96 introduces several improvements to mutualinformation clustering. His method, which was applied to partofspeech tagging, records all classes which contained a particular word during the bottomup merging process. The word is then represented by a mixture of these classes rather than by a single class.
The algorithms of Kneser+Ney:93a and Ueberla:94a are similar to that of Brown:92c, although a different optimization criterion is used, and the number of clusters remains constant throughout the membership assignment process. Pereira:93a use a formalism from statistical mechanics to derive a topdown softclustering algorithm with probabilistic class membership. Word cooccurrence probability is then modeled by a weighted average of class cooccurrence probabilities, where the weights correspond to membership probabilities of words within classes.
5.2 Thesaurusbased similarity
The approaches described in the previous section induce word similarity relationships or word clusters from cooccurrence statistics in a corpus. Other researchers developed methods which quantify similarity relationships based on information in the manually crafted WordNet thesaurus [Miller, Beckwith, Fellbaum, Gross MillerMiller 1990]. Resnik:92a,Resnik:95a proposes a nodebased approach for measuring the similarity between a pair of words in the thesaurus and applies it to various disambiguation tasks. His similarity function is an informationtheoretic measure of the informativeness of the least general common ancestor of the two words in the thesaurus classification. Jiang97 combine the nodebased approach with an edgebased approach, where the similarity of nodes in the thesaurus is influenced by the path that connects them. Their similarity method was tested on a data set of word pair similarity ratings derived from human judgments.
Lin:97a,Lin:98 derives a general conceptsimilarity measure from assumptions on desired properties of similarity. His measure is a function of the number of bits required to describe each of the two concepts as well as their “commonality”. He then describes an instantiation of the measure for a hierarchical thesaurus and applies it to WordNet as part of a word sense disambiguation algorithm.
5.3 Contextual similarity for information retrieval
Query expansion in information retrieval (IR) provides an additional motivation for automatic identification of word similarity. One line of work in the IR literature considers two words as similar if they occur often in the same documents. Another line of work considers the same type of word similarity we are concerned with, that is, similarity measured derived from wordcooccurrence statistics.
Grefenstette92,Grefenstette94 argues that cooccurrence within a document yields similarity judgements that are not sharp enough for query expansion. Instead, he extracts coarse syntactic relationships from texts and represents a word by the set of its wordcooccurrences within each relation. Word similarity is defined by a weighted version of the Tanimoto measure which compares the cooccurrence statistics of two words. The similarity method was evaluated by measuring its impact on retrieval performance.
Ruge92 also extracted word cooccurrences within syntactic relationships and evaluated several similarity measures on those data, focusing on versions of the cosine measure. The similarity rankings obtained by these measures were compared to those produced by human judges.
6 Conclusions
Similaritybased language models provide an appealing approach for dealing with data sparseness. In this work, we proposed a general method for using similaritybased models to improve the estimates of existing language models, and we evaluated a range of similaritybased models and parameter settings on important languageprocessing tasks. In the pilot study, we compared the language modeling performance of a similaritybased model with a standard backoff model. While the improvement we achieved over a bigram backoff model is statistically significant, it is relatively modest in its overall effect because of the small proportion of unseen events. In a second, more detailed study we compared several similaritybased models and parameter settings on a smaller, more manageable wordsense disambiguation task. We observed that the similaritybased methods perform much better on unseen word pairs, with the measure based on the JensenShannon divergence being the best overall.
Our experiments were restricted to bigram probability estimation for reasons of simplicity and computational cost. However, the relatively small proportion of unseen bigrams in test data makes the effect of similaritybased methods necessarily modest in the overall tasks. We believe that the benefits of similaritybased methods would be more substantial in tasks with a larger proportion of unseen events, for instance language modeling with longer contexts. There is no obstacle in principle to doing this: in the trigram case, for example, we would still be determining the probability of pairs , but would consist of word pairs instead of single words. However, the number of possible similar events to a given element in is then much larger than in the bigram case. Direct tabulation of the events most similar to each event would thus not be practical, so more compact or approximate representations would have to be investigated. It would also be worth investigating the benefit of similaritybased methods to improve estimates for lowfrequency seen events. However, we would need to replace the backoff model by another one that combines multiple estimates for the same event, for example an interpolated model with contextdependent interpolation parameters.
Another area for further investigation is the relationship between similaritybased and classbased approaches. As mentioned in the introduction, both rely on a common intuition, namely, that events can be modeled to some extent by similar events. Classbased methods are more computationally expensive at training time than nearest neighbor methods because they require searching for the best model structure (number of classes and, for hard clustering, class membership) and estimation of hidden parameters (class membership probabilities in soft clustering). On the other hand, classbased methods reduce dimensionality and are thus smaller and more efficient at test time. Dimensionality reduction has also been claimed to improve generalization to test data, but the evidence for this is mixed. Furthermore, some classbased models have theoretically satisfying probabilistic interpretations [Saul PereiraSaul Pereira1997], whereas the justification for our similaritybased models is heuristic and empirical at present. Given the variety of classbased language modeling algorithms, as described in the section on related work above, it is beyond the scope of this paper to compare the performance of the two approaches. However, such a comparison, especially one that would bring both approaches under a common probabilistic interpretation, would be well worth pursuing.
Acknowledgments
We thank Hiyan Alshawi, Joshua Goodman, Rebecca Hwa, Slava Katz, Doug McIlroy, Stuart Shieber, and Yoram Singer for many helpful discussions, Doug Paul for help with his bigram backoff model, and Andrej Ljolje and Michael Riley for providing word lattices for our speech recognition evaluation. We also thank the reviewers of this paper for their constructive criticisms, and the editors of the present issue, Claire Cardie and Ray Mooney, for their help and suggestions. Portions of this work have appeared previously [Dagan, Pereira LeeDagan 1994, Dagan, Lee PereiraDagan 1997]; we thank the reviewers of those papers for their comments. Part of this work was done while the first author was a member of technical staff and then a visitor at AT&T Labs, and the second author was a graduate student at Harvard University and a summer visitor at AT&T Labs. The second author received partial support from the National Science Foundation under Grant No. IRI9350192, a National Science Foundation Graduate Fellowship, and an AT&T GRPW/ALFP grant.
References
 [Aha, Kibler AlbertAha 1991] Aha, D. W., Kibler, D. Albert, M. K. 1991. Instancebased learning algorithms. Machine Learning, 6, 37–66.
 [Atkeson, Moore SchaalAtkeson 1997] Atkeson, C. G., Moore, A. W. Schaal, S. 1997. Locally weighted learning. Artificial Intelligence Review, 11(1), 11–73.
 [Brown, DellaPietra, deSouza, Lai MercerBrown 1992] Brown, P. F., DellaPietra, V. J., deSouza, P. V., Lai, J. C. Mercer, R. L. 1992. Classbased gram models of natural language. Computational Linguistics, 18(4), 467479.
 [CardieCardie1993] Cardie, C. 1993. A casebased approach to knowledge acquisition for domainspecific sentence analysis. In 11th National Conference on Artifical Intelligence ( 798–803). Menlo Park, California: AAAI.
 [Chen GoodmanChen Goodman1996] Chen, S. F. Goodman, J. T. 1996. An empirical study of smoothing techniques for language modeling. In 34th Annual Meeting of the ACL ( 310318). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [ChurchChurch1988] Church, K. W. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing ( 136143). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [Church GaleChurch Gale1991] Church, K. W. Gale, W. A. 1991. A comparison of the enhanced GoodTuring and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5, 19–54.
 [Cover HartCover Hart1967] Cover, T. M. Hart, P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
 [Cover ThomasCover Thomas1991] Cover, T. M. Thomas, J. A. 1991. Elements of information theory. New York: John Wiley.
 [Dagan, Lee PereiraDagan 1997] Dagan, I., Lee, L. Pereira, F. C. N. 1997. Similaritybased methods for word sense disambiguation. In 35th Annual Meeting of the ACL ( 56–63). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [Dagan, Marcus MarkovitchDagan 1993] Dagan, I., Marcus, S. Markovitch, S. 1993. Contextual word similarity and estimation from sparse data. In 31st Annual Meeting of the ACL ( 164–171). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [Dagan, Marcus MarkovitchDagan 1995] Dagan, I., Marcus, S. Markovitch, S. 1995. Contextual word similarity and estimation from sparse data. Computer Speech and Language, 9, 123–152.
 [Dagan, Pereira LeeDagan 1994] Dagan, I., Pereira, F. C. N. Lee, L. 1994. Similaritybased estimation of word cooccurrence probabilities. In 32nd Annual Meeting of the ACL ( 272278). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [Devroye, Györfi LugosiDevroye 1996] Devroye, L., Györfi, L. Lugosi, G. 1996. A probabilistic theory of pattern recognition. New York: SpringerVerlag.
 [Duda HartDuda Hart1973] Duda, R. O. Hart, P. E. 1973. Pattern classification and scene analysis. New York: WileyInterscience.
 [Essen SteinbissEssen Steinbiss1992] Essen, U. Steinbiss, V. 1992. Cooccurrence smoothing for stochastic language modeling. In ICASSP 92 ( 1, 161164). Piscataway, New Jersey: IEEE.
 [FinchFinch1993] Finch, S. 1993. Finding structure in language. , University of Edinburgh.
 [Gale, Church YarowskyGale 1992] Gale, W., Church, K. W. Yarowsky, D. 1992. Work on statistical methods for word sense disambiguation. In R. Goldman (), Fall Symposium on Probabilistic Approaches to Natural Language ( 54–60). Menlo Park, California: AAAI.
 [GoodGood1953] Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4), 237264.
 [GrefenstetteGrefenstette1992] Grefenstette, G. 1992. Use of syntactic context to produce term association lists for text retrieval. In International conference on research and development in information retrieval, SIGIR ( 8997). New York: ACM.
 [GrefenstetteGrefenstette1994] Grefenstette, G. 1994. Explorations in automatic thesaurus discovery. Boston: Kluwer Academic Publishers.
 [Grishman, Hirschman NhanGrishman 1986] Grishman, R., Hirschman, L. Nhan, N. T. 1986. Discovery procedures for sublanguage selectional patterns – initial experiments. Computational Linguistics, 12, 205–214.
 [Grishman SterlingGrishman Sterling1993] Grishman, R. Sterling, J. 1993. Smoothing of automatically generated selectional constraints. In Human Language Technology: Proceedings of the ARPA Workshop ( 254–259). San Francisco: Morgan Kaufmann.
 [HindleHindle1990] Hindle, D. 1990. Noun classification from predicateargument structures. In 28th Annual Meeting of the ACL ( 268–275). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [Jelinek MercerJelinek Mercer1980] Jelinek, F. Mercer, R. L. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam: North Holland.
 [Jelinek, Mercer RoukosJelinek 1992] Jelinek, F., Mercer, R. L. Roukos, S. 1992. Principles of lexical language modeling for speech recognition. In S. Furui M. M. Sondhi (), Advances in speech signal processing ( 651–699). New York: Marcel Dekker.
 [Jiang ConrathJiang Conrath1997] Jiang, J. J. Conrath, D. W. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In ROCLING X International Conference. Tapei, Taiwan: Academia Sinica.
 [Karov EdelmanKarov Edelman1996] Karov, Y. Edelman, S. 1996. Learning similaritybased word sense disambiguation from sparse data. In E. Ejerhed I. Dagan (), Fourth Workshop on Very Large Corpora ( 42–55). Somerset, New Jersey: Association for Computational Linguistics.
 [KatzKatz1987] Katz, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP35(3), 400401.
 [Kneser NeyKneser Ney1993] Kneser, R. Ney, H. 1993. Improved clustering techniques for classbased statistical language modelling. In EUROSPEECH’93 ( 973–976). Grenoble, France: European Speech Communication Association.
 [KullbackKullback1959] Kullback, S. 1959. Information theory and statistics. New York: John Wiley and Sons.
 [LeeLee1997] Lee, L. 1997. Similaritybased approaches to natural language processing. , Harvard University, Cambridge, Massachusetts.
 [LinLin1997] Lin, D. 1997. Using syntactic dependency as local context to resolve word sense ambiguity. In 35th Annual Meeting of the ACL ( 64–71). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [LinLin1998] Lin, D. 1998. An information theoretic definition of similarity. In Machine Learning: Proceedings of the Fiftheenth International Conference (ICML ’98). San Francisco: Morgan Kaufmann.
 [LinLin1991] Lin, J. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145151.
 [LukLuk1995] Luk, A. K. 1995. Statistical sense disambiguation with relatively small corpora using dictionary definitions. In 33rd Annual Meeting of the ACL ( 181–188). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [Miller, Beckwith, Fellbaum, Gross MillerMiller 1990] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D. Miller, K. J. 1990. Introduction to WordNet: an online lexical database. International Journal of Lexicography, 3(4), 235–244.
 [NgNg1997] Ng, H. T. 1997. Exemplarbased word sense disambiguation: Some recent improvements. In C. Cardie R. Weischedel (), Second Conference on Empirical Methods in Natural Language Processing (EMNLP2) ( 208–213). Somerset, New Jersey: Association for Computational Linguistics.
 [Ng LeeNg Lee1996] Ng, H. T. Lee, H. B. 1996. Integrating multiple knowledge sources to disambiguate word sense: An exemplarbased approach. In 34th Annual Meeting of the ACL ( 40–47). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)

[PaulPaul1991]
Paul, D. B.
1991.
Experience with a stack decoderbased HMM CSR and backoff ngram language models.
In ARPA Speech and Natural Language Workshop ( 284–288). San Francisco: Morgan Kaufmann.  [Pereira, Tishby LeePereira 1993] Pereira, F. C. N., Tishby, N. Lee, L. 1993. Distributional clustering of English words. In 31st Annual Meeting of the ACL ( 183190). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
 [RaoRao1982] Rao, C. R. 1982. Diversity: Its measurement, decomposition, apportionment and analysis. Sankyh: The Indian Journal of Statistics, 44(A), 122.
 [ResnikResnik1992] Resnik, P. 1992. Wordnet and distributional analysis: A classbased approach to lexical discovery. In Workshop on Statisticallybased Natural Language Processing Techniques ( 5664). Menlo Park, California: AAAI.
 [ResnikResnik1995] Resnik, P. 1995. Disambiguating noun groupings with respect to WordNet senses. In D. Yarowsky K. W. Church (), Third Workshop on Very Large Corpora ( 54–68). Somerset, New Jersey: Association for Computational Linguistics.
 [RugeRuge1992] Ruge, G. 1992. Experiments on linguisticallybased term associations. Information Processing & Management, 28(3), 317332.

[Saul PereiraSaul
Pereira1997]
Saul, L. Pereira, F. C. N.
1997.
Aggregate and mixedorder Markov models for statistical language processing.
In C. Cardie R. Weischedel (), Second Conference on Empirical Methods in Natural Language Processing (EMNLP2) ( 81–89). Somerset, New Jersey: Association for Computational Linguistics.  [SchützeSchütze19921] Schütze, H. 19921. Context space. In R. Goldman (), Fall Symposium on Probabilistic Approaches to Natural Language ( 113–120). Menlo Park, California: AAAI.
 [SchützeSchütze19922] Schütze, H. 19922. Dimensions of meaning. In Supercomputing ’92: Proceedings of the ACM/IEEE Conference ( 787796). New York: ACM.
 [SchützeSchütze1993] Schütze, H. 1993. Word space. In S. J. Hanson, J. D. Cowan C. L. Giles (), Advances in Neural Information Processing Systems 5 ( 895–902). San Francisco: Morgan Kaufmann.
 [Stanfill WaltzStanfill Waltz1986] Stanfill, C. Waltz, D. 1986. Toward memorybased reasoning. Communications of the ACM, 29(12), 12131228.

[Sugawara, Nishimura, Toshioka, Okochi
KanekoSugawara 1985]
Sugawara, K., Nishimura, M., Toshioka, K., Okochi, M. Kaneko, T.
1985.
Isolated word recognition using hidden Markov models.
In ICASSP 85 ( 1–4). Piscataway, New Jersey: IEEE.  [UeberlaUeberla1994] Ueberla, J. P. 1994. An extended clustering algorithm for statistical language models ( DRA/CIS(CSE1)/RN94/13). Forum Technology – DRA Malvern.
 [UshiodaUshioda1996] Ushioda, A. 1996. Hierarchical clustering of words and applications to NLP tasks. In E. Ejerhed I. Dagan (), Fourth Workshop on Very Large Corpora ( 28–41). Somerset, New Jersey: Association for Computational Linguistics.
 [Witten BellWitten Bell1991] Witten, I. H. Bell, T. C. 1991. The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094.
 [YarowskyYarowsky1992] Yarowsky, D. 1992. Wordsense disambiguation using statistical models of Roget’s categories trained on large corpora. In COLING92 ( 454–460). Nantes, France.
 [Zavrel DaelemansZavrel Daelemans1997] Zavrel, J. Daelemans, W. 1997. Memorybased learning: Using similarity for smoothing. In 35th Annual Meeting of the ACL ( 436–443). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)