Word Sense Disambiguation () is the problem of assigning the appropriate meaning (sense) to a given word in a text or discourse. This meaning is distinguishable from other senses potentially attributable to that word. Resolving the ambiguity of words is a central problem for language understanding applications and their associated tasks 
, including, for instance, machine translation, information retrieval and hypertext navigation, parsing, spelling correction, reference resolution, automatic text summarization, etc.
is one of the most important open problems in the Natural Language Processing () field. Despite the wide range of approaches investigated and the large effort devoted to tackling this problem, it is a fact that to date no large–scale, broad coverage and highly accurate word sense disambiguation system has been built.
The most successful current line of research is the corpus–based approach in which statistical or Machine Learning () algorithms have been applied to learn statistical models or classifiers from corpora in order to perform . Generally, supervised approaches (those that learn from a previously semantically annotated corpus) have obtained better results than unsupervised methods on small sets of selected highly ambiguous words, or artificial pseudo–words. Many standard algorithms for supervised learning have been applied, such as: Naive Bayes[19, 22], [19, 10], Exemplar–based learning Decision Lists 27], etc. Further, Mooney 
has also compared all previously cited methods on a very restricted domain and including Decision Trees and Rule Induction algorithms. Unfortunately, there have been very few direct comparisons of alternative methods on identical test data. However, it is commonly accepted that Naive Bayes, Neural Networks and Exemplar–based learning represent state–of–the–art accuracy on supervised .
Supervised methods suffer from the lack of widely available semantically tagged corpora, from which to construct really broad coverage systems. This is known as the “knowledge acquisition bottleneck”. Ng estimates that the manual annotation effort necessary to build a broad coverage semantically annotated corpus would be about 16 man-years. This extremely high overhead for supervision and, additionally, the also serious overhead for learning/testing many of the commonly used algorithms when scaling to real size problems, explain why supervised methods have been seriously questioned.
Due to this fact, recent works have focused on reducing the acquisition cost as well as the need for supervision in corpus–based methods for . Consequently, the following three lines of research can be found: 1) The design of efficient example sampling methods [6, 10]; 2) The use of lexical resources, such as WordNet , and WWW search engines to automatically obtain from Internet arbitrarily large samples of word senses [12, 15]; 3) The use of unsupervised EM–like algorithms for estimating the statistical model parameters . It is also our belief that this body of work, and in particular the second line, provides enough evidence towards the “opening” of the acquisition bottleneck in the near future. For that reason, it is worth further investigating the application of new supervised methods to better resolve the problem.
Boosting Algorithms. The main idea of boosting algorithms is to combine many simple and moderately accurate hypotheses (called weak classifiers) into a single, highly accurate classifier for the task at hand. The weak classifiers are trained sequentially and, conceptually, each of them is trained on the examples which were most difficult to classify by the preceding weak classifiers.
The algorithm applied in this paper  is a generalization of Freund and Schapire’s AdaBoost algorithm , which has been (theoretically and experimentally) studied extensively and which has been shown to perform well on standard machine–learning tasks using also standard machine–learning algorithms as weak learners [23, 8, 5, 2].
Regarding Natural Language () problems, has been successfully applied to Part–of–Speech () tagging , Prepositional–Phrase–attachment disambiguation , and, Text Categorization  with especially good results.
The Text Categorization domain shares several properties with the usual settings of , such as: very high dimensionality (typical features consist in testing the presence/absence of concrete words), presence of many irrelevant and highly dependent features, and the fact that both, the learned concepts and the examples, reside very sparsely in the feature space. Therefore, the application of to seems to be a promising choice. It has to be noted that, apart from the excellent results obtained on problems, has the advantages of being theoretically well founded and easy to implement.
The paper is organized as follows: Section 2 is devoted to explain in detail the algorithm. Section 3 describes the domain of application and the initial experiments performed on a reduced set of words. In Section 4 several alternatives are explored for accelerating the learning process by reducing the feature space. The best alternative is fully tested in Section 5. Finally, Section 6 concludes and outlines some directions for future work.
2 The Boosting Algorithm
As already said, the purpose of boosting is to find a highly accurate classification rule by combining many weak hypotheses (or weak rules), each of which may be only moderately accurate. It is assumed that there exists a separate procedure called the for acquiring the weak hypotheses. The boosting algorithm finds a set of weak hypotheses by calling the weak learner repeatedly in a series of rounds. These weak hypotheses are then combined into a single rule called the combined hypothesis.
Let be the set of training examples, where each instance belongs to an instance space and each is a subset of a finite set of labels or classes . The size of is denoted by .
The pseudo–code of is presented in figure 1. maintains an matrix of weights as a distribution over examples and labels. The goal of the algorithm is to find a weak hypothesis with moderately low error with respect to these weights. Initially, the distribution is uniform, but the boosting algorithm updates the weights on each round to force the weak learner to concentrate on the pairs (examples,label) which are hardest to predict.
More precisely, let be the distribution at round , and the weak rule acquired according to . The sign of is interpreted as a prediction of whether label should be assigned to example or not. The magnitude of the prediction is interpreted as a measure of confidence in the prediction. In order to understand correctly the updating formula this last piece of notation should be defined. Thus, given and , let be +1 if and -1 otherwise.
Now, it becomes clear that the updating function increases (or decreases) the weights for which makes a good (or bad) prediction, and that this variation is proportional to .
Note that is not a multi–label classification problem since a unique sense is expected for each word in context. In our implementation, the algorithm runs exactly in the same way as explained above, except that sets are reduced to a unique label, and that the combined hypothesis is forced to output a unique label, which is the one that maximizes .
Up to now, it only remains to be defined the form of the . Schapire and Singer  prove that the Hamming loss of the algorithm on the training set111i.e. the fraction of training examples and labels for which the sign of differs from . is at most , where is the normalization factor computed on round . This upper bound is used in guiding the design of the algorithm, which attempts to find a weak hypothesis that minimizes: .
2.1 Weak Hypotheses for
As in , very simple weak hypotheses are used to test the value of a boolean predicate and make a prediction based on that value. The predicates used, which are described in section 3.1, are of the form “”, where is a feature and is a value (e.g.: “previous_word = hospital”). Formally, based on a given predicate , our interest lies on weak hypotheses which make predictions of the form:
where the ’s are real numbers.
For a given predicate , and bearing the minimization of in mind, values should be calculated as follows. Let be the subset of examples for which the predicate holds and let be the subset of examples for which the predicate does not hold. Let , for any predicate , be 1 if holds and 0 otherwise. Given the current distribution , the following real numbers are calculated for each possible label , for , and for :
That is, () is the weight (with respect to distribution ) of the training examples in partition which are (or not) labelled by .
As it is shown in , is minimized for a particular predicate by choosing:
These settings imply that:
Thus, the predicate chosen is that for which the value of is smallest.
Very small or zero values for the parameters cause predictions to be large or infinite in magnitude. In practice, such large predictions may cause numerical problems to the algorithm, and seem to increase the tendency to overfit. As suggested in , smoothed values for have been used.
3 Applying Boosting to
In our experiments the boosting approach has been evaluated using the corpus containing 192,800 semantically annotated occurrences222These examples are tagged with a set of labels which correspond, with some minor changes, to the senses of WordNet 1.5 . of 121 nouns and 70 verbs. These correspond to the most frequent and ambiguous English words. The corpus was collected by Ng and colleagues  and it is available from the Linguistic Data Consortium (LDC)333LDC e-mail address: firstname.lastname@example.org.
For our first experiments, a group of 15 words (10 nouns and 5 verbs) which frequently appear in the related literature has been selected. These words are described in the left hand–side of table 1. Since our goal is to acquire a classifier for each word, each row represents a classification problem. The number of classes (senses) ranges from 4 to 30, the number of training examples from 373 to 1,500 and the number of attributes from 1,420 to 5,181. The column on the right hand–side of table 1 shows the percentage of the most frequent sense for each word, i.e. the accuracy that a naive “Most–Frequent–Sense” classifier would obtain.
The binary–valued attributes used for describing the examples correspond to the binarization of seven features referring to a very narrow linguistic context. Let “” be the context of 5 consecutive words around the word to be disambiguated. The seven features mentioned above are exactly those used in : , , , , , (), and , where the last three correspond to collocations of two consecutive words.
3.2 Benchmark Algorithms and Experimental Methodology
has been compared to the following algorithms:
Naive Bayes (). The naive Bayesian classifier has been used in its most classical setting 
. To avoid the effect of zero counts when estimating the conditional probabilities of the model, a very simple smoothing technique has been used, which was proposed in.
Exemplar–based learning (). In our implementation, all examples are stored in memory and the classification of a new example is based on a –NN algorithm using Hamming distance to measure closeness (in doing so, all examples are examined). If is greater than 1, the resulting sense is the weighted majority sense of the nearest neighbours (each example votes its sense with a strength proportional to its closeness to the test example). Ties are resolved in favour of the most frequent sense among all those tied.
The comparison of algorithms has been performed in series of controlled experiments using exactly the same training and test sets for each method. The experimental methodology consisted in a 10-fold cross-validation. All accuracy/error rate figures appearing in the paper are averaged over the results of the 10 folds. The statistical tests of significance have been performed using a 10-fold cross validation paired Student’s-test with a confidence value of: .
Figure LABEL:f-lc15 shows the error rate curve of , averaged over the 15 reference words, and for an increasing number of weak rules per word. This plot shows that the error obtained by is lower than those obtained by and (=15 is the best choice for that parameter from a number of tests between =1 and =30) for a number of rules above 100. It also shows that the error rate decreases slightly and monotonically, as it approaches the maximum number of rules reported444The maximum number of rounds considered is 750, merely for efficiency reasons..
lc157Error rate of related to the number of weak rulesf-lc15
According to the plot in figure LABEL:f-lc15, no overfitting is observed while increasing the number of rules per word. Although it seems that the best strategy could be “learn as many rules as possible”, in  it is shown that the number of rounds must be determined individually for each word since they have different behaviours. The adjustment of the number of rounds can be done by cross–validation on the training set, as suggested in . However, in our case, this cross–validation inside the cross–validation of the general experiment would generate a prohibitive overhead. Instead, a very simple stopping criterion (sc) has been used, which consists in stopping the acquisition of weak rules whenever the error rate on the training set falls below 5%, with an upper bound of 750 rules. This variant, which is referred to as , obtained comparable results to but generating only 370.2 weak rules per word on average, which represents a very moderate storage requirement for the combined classifiers.
The numerical information corresponding to this experiment is included in table 1. This table shows the accuracy results, detailed for each word, of , , , , and . The best result for each word is printed in boldface.
|Number of||Accuracy (%)|
As it can be seen, in 14 out of 15 cases, the best results correspond to the boosting algorithms. When comparing global results, accuracies of either or are significantly greater than those of any of the other methods. Finally, note that accuracies corresponding to and are comparable (as suggested in ), and that the use of ’s greater than 1 is crucial for making Exemplar–based learning competitive on .
4 Making Boosting Practical for
Up to now, it has been seen that is a simple and competitive algorithm for the task. It achieves an accuracy performance superior to that of the Naive Bayes and Exemplar–based algorithms tested in this paper. However, has the drawback of its computational cost, which makes the algorithm not scale properly to real domains of thousands of words.
The space and time–per–round requirements of are (recall that is the number of training examples and the number of senses), not including the call to the weak learner. This cost is unavoidable since is inherently sequential. That is, in order to learn the (+1)-th weak rule it needs the calculation of the -th weak rule, which properly updates the matrix . Further, inside the , there is another iterative process that examines, one by one, all attributes so as to decide which is the one that minimizes . Since there are thousands of attributes, this is also a time consuming part, which can be straightforwardly spedup either by reducing the number of attributes or by relaxing the need to examine all attributes at each iteration.
4.1 Accelerating the
Four methods have been tested in order to reduce the cost of
searching for weak rules. The first three, consisting in aggressively
reducing the feature space, are frequently applied in Text Categorization.
The fourth consists in reducing the number of attributes that are
examined at each round of the boosting algorithm.
Frequency filtering (Freq): This method consists in simply discarding those features corresponding to events that occur less than times in the training corpus. The idea beyond that criterion is that frequent events are more informative than rare ones.
Local frequency filtering (LFreq): This method works similarly to Freq but considers the frequency of events locally, at the sense level. More particularly, it selects the most frequent features of each sense.
RLM ranking: This third method consists in making a ranking of all attributes according to the distance measure  and selecting the
most relevant features. This measure has been commonly used for attribute selection in decision tree induction algorithms555 distance belongs to the distance–based and information–based families of attribute selection functions. It has been selected because it showed better performance than seven other alternatives in an experiment of decision tree induction for tagging ..
LazyBoosting: The last method does not filter out any attribute but reduces the number of those that are examined at each iteration of the boosting algorithm. More specifically, a small proportion of attributes are randomly selected and the best weak rule is selected among them. The idea behind this method is that if the proportion is not too small, probably a sufficiently good rule can be found at each iteration. Besides, the chance for a good rule to appear in the whole learning process is very high. Another important characteristic is that no attribute needs to be discarded and so we avoid the risk of eliminating relevant attributes666This method will be called in reference to the work by Samuel and colleagues . They applied the same technique for accelerating the learning algorithm in a Dialogue Act tagging system..
The four methods above have been compared for the set of 15 reference words. Figure LABEL:f-ap250 contains the average error–rate curves obtained by the four variants at increasing levels of attribute reduction. The top horizontal line corresponds to the error rate, while the bottom horizontal line stands for the error rate of working with all attributes. The results contained in figure LABEL:f-ap250 are calculated running the boosting algorithm 250 rounds for each word.
ap2507.1Error rate obtained by the four methods, at 250 weak rules per word, with respect to the percentage of rejected attributesf-ap250
The main conclusions that can be drawn are the following:
All methods seem to work quite well since no important degradation is observed in performance for values lower than 95% in rejected attributes. This may indicate that there are many irrelevant or highly dependent attributes in our domain.
LFreq is slightly better than Freq, indicating a preference to make frequency counts for each sense rather than globally.
The more informed ranking performs better than frequency–based reduction methods Freq and LFreq.
is better than all other methods, confirming our expectations: it is worth keeping all information provided by the features. In this case, acceptable performance is obtained even if only 1% of the attributes is explored when looking for a weak rule. The value of 10%, for which still achieves the same performance and runs about 7 times faster than working with all attributes, will be selected for the experiments in section 5.
The algorithm has been tested on the full semantically annotated corpus with and the same stopping criterion described in section 3.3, which will be referred to as . The average number of senses is 7.2 for nouns, 12.6 for verbs, and 9.2 overall. The average number of training examples is 933.9 for nouns, 938.7 for verbs, and 935.6 overall.
The algorithm learned an average of 381.1 rules per word, and took about 4 days of cpu time to complete777The current implementation is written in PERL-5.003 and it was run on a SUN UltraSparc2 machine with 194Mb of RAM.. It has to be noted that this time includes the cross–validation overhead. Eliminating it, it is estimated that 4 cpu days would be the necessary time for acquiring a word sense disambiguation boosting–based system covering about 2,000 words.
The has been compared again to the benchmark algorithms using the 10-fold cross–validation methodology described in section 3.2. The average accuracy results are reported in the left hand–side of table 2. The best figures correspond to the algorithm , and again, the differences are statistically significant using the 10-fold cross–validation paired -test.
|MFS||NB||EB||AB||AB vs. NB||AB vs. EB|
|Nouns (121)||56.4||68.7||68.0||70.8||99(51) – 1 – 21(3)||100(68) – 5 – 16(1)|
|Verbs (70)||46.7||64.8||64.9||67.5||63(35) – 1 – 6(2)||64(39) – 2 – 4(0)|
|Average (191)||52.3||67.1||66.7||69.5||162(86) – 2 – 27(5)||164(107) – 7 – 20(1)|
The right hand–side of the table shows the comparison of versus and algorithms, respectively. Each cell contains the number of wins, ties, and losses of competing algorithms. The counts of statistically significant differences are included in brackets. It is important to point out that only beats significantly in one case while does so in five cases. Conversely, a significant superiority of over and is observed in 107 and 86 cases, respectively.
6 Conclusions and Future Work
In the present work, Schapire and Singer’s algorithm has been evaluated on the word sense disambiguation task, which is one of the hardest open problems in Natural Language Processing. As it has been shown, the boosting approach outperforms Naive Bayes and Exemplar–based learning, which represent state–of–the–art accuracy on supervised . In addition, a faster variant has been suggested and tested, which is called . This variant allows the scaling of the algorithm to broad-coverage real domains, and is as accurate as . Further details can be found in an extended version of this paper .
Future work is planned to be done in the following directions:
Extensively evaluate on the task. This would include taking into account additional attributes, and testing the algorithms in other manually annotated corpora, and especially on sense–tagged corpora automatically obtained from Internet.
Confirm the validity of the approach on other language learning tasks in which works well, e.g.: Text Categorization.
It is known that mislabelled examples resulting from annotation errors tend to be hard examples to classify correctly, and, therefore, tend to have large weights in the final distribution. This observation allows both to identify the noisy examples and use boosting as a way to improve data quality [26, 1]. It is suspected that the corpus used in the current work is very noisy, so it could be worth using boosting to try and improve it.
-  Abney, S., Schapire, R.E. and Singer, Y.: Boosting Applied to Tagging and PP–attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
-  Bauer, E. and Kohavi, R.: An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants. Machine Learning Journal. Special issue on IMLM for Improving and Scaling Machine Learning Algorithms, 1999.
-  Breiman, L.: Arcing Classifiers. The Annals of Statistics, 26(3), 1998.
-  Duda, R. O. and Hart, P. E.: Pattern Classification and Scene Analysis. Wiley, New York, 1973.
-  Dietterich, T.G.: An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning (to appear).
-  Engelson, S.P. and Dagan, I.: Minimizing Manual Annotation Cost in Supervised Training from Corpora. In S. Wermter, E. Riloff and G. Scheler, editors, Connectionist, Statistical an Symbolic Approaches to Learning for Natural Language Processing, LNAI, 1040. Springer, 1996.
-  Escudero, G., Màrquez, L. and Rigau, G. Boosting Applied to Word Sense Disambiguation. Technical Report LSI-00-3-R, LSI Department, UPC, 2000.
-  Freund, Y. and Schapire, R.E.: Experiments with a New Boosting Algorithm. In Procs. of the 13th International Conference on Machine Learning, ICML, 1996.
-  Freund, Y. and Schapire, R.E.: A Decision–theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 1997.
-  Fujii, A., Inui, K., Tokunaga, T. and Tanaka, H.: Selective Sampling for Example–based Word Sense Disambiguation. Computational Linguistics, 24(4), ACL, 1998.
-  Ide, N. and Véronis, J.: Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art. Computational Linguistics, 24(1), ACL, 1998.
-  Leacock, C., Chodorow, M. and Miller, G.A.: Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1), ACL, 1998.
-  López de Mántaras, R.: A Distance–based Attribute Selection Measure for Decision Tree Induction. Machine Learning, 6(1), 1991.
-  Màrquez, L.: Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees. Phd. Thesis, LSI Department, UPC, 1999.
Mihalcea, R. and Moldovan, I.:
An Automatic Method for Generating Sense Tagged Corpora.
In Proceedings of the 16th National Conference on Artificial Intelligence, AAAI, 1999.
-  Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D. and Miller, K.: Five Papers on WordNet. Special Issue of International Journal of Lexicography, 3(4), 1990.
-  Mooney, R.J.: Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning. In Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP, 1996.
-  Ng, H.T. and Lee, H.B.: Integrating Multiple Knowledge Sources to Disambiguate Word Senses: An Exemplar-based Approach. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL, 1996.
-  Ng, H.T.: Exemplar-based Word Sense Disambiguation: Some Recent Improvements. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, EMNLP, 1997.
-  Ng, H.T.: Getting Serious about Word Sense Disambiguation. In Proceedings of the SIGLEX Workshop “Tagging Text with Lexical Semantics: Why, What and How?”, 1997.
-  Ng, H.T., Chung, Y. L. and Shou, K. F.: A Case Study on Inter-Annotation Agreement for WSD. In Proceedings of the SIGLEX Workshop “Standardizing Lexical Resources”, Maryland, USA, 1999.
-  Pedersen, T. and Bruce, R.: Knowledge Lean Word-Sense Disambiguation. In Proceedings of the 15th National Conference on Artificial Intelligence, 1998.
-  Quinlan, J.R.: Bagging, Boosting and C4.5. In Proceedings of the 13th National Conference on Artificial Intelligence, AAAI, 1996.
-  Samuel, K.: Lazy Transformation-Based Learning. In Proceedings of the 11th International Florida AI Research Symposium Conference, 1998.
-  Schapire, R.E. and Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning (to appear).
-  Schapire, R.E. and Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning (to appear).
-  Towell, G. and Voorhees, E.M.: Disambiguating Highly Ambiguous Words. Computational Linguistics, 24(1), ACL, 1998.
-  Yarowsky, D.: Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of the 32nd annual Meeting of the Association for Computational Linguistics, ACL, 1994.