Interactive Semantic Featuring for Text Classification

06/24/2016 ∙ by Camille Jandot, et al. ∙ Télécom ParisTech Facebook Microsoft 0

In text classification, dictionaries can be used to define human-comprehensible features. We propose an improvement to dictionary features called smoothed dictionary features. These features recognize document contexts instead of n-grams. We describe a principled methodology to solicit dictionary features from a teacher, and present results showing that models built using these human-comprehensible features are competitive with models trained with Bag of Words features.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Featuring for machine learning (ML) is a way for the model builder, or the

teacher, to describe the data and encode pertinent signals to the underlying learner so that the machine can “see” the data in the way the teacher intended. In the case of text classification, the “Bag of Words” (BoW) representation–where each feature represents a single word or n-gram in the document–is a popular choice; these features are usually generated automatically from text corpora and require no input from the teacher. Although BoW features have shown to be very successful in experiments and in practice Croft et al. (2010); Scott & Matwin (1999), the dimensionality of the space (i.e., the number of words used) is so large that the resulting model is hard to understand and diagnose for humans. Furthermore, the BoW features do not take the order of words into account, which means that it necessarily loses information about its local context, semantic meaning, or syntactic structure Scott & Matwin (1999) (e.g. “social security in media” and “security in social media” are indistinguishable by BoW).

As an alternative to BoW features, dictionaries (or lexicons) allow the teacher to provide further supervision. Dictionaries contain multiple n-grams that convey the same meaning or are semantically related to each other. Dictionaries can be generated manually by the domain expert Taboada et al. (2011) or by leveraging external sources such as WordNet Fellbaum (1998). By using manually-created dictionaries, we can improve generalization Williams et al. (2015) and interpretability of the resulting model; furthermore, because the dimensionality of the feature space can be much lower, we require far fewer examples to train the models. Dictionary features also have a number of drawbacks: they can be time-consuming and tedious for teachers to construct, they can result in low-recall models if the teacher omits an important concept, and they have the same lack-of-ordering-information limitation as discussed above for BoW features.

In this paper, we propose an improvement to the dictionary feature, called the smoothed dictionary features, that takes into account the context in which a word appears. Instead of recognizing the explicit occurrences of n-grams of the dictionaries in the document, the smoothed dictionary features leverage the probability that a word belongs to the dictionary given its context (i.e. the words that co-appear with the considered word). We describe a procedure that interleaves featuring and labeling and show that models built using the resulting features compare favorably to models built using BoW features for two classification tasks.

2 Semantic Featuring

In this section, we describe two ways for a teacher to communicate semantic meaning to the model, dictionary features and smoothed dictionary features, and an interactive teaching process that can be used to solicit these features.

2.1 Dictionary Features

A dictionary is a set of n-gram terms . For a given text document, we say that the dictionary matches at a position if the document contains some at that position. A dictionary feature for a document is an aggregate count of the dictionary matches within that document. We assume for the remainder of the paper that, for dictionary with matches in document , the feature value is defined as

Note that the feature value is not a function of the identity of the matching n-grams. This means that care must be taken when constructing the dictionary.

On the positive side, this allows faster generalization by considering the presence of all n-grams to be equivalent. On the negative side, this equivalence multiplies the risk that a dictionary would fire outside of its intended scope. In particular, a single n-gram can derail the semantic intent of the dictionary. For example, a dictionary containing the months of the year will include the term “May”, which will match on “May I help you”.

Dictionaries, sometimes referred to as lexicons or gazetteers

, have shown promise for both sentiment analysis

Taboada et al. (2011)

, for extraction tasks such as named-entity recognition

Smith & Osborne (2006), and for utterance analysis tasks Williams et al. (2015).

2.2 Smoothed Dictionary Features

We propose an improvement to the dictionary feature that, instead of using literal dictionary matches, predicts the probability that there is a dictionary match according to a context model. By using the contexts instead of dictionary matches, our model can better capture the semantics instead of the syntax of the dictionary terms; the corresponding feature will match n-grams that are not in the dictionary but have similar contexts, and the feature will be suppressed for n-grams that are in the dictionary but have unrelated contexts.

Our context model is a logistic-regression model built to output, for any given n-gram

within a document, the probability that belongs to the dictionary . We train the model by selecting positive and negative instances of the dictionary terms, in context, from a large, unlabeled corpus of text. Each context feature of the context model corresponds to the sequence of unigrams within a context window near , with the requirement that these unigrams do not overlap with . For example, one context window might consist of the five tokens following . Given the context window, let denote the set of unigrams in that window for some document and position of ; we define the value of the corresponding context feature

via the log odds


We use a naïve-Bayes classifier for the probabilities in Equation

1. In our experiments, we used ten context features corresponding to non-overlapping windows of increasing size around : a size-one window immediately before and after , a size-two window immediately before and after the two respective size-one windows, a size-four window before and after the size-two windows, and so on.

The logistic-regression model combines the context-window features from Equation 1 together via:

By combining the naïve-Bayes classifier scores together with logistic regression, our model benefits from the efficiency of naïve-Bayes for both training and runtime–requiring only pair-wise counts to produce the scores–without needing to rely its independence assumptions in the final prediction.

Given the model and a threshold , we say that the dictionary smooth matches at a position if, for the n-gram at that position we have . A smoothed dictionary feature for a document is an aggregate count of the smooth dictionary matches. As above we assume for the remainder of the paper that, for a dictionary with smooth matches in document , the smoothed feature is defined as

The threshold can be set manually by the teacher by looking at sorted lists of contexts as we discuss below. In our experiments, however, we used an unsupervised method of setting such that we had a constant average number of smoothed matches per document across a large corpus of unlabeled documents.

In Table 1, we show how, unlike dictionary features, the smoothed dictionary features can distinguish homonyms. The table shows a list of contexts from the the Open Directory Project (ODP) data set ( where the unigram “may” occurs, ranked by decreasing probability that “may” occurs in the dictionary (containing all 12 months, including “may”) according to the context model trained as described above. A teacher would likely to manually choose a value between and as the threshold .

Prob. Words before Targ. Words after

september 2008 august 2008 july 2008 june 2008 may 2008 april 2008 march 2008 february 2008 january
as bachelor of theology when , on 24 may , , 1323 , john xxii at the request
were arrested . the following year , in may , four thousand huguenots assembled on the seine
0 like 6 5 blog gear about rss may 13 macro madness usa - washington posted by
to the new syndicomm web site as you may , or may not have noticed already , i
being phased out . the mp command set may or may not ( according to the implementation
, not be married to one another , who may or may not both be the biological parent
many cases to ‘ art ’ films that may appease only the film connaisseur ’ and
based on the filing system date , which may or may not be accurate ) arm object
convenience , in the same spirit as you may read a journal or a proceedings article in

  • Note: The table above displays the occurrences of the target term “may” and the context in which it appears, sorted by probability that the considered occurrence is part of the dictionary. If the threshold of the dictionary is set to , the percentage of words that will trigger the dictionary is .

Table 1: “May” in context for

The context model can also be used to recommend n-grams to add to existing dictionaries. In particular, for any n-gram , we can average the context-model predictions over all instances of ; we can then suggest to add the highest-scoring n-grams to . To illustrate this, we started with an incomplete dictionary of months = [“january”; “february”; “march”] and averaged the context-model predictions for all unigrams; the top-scoring unigrams were “april”, “august”, “october”, “september”, “july”, “november”, “june”, and “december”, followed by some misspelled months, abbreviations of months, “may” and months in German and French.

2.3 Soliciting Features From the Teacher

When training models with BoW features, the teacher needs only to supply labels (likely through an active-learning strategy

Settles (2009)) to train the model, as the feature space is pre-defined. If our goal is to build classifiers with semantic features such as dictionary or smoothed dictionary features, we need a methodology to solicit them from the teacher. Although the teacher could pre-define all the semantic features before concentrating on labeling of documents, we take instead the approach of interleaving featuring and labeling. This allows the teacher to re-examine feature decisions during the labeling effort, which can prevent the teacher from making incorrect assumptions about the data distribution and classification errors.

1:  repeat
2:     if (Not feature blind) then
3:         Sample, label, and add to training
4:     else
5:         Add or modify feature
6:     end if
7:     Train the model
8:  until 
Algorithm 1 Interactive Teaching Process

Consider the teaching process shown in Algorithm 1. At each iteration, the teacher samples for an example to add to the training set. Each iteration will train a new model, and if the resulting model is confused by the new example due to missing features, the model has a ”feature-blindness”. The teacher then adds a feature to help distinguish the confusing patterns until the blindness is resolved.

A good teaching strategy is one that uses this teaching process with two key principles. (1) Sample for confusion. During the Sample step, a good teacher checks for generalization ability by searching for examples that are likely to confuse the student. This active sampling strategy helps reduce the number of labels. (2) Feature for generalization. During the Feature step, a good teacher chooses features that will not only resolve existing conflicts but also resolve future conflicts, thus improving generalization on unseen data. For instance, a feature blindness for “piano” could benefit from adding an entire concept class for “musical instruments”. Features chosen in this fashion are not likely to lead to overtraining because they are only added when necessary. The resulting features are interpretable because there are relatively small in number (i.e. limited to teacher’s interaction) and they are expressed directly by the teacher.

3 Experiments

In this section we compare classifiers built using (1) BoW features, (2) dictionary features, and (3) smoothed dictionary features. We use the documents (web pages) and labels provided by ODP data set, which we split randomly into two sets and , containing respectively and documents. The first set is used to build the dictionaries and to train the models, and the second set is used for testing. We chose two classification tasks, health and music, that are similar in their positive distribution in ( and documents are positive, respectively).

3.1 Dictionary Generation

We built dictionaries as described in Section 2.3 using an interactive learning interface Simard et al. (2014) that enables us to both label and add (not smoothed) dictionary features. We first created an initial dictionary that contains descriptive words about the main classification task (e.g., for music, we used the dictionary: [“music”; “musician”; “musical”]). We then started labeling items by searching the unlabeled corpus for certain keywords or by soliciting suggested items to label. Suggested items were selected using active learning strategies designed to identify documents with potential feature deficiencies, including uncertainty sampling Settles (2009) and disagreement sampling where documents were selected based on the disagreement between the current model and a model trained with BoW features. Throughout the process, the model was trained in the background every time a few labels were submitted.

When an error occurs in the training data (i.e., the classifier disagrees with our label for an example that is in the training set), we tried to identify a feature blindness and addressed the error either by refining an existing dictionary or adding a new one. We also refined existing dictionaries using the context models as described in Section 2.2; for example, the first suggested words for the dictionary [“piano”; “guitar”; “sax”; “bass”] were “bassoon”, “saxophone”, “timpani” (a synonym of kettledrums), “cello”, “tuba”, “oboe”, “clarinet”, “ibanez” (a well-known guitar brand), “viola” and “trombone”.

The process produced dictionaries for health within

hours (we estimate that

of time was spent adding features and labeling), and dictionaries for music within hour ( of which was spent adding features).

3.2 Model Training

For each of the three candidate feature sets and classification tasks, we trained a model using L2-regularized logistic regression. For the BoW features, we used TF-IDF weighting. In order to eliminate the potential effect of different featuring strategies in our training data, we retained only the features from the process described above and created a separate training data set for evaluation.

We initialized the common training set with positive and negative examples drawn from and then applied uncertainty sampling using the BoW classifier to iteratively choose a document whose predicted score is closest to to build the training set , totaling 300 documents with 50/50 class balance. Note that because this training set was constructed explicitly to improve the BoW model, the semantic-feature models are somewhat at a disadvantage.

Figure 1: Precision-Recall curves using BoW (blue), dictionary (orange), and smoothed dictionary (green) features for health (top) and music (bottom) classifiers.

Health Music
Feature AUC() ACC() AUC() ACC()
BoW 93.3 97.5 95.3 97.7
Dict 86.4 96.5 94.1 97.2
Smoo 95.1 97.3 96.7 97.4
Table 2: AUROC and accuracy of the classifiers

3.3 Results

Figure 1 and Table 2 illustrate that the model built using smoothed dictionary features performs competitively with BoW model. However, the compressed BoW (trained with L1 with negligible loss of performance) model uses 6932 (resp. 6999) for health (resp. music), which become intractable for humans to understand, while only 24 (resp. 27) weights are used for dictionaries. When we force BoW model to use the same number of words as dictionaries (189 and 185 words), we see that BoW performs much worse, as expected. The top 100 BoW words (sorted by weight magnitude) are easy to interpret. After this, the effect of removal or addition of words is very difficult to predict. Dictionaries are semantically linked to “feature blindness”. The effect of removal or addition is more predictable because it is linked to being the last feature that allows the model to distinguish two subsets of examples of different classes.

4 Discussion

In this paper, we introduced smoothed dictionary features as well as an interactive teaching process. While our preliminary experimental results shows that smoothed dictionaries perform competitively with BoW features, much work remains to be done. The use of smoothed dictionary features is one way to incorporate contextual information into the model. We need to perform a series of experiments comparing smoothed dictionary feature to other context models in terms of overall model performance as well as interpretability and cohesiveness of the features. We introduced Algorithm 1 as a process for gathering semantic features. It remains to be seen if constructing features in an interactive loop is indeed more efficient in terms of labeling effort. We also need to explore whether or not the features constructed in this process are more interpretable. In a followup experiment, instead of using a fixed , we would like to solicit the teacher for the appropriate threshold.


CJ is affiliated with Télécom ParisTech. Part of this work was done when DG was at Microsoft Research.


  • Croft et al. (2010) Croft, W Bruce, Metzler, Donald, and Strohman, Trevor. Search engines: Information retrieval in practice, volume 283. Addison-Wesley Reading, 2010.
  • Fellbaum (1998) Fellbaum, Christiane. WordNet. Wiley Online Library, 1998.
  • Scott & Matwin (1999) Scott, Sam and Matwin, Stan. Feature engineering for text classification. In ICML, volume 99, pp. 379–388. Citeseer, 1999.
  • Settles (2009) Settles, Burr. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.
  • Simard et al. (2014) Simard, Patrice, Chickering, David, Lakshmiratan, Aparna, Charles, Denis, Bottou, Léon, Suarez, Carlos Garcia Jurado, Grangier, David, Amershi, Saleema, Verwey, Johan, and Suh, Jina. Ice: enabling non-experts to build models interactively for large-scale lopsided problems. arXiv preprint arXiv:1409.4814, 2014.
  • Smith & Osborne (2006) Smith, Andrew and Osborne, Miles. Using gazetteers in discriminative information extraction. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pp. 133–140. Association for Computational Linguistics, 2006.
  • Taboada et al. (2011) Taboada, Maite, Brooke, Julian, Tofiloski, Milan, Voll, Kimberly, and Stede, Manfred. Lexicon-based methods for sentiment analysis. Computational linguistics, 37(2):267–307, 2011.
  • Williams et al. (2015) Williams, Jason D, Kamal, Eslam, Mokhtar Ashour, Hani Amr, Miller, Jessica, and Zweig, Geoff. Fast and easy language understanding for dialog systems with microsoft language understanding intelligent service (luis). In 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 159, 2015.