Visualizing textual models with in-text and word-as-pixel highlighting

06/20/2016 ∙ by Abram Handler, et al. ∙ University of Massachusetts Amherst 0

We explore two techniques which use color to make sense of statistical text models. One method uses in-text annotations to illustrate a model's view of particular tokens in particular documents. Another uses a high-level, "words-as-pixels" graphic to display an entire corpus. Together, these methods offer both zoomed-in and zoomed-out perspectives into a model's understanding of text. We show how these interconnected methods help diagnose a classifier's poor performance on Twitter slang, and make sense of a topic model on historical political texts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic models of text are a core technology for natural language processing. Such models typically link words or phrases with semantic categories, like classes or topics. When we analyze data with these models, we need to understand how the method interprets text in order to perform (1) exploratory and confirmatory data analysis and (2) error analysis for engineering improvements.


Previous work on interpreting and understanding text models has focused on summarizing text at the semantic or category level—for instance, by showing a list of most probable words in a particular latent topic

(Gardner et al., 2010; Chaney & Blei, 2013).

In this work, we emphasize that text is originally a sequence of symbols (characters or words), intended for a person to read. A system can provide insight into what a text model is thinking by showing a user the original text with automatic in-text annotations describing the model’s inferences (§3). Such annotations can be shown abstractly with a zoomed-out words-as-pixels view (§4) of text. We demonstrate our methods using topic models on political speeches and language classification on dialectal Twitter.

2 Models

For all models that we consider, a document consists of a sequence of symbols . This could be a sequence of words, or a sequence of characters; we refer to elements in such a sequence as tokens (though §5 examines a character-based model). For a particular model, we define a token-level visual quantity of interest for position , which corresponds to an interesting value in the model. These values are then encoded as visual attributes when displaying the original token sequence directly to the user (§3).

2.1 Token-level models (LDA)

First we consider models that define latent variables at the token level. For example, the latent Dirichlet allocation Blei et al. (2003) model of text posits a document arises from a weighting over topics, where each token has a latent class , indexing which word distribution is used to generate word : . We conventionally describe as a topic.

At a single token position, the posterior topic membership breaks down as a compromise between document prevalence versus lexical probability; LDA is able to learn interesting representations since individual documents tend to be about a subset of topics and individual topics tend to include a subset of the vocabulary. The probability of a given latent topic is:

We consider the vector of membership probabilities to be the visual quantity of interest, defining:

Although we demonstrate our method using LDA, the same approach and methodology would apply to other common text models. For example, supervised sequence models (such as conditional random fields; Lafferty et al. (2001)) also place tokens into semantic categories using token-level variables which can be visualized, as is often done in annotations interfaces for information extraction.111e.g. Brat: http://brat.nlplab.org/ Similarly, Karpathy et al. (2015)

give an excellent demonstration of visualizing latent states of a character-level long short-term memory (LSTM) recurrent neural network using token-level, in-text annotations (like §

3

) to help understand a machine learned model.

2.2 Token-level posterior impacts (MNB, LogReg)

Many models do not directly define random variables at the token-level, but sufficient statistics resulting from individual tokens have a clear interpretation in terms of how they affect inferences on model variables. An example is document classification, where the frequencies of words impact the posterior probability of the document class.

Concretely, we consider multinomial naive Bayes

McCallum & Nigam (1998), whose generative assumption posits that each document has a discrete label (drawn from distribution ), and the document’s tokens are independently generated from a single topic .

Given learned parameters and , to classify a document, we utilize the posterior

and calculate the posterior log-odds between classes

and :

We restrict our attention to comparing the model’s relative preferences for two classes and , and define

to denote the token-level logit weight for one token instance

in the text, representing how much that token contributes to the posterior prediction of the document class.

A wide variety of other models in the supervised setting may also define

terms; for example, logistic regression has a very similar form

Ng & Jordan (2002). In the binary classification case, with bias term and word weights , logistic regression can be formulated similarly as MNB in the case where features are word counts and the “posterior” log odds is , in which case the token-level logits222

One issue is that non-linear transforms of the word counts, such as thresholding or log scaling, often improve classification performance

Yogatama et al. (2015); unfortunately, they do not correspond to a uniform per-token impact.
are .

In practice, for both LDA and MNB, the full generative model is rarely used for all the text; for example, at the very least, terms are excluded due to being stopwords, punctuation, having a very high or very low frequency (e.g. Boyd-Graber et al. (2014)

) or are filtered out during feature selection. This causes many tokens to not be accounted for in the model and thus do not change the posterior. For MNB, we define

in such cases.

2.3 N-gram features

It is useful to define features over n-grams, where each instance comes from a

span in the text in the form of a [start position, end position) pair; e.g. span corresponds to a bigram in positions 3 and 4. Using n-gram features, MNB is no longer a proper generative model of the text sequence but is still widely used in this setting where the document’s log-probability is defined as the sum over all n-grams in the model. In this case we define the span-level weight in a similar manner as §2.2:

Since a single token may participate in multiple overlapping n-grams, we define the token-level weight as the sum of the weights of all (overlapping) n-grams that include :

This can be extended to logistic regression or other feature-based classifiers as well. answers part of the counterfactual: if the token was deleted, the prediction’s logodds would change333 This analysis ignores the impact of new n-gram features, bridging position , that would be introduced; on the other hand, the new text likely would not be a valid or likely text, so perhaps the counterfactual viewpoint is limited. by .

Other linguistic features could also be visualized using color annotations. For example, a syntactic dependency path is a sequence alternating between tokens and directed edges (e.g. Mintz et al. (2009)). Unlike an n-gram, the set of tokens in a path is not necessarily contiguous. But tokens can be colored in the same way through a value: for a word token, the sum of the model’s weights for paths whose token set includes the token. If dependency edges are shown alongside the text, they could colored in a similar way.

3 In-text visualization

We define a visual encoding function to select the final visual attributes to represent the quantity of interest to the user, inspired by Wilkinson (2006)

’s grammar of graphics approach to data visualization.

We would like to show the original text, with visual annotations. Some easily implementable options for visual encoding include

  • Color: the background or foreground text color.

  • Boldface or italics.

  • Underlines (possibly varying color or line width).

  • Size of text.

In our preliminary experiments, color emerged as an effective encoding scheme. Color can represent multiple dimensions as well as scalar values. Previous research in visualization has examined how to effectively encode data in color given the strengths and weaknesses of the human visual system (e.g. Ware (2012); Munzner (2014)), and research results such as the ColorBrewer palettes444http://colorbrewer2.org/, https://bl.ocks.org/mbostock/5577023 are available for use. (On the downside, colors can pose an issue for colorblind users.) Text size is another interesting option,555 Both word and tag clouds have long sought to encode frequencies from a bag-of-words using text size, e.g. the Wordle system (http://www.wordle.net/) — but unfortunately variable sized text is often difficult to read. Boldface and italics have a relatively limited information capacity, and we found underlines visually busy. (An alternate approach is to use extra-textual visual cues alongside words; for example, Chahuneau et al. (2012) aligns a bar graph (heights corresponding to ) next to word tokens.)

For the vector-valued from LDA, we assign different topics to different color hues (but similar brightness levels) and assign a token’s color according to the argmax of . (An additional possible strategy may be to blend the color towards white if the posterior entropy is higher.)

For binary document classification with a scalar-valued , there is a diverging semantics: negative and positive should correspond to different colors (e.g. red versus blue), blending to white at . We utilize this for classifier visualizations.

4 Words-as-pixels visualization

Color can be used in zoomed-out views as well. For a very high aggregation level, such as summarizing topic frequencies across thousands of documents over time, the same colors can be used as the in-text annotations to assist interpretation.

We propose a complementary, high-level level view—words as pixels, shown in Figure 1. Here, individual tokens are represented as pixels or very small squares with coloring from their , and these points are laid out in order within a document. We arrange as left-to-right descending columns, mimicking the natural reading order of many left-to-right languages, and thus corresponding to a zoomed-out view of the original text.666An inspiration is the zoomed-out scrollbar view of the Sublime Text editor.

When documents have a natural ordering, such as date of publication, or sections or chapters within a book, multiple documents can laid out one after another. This allows the user to see certain discourse structures in the text; at least, ones that are captured by the model. In Figure 1, we visualize LDA on a corpus of U.S. presidential State of the Union speeches from 1946–2007 using David Mimno’s jsLDA data preprocessing and topic model implementation (2016).777https://github.com/mimno/jsLDA   The words-as-pixels view only shows tokens that are in the model, which here is roughly half of all tokens in the text after preprocessing. We average 100 Gibbs samples to calculate the posteriors to be the quantities. The model clearly picks up on natural local groupings of latent topics in the text. This is driven in part by the model assumptions encoded in data preprocessing, since this version of the corpus defines model “documents” as paragraphs from the speeches. The model assumption is that topic prevalence can be expected to vary by textual locality, and the visualization allows a qualitative assessment of to what extent this assumption holds in the posterior inferences.

This is apparent in the example: for example, large streaks of orange correspond to detailed discussions of budgets that were common in the 1940s and 1950s. We include callouts of individual paragraphs with a strong blue topic prevalence: discussion of political ideologies with regards to Communism (Eisenhower in 1959) and Islam (Bush in 2002). We aim to develop this interface as a linked views data explorer Buja et al. (1996); O’Connor (2014) where a user can click on the word-as-pixel view to show the corresponding text passage. A web demo is available at http://slanglab.cs.umass.edu/topic-animator/.

5 Classification: Language identification in social media

A key step in any internet text analysis pipeline is to identify which language a text was written in. Character n-gram models (where each is a character symbol) are a widely used approach for this task, and the popular open-source langid.py tool (Lui & Baldwin, 2011, 2012)888https://github.com/saffsd/langid.py uses a multinomial Naive Bayes model.

Short texts pose a challenge for language identification — and social media messages also present a domain adaptation problem, since they contain much creative and non-standard language very different from traditional well-edited corpora that NLP systems are typically trained on. For example, langid.py uses Wikipedia corpora as a major source of training data.

Predicted as Portugese (pt)

Predicted as Irish Gaelic (ga)

Figure 2: Tweets we assess as English that were classified as non-English; every character position has is own . Blue indicates a log-likelihood weight towards the non-English language; red towards English.

We examined a corpus of tens of millions public Twitter messages geolocated in the U.S., filtered to users who use language statistically associated with neighborhoods containing high populations of African-Americans.999Details in paper under review. As expected from the emerging sociolinguistic literature on social media corpora Eisenstein (2015a, b); Jones (2015); Jørgensen et al. (2015), these messages contain rich dialectical language very different from well-edited genres of English. In fact, even after filtering only to messages only containing Latin-1 characters,101010This filter gives the classifier an easier dataset more similar to its training data; for example, this excludes emoji. langid.py classifies 17% of these users’ messages as non-English, but upon inspection, nearly all of them are English.

We used in-text highlighting to help diagnose model errors (Figure 2), assigning each character at position a color reflecting , the sum of all n-gram feature weights that fire at that position (§2.3).

For example, the common term lmao (laughing my ass off), ends in ao, a common suffix in Portugese identified by the classifier. The characters nna, which are common in modal verbs in non-formal American English (e.g. gonna, wanna, and the African-American English-associated finna short for fixing to) cause confusion towards Irish Gaelic.

Another result that we did not expect is the issue of sparsity in short texts. Many messages have only a small number of firing features (which we anticipate could lead to low accuracy), which is partly due to the feature selection process used to train langid.py’s models, suggesting that its sparsity level may be tuned to a level more appropriate for longer documents than for these short ones.

6 Conclusion

This work stems from a fundamental aspect of text processing: the most natural and intuitive way to grasp the full meaning of a written text is simply to read it. We believe in-text annotation is a less explored, but natural choice for explaining and understanding a computer’s view of language. We present a few simple methods for viewing text models, but expect many avenues for future work.

References

  • Blei et al. (2003) Blei, David M., Ng, Andrew Y., and Jordan, Michael I. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.
  • Boyd-Graber et al. (2014) Boyd-Graber, Jordan, Mimno, David, and Newman, David. Care and feeding of topic models: Problems, diagnostics, and improvements. In Handbook of Mixed Membership Models and Its Applications. CRC Press, 2014.
  • Buja et al. (1996) Buja, Andreas, Cook, Dianne, and Swayne, Deborah F.

    Interactive high-dimensional data visualization.

    Journal of Computational and Graphical Statistics, 5(1):78–99, 1996.
  • Chahuneau et al. (2012) Chahuneau, Victor, Gimpel, Kevin, Routledge, Bryan R., Scherlis, Lily, and Smith, Noah A. Word salad: Relating food prices and descriptions. In Proceedings of EMNLP/CoNLL, pp. 1357–1367, July 2012.
  • Chaney & Blei (2013) Chaney, Allison J.B. and Blei, David M. Visualizing topic models. In Proceedings of ICWSM, 2013.
  • Eisenstein (2015a) Eisenstein, Jacob. Identifying regional dialects in online social media. In Boberg, C., Nerbonne, J., and Watt, D. (eds.), Handbook of Dialectology. Wiley, 2015a.
  • Eisenstein (2015b) Eisenstein, Jacob. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics, 19(2):161–188, 2015b.
  • Gardner et al. (2010) Gardner, M.J., Lutes, J., Lund, J., Hansen, J., Walker, D., Ringger, E., and Seppi, K. The topic browser: An interactive tool for browsing topic models. In NIPS Workshop on Challenges of Data Visualization, 2010.
  • Jones (2015) Jones, Taylor. Toward a description of African American Vernacular English dialect regions using “Black Twitter”. American Speech, 90(4):403–440, 2015.
  • Jørgensen et al. (2015) Jørgensen, Anna, Hovy, Dirk, and Søgaard, Anders. Challenges of studying and processing dialects in social media. In

    Proceedings of the Workshop on Noisy User-generated Text

    , Beijing, China, 2015. ACL.
  • Karpathy et al. (2015) Karpathy, Andrej, Johnson, Justin, and Li, Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.
  • Lafferty et al. (2001) Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, pp. 282–289, 2001.
  • Lui & Baldwin (2012) Lui, M. and Baldwin, T. langid.py: An off-the-shelf language identification tool. In Proceedings of ACL, Demo Session, 2012.
  • Lui & Baldwin (2011) Lui, Marco and Baldwin, Timothy. Cross-domain feature selection for language identification. In In Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 553–561, 2011.
  • McCallum & Nigam (1998) McCallum, Andrew and Nigam, Kamal. A comparison of event models for naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, volume 752, pp. 41–48, 1998.
  • Mimno (2016) Mimno, David. jsLDA: In-browser topic models (in preparation). 2016.
  • Mintz et al. (2009) Mintz, Mike, Bills, Steven, Snow, Rion, and Jurafsky, Daniel. Distant supervision for relation extraction without labeled data. In Proceedings of ACL/IJCNLP, pp. 1003–1011, August 2009.
  • Munzner (2014) Munzner, Tamara. Visualization Analysis and Design. CRC Press, 2014.
  • Ng & Jordan (2002) Ng, Andrew and Jordan, Michael. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in neural information processing systems, 14:841, 2002.
  • O’Connor (2014) O’Connor, Brendan. MiTextExplorer: Linked brushing and mutual information for exploratory text data analysis. In Proceedings of the ACL Workshop on Interactive Language Learning, Visualization, and Interfaces, 2014.
  • Ware (2012) Ware, Colin. Information visualization: perception for design. Elsevier, 2012.
  • Wilkinson (2006) Wilkinson, Leland. The grammar of graphics. Springer, 2006.
  • Yogatama et al. (2015) Yogatama, Dani, Kong, Lingpeng, and Smith, Noah A. Bayesian optimization of text representations. In Proceedings of EMNLP, pp. 2100–2105, September 2015.