Towards a science of human stories: using sentiment analysis and emotional arcs to understand the building blocks of complex social systems

by   Andrew J. Reagan, et al.

Given the growing assortment of sentiment measuring instruments, it is imperative to understand which aspects of sentiment dictionaries contribute to both their classification accuracy and their ability to provide richer understanding of texts. Here, we perform detailed, quantitative tests and qualitative assessments of 6 dictionary-based methods applied, and briefly examine a further 20 methods. We show that while inappropriate for sentences, dictionary-based methods are generally robust in their classification accuracy for longer texts. Stories often following distinct emotional trajectories, forming patterns that are meaningful to us. By classifying the emotional arcs for a filtered subset of 4,803 stories from Project Gutenberg's fiction collection, we find a set of six core trajectories which form the building blocks of complex narratives. Of profound scientific interest will be the degree to which we can eventually understand the full landscape of human stories, and data driven approaches will play a crucial role. Finally, we utilize web-scale data from Twitter to study the limits of what social data can tell us about public health, mental illness, discourse around the protest movement of #BlackLivesMatter, discourse around climate change, and hidden networks. We conclude with a review of published works in complex systems that separately analyze charitable donations, the happiness of words in 10 languages, 100 years of daily temperature data across the United States, and Australian Rules Football games.


page 13

page 31

page 36


The emotional arcs of stories are dominated by six basic shapes

Advances in computing power, natural language processing, and digitizati...

Benchmarking sentiment analysis methods for large-scale texts: A case for using continuum-scored words and word shift graphs

The emergence and global adoption of social media has rendered possible ...

The Evolution of Sentiment Analysis - A Review of Research Topics, Venues, and Top Cited Papers

Sentiment analysis is one of the fastest growing research areas in compu...

Building domain specific lexicon based on TikTok comment dataset

In the sentiment analysis task, predicting the sentiment tendency of a s...

Sentiment Analysis of German Twitter

This thesis explores the ways by how people express their opinions on Ge...

Imbalanced Sentiment Classification Enhanced with Discourse Marker

Imbalanced data commonly exists in real world, espacially in sentiment-r...

A Philosophy of Data

We argue that while this discourse on data ethics is of critical importa...

1.1 Introduction

Individual words encapsulate information and emotion as the building blocks that we use to capture experiences in stories. Beyond words, multi-word expressions (phrases), conceptual metaphor, and complicated grammar (syntax) coalesce to provide incredible expressive power. Attempts to quantify semantic content build atop syntactic understand of language with the aim of transforming a model of meaning that has proven useful to our own cognitive machinery into something more readily applicable for another purpose (e.g., summarization by a computer). One such goal of semantic understanding is to measure the sentiment expressed in written communication, which is broadly known as sentiment analysis. The next evolution of natural language systems will tackle the harder-yet problems of pragmatics, where narrative understanding and generation can enable common-sense reasoning on par with human intuition.

In our work, we transfer the emotion of single, isolated words into a one-dimensional happiness measure to build the Hedonometer. Leveraging the Hedonometer technology and modern computational power, we analyze digitized text with the ultimate goal of understanding stories. This dissertation proceeds as follows: in this chapter we explore the foundations of sentiment analysis and narrative structure. In Chapter 2 we benchmark and compare methods for sentiment analysis. In Chapter 3 we apply these methods and extract dominant emotional arcs from digitized text. In Chapter 4, we discuss contributions made to published work in the broader science of complex systems. Finally, in Chapter 5 we offer some concluding remarks.

Next, we examine prior work in natural language processing, sentiment analysis, and computational narrative understanding.

1.2 Sentiment analysis

The field of Natural Language Processing (NLP) has been around since the advent of computers, but is growing rapidly alongside computational advances. While major advances have been made, there remain many open problems. We focus here on a specific NLP problem, namely understanding the emotional content of language. We refer to the emotional content in a written text broadly as the sentiment. In addition to the summaries given in recent review articles (Giachanou and Crestani, 2016), the landscape of tools and technologies is expanding quickly and sentiment analysis systems are deployed to tackle important challenges. As we will see, sentiment analysis is a sub-field of NLP that can benefit from advancement in other realms of NLP as well (e.g., phrase partitioning).

Applications of sentiment analysis span academia, industry, and government. Just some of the current uses include predicting elections (Tumasjan et al., 2010), product sales (Liu et al., 2007), stock market movement (Bar-Haim et al., 2011), and tracking public opinion (Cody et al., 2015). NLP and measures of sentiment are used to analyze consumption of information from the media, and societal level decisions are driven by this flow of public opinion online. Beyond individual and collective decisions, corporate success demands an understanding of the public sentiments directed towards their products.

Advances in Artificial Intelligence (AI) have elucidated the distinction between problems that are hard for computers and those that are hard for humans—a difference that is not obvious at the outset. Determining sentiment is one such task: understanding the sentiment of our friends and colleagues through informal text is relatively easy for us, but it is hard to codify in a computer algorithm. As we will see, machine learning (often broadly referred to as AI) is finding uses in all areas of Natural Language Processing (NLP), including advancing the state-of-the-art in sentiment classification and sentiment dictionary creation. While sentiment analysis benefits from machine learning to create classifiers and sentiment dictionaries, the output of sentiment detection also aids higher level approaches to language understanding.

1.2.1 Psychology of emotion

With few exceptions, current sentiment analysis methods aim to detect sentiment one-dimensionally, giving a score on a range from negative to positive sentiment. While this pragmatic approach proves useful, Jack et al. (2014) conjectured that there are four basic emotions, Ekman (1992) names six, and Plutchik (1991) identifies two additional basic emotions in humans. These theories are only the most well known classifications, with at least 90 such classifications being given over the past century, as noted by Plutchik (2001). Through the use of brain imaging and fMRI techniques, researchers in neuroscience have also attempted to distinguish whether basic emotions are best captured as discrete categories (anger, fear) or underlying dimensions (valence, arousal). Altogether they have found consistent neural locations for basic emotions but no one-to-one mapping, and further research is still needed (Harrison et al., 2010; Hamann, 2012).

The widely acknowledged six basic emotions identified by Paul Eckman are:

  • happy,

  • surprised,

  • afraid,

  • disgusted,

  • angry,

  • and sad.

In Figure 1.1, a visualization of these six basic emotions is shown. As noted in the caption, these six emotions serve as a basis for more complex emotions. The eight basic emotions of Plutchik (1991) are shown as the variations along four dimensions in Figure 1.2. While we do not expect that each of the six basic emotions have orthogonal representations in their embodiment in language (as orthogonality may be inferred from the Figures, is found in facial expression, and underlies the theory), a basis of more than a single dimension is likely necessary to represent the full range of emotion. The basic emotions theory rejects that all emotions can be represented as either positive of negative states, and this should extend to language. Indeed, attempts to cast the basic emotions as either positive (e.g., happy) or negative (e.g., sad) are subjective, e.g. by Robinson (2008) classifying pride as a negative emotion. According to Ekman (1992), basic emotions are distinguished by nine characteristics:

  1. Distinctive universal signals.

  2. Presence in other primates.

  3. Distinctive physiology.

  4. Distinctive universals in antecedent events.

  5. Coherence among emotional response.

  6. Quick onset.

  7. Brief duration.

  8. Automatic appraisal.

  9. Unbidden occurrence.

To this end, in Figure 1.3 the theory of Russell (1980)

attempts to find the core dimensions of emotion using data from emotions manually labelled for 28 adjectives. The explained variance by the first two principal components would provide an indication of how well we can capture emotion with two abstract dimensions, however this is not provided by

Russell (1980). Each of these theories expands upon the single dimension considered further in sentiment analysis: positive and negative. More complex emotions can be constructed from combinations of the basic emotions (e..g., delight = joy + surprise), which is not possible from combinations of simply positive and negative states (e.g., it would be nonsensical to find coefficients for the abstract categories positive and negative to satisfy delight = a*positive + b*negative).

Figure 1.1: The six emotions of Ekman (1992), illustrated here by McCloud (2006). In principle, the entire range of human emotions can be constructed from this minimal “basis”, e.g., the emotion delight is the addition of joy and surprise. This theory of basic emotions distinguishes these emotions as being fundamentally distinct, adapted for fundamental life tasks, and universally present through evolution (or, perhaps, universal social learning). In particular the distinction between basic emotions is not explained by variation in dimensions of arousal, pleasantness, or activity.
Figure 1.2: Schematic of the eight emotions from Plutchik (1991). The commonly known eight names (e.g., joy, etc.) are one row out from the center. In addition to the six emotions of Ekman (1992) we find anticipation and trust on the first level.
Figure 1.3: Eight emotions on the arousal–pleasure axis of Russell (1980)

, who finds these axis to be the best representation of emotion. To this end, using 28 emotional words manually annotated for the characteristics which they share, Russell finds the two major principal components in a Principal Component Analysis, establishing this “circular ordering.” This circular ordering agrees well with the mental model of emotional states used by psychologists at the time.

An alternative to basic, discrete emotions being the building blocks for all emotions is to place all emotions in the dimensions of valence, arousal, and dominance, often referred to as “norms” and measured alongside concreteness and age of acquisition (Lindquist et al., 2016). In the literature, the term valence is used interchangeably to mean the negative/positive emotional dimension.

The positivity bias in language is frequency-independent, as long as the frequency selections are rank ordered (see Dodds et al. (2015a) and Chapter 4). Schrauf and Sanchez (2004) asked participants to write as many emotion words as they could think of in two minutes, and found that participants were able to recall a larger list of negative emotional words. At least one theory for this difference, as elaborated in Koch et al. (2016), posits that this difference is because positive words are more similar than negative words. In one of six tests, they show that the scores for positive words are more tightly clustered than the scores for negative words from the Warriner & Kuperman sentiment dictionary.

In addition to the emotion of expression, we note that other work attempts to measure personality traits of individuals based on their expressions (rather than the sentiment of the expressions themselves), specifically Kosinski et al. (2013) and Youyou et al. (2015). As an example, given a person’s micro-blog post, the algorithms developed by Kosinski et al. (2013) are trained to measure whether the person is an introvert or extrovert. These attempts fundamentally differ from sentiment analysis by measuring traits of an individual rather than traits of the expression, though in practice the two goals make use of similar machine learning techniques.

For the remainder of this chapter, we will assume that each emotion is being measured on a scale from -4 4, with 0 representing no presence of emotion and a score of -4/4 representing the maximum negative/positive emotional priming. While some dictionaries benefit from considering emotion on a different scale for human evaluation (e.g. “labMT” with 1 9 or “AFINN” with ), we make this choice to speak more generally about each sentiment dictionary we test.

1.2.2 Goals of sentiment analysis

It may help to first frame the problem of detecting sentiment in text, and we will utilize the generalization given by Bing Liu in his 2012 book Sentiment Analysis and Opinion Mining (Liu, 2012). Here, our goal is to detect and understand the average sentiment of a document using the words contained within: document-level sentiment classification. Our definition extends that of Liu (2012) to include the goal of better understanding text through sentiment detection, and this goal is complementary (and in some cases outright necessary) to achieve classification. While document length varies, Liu (2012) subdivides finer-grained classification into two categories: (1) classifying sentence-level sentiment and (2) classifying entity-level sentiment. Sentence-level sentiment is detecting sentiment in sentences, and entity-level sentiment aims to predict sentiments that are directed at named entities (e.g., products, people, or corporations). We express caution in pursuing these latter goals using existing methodology, namely in classifying short, informal text. We will examine in Chapter 2 how dictionary based approaches are effective at the document level, but fail at the sentence level (and by extension fail at the entity level as well). Several examples of different sentences are also given in Liu (2012), highlighting the difficulty of classifying individual sentences, and we share these examples here.

The accuracy of classifying documents correctly as positive or negative is commonly measured using precision, recall, and F-score statistics, as in

Ribeiro et al. (2016). These measures assess the classification accuracy, but do not attempt to assess the qualitative goal of achieving better understand of text with sentiment analysis (an area on which our work will build). Both of these goals can be assessed with ground truth data, and next we review publicly available data sets for sentiment evaluation.

1.2.3 Publicly available annotated data

Review papers such as those by Giachanou and Crestani (2016) attempt to capture the many advances in the field, including applications of machine learning with training data, although they only identify 3 of the 17 sentiment dictionaries that we list in Chapter 2. They identify the lack of benchmarks as important issue (Giachanou and Crestani, 2016):

One of the main challenges in evaluating approaches that address Twitter-based sentiment analysis is the absence of benchmark datasets. In the literature, a large number of researchers have used the Twitter API to crawl tweets and create their own datasets, whereas others evaluate their methods on collections that were created by previously reported studies. One major challenge in creating new datasets is how the tweets should be annotated. There are two approaches that have been followed for annotating the tweets according to their polarity: manual annotation and distant supervision.

To this end, we note the availability of datasets below and attempt to collect each dataset enumerated by Giachanou and Crestani (2016); Saif et al. (2013); Ribeiro et al. (2016) in Table 1.1 and make them accessible in one place online. In addition to these public datasets, some academic groups choose not to release their tagged data, and there are claims of very large datasets held by private companies in the sentiment analysis space. Given the time and cost associated with obtaining high quality training data, and the ubiquity of machine learning for sentiment analysis in industry, the training data itself can be viewed as a trade secret.

Short name Description # Samples Referenced By
STS,Tweets_STF,STS-Test Stanford Twitter Sentiment 499 G, R, S
Sanders,Tweets_SAN,Sanders Sanders Corpus 3424 G, R, S
HCR,HCR Health Care Reform 4616 G, S
OMD,Tweets_DBT,OMD Obama-McCain Debate 3298 G, R, S
SS-Tweet,Tweets_RN_I,SS-Twitter SentiStrength Twitter Dataset 4243 G, R, S
SemEval,Tweets_Semeval,SemEval SemEval Datasets 6087 G, R, S
STS-Gold,STS-Gold STS-Gold 2036 G, S
DETC,DETC Dialogue Earth Twitter Corpus N/A G, S
Tweets_RND_IV aisopos_ntua 500 R
Comments_TED TED Comments 839 R
Comments_BBC SentiStrength BBC Comments 1000 R
Comments_Digg SentiStrength Digg Comments 1077 R
Reviews_I SentiStrength Myspace Reviews 1041 R
RW SentiStrength Runners World Forum 1046 R
Comments_YTB SentiStrength YouTube Comments 3407 R
Amazon VADER Amazon Reviews 3708 R
Reviews_II VADER Movie Reviews 10605 R
Comments_NYT VADER NYT Comments 5190 R
Tweets_RND_II VADER Tweets 4200 R
Tweets_RND_III DAI-Labor English MT 3771 R
ORT Opinion Retrieval Twitter 5051 L

Table 1.1: Summary of publicly available Twitter datasets tagged with sentiment labels. In respect of Twitter’s Terms of Service, lists of the Tweet IDs are provided, as well as a script to download the Tweets through Twitter’s public API (note some data may not longer be available). We shorted the references as follows as G: Giachanou and Crestani (2016), S: Saif et al. (2013), R: Ribeiro et al. (2016), and L: Luo et al. (2012).

In addition to the tagged datasets above, we attempt to provide a comprehensive list of sentiment dictionaries in Table 2.1.

1.2.4 Natural Language Processing techniques

As itself a tool for NLP, sentiment analysis leverages approaches that are applied more broadly (e.g., classification), and can benefit, if only slightly, from other such techniques. In this section, we provide a very brief overview of techniques for processing raw text, detecting boundaries of multi-word expressions, disambiguating word senses, tagging parts-of-speech, and dependency parsing.


Here, we consider words as the basis for our computation, and the process of extracting words from raw text is often referred to as “tokenization”. The simplest tokenization procedure is splitting raw text strings on spaces, with words being any contiguous non-space characters. For well structured (formal) writing, this approach presents few false positive matches, but this approach is often too simple for processing informal text (e.g., Twitter), where grammar is not reliable. To improve upon the aforementioned approach, we build a list of known “word characters” (e.g., the letters a-z, the apostrophe, hyphen, etc.) and extract all contiguous sequences of these characters as words. An example regular expression implementing this approach is provided in Section A.1.2. The final consideration here are the various uses of individual words; the representation of a word differs based on, but not limited to, the different classes, inflection, contractions, possessive use, and/or pluralization of the word. Depending upon the ultimate use case, a choice can be made for how to process words. A common choice is to reduce words to their root, a process called “stemming”, which removes the inflection from words, a popular implementation is provided by Porter (2001). A widely used source for annotated data based on word stems is the morphology of WordNet (Fellbaum, 1998). In the approach that we adopt for sentiment analysis, we attempt to retain the most complete representation of words, without removing the information about usage that may be contained beyond a word’s root or stem. This achieves a very basic and computationally efficient disambiguation between word senses.

Multi-word Expressions

In addition to tokenization, the meaningful units of language often span multiple words. These multi-word expressions, or “phrases”, can also be extracted from tokenized words. Here we summarize two state-of-the-art approaches from Handler et al. (2016) and Williams (2016).

Williams, J. R. (2016). Boundary-based MWE segmentation with text partitioning. arXiv preprint arXiv:1608.02025.

Williams performs boundary-based MWE segmentation with text partitioning, building on prior work that introduces random and serial partitioning algorithms, and showing that phrase frequency follows Zipf’s law more closely than words alone. Trained models for partitioning rely on (1) phrase likelihood from “informed random partitioning”, (2) entries the Wiktionary, and (3) annotated corpora. The model is general purpose for pattern recognition, and can be run using text data or PoS tags, combining the output phrases for higher recall. Altogether, this achieves state-of-the-art performance with flexible application to any text-based corpora.

Handler, A., M. J. Denny, H. Wallach, and B. O’Connor (2016). Bag of what? simple noun phrase extraction for text analysis. NLP+ CSS 2016, 114.

Handler and colleagues build upon prior work that defines a grammar of PoS labels for noun phrases. In essence, the approach uses patterns to match noun phrases. The implementation realizes computational feasibility with a Finite State Transducer (FST) compiled to find all matches of their pattern represented by a Finite State Grammar (FSG). As an example of this general type of approach, the pattern of word labels Adjective Noun Noun (encoded ANN) would be successfully matched by the grammar (A|N)*N(N)*, where the * represents 0 or more matches of the previous expression (as in standard regular expression syntax, otherwise known as the Kleene star). The availability of reliable part-of-speech tags is assumed by this approach, although this is known to be a harder problem for informal text (e.g., social media).

We conclude that both of these available methods, and even the “naive” method described by Mikolov and Dean (2013) offer an improvement upon unigram models for bag-of-words approaches to sentiment analysis, which includes the methods used in this dissertation. Sentiment dictionaries only contain ratings for single words, and extending existing dictionary ratings to MWEs is a widely acknowledged area for future research.

Word Sense Disambiguation (WSD)

To get a sense of the Word Sense Disambiguation (WSD) problem, here we examine a scholarly competition: The English All-Words Task of the SENSEVAL-2 series. The SENSEVAL competitions began in 1998, and the second and third instantiations took place in 2001 and 2004. After 2004, the scope of tasks was broadened and the name switched to SemEval, being held again in 2007, 2010, and 2012–2017 every year. First, we summarize the construction of the benchmark by Snyder and Palmer (2004), and then we examine the winning entry from Decadt et al. (2004).

Snyder, B. and M. Palmer (2004). The english all-words task. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 41–43. Association for Computational Linguistics.

To develop the training and testing data for Senseval-3, Snyder and Palmer extracted approximately 5,000 words from two Wall Street Journal articles and one excerpt from the Brown Corpus. Word sense was annotated by two people using Wordnet senses, and then settled by a third party, for a total of 2,212 words and multi-word-expressions. They found the inter-annotator agreement at 72.5%, representing a practical upper bound for the performance of computational methods.

Decadt, B., V. Hoste, W. Daelemans, and A. Van den Bosch (2004). Gambl, genetic algorithm optimization of memory-based wsd. In

Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 108–112. Association for Computational Linguistics.

GAMBL is a “word expert” approach to WSD in which a word sense classifier is trained for each individual word. The parameters of this classifier are optimized using a genetic algorithm, and the method achieves the best precision/recall of .652.

Part-of-Speech tagging

Part-of-Speech (PoS) tagging aims to disambiguate between the various forms that a word can take: verb, pronoun, preposition, adverb, conjunction, participle, and article are eighth of the most well recognized categories. This information tells us how a word relates to the neighboring words around it, and finer grained taxonomies of parts of speech in English contain more than 80 types. To train and test algorithms for this task, large annotated corpora such as the Penn Treebank are available form Marcus et al. (1993) and OntoNotes .

Abney, S. (1997). Part-of-speech tagging and partial parsing. In Corpus-based methods in language and speech processing, pp. 118–136. Springer.

Abney (1997) elaborates upon the work of Church (1988) and DeRose (1988)

to develop a reasonable, approximate approach to PoS tagging. State-of-the-art approaches can be classified into rule-based and stochastic, the latter making extensive use of Hidden Markov Models (HMMs) to represent state as a latent variable.

Toutanova, K., D. Klein, C. D. Manning, and Y. Singer (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 173–180. Association for Computational Linguistics.

Toutanova et al. (2003) develop a PoS tagger with improved accuracy which is competitive in terms of both speed and accuracy with any attempt since. This is achieved by using a cyclic dependency network to represent the state of the tagger, and achieves 97.24% accuracy on the Penn Treebank corpus. The tagger is used by Manning et al. (2014) in the most recent release the Stanford CoreNLP natural language processing toolkit.

Owoputi, O., B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith (2013). Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics.

Existing PoS taggers excel at the task in well structured language but are not applicable to short, informal text. In Owoputi et al. (2013), large-scale unsupervised word clustering and lexical features are used to achieve 93% accuracy on Twitter. In addition, guidelines for manually annotating this type of text are provided.

The application of PoS tagging in stand-alone tests on tagged corpora has achieved high rates of accuracy on both formal and informal text. It now stands to reason that this addition of information for individual words and MWEs have applications in an end-to-end system for sentiment analysis.

Dependency Parsing

Dependency parsing aims to extract the syntactic relationship between the words used in a sentence. Also referred to as syntax parsing, dependency parsing is one more NLP tool that aims to solve a disambiguation problem: of all possible dependency parses, choosing the most appropriate. In many cases, this disambiguation is between two parses that are both grammatically valid, but nonsensical otherwise; consider the different interpretations of “They ate the pizza with anchovies” (seen in Figure 1.4

). In the prior examples, anchovies could either be utensils or a topping or their friends, but this is obvious to us with commonsense knowledge. Other examples that I found compelling for parsing are garden path sentences—those which confuse the common human parsing by leading our parse down the wrong path—such as “the old man the boat” or “the horse ran past the barn fell”. Both examples are valid senses, but are easy to read incorrectly on the first pass. The dependency parsing algorithms that we examine next solve each of the examples we have just given correctly by utilizing neural network approaches that find the most probable parse.

We note that PoS tagging, a shallower form of parsing, is about twenty times faster than parsing, for applications where computational costs of parsing are a bottleneck (Handler et al., 2016). State-of-the-art approaches from both Chen and Manning (2014) and Andor et al. (2016) achieve parse accuracies over 90%.

Chen, D. and C. D. Manning (2014). A fast and accurate dependency parser using neural networks. In EMNLP, pp. 740–750.

In Chen and Manning (2014), a dependency parser is built that uses dense features of the surrounding text to improve upon both the accuracy and speed of current parsers. For performance, they note their “parser is able to parse more than 1000 sentences per second at 92.2% unlabeled attachment score on the English Penn Treebank”.

Andor, D., C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins (2016). Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042.

Andor et al. (2016) from Google Inc. (now Alphabet) improve further on the accuracy of neural network parsers and release a pre-trained model for general consumption. Their pre-trained model is Parsey McParseface and they note that “for dependency parsing on the Wall Street Journal we achieve the best-ever published unlabeled attachment score of 94.61%”.

Much like PoS tagging, dependency parsing algorithms extract meaningful information at the sentence level with high accuracy. An open challenge for sentiment analysis is the incorporation of this local information while retaining interpretability across large corpora.


In our pursuit to understand and evaluate sentiment analysis methods at a human level, it is intuitive yet deceiving to consider individual sentences. At the level of individual sentences, the bag of words approach is no longer useful. One attempt to improve these models for short text is to incorporate rules that are manually encoded to fit a given model for language, relying on the grammatical structure of language. Such a rule might be to consider negation words such as “not” to reverse the polarity of the following sentiment word, such that “not ” would be combined and assigned the score of “”.

Various attempts to incorporate rule-based heuristics and dictionary approaches for sentiment analysis include the work of

Thelwall et al. (2012) and Hutto and Gilbert (2014). The systems developed by Kiritchenko et al. (2014), Wilson et al. (2005), and Polanyi and Zaenen (2006) incorporate a rule for negation. An analysis of the usefulness of different features for Twitter sentiment analysis is performed by Agarwal et al. (2011), including PoS and binary lexicon features. Perhaps unsurprisingly, the polarity of words is the single most useful feature. The analysis showed that the most useful combination is the one of PoS with the polarity of words. Hutto and Gilbert (2014) report an increase on in the F1 score for binary Tweet classification of 2.1% using negation, extended vowels (“happy” to “haaapy”), punctuation, and capitalization as cues.

1.2.5 Building corpus-specific sentiment dictionaries


Previous work on building sentiment dictionaries using data, as opposed to human evaluation, has taken various forms. We categorize these approaches by three main categories; (1) the type of data that is used to gain information about how words are similar, (2) how the data is processed, and (3) which methods are used to infer semantic properties.

Types of data include:

  • Thesaurus

  • Word associations

  • Unstructured text corpora

Data processing

  • Network from structured data

  • Network for POS patterns

  • Word embedding vectors

  • Vectors similarity (cosine distance, etc) networks (-NN, etc)

Some of the methods employed:

  • Graph clustering

  • Graph label propagation

  • Orthogonal subspace projection on embedding

We distinguish these approaches from machine learning approaches that estimate emotion of words from tagged training data in that these approaches extend existing scores about words.

Chronologically, the first approach here is by Hatzivassiloglou and McKeown (1997), and the most recent we have found is the work of Rothe et al. (2016). We will proceed by summarizing the main result of each paper, casting the methodology into one of the aforementioned categories.

Previous approaches

First, we take a close look at the earliest effort to build a corpus-specific sentiment dictionary to get a deeper sense of the steps involved in this task.

Hatzivassiloglou, V. and K. R. McKeown (1997). Predicting the semantic orientation of adjectives. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pp. 174–181. Association for Computational Linguistics.

Hatzivassiloglou and McKeown (1997)

use a four-pronged approach: (1) adjectives are extracted from large text corpora that are linked by conjunctions (“and” or “but”), (2) a log-linear regression determines whether they are synonyms/antonyms to make a graph of positive/negative connections, (3) a clustering algorithm is run for two clusters, and (4) the cluster with the greatest average frequency is labeled as the positive words. The 1987 WSJ corpus is used, with PoS tags for adjectives and conjunctions. They report 82% accuracy on the binary classification of word pairs as synonym or antonym, and 90% accuracy on semantic orientation (predicting manual labels on 1336 adjectives). Their approach does not rely on existing word scores, but nevertheless forms the basis for future work that does incorporate existing sentiment dictionary data.

Now that we have seen one approach in more detail, we will look ahead to methodology that more closely informs our own work. The years following saw an expansion in the methods, processing, and data used to automatically extend affective word scores, including work (Turney, 2002; Turney and Littman, 2003; Taboada and Grieve, 2004; Kim and Hovy, 2004; Hu and Liu, 2004; Esuli and Sebastiani, 2006; Das and Chen, 2007; Kaji and Kitsuregawa, 2007; Blair-Goldensohn et al., 2008; Bestgen et al., 2008; Rao and Ravichandran, 2009). We start again in more depth with recent work of Velikovich, directly applicable to extending data sets that we are familiar with (e.g., labMT).

Velikovich, L., S. Blair-Goldensohn, K. Hannan, and R. McDonald (2010). The viability of web-derived polarity lexicons. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 777–785. Association for Computational Linguistics.

In the paper from Velikovich et al. (2010)

, many of the specifics of the approach are left out. We review this paper because the methodology outlined is very similar in spirit to all of the approaches that follow. For a domain corpus, they use n-grams up to length 10 scraped from 4 billion web pages, however the details of this corpus are left vague. They then use the cosine distance between context vectors from these n-grams to build a

nearest neighbor (-NN) network with (the method used to generate context vectors is again left to the reader). Seed words within the network are labeled with positive and negative sentiment, and scores for all n-grams are determined by shortest paths to the seed set, a using a generic graph propagation algorithm. For results, Velikovich et al. (2010) report that their effort compares favorably to the manually constructed lexicon from Wilson et al. (2005) and a lexicon from WordNet used in Blair-Goldensohn et al. (2008).

Bestgen, Y. and N. Vincze (2012). Checking and bootstrapping lexical norms by means of word similarity indexes. Behavior research methods 44(4), 998–1006.

Bestgen and Vincze (2012)

begin by taking 300-dimensional word embeddings from the Singular Value Decomposition (SVD) of the word co-occurrence matrix of the TASA corpus, comprised of 44K documents. They use these embeddings to build a

-NN network, and then use the DIC-LSA technique of Bestgen et al. (2008) with the ANEW dictionary (using the dictionary scores to measure correlations with words in the network). This approach extends the ANEW dictionary by adding scores to additional words, directly using the scores in the ANEW dictionary itself. For different values of , the score for each word in the network is taken to be the average of it’s neighbors (the closest words in the embedding space), and for words with scores from ANEW, the node value itself is held-out. By using only the most extreme words (those in ANEW with scores closer to 1 and closer to 9), they achieve an correlation coefficient (Cohen’s Kappa) of .53–.94 on sets of all–190 of the words from ANEW (the latter .94 correlation achieved with using the 190 most extreme words in ANEW). In addition, they provide ratings using their method for 17,000 English words.

Tang, D., F. Wei, B. Qin, M. Zhou, and T. Liu (2014). Building large-scale twitter-specific sentiment lexicon: A representation learning approach. In COLING, pp. 172–182.

Tang et al. (2014)

train a neural network (NN) to learn phrase sentiment from phrase embeddings using a graph collected from Urban Dictionary and Tweets with emoticons. The Tweets with emoticons are used to embed all phrases in a two dimensional space with the loss function as a hybrid between word context (e.g., word2vec) and emoticon label context (happy or sad). A network of words is extracted from Urban Dictionary and used to apply label propagation for positive (good,

:)), negative (poor, :(), and neutral words (when, he) across the network (which includes phrases). The word embeddings and scores from label propagation are used as features for a ternary sentiment classifier that is trained to predict scores from label propagation. Their system outperforms those tested for the SemEval 2013 competition by attaining a performance of macro F1 score .78, and their final dataset, TS-Lex, is composed of 65,685 words with sentiment scores and provided online.

Amir, S., R. Astudillo, W. Ling, P. C. Carvalho, and M. J. Silva (2016). Expanding subjective lexicons for social media mining with embedding subspaces. arXiv preprint arXiv:1701.00145.

Their approach to lexicon expansion “consists of training models to predict the labels of pre-existing lexicons, leveraging unsupervised word embeddings as features” (Amir et al., 2016). Correlations between their method and existing continuous datasets had a maximum of 0.68, an improvement over support vector regression. The resulting lexicon out-performed other methods in Tweet classification, although not all methods were compared.

Hamilton, W. L., K. Clark, J. Leskovec, and D. Jurafsky (2016). Inducing domain-specific sentiment lexicons from unlabeled corpora. arXiv preprint arXiv:1606.02820.

Hamilton et al. (2016) utilize the approach set out in Velikovich et al. (2010) to generate corpus specific word embeddings using SVD and propagating sentiment labels on inferred -NN network. The most novel part of the approach measures the uncertainty in predicted labels with bootstrapping procedure that holds out fractions of seed set (with a seed set of 10 words, holding out 2). They claim to measure performance with correlations to existing dataset of Warriner et al. (2013), but not found in results.

Mandera, P., E. Keuleers, and M. Brysbaert (2015). How useful are corpus-based methods for extrapolating psycholinguistic variables? The Quarterly Journal of Experimental Psychology 68(8), 1623–1642.

Mandera et al. (2015) measure sensitivity of the performance of corpus specific sentiment dictionaries to the number of words in the training data. They split the Warriner et al. (2013) corpus into training and testing sets at different thresholds (e.g., 70/30 and 80/20). Networks are built using

-NN and Random Forests on four different distances metrics, and the best performance is attained from the SVD of PMI embedding and a

-NN with . They show that accuracy for this best method varies from .61–.72 between a 10/90 to 50/50 split into testing and training. The reported accuracy leads the authors to cast doubts on the efficacy of automated approaches, but their survey is not exhaustive and the next methods we will explore improve upon the accuracy.

Van Rensbergen, B., S. De Deyne, and G. Storms (2016). Estimating affective word covariates using word association data. Behavior Research Methods 48(4), 1644–1652.

Van Rensbergen et al. (2016) estimate word scores using word association data for 14K dutch words, finding the best correlation between this method and human evaluation for -NN algorithm (also tried “Orientation towards Paradigm Words”). For they obtained correlations for valence, arousal, and dominance of .91, .84, and .85. This performance is considerably better than was achieved by Mandera et al. (2015) for English using corpus derived word similarity. These results highlight the sensitive differences between word analogy tasks for human readers and the information extracted by vector space embedding methods.

Rothe, S., S. Ebert, and H. Schütze (2016). Ultradense word embeddings by orthogonal transformation. arXiv preprint arXiv:1602.07572.

Rothe et al. (2016)

transform the embedding space of works via optimization of certain dimensions onto known semantic properties. This amounts to reducing the 300 or so dimensions typically used for vector space embedding into less than three dimensions. They apply Stochastic Gradient Descent (SGD) to learn a transformation

that orthogonalizes the embedding matrix under the constraint of establishing a sentiment dimension. This approach is more successful than embedding words directly into such a low dimension space, agreeing with previous work that has show vector embedding performs best with more than 100 dimensions, while extracting the relevant semantic information for sentiment analysis. For lexicon creation, their approach labeled “Densifier” achieves the statistically significant best performance on SemEval 2015 Task 10E with Kendall’s of .654.

Altogether, these approaches provide a roadmap and demonstrate the possibility of constructing a high-quality, general purpose, phrase based sentiment dictionary.

1.2.6 Visualization

Lacking from the bulk of research that applies sentiment analysis, but crucial for validation and understanding, is visualization of sentiment analysis. Despite limited attempts by researchers in sentiment analysis to use visualization to understand their analysis, online tools have been built to allow anyone to build simple visualizations in a straightforward way (Viegas et al., 2007). Motivation for our choice of dictionary-based methods along with a straightforward averaging algorithm for generating scores is that the analysis can be visualized to be understood. The averaging algorithm is linear and this allows for the comparison of the individual word contributions to text sentiment classification, both enabling greater understanding and validating the analysis.

An overview of previous approaches to text visualization can be found in Heer (2014) and Cao and Cui (2016). We note the four goals of text visualization as identified by Heer: understanding, comparison, grouping, and correlation. Here, we focus on the task of understanding. A selection of recent work that builds on this task is available from Hearst (2009) (Chapter 11), Chuang et al. (2012), Van Ham et al. (2009), and Chuang et al. (2012).

Visualizations of readable portions of text are able to communicate the results of analysis at that level, such as the syntactic parse visualization in 1.4. On a sentence level, we can see with or without added visual clues (e.g., colored backgrounds or font size) which individual words have either positive or negative scores, and how their balance contributes to the average-based classification. When rules become involved, this process is more complicated and it may be necessary to utilize a sentence diagram to understand the classification at even the individual sentence level. Neither of these approaches scale to visualize more than individual sentences, a fundamental shortcoming in working with big data.

Figure 1.4: Visualization of a syntactic dependency parse with the displaCy tool from Honnibal (2015), a companion to the spaCy package for NLP in Python. The tool doubles as an annotation tool with key-based input for efficient manual dependency tagging.

Next, we examine tag clouds as a tool to understand text and the results of text analysis.

Tag clouds

Tag clouds are a popular method for displaying the results of text analysis, with the size of text being used to represent one variable from the analysis and the layout of words with random locations, angles, and color, generally positioned to minimize white space. Various attempts have been made to assess the efficacy of tag clouds compared to more traditional statistical information visualizations such as bar charts with a consensus that they are less effective, though aesthetically pleasing: see Halvey and Keane (2007), Rivadeneira et al. (2007), and Hearst and Rosner (2008). One popular package for producing word clouds layouts is “Wordle” from Feinberg (2009).

Since tag clouds by wordle have random layouts, improvements that incorporate relevant information into the layout itself have been considered. In Schrammel et al. (2009) they compare the performance and likability of four approaches: alphabetic, random, similarity on Flickr, and distance in WordNet. From 64 participants, they find that “semantically clustered tag clouds can provide improvements over random layouts in specific search tasks and that they tend to increase the attention towards tags in small fonts compared to other layouts”.

In Lohmann et al. (2009) tag cloud layouts are compared on three tags and results show that there is no single best layout. The three tasks they test and the best layout for each are:

  • Finding a specific tag: Sequential layout with alphabetical sorting.

  • Finding the most popular tags: Circular layout with decreasing popularity.

  • Finding tags that belong to a certain topic: Thematically clustered layout.

It is also confirmed using eye tracking that tag clouds are scanned (not read), attention is focused on the center of the tag cloud, and they all perform sub-optimally for looking up specific words.

A study of the social (non-academic) use of Wordle is done by Viegas et al. (2009), finding that the existence of tools for building custom Wordles was crucial to their popularity and that 35/49% of men/women under the age of 20 did not know that frequency of usage is used for the font size.

Adding a time component to tag clouds with the use of “sparklines”, Lee et al. (2010) find that SparkClouds are able to communicate trends as well. New layouts attempt to incorporate additional information to tag clouds through layout and color, such as the TAGGLE system of Emerson et al. (2015).

Moving beyond tag clouds, we briefly present word shift graphs in the next section.

Word shift graphs

An indispensable, scientific tool for visualizing text analysis is the word shift graph. The graph was first designed and put to use by Dodds and Danforth (2009) to understand the result of sentiment analysis. An online, interactive version of the graphs are used widely at, and more details on the use of these graphs is available at The important difference between the word shift graph and tag cloud is that the word shift graph uses both spatial dimensions meaningfully, encoding the ranking of words in the vertical direction and the relevant statistical value in the horizontal direction, enabling comparison between the values. We present a closer examination of an example word shift graph in Figure 1.5.

Figure 1.5: We quote the following caption and re-use the figure from Cody et al. (2015): A word shift graph comparing the happiness of tweets containing the word “climate” to all unfiltered tweets. The reference text is roughly 100 billion tweets from September 2008 to July 2014. The comparison text is tweets containing the word “climate” from September 2008 to July 2014. A yellow bar indicates a word with an above average happiness score. A purple bar indicates a word with below average happiness score. A down arrow indicates that this word is used less within tweets containing the word “climate”. An up arrow indicates that this word is used more within tweets containing the word “climate”. Words on the left side of the graph are contributing to making the comparison text (climate tweets) less happy. Words on the right side of the graph are contributing to making the comparison text more happy. The small plot in the lower left corner shows how the individual words contribute to the total shift in happiness. The gray squares in the lower right corner compare the sizes of the two texts, roughly vs words. The circles in the lower right corner indicate how many happy words were used more or less and how many sad words were used more or less in the comparison text.

We elaborate more on the construction, present use cases where the word shift graph helps us understand successes and failures of sentiment analysis, and generally make extensive use of the word shift graph as a tool in Chapter 3. A future effort could aim to assess the efficacy of the word shift graph for text-based research, by performing a task-level user study.

1.2.7 Benchmarking literature

In this section we review recent efforts to benchmark sentiment analysis methods for their performance.

Liu, B. (2012, May). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies. San Rafael, CA: Morgan & Claypool Publishers.

This book from Bing Liu provides a broad overview of sentiment analysis, and the many different problems that it hopes to address as well as summaries of many common approaches. Liu provides a framework to understand the aspects of sentiment analysis, with the levels of analysis (aspect, sentence, document level), and goals including classification and opinion summarization. In Chapter 8, a discussions of the methods for generating sentiment dictionaries is presented, and includes manual, dictionary-based, and corpus based approaches. Survey methods are not considered (the well-known ANEW dictionary is absent), and there is some confusion between methods that use a dictionary to propagate scores and those that use features of a corpus to propagate scores (Velikovich et al. (2010) incorrectly classified as the former). While the references are extensive, no analysis is conducted to understand how the different approaches for generating sentiment dictionaries perform. Despite these shortcomings, the book is a broad and very useful guide to the landscape of sentiment analysis.

Hutto, C. J. and E. Gilbert (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media.

This paper is focused on the development of a new dictionary-based method for sentiment analysis that incorporates a rule-based system and a dictionary tailored to social media. While other papers that introduce dictionaries for sentiment analysis have made comparisons between methods (e.g., LIWC correlations between the 2001, 2007, and 2015 dictionaries on their website), we include this as a benchmark because of the uncommon rigor in the comparisons made. In particular, Hutto and Gilbert compare their new method VADER to 11 other sentiment analysis methodologies. They compare to seven dictionary-based methods and four ML methods, and find favorable correlations between the classification of Tweets for the dictionary based methods. In addition they perform tests to measure the performance gains to be had using four rules, and word sense disambiguation, finding mean F1 performance gains of 2 points on individual Tweets. These rules are a subset of those employed by

Thelwall et al. (2012). The comparisons between sentiment dictionaries focus on the classification performance, and do not provide any insight into what properties of the dictionaries contributes to their performance. In addition, no effort is made to use sentiment analysis as more than a binary classifier, a shortcoming that we will address.

Giachanou, A. and F. Crestani (2016, June). Like it or not: A survey of twitter sentiment analysis methods. ACM Comput. Surv. 49(2), 28:1–28:41.

This extensive survey from Giachanou et al.provides an overview and categorization of methods used to quantify sentiment on Twitter. No quantitative comparisons are made between the methods themselves. The broad categories of the methods they find are based on those from Liu (2012):

  • Machine Learning.

  • Lexicon-Based.

  • Hybrid (Machine Learning & Lexicon-Based).

  • Graph-Based.

The focus is on ML approaches (as they note: “The majority of [Twitter Sentiment Analysis] methods use a method from the field of machine learning”).

Ribeiro, F. N., M. Araújo, P. Gonçalves, M. André Gonçalves, and F. Benevenuto (2016, jul). SentiBench — a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Sci. 5(1), 23.

This recent benchmark from Ribeiro et al.was published while our work was under review, having been submitted after our preprint was released on the arXiv. The comparisons made by Ribeiro et al.utilize a variety of methods, and provide measures of performance for all methods based on F1 scores. The methods selected include commercial, ML, and dictionary-based, and they are applied for four corpora. Beyond metrics of classification performance, no insight is provided into the reasons why certain methods out-perform others, nor is any focus on understanding texts through sentiment (or using visualization), the key tenets of our effort in Chapter 2.

1.3 Emotional arcs

Stories provide a useful framing to condense our experience, and through this they are both ubiquitous and powerful. In 2011, a DARPA initiative “Narrative Networks” (DARPA, 2011) said the following in relation to security:

Narratives exert a powerful influence on human thoughts and behavior. They consolidate memory, shape emotions, cue heuristics and biases in judgment, influence in-group/out-group distinctions, and may affect the fundamental contents of personal identity. It comes as no surprise that because of these influences stories are important in security contexts: for example, they change the course of insurgencies, frame negotiations, play a role in political radicalization, influence the methods and goals of violent social movements, and likely play a role in clinical conditions important to the military such as post-traumatic stress disorder.

The ubiquitous nature of stories is summed up well in Dodds (2013):

We humans are storytelling and story-finding machines: homo narrativus, if you will. In making sense of the world, we look for the shapes of meaningful narratives in everything. Even in science, we enjoy mathematical equations and algorithms because they are a kind of universal story. Fluids—the oceans and atmosphere, the blood in your body, honey—all flow according to a single, beautiful set of equations called the Navier-Stokes equations.
In our everyday, human stories, far away from science, we have a limited (if generous) capacity to entertain randomness—we are certainly not homo probabilisticus

. Too many coincidences in a movie or book will render it unbelievable and unpalatable. We would think to ourselves, “that would never happen in real life!” This skews our stories. We tend to find or create story threads where there are none. While it can sometimes be useful to err on the side of causality, the fact remains that our tendency toward teleological explanations often oversteps the evidence.

In Chapter 3 we consider previous work that finds between one and 36 different plot types: Campbell (1949); Harris (1959); Abbott (2008); Booker (2006); Polti (1921). Of these, the work of Campbell and Moyers (1991) has gained popular attention as a result of the expositions of Dan Harmon in writing the show Community (Raftery, 2011). In a series of online posts, Harmon elaborates on the “monomyth” and its incorporation into the writing of the Star Wars movies (Volger, 1992). The plot here is cyclical, and therefore represented on a circle, and the argument goes that all well constructed plots can be arranged to fit into this mold. The basic circle consists of 8 locations, starting and ending in the same place, and show a labeled visualization of these locations in Figure 1.6.

Figure 1.6: Harmon cycles with and without labels, as used to develop the show Community. The cyclical nature of the story has roots in the “monomyth” of Campbell (1949).

Lacking from the existing work considering theories of plot is a strong grounding in empirical evidence or stability of the “universal” theories across culture. It is precisely this shortcoming which we hope to address, by using a broad collection of Fiction stories within western culture.

1.3.1 Story graphs, plot diagrams, and inferring causality

With the distinction between plot, structure, and emotional trajectory in mind, there have also been attempts to discover plot using data driven methods. Brewer and Lichtenstein (1980) makes the distinction between plot and structure is made even clearer. Through experimentation with different structures, Brewer and Lichtenstein find that the resulting affect in readers is different, with some structures being considered stories and others not (the authors single out “suspense and resolution” and “surprise and resolution” as indicative of stories).

Plot units were first introduced by Lehnert (1981), and form the basis for most all efforts that follow.

Using topic modeling, both Schmidt (2015b) and Jockers (2013) find known patterns of plot across many thousand stories. In Piper (2015a), computational analysis is applied to realize the potential of distant reading (a term owing to Moretti (2013)) to find and test scholarly insights. In Winston (2011), a system called “Genesis” is developed to compare plot summaries and infer causal connections between events, with the broad aim of the system formalized as the Strong Story Hypothesis:

The mechanisms that enable humans to tell, understand, and recombine stories separate human intelligence from that of other primates.

In his Master’s Thesis, Awad (2013) extends the Genesis system to model differences in American and Chinese stories by adding commonsense rules that differ between cultures. With commonsense rules, Genesis is able to measure story coherence.

Work by Regneri et al. (2010) learns event scripts from written descriptions of events that may not always exist in written form (implicit scripts, like shopping), using a graph-based (“temporal script graph”) algorithm and data collected on Amazon’s Mechanical Turk. The algorithm is tested to detect similar events with differing descriptions.

The Analogical Story Merging (ASM) system is developed using “Bayesian model merging” for story categorization and is applied to 15 Russian folktales (Finlayson, 2011). The test folktales are annotated for 18 aspects of meaning by 12 annotators using a tool developed for this task. The folktale categories defined by Vladmir Propp are predicted by ASM and the system achieves a Rand Index of 0.511 (a measure of the similarity between clusters).

In Elson (2012a) a Story Intention Graph (SIG) is developed to model stories and implemented to measure similarity and analogy. Elson’s propositional similarity metric is used to predict human judgments of story similarity and outperforms human annotation (is better than inter-annotator agreement) on Aesop’s fables.

The AESOP system of Goyal et al. (2013) converts narrative texts into their plot unit model (where plot units are “conceptual knowledge structure to represent the affect states of and emotional tensions between characters in narrative stories”). AESOP performs four steps: “affect state recognition, character identification, affect state projection, and link creation.” Performance is inspected on a set of Aesop’s fables, similar to Elson (2012a).

In Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel, Piper (2015a)

applies Multi-Dimensional Scaling (MDS) on representations of novels in a VSM (Vector Space Model — vectors of word frequencies), and performs hierarchical clustering to understand the differences between novels and autobiographies.

1.3.2 Story generation


Plot Induction and Evolutionary Search for Story Generation

, McIntyre and Lapata (2010) build upon their previous work to train a story planner from extracted events, their participants, and preceding relationships from a large corpus. Their system is used to to generate simple, 4 or 5 sentences stories that are mildly coherent.

The Neukom Institute at Dartmouth hosts a competition for algorithms to produce short stories, in a true-fashion Turing test (Neukom Institute, 2016). In the 2016 competition, algorithms and writers were given a one-word prompt and tasked to write a 500-word short story. The stories were then judged by a panel consisting of David Cope, Lynn Neary, and David Krakauer to be either human or machine written. Each judge received 8 human written stories and 3 machine generated stories, one from each of the 3 entrants into the competition. To quote their results:

No machine won, but one submission generated by Toksu and Ibrahim on the seed “thesaurus” “fooled” one of the judges!

With no first place award, the second place award and $1000 prize was awarded to Judy Malloy whose algorithm rearranged sentences from “Another Party in Woodside”.

1.3.3 Character Identification and Networks

Much work on computational understanding of stories has focused on the extraction and analysis of character networks. The ideas behind character networks were first examined in the original work of Moretti (Moretti, 2000, 2007; Schulz, 2011; Moretti, 2013), and have been used widely in Digital Humanities research. Below we highlight work that has caught our attention.

Elson et al. (2010) use character name chunking, quoted speech attribution and conversation detection to generate character networks from a collection of British novels. They find a lack of support for characterizations provided by literacy scholars and suggest an alternative explanation. Namely, the do not find support for the hypothesis that 19th century fiction novels have (1) social networks that differ by the setting of the novel (rural vs. urban) and that (2) novels with more characters have less dialogue (an inverse relationship is suggested by the so-called “chronotype” theories). Instead they find that the point of view of narration (first vs. third person) is strongly correlated with the This work applies the distant reading philosophy by first carefully selecting a corpus of books and consulting previous literary research before doing analysis, an approach we aspire to emulate. Elson later extended this work with models of discourse (Elson, 2012b).

Bamman et al.use Bayesian models, word embedding, and state-of-the-art NLP techniques to learn personas of characters in literature (Bamman et al., 2014) and in film (Bamman et al., 2014). Their analysis is performed across a large corpora of 15,099 books selected from Hathitrust, 42,306 wikipedia movie plot summaries for film, and is shown to replicate the classification of character roles by a literary scholar. A similar effort is undertaken by Valls-Vargas et al. (2014), utilizing PoS annotations from syntactic parsing to detect characters in a small set of stories, and using “action matrices” in another attempt (Valls-Vargas et al., 2014) to encode Propp’s narrative theory. They are able to automatically detect the roles of characters within 10 folktales (developing a system they refer to as “Voz”).

These methods have also been used to examine popular culture. In a blog post, Gabasova (2015) finds the most central character in Star Wars. Xanthos et al. (2016) elaborate on the method of constructing and visualizing character networks, an example of their work for Shakespeare is available as a poster: Min and Park (2016) perform an in-depth study of Victor Hugo’s Les Misérables, proposing using the growth of edges in and characters in the network over time to compare different works (with each edge/character curve normalized to sum to 1 at the end of each book). More recently, Wu (2016) has made an interactive exploration of the play Hamilton using discourse and the character network, and Meeks and Averick built an interactive exploration of the dialogue in the show Archer (Meeks and Averick, Meeks and Averick)

To compare character networks across movies, Ruths (2016) uses network alignment to map characters between the Stars Wars movies The Force Awakens and A New Hope revealing both expected and surprising similarities. For example R2-D2 maps to BB-8 and Chewbacca maps to Chewbacca, as we might expect. However, the main characters have more surprising alignments from the interaction networks, with Luke mapping to Poe, Obi-wan mapping to Kylo Ren, and Darth Vader mapping Rey. A particular problem in using character networks that span an entire movie, TV show, or book is that multiple story lines can intersect in ways that are not accounted for by the method. Bost et al. (2016) examine conversation in TV shows using a smoothing of narration to overcome the multiple narrative problem, finding protagonists more readily than using simpler interaction networks.

1.3.4 Frames for NLP

The seminal work by Schank and Abelson (1977) (and earlier efforts by Rumelhart (1975)) laid the groundwork for scripts as a framework for cognitive algorithmic computation. Research programs separately advancing AI capabilities and NLP tasks have made use of this framework. Although existing knowledge bases such as SUMO (Niles and Pease, 2001), Cyc (Lenat, 1995) or FrameNet (Fillmore et al., 2003) contain such script-like knowledge to a certain extent, their coverage is severely limited. Increases in computational power have realized the building of systems for script-based event detection, and there have been many efforts made in the past decade to advance such systems. Schemata such as NarrativeML to annotate narratives are reviewed by Mani (2012). Next, we very briefly highlight some of these approaches, focusing particularly on the research program of Chambers due to the accessibility of the papers and the breadth of research by himself and his students.

In a series of papers Chambers et al. (2007); Chambers and Jurafsky (2008, 2009, 2010); Chambers (2013) set to classify temporal relations between events, apply unsupervised learning to detect narrative event chains and entities involved, build a database of narrative schemata, and find schemata in large corpora with probability-based models. A narrative event chain is defined as two events linked by a common actor. Event chains are identified in text through co-reference between a single entity, ordered by a trained classifier, and all possible event chains are restricted through a clustering approach in Chambers et al. (2007); Chambers and Jurafsky (2008). Both Cheung et al. (2013) (using the proposed approach of O’Connor (2013)) and Chambers (2013) utilize generative models for inducing event schemata, with the former utilizing a HMM over latent event variables and the latter using a entity-driven model. Recent work from Pichotta and Mooney (2015)

improves on the baseline results of Chambers in detecting scripts using Recurrent Neural Networks (RNNs, particularly a flavor known as Long Short Term Memory (LSTM)) and architectures adapted to this task.

Corpora used by Chambers and by others include the FrameNet from Baker et al. (1998), Timebank Corpus from Pustejovsky et al. (2003), Opinion Corpus from Mani et al. (2006), Narrative Schema Database from Chambers and Jurafsky (2010), the Media Frames Corpus by Card et al. (2015), and most recently the Story Cloze Dataset from Mostafazadeh et al. (2016). As an example, Do et al. (2011) use a primarily unsupervised approach to specifically learn causality between events in the Penn Discourse Treebank, and Roemmele et al. (2017) use an RNN on the Story Cloze dataset. The understanding and generation of stories with these data sets and new models may hold promise for major advances in the field of NLP. Cambria and White (2014) has suggested that the next wave of NLP advances that aim to decode stories (a move from “bag of words” approaches to “bag of narratives”) may very well be a breakthrough in understanding human nature.

Along those lines, stories have been explored as a model to training artificial intelligence systems for commonsense reasoning. Advanced in this area all recognize and leverage the utility of stories for sense-making (Bex and Bench-Capon, 2010; Bex, 2013; Li et al., 2012; Riedl, 2016).

1.3.5 Visualization

Stories as a model for understanding are not readily visualized, as finding a proper encoding for the mental models we use is difficult. Nevertheless, efforts at capturing the essence of story in a visual form are omnipresent in art and automated attempts to generate such mappings are attempted (recall Figure 1.6). The illustrated movie maps of DeGraff and Harmon (2015) make representations of movies in the limited space of two pages by using three dimensions to show the movement of time and place. The web comic XKCD draws inspiration from the well-known visualization of Napoleon’s march by Minard and maps the interaction of characters with time as a x-axis and character proximity as distance in the y-axis of a chart, see Munroe (2009) and Figure 1.7. Ogawa and Ma (2010) attempt to automatically build XKCD-style plots for software development, and an image of their reproduction of the XKCD Lord of the Rings visualization is shown by Cao and Cui (2016).

Figure 1.7: XKCD number 657 by Munroe (2009) shows the time evolution of character co-occurrence in Lord of the Rings, Star Wars, Jurassic Park, 12 Angry Men, and Primer. Munroe adds: “in the LotR map, up and down correspond LOOSELY to northwest and southeast respectively.” The width of lines correspond to the number of characters in each group, which applies here to the Orcs in Lord of the Rings.

1.3.6 Emotional arcs

The emotional arcs drawn by Vonnegut (1981) are simpler, using time again on the x-axis and representing the fortune of the main character in the vertical direction. Vonnegut explicitly draws a connection between the New Testament and Cinderella, a story that has incredible popular appeal. Other story arcs named by Vonnegut are the “Man in the Hole” and the “Boy meets girl” arcs.

With the same goal of finding commonalities between stories as Vonnegut (1981), in a series of blog posts Jockers (2014) lays out a strategy for generating emotional arcs and eventually finds six story types using hierarchical clustering. Our work in Chapter 3 is an continuation of a very different core methodology that we first propose in Dodds et al. (2015a). Though the core methodology is markedly different, we note that Jocker’s first blog post appeared 10 days before the pre-print of our paper As we note in Chapter 3 as well, the distinction between plot and emotional arc as well as correct use of using sentiment analysis tools distinguish our contributions from those of Jockers (2014).

Attempts to analyze plot more directly than emotional arc have been increasing in the past few years. Cherny (2016)

applies machine learning over a bag-of-words analysis to predict action and sex scenes using Naive Bayes (NB) and Stochastic Gradient Descent (SGD). Training data is crowd-sourced from two ratings of 500 word chunks on the survey platform Mechanical Turk (MT), and Cherny develops novel visualizations of the relationships between topics in chapters.

Reiter et al. (2014) use an unsupervised method to generate and compare event-based representations of rituals and folktales, but we were unable to obtain their manuscript. Piper (2015a) analyzes the differences between the first and second half of novels about “conversion.” We revisit the approach by Schmidt (2015b) here: he uses Latent Semantic Analysis (LSA) and dimensionality reduction to find patterns of plot in a reduced 2-dimensional topic space. While this is an interesting approach, we would not expect the coefficients of the first two modes in the SVD to hold particular relationships between themselves. Most recently, the approach of measuring sentiment using sentences and smoothing has been published by Gao et al. (2016).

The most similar approach to ours (perhaps based on our method from Dodds et al. (2015a), though they cite Vonnegut) was an effort by sentiment analysis startup Indico’s Dan Kuster, available at (Kuster, 2015). Kuster uses sliding windows and dynamic time warping as a distance metric between emotional arcs, and on single books the method is indeed very similar to ours, yet they don’t extend to mine for patterns across a large corpus.

1.3.7 Suzyhet and validation

The work of Jockers (2014) has been publicly debated in the online sphere. The back-and-forth between Matt Jockers and Annie Swafford (and others) has happened in blog posts (Swafford, 2015), comments on blogs, and on Twitter. The extent of this debate is documented in two parts by Clancy (2015) (available online: and We attempt to briefly summarize some of the discussion of prominent scholars in digital humanities and how this relates to our own work on emotional arcs, particularly the comments of Bamman, Piper, Schmidt, Enderle, and Underwood.

Bamman (2015) elaborates on the discussion around on how to measure validity of emotional arcs Bamman (2015) goes on to build a survey to perform the validation proposed by Piper (2015b) and Weingart (Weingart). Bamman’s survey for Shakespeare’s Romeo and Juliet takes responses from 5 participants on Mechanical Turk for each scene on a -5 to 5 scale along with a free text reasoning for the score. We plot the mean of these ratings along with our measure of the emotional arc (the happiness of the words in the play for a sliding window of 10000 words and 200 time points) of the play in Figure 1.8. This approach could, of course, be extended to provide additional formal validation of the methods and parameters used in our study of emotional arcs. However, non-expert annotations are not always a proper gold-standard (Snow et al., 2008), and there may even be (we might even expect) valid interpretations of a story that produce different emotional responses. In this case, we would expect that our automated method would find one of these arcs, and the goal of a more advanced system could be to find more than one arc for a given book.

In addition to the problems identified by Swafford, Schmidt (2015a) builds on Enderle (2015) and highlights the problem that the low pass filter needs to be circular. These discussions have provided many interesting future directions for this work and the validation of computational approaches to narratives.

Figure 1.8: Emotional arcs of Shakespeare’s Romeo and Juliet, generated with the labMT sentiment dictionary and the average of 5 human annotations on each scene. The labMT approach generated 100 time points, with 2000 rated words at each point shown, , and ignoring scene boundaries (the same approach used in general). The human annotation data is from a survey conducted in Bamman (2015) with 5 responses for each of the 26 scenes in the play, points are shown on the x-axis in the center of each scene’s words. The survey collected responses from -5 to 5, which we have re-scaled linearly to -1 to 1 (by dividing by 5), and the labMT data is re-scaled by first mean centering the time series, then multiplying by the inverse of the absolute maximum (such that the time series will touch -1 or 1 in the direction of the absolute maximum).

Our own work on emotional arcs (Chapter 3) has attracted a great deal of popular attention and has been noticed by those in the digital humanities community, particularly by Schmidt (2016) and Enderle (2016)

. We address the concerns raised in both of these critical takes in our work. Drawing directly from the suggestions from Schmidt, we utilize the Library of Congress classification to produce a better selection of texts from Project Gutenberg in our published manuscript, a notable improvement from the pre-print corpus he analyzes. In our treatment, we carefully consider the choice of a suitable null hypothesis to test whether there is structure in the emotional arcs of real stories. Our first pass used the emotional arcs of the same books with randomly shuffled words (“word salad” books), for a corpora that has no narrative structure but the same emotional words. The final version of our null model generates stories from a bigram Markov chain trained on the actual text. These “nonsense” narratives have no real structure, but resemble written English. For more complete details and sample text from each method, see Section

B.3. Other reasonable attempts could consider shuffling sentences of paragraphs, however Brownian noise and arbitrary random walks are not sensible comparisons. In particular, the singular value spectrum of Brownian noise is arbitrary.

In the next Chapter, we test sentiment analysis methods for performance in classification and providing understanding of emotional text, methods that form the basis of our study into emotional arcs.

2.1 Introduction

As we move further into what might be called the Sociotechnocene—with increasingly more interactions, decisions, and impact being made by globally distributed people and algorithms—the myriad human social dynamics that have shaped our history have become far more visible and measurable than ever before. Driven by the broad implications of being able to characterize social systems in microscopic detail, sentiment detection for populations at all scales has become a prominent research arena. Attempts to leverage online expression for sentiment mining include prediction of stock markets (Bollen et al., 2011; Si et al., 2013; Chung and Liu, 2011; Ruiz et al., 2012), assessing responses to advertising, real-time monitoring of global happiness (Dodds et al., 2015a), and measuring a health-related quality of life (Alajajian et al., 2016). The diverse set of instruments produced by this work now provide indicators that help scientists understand collective behavior, inform public policy makers, and, in industry, gauge the sentiment of public response to marketing campaigns. Given their widespread usage and potential to influence social systems, understanding how these instruments perform and how they compare with each other has become imperative. Benchmarking their ability to provide insight into sentiment, and their performance, both focuses future development and provides practical advice to non-experts in selecting a sentiment dictionary.

We identify sentiment detection methods as belonging to one of three categories, each carrying their own advantages and disadvantages:

  1. Dictionary-based methods (Dodds et al., 2015a; Bradley and Lang, 1999; Pennebaker et al., 2001; Wilson et al., 2005; Liu, 2010; Warriner et al., 2013),

  2. Supervised learning methods (Liu, 2010), and

  3. Unsupervised (or deep) learning methods 

    (Socher et al., 2013).

Here, we focus on dictionary-based methods, which all center around the determination of a text ’s average happiness (sometimes referred to as valence) with sentiment dictionary through the equation:


where we denote each of the words in a given sentiment dictionary as words , word sentiment scores as , word frequency as , and normalized frequency of in as . In this way, we measure the happiness of a text in a manner analogous to taking the temperature of a room. While other simple sentiment metrics may be considered, we will see that analyzing individual word contributions is important and that this equation allows for a straightforward, meaningful interpretation.

Dictionary-based methods offer two distinct advantages which we find necessary: (1) they are in principle corpus agnostic (applicable to corpora without ground truth data available) and (2) in contrast to black box (highly non-linear) methods, they offer the ability to “look under the hood” at words contributing to a particular score through word shift graphs (defined fully later; see also  (Dodds and Danforth, 2009; Dodds et al., 2011)). Indeed, if we are at all concerned with understanding why a particular scoring method varies—e.g,, our undertaking is scientific—then word shift graphs are essential tools. In the absence of word shift graphs, or similar devices, any explanation of sentiment trends is missing crucial information and rises only to the level of opinion or guesswork (Golder and Macy, 2011; Garcia et al., 2015; Dodds et al., 2015b; Wojcik et al., 2015).

As all methods must, dictionary-based “bag-of-words” approaches suffer from various drawbacks, and three are worth stating up front. First, they are only applicable to corpora of sufficient size, well beyond that of a single sentence (Ribeiro et al., 2016) (widespread usage in this misplaced fashion does not suffice as a counterargument). We directly verify this assertion on individual Tweets, finding that some sentiment dictionaries perform admirably, however the average (median) F1-score on the STS-Gold data set is 0.50 (0.54) from all datasets (Table A.1), others having shown similar results for dictionary methods with short text (Ribeiro et al., 2016). Second, state-of-the-art learning methods with a sufficiently large training set for a specific corpus will outperform dictionary-based methods on same corpus (Liu, 2012). However, in practice the domains and topics to which sentiment analysis are applied are highly varied, such that training to a high degree of specificity for a single corpus may not be practical and, from a scientific standpoint, will severely constrain attempts to detect and understand universal patterns. Third, words may be evaluated out of context or with the wrong sense. A simple example is the word “miss” occurring frequently when evaluating articles in the Society section of the New York Times. This kind of contextual error is something we can readily identify and correct for through word shift graphs, but would remain hidden to users of nonlinear learning methods.

We lay out our paper as follows. We list and describe the dictionary-based methods we consider in Sec. Dictionaries, Corpora, and Word Shift Graphs, and outline the corpora we use for tests in Subsec. Corpora Tested. We present our results in Sec. Results, comparing all methods in how they perform for specific analyses of the New York Times (NYT) (Subsec. New York Times Word Shift Analysis), movie reviews (Subsec. Movie Reviews Classification and Word Shift Analysis), Google Books (Subsec. Google Books Time Series and Word Shift Analysis), and Twitter (Subsec. Twitter Time Series Analysis). In Subsec. Brief Comparison to Machine Learning Methods, we make some basic comparisons between dictionary-based methods and machine learning approaches. We provide concluding remarks in Sec. Conclusion and bolster our findings with figures, tables, and additional analysis in the Supporting Information.

2.2 Sentiment Dictionaries, Corpora, and Word Shift Graphs

0in0in Dictionary # Entries Range Construction License Ref. labMT 10222 1.3 8.5 Survey: MT, 50 ratings CC (Dodds et al., 2015a) ANEW 1034 1.2 8.8 Survey: FSU Psych 101 Free for research (Bradley and Lang, 1999) LIWC07 4483 [-1,0,1] Manual Paid, commercial (Pennebaker et al., 2001) MPQA 7192 [-1,0,1] Manual + ML GNU GPL (Wilson et al., 2005) OL 6782 [-1,1] Dictionary propagation Free (Liu, 2010) WK 13915 1.3 8.5 Survey: MT, 14–18 ratings CC (Warriner et al., 2013) LIWC01 2322 [-1,0,1] Manual Paid, commercial (Pennebaker et al., 2001) LIWC15 6549 [-1,0,1] Manual Paid, commercial (Pennebaker et al., 2001) PANAS-X 20 [-1,1] Manual Copyrighted paper (Watson and Clark, 1999) Pattern 1528 -1.0 1.0 Unspecified BSD (De Smedt and Daelemans, 2012) SentiWordNet 147700 -1.0 1.0 Synset synonyms CC BY-SA 3.0 (Baccianella et al., 2010) AFINN 2477 [-5,-4, ,4,5] Manual ODbL v1.0 (Nielsen, 2011) GI 3629 [-1,1] Harvard-IV-4 Unspecified (Stone et al., 1966) WDAL 8743 0.0 3.0 Survey: Columbia students Unspecified (Whissell et al., 1986) EmoLex 14182 [-1,0,1] Survey: MT Free for research (Mohammad and Turney, 2013) MaxDiff 1515 -1.0 1.0 Survey: MT, MaxDiff Free for research (Kiritchenko et al., 2014) HashtagSent 54129 -6.9 7.5 PMI with hashtags Free for research (Zhu et al., 2014) Sent140Lex 62468 -5.0 5.0 PMI with emoticons Free for research (Mohammad et al., 2013) SOCAL 7494 -30.2 30.7 Manual GNU GPL (Taboada et al., 2011) SenticNet 30000 -1.0 1.0 Label propogation Citation requested (Cambria et al., 2014) Emoticons 132 [-1,0,1] Manual Open source code (Gonçalves et al., 2013) SentiStrength 2615 [-5,-4, ,4,5] LIWC+GI Free for research (Thelwall et al., 2010) VADER 7502 -3.9 3.4 MT survey, 10 ratings Freely available (Hutto and Gilbert, 2014) Umigon 927 [-1,1] Manual Public Domain (Levallois, 2013) USent 592 [-1,1] Manual CC (Pappas et al., 2013) EmoSenticNet 13188 [-10,-2,-1,0,1,10] Bootstrapped extension Non-commercial (Poria et al., 2013)

Table 2.1: Summary of dictionary attributes used in sentiment measurement instruments. We provide all acronyms and abbreviations and further information regarding sentiment dictionaries in Subsec. Dictionaries. We test the first 6 dictionaries extensively. The range indicates whether scores are continuous or binary (we use the term binary for sentiment dictionaries for which words are scored as and optionally 0).

2.2.1 Sentiment Dictionaries

The words “sentiment dictionary,” “lexicon,” and “corpus” are often used interchangeably, and for clarity we define our usage as follows.

Sentiment Dictionary:

Set of words (possibly including word stems) with ratings.


Collection of texts which we seek to analyze.


The words contained within a corpus (often said to be “tokenized”).

We test the following six sentiment dictionaries in depth:


— language assessment by Mechanical Turk (Dodds et al., 2015a).


— Affective Norms of English Words (Bradley and Lang, 1999).


— Warriner and Kuperman rated words from SUBTLEX by Mechanical Turk (Warriner et al., 2013).


— The Multi-Perspective Question Answering (MPQA) Subjectivity Dictionary (Wilson et al., 2005).


— Linguistic Inquiry and Word Count, three versions (Pennebaker et al., 2001).


— Opinion Lexicon, developed by Bing Liu (Liu, 2010).

We also make note of 18 other sentiment dictionaries:


— The Positive and Negative Affect Schedule — Expanded (Watson and Clark, 1999).


— A web mining module for the Python programming language, version 2.6 (De Smedt and Daelemans, 2012).


— WordNet synsets each assigned three sentiment scores: positivity, negativity, and objectivity (Baccianella et al., 2010).


— Words manually rated -5 to 5 with impact scores by Finn Nielsen (Nielsen, 2011).


— General Inquirer: database of words and manually created semantic and cognitive categories, including positive and negative connotations (Stone et al., 1966).


— Whissel’s Dictionary of Affective Language: words rated in terms of their Pleasantness, Activation, and Imagery (concreteness) (Whissell et al., 1986).


— NRC Word-Emotion Association Lexicon: emotions and sentiment evoked by common words and phrases using Mechanical Turk (Mohammad and Turney, 2013).


— NRC MaxDiff Twitter Sentiment Lexicon: crowdsourced real-valued scores using the MaxDiff method (Kiritchenko et al., 2014).


— NRC Hashtag Sentiment Lexicon: created from Tweets using Pairwise Mutual Information with sentiment hashtags as positive and negative labels (here we use only the unigrams) (Zhu et al., 2014).


— NRC Sentiment140 Lexicon: created from the “sentiment140” corpus of Tweets, using Pairwise Mutual Information with emoticons as positive and negative labels (here we use only the unigrams) (Mohammad et al., 2013).


— Manually constructed general-purpose sentiment dictionary (Taboada et al., 2011).


— Sentiment dataset labeled with semantics and 5 dimensions of emotions by Cambria et al., version 3 (Cambria et al., 2014).


— Commonly used emoticons with their positive, negative, or neutral emotion (Gonçalves et al., 2013).


— an API and Java program for general purpose sentiment detection (here we use only the sentiment dictionary) (Thelwall et al., 2010).


— method developed specifically for Twitter and social media analysis (Hutto and Gilbert, 2014).


— Manually built specifically to analyze Tweets from the sentiment140 corpus (Levallois, 2013).


— set of emoticons and bad words that extend MPQA (Pappas et al., 2013).


— extends SenticNet words with WNA labels (Poria et al., 2013).

All of these sentiment dictionaries were produced by academic groups, and with the exception of LIWC, they are provided free of charge. In Table 2.1, we supply the main aspects—such as word count, score type (continuum or binary), and license information—for the sentiment dictionaries listed above. In the GitHub repository associated with our paper,, we include all of the sentiment dictionaries except LIWC.

The labMT, ANEW, and WK sentiment dictionaries have scores ranging on a continuum from 1 (low happiness) to 9 (high happiness) with 5 as neutral, whereas the others we test in detail have scores of , and either explicitly or implicitly 0 (neutral). We will refer to the latter sentiment dictionaries as being binary, even if neutral is included. Other non-binary ranges include a continuous scale from -1 to 1 (SentiWordNet), integers from -5 to 5 (AFINN), continuous from 1 to 3 (GI), and continuous from -5 to 5 (NRC). For coverage tests, we include all available words, to gain a full sense of the breadth of each sentiment dictionary. In scoring, we do not include neutral words from any sentiment dictionary.

We test the labMT, ANEW, and WK dictionaries for a range of stop words (starting with the removal of words scoring within of the neutral score of 5) (Dodds et al., 2011). The ability to remove stop words—a common practice for text pre-processing—is one advantage of dictionaries that have a range of scores, allowing us to tune the instrument for maximum performance, while retaining all of the benefits of a dictionary method. We will show that, in agreement with the original paper introducing labMT and looking at Twitter data, a is a pragmatic choice in general (Dodds et al., 2011).

Since we do not apply a part of speech tagger, when using the MPQA dictionary we are obliged to exclude words with scores of both +1 and -1. The words and stems with both scores are: blood, boast* (we denote stems with an asterisk), conscience, deep, destiny, keen, large, and precious. We choose to match a text’s words using the fixed word set from each sentiment dictionary before stems, hence words with overlapping matches (a fixed word that also matches a stem) are first matched by the fixed word.

2.2.2 Corpora Tested

For each sentiment dictionary, we test both the coverage and the ability to detect previously observed and/or known patterns within each of the following corpora, noting the pattern we hope to discern:

  1. The New York Times (NYT) (Sandhaus, 2008): Goal of understanding differences between sections and ranking by sentiment (Subsec. New York Times Word Shift Analysis).

  2. Movie reviews (Pang and Lee, 2004): Goal of discerning how emotional language differs in positive and negative reviews and how these differences influence classification accuracy (Subsec. Movie Reviews Classification and Word Shift Analysis).

  3. Google Books (Lin et al., 2012): Goal of understanding time series (Subsec. Google Books Time Series and Word Shift Analysis).

  4. Twitter: Goal of understanding time series (Subsec. Twitter Time Series Analysis).

For the corpora other than the movie reviews and small numbers of tagged Tweets, there is no publicly available ground truth sentiment, so we instead make comparisons between methods and examine how words contribute to scores. We note that measuring how patterns of sentiment compares with societal measures of well being would also be possible (Mitchell et al., 2013). We offer greater detail on corpus processing below, and we also provide the relevant scripts on GitHub at

2.2.3 Word Shift Graphs

Sentiment analysis is often applied to classify text as positive or negative. Indeed if this were the only use case, the value added by sentiment analysis would be severely limited. Instead we use sentiment analysis as a lens that allow us to see how the emotive words in a text shape the overall content. This is accomplished by first analyzing each word to find its individual contribution to the difference in sentiment scores between two texts. Most importantly, the final step is to examine the words themselves, ranked by their individual contribution. Of the four corpora that we analyze, three rely on this type of qualitative analysis: using the sentiment dictionary as a tool to better understand the sentiment of the corpora rather than as a binary classifier.

To make this possible, we must first find the contribution of each word individually. Starting with the ANEW sentiment dictionary and two texts which we label reference and comparison, we take the difference of their sentiment scores and , rearrange a few things, and arrive at

Each word in the summation contributes to the sentiment difference between the texts according to (1) its sentiment relative to the reference text ( = more/less emotive), and (2) its change in frequency of usage ( = more/less frequent). As a first step, it is possible to visualize this sorted word list in a table, along with the basic indicators of how its contribution is constituted. We use word shift graphs to present this information in the most accessible manner to advanced users. For further detail, we refer the reader to our instructional post and video at

2.3 Results

In Fig 2.1, we show a direct comparison between word scores for each pair of the 6 dictionaries tested. Overall, we find strong agreement between all dictionaries with the exceptions we note below. As a guide, we will provide more detail on the individual comparison between the labMT dictionary and the other five dictionaries by examining the words whose scores disagree across dictionaries shown in Fig 2.2. We refer the reader to the S2 Appendix for the remaining individual comparisons.

Figure 2.1: Direct comparison of the words in each of the dictionaries tested. For the comparison of two dictionaries, we plot words that are matched by the independent variable “” in the dependent variable “”. Because of this, and cross stem matching, the plots are not symmetric across the diagonal of the entire figure. Where the scores are continuous in both dictionaries, we compute the RMA linear fit. When a sentiment dictionary contains both fixed and stem words, we plot the matches by fixed words in blue and by stem words in green. The axes in the bar plots are not of the same height, due to large mismatches in the number of words in the dictionaries, and we note the maximum height of the bar in the upper left of such plots. Detailed analysis of Panel C can be found in (Dodds et al., 2015b). We provide a table for each off-diagonal panel in the S2 Appendix with the words whose scores exhibit the greatest mismatch, and a subset of these tables in Fig 2.2.

To start with, consider the comparison of the labMT and ANEW dictionaries on a word-for-word basis. Because these dictionaries share the same range of values, a scatterplot is the natural way to visualize the comparison. Across the top row of Fig 2.1, which compares labMT to the other 5 dictionaries, we see in Panel B for the labMT-ANEW comparison that the RMA best fit (Rayner, 1985) is

for words in both labMT and ANEW. The 10 words with farthest from the line of best fit shown in Panel B of Fig 2.2

are (with labMT, ANEW scores in parenthesis): lust (4.64, 7.12), bees (5.60, 3.20), silly (5.30, 7.41), engaged (6.16, 8.00), book (7.24, 5.72), hospital (3.50, 5.04), evil (1.90, 3.23), gloom (3.56, 1.88), anxious (3.42, 4.81), and flower (7.88, 6.64). We observe that these words have high standard deviations in labMT. While the overall agreement is very good, we should expect some variation in the emotional associations of words, due to chance, time of survey, and demographic variability. Indeed, the Mechanical Turk users who scored the words for the labMT set in 2011 are evidently different from the University of Florida students who took the ANEW survey in 2000.

Comparing labMT with WK in Panel C of Fig 2.1, we again find a fit with slope near 1, and with a smaller positive shift: . The 10 words farthest from the best fit line, shown in Panel B of Fig 2.2, are (labMT, WK): sue (4.30, 2.18), boogie (5.86, 3.80), exclusive (6.48, 4.50), wake (4.72, 6.57), federal (4.94, 3.06), stroke (2.58, 4.19), gay (4.44, 6.11), patient (5.04, 6.71), user (5.48, 3.67), and blow (4.48, 6.10). Like labMT, the WK dictionary used a Mechanical Turk online survey to gather word ratings. We speculate that the variation is due to differences in the number of scores required for each word in the surveys, with 14–18 in WK and 50 in labMT. For an in depth comparison of these sentiment dictionaries, see reference (Dodds et al., 2015b).

To compare the word scores in a binary sentiment dictionaries (those with or ) to the word scores in a sentiment dictionary with a 1–9 range, we examine the distribution of the continuous scores for each binary score. Looking at the labMT-MPQA comparison in Panel D of Fig 2.1, we see that more of the matches are between words without stems (blue) than those with stems (orange), and that each score in -1, 0, +1 from MPQA corresponds to a wider range of scores in labMT. We examine the shared individual words from labMT with high sentiment scores and MPQA with score -1, both the happiest and the least happy in labMT with MPQA score 0, and the least happy when MPQA is 1 (Fig 2.2 Panels C-E). The 10 happiest words in labMT matched by MPQA words with score -1 are: moonlight (7.50), cutest (7.62), finest (7.66), funniest (7.76), comedy (7.98), laughs (8.18), laughing (8.20), laugh (8.22), laughed (8.26), laughter (8.50). This is an immediately troubling list of evidently positive words rated as -1 in MPQA. We observe the top 5 are matched by the stem “laugh*” in MPQA. The least happy 5 words and happiest 5 words in labMT matched by words in MPQA with score 0 are: sorrows (2.69), screaming (2.96), couldn’t (3.32), pressures (3.49), couldnt (3.58), and baby (7.28), precious (7.34), strength (7.40), surprise (7.42), and song (7.58). We see that these MPQA word scores are departures from the other dictionaries, warranting concern about their scores. The least happy words in labMT with score +1 in MPQA that are matched by MPQA are: vulnerable (3.34), court (3.78), sanctions (3.86), defendant (3.90), conviction (4.10), backwards (4.22), courts (4.24), defendants (4.26), court’s (4.44), and correction (4.44). These words have sentiments that appear to vary with context.

While it would be simple to adjust these ratings in the MPQA dictionary going forward, we are naturally led to be concerned about existing work using MPQA that does not examine words contributing to overall sentiment. We note again that the use of word shift graphs of some kind would have exposed these problematic scores immediately.

Figure 2.2: We present the specific words from Panels G, M, S and Y of Fig 2.1 with the greatest mismatch. Only the center histogram from Panel Y of Fig 2.1 is included. We emphasize that the labMT dictionary scores generally agree well with the other dictionaries, and we are looking at the marginal words with the strongest disagreement. Within these words, we detect differences in the creation of these dictionaries that carry through to these edge cases. Panel A: The words with most different scores between the labMT and ANEW dictionaries are suggestive of the different meanings that such words entail for the different demographic surveyed to score the words. Panel B: Both dictionaries use surveys from the same demographic (Mechanical Turk), where the labMT dictionary required more individual ratings for each word (at least 50, compared to 14) and appears to have dampened the effect of multiple meaning words. Panels C–E: The words in labMT matched by MPQA with scores of -1, 0, and +1 in MPQA show that there are at least a few words with negative rating in MPQA that are not negative (including the happiest word in the labMT dictionary: “laughter”), not all of the MPQA words with score 0 are neutral, and that MPQA’s positive words are mostly positive according to the labMT score. Panel F: The function words in the expert-curated LIWC dictionary are not emotionally neutral.

For the labMT-LIWC comparison in Panel E of Fig 2.1 we examine the same matched word lists as before. The 10 happiest words in labMT matched by words in LIWC with score -1 are: trick (5.22), shakin (5.29), number (5.30), geek (5.34), tricks (5.38), defence (5.39), dwell (5.47), doubtless (5.92), numbers (6.04), shakespeare (6.88). From Panel F of Fig 2.2, the least happy 5 neutral words and happiest 5 neutral words in LIWC, matched in LabMT from LIWC words (i.e., using the word stems in LIWC to match across LabMT, directionality matters), are: negative (2.42), lack (3.16), couldn’t (3.32), cannot (3.32), never (3.34), millions (7.26), couple (7.30), million (7.38), billion (7.56), millionaire (7.62). The least happy words in labMT with score +1 in LIWC that are matched by LIWC are: merrill (4.90), richardson (5.02), dynamite (5.04), careful (5.10), richard (5.26), silly (5.30), gloria (5.36), securities (5.38), boldface (5.40), treasury’s (5.42). The +1 and -1 words in LIWC match some neutral words in labMT, which is not alarming. However, the problems with the “neutral” words in the LIWC set are immediate: these are not emotionally neutral words. The range of scores in labMT for these 0-score words in LIWC formed the basis for Garcia et al.’s response to (Dodds et al., 2015a), and we point out here that the authors must not have looked at the words, an all-too-common problem in studies using sentiment analysis (Garcia et al., 2015; Dodds et al., 2015b).

For the labMT-OL comparison in Panel E of Fig 2.1

we again examine the same matched word lists as before (except the neutral word list because OL has no explicit neutral words). The 10 happiest words in labMT matched by OL’s negative list are: myth (5.90), puppet (5.90), skinny (5.92), jam (6.02), challenging (6.10), fiction (6.16), lemon (6.16), tenderness (7.06), joke (7.62), funny (7.92). The least happy words in labMT with score +1 in OL that are matched by OL are: defeated (2.74), defeat (3.20), envy (3.33), obsession (3.74), tough (3.96), dominated (4.04), unreal (4.57), striking (4.70), sharp (4.84), sensitive (4.86). Despite nearly twice as many negative words in OL as positive words (at odds with the frequency-dependent positivity bias of language 

(Dodds et al., 2015a)), after examining the words which are the most differently scored and seeing how quickly the labMT scores move into the neutral range, we can conclude that these dictionaries generally agree with the exception of only a few bad matches.

Direct comparisons between the word scores in sentiment dictionaries, while evidently tedious, have brought to light many problematic word scores. In addition, this analysis serves as a template for further comparisons of the words across new sentiment dictionaries. The six sentiment dictionaries under careful examination in the present study are further analyzed in the Supporting Information. Next, we examine how each sentiment dictionary can aid in understanding the sentiments contained in articles from the New York Times.

2.3.1 New York Times Word Shift Analysis

The New York Times corpus (Sandhaus, 2008) is split into 24 sections of the newspaper that are roughly contiguous throughout the data from 1987–2008. With each sentiment dictionary, we rate each section and then compute word shift graphs (described below) against the baseline, and produce a happiness ranked list of the sections.

To gain understanding of the sentiment expressed by any given text relative to another text, it is necessary to inspect the words which contribute most significantly by their emotional strength and the change in frequency of usage. We do this through the use of word shift graphs, which plot the contribution of each word from the sentiment dictionary (denoted ) to the shift in average happiness between two texts, sorted by the absolute value of the contribution. We use word shift graphs to both analyze a single text and to compare two texts, here focusing on comparing text within corpora. For a derivation of the algorithm used to make word shift graphs while separating the frequency and sentiment information, we refer the reader to Equations 2 and 3 in (Dodds et al., 2011). We consider both the sentiment difference and frequency difference components of by writing each term of Eq. B.1 as in (Dodds et al., 2011):


An in-depth explanation of how to interpret the word shift graph can also be found at

To both demonstrate the necessity of using word shift graphs in carrying out sentiment analysis, and to gain understanding about the ranking of New York Times sections by each sentiment dictionary, we look at word shift graphs for the “Society” section of the newspaper from each sentiment dictionary in Fig 2.3, with the reference text being the whole of the New York Times. The “Society” section happiness ranks 1, 1, 1, 18, 1, and 11 within the happiness of each of the 24 sections in the dictionaries labMT, ANEW, WK, MPQA, LIWC, and OL, respectively. These graphs show only the very top of the distributions which range in length from 1030 (ANEW) to 13915 words (WK).

Figure 2.3: New York Times (NYT) “Society” section shifted against the entire NYT corpus for each of the six dictionaries listed in tiles A–F. We provide a detailed analysis in Sec. 2.3.1. Generally, we are able to glean the greatest understanding of the sentiment texture associated with this NYT section using the labMT dictionary. Additionally we note the labMT dictionary has the most coverage quantified by word match count (Figure in S3 Appendix), we are able to identify and correct problematic words scores in the OL dictionary, and we see that the MPQA dictionary disagrees entirely with the others because of an overly broad stem match.

First, using the labMT dictionary, we see that the words “graduated”, “father”, and “university” top the list, which is dominated by positive words that occur more frequently (). These more frequent positive words paint a clear picture of family life (relationships, weddings, and divorces), as well as university accomplishment (graduations and college). In general, we are able to observe with only these words that the “Society” section is where we find the details of these events.

From the ANEW dictionary, we see that a few positive words have increased frequency, lead by “mother”, “father”, and “bride”. Looking at this shift in isolation, we see only these words with three more (“graduate”, “wedding”, and “couple”) that would lead us to suspect these topics are present in the “Society” section.

The WK dictionary, with the most individual word scores of any sentiment dictionary tested, agrees with labMT and ANEW that the “Society” section is the happiest section, with somewhat similar set of words at the top: “new”, “university”, and “father”. Low coverage of the New York Times corpus (see Fig A.3) resulted in less specific words describing the “Society” section, with more words that go down in frequency in the shift. With the words “bride” and “wedding” up, as well as “university”, “graduate”, and “college”, it is evident that the “Society” section covers both graduations and weddings, in consensus with the other sentiment dictionaries.

The MPQA dictionary ranks the “Society” section 18th of the 24 NYT sections, a departure from the other rankings, with the words “mar*”, “retire*”, and “yes*” the top three contributing words (where “*” denotes a wildcard “stem” match). Negative words increasing in frequency () are the most common type near the top, and of these, the words with the biggest contributions are being scored incorrectly in this context (specifically words “mar*”, “retire*”, “bar*”, “division”, and “miss*”). Looking more in depth at the problems created by the first out of context word match, we find 1211 unique words match “mar*”. The five most frequent, with counts in parenthesis, are married (36750), marriage (5977), marketing (5382), mary (4403), and mark (2624). The score for these words in MPQA is -1, in stark contrast to the scores in other sentiment dictionaries (e.g., the labMT scores are 6.76, 6.7, 5.2, 5.88, and 5.48). These problems plague the MPQA dictionary for scoring the New York Times corpus, and without using word shift graphs would have gone completely unseen. In an attempt to fix contextual issues by fixing corpus-specific words, we remove “mar*,retire*,vice,bar*,miss*” and find that the MPQA dictionary ranks the Society section of the NYT at 15th of the 24 sections

Figure 2.4: Coverage of the words in the movie reviews by each of the dictionaries. We observe that the labMT dictionary has the highest coverage of words in the movie reviews corpus both across word rank and cumulatively. The LIWC dictionary has initially high coverage since it contains some high-frequency function words, but quickly drops off across rank. The WK dictionary coverage increases across word rank and cumulatively, indicating that it contains a large number of less common words in the movie review corpus. The OL, ANEW, and MPQA have a cumulative coverage of less than 20% of the lexicon.

The second binary sentiment dictionary, LIWC, agrees well with the first three dictionaries and ranks the “Society” section at the top with the words “rich*”, “miss”, and “engage*” at the top of the list. We immediately notice that the word “miss” is being used frequently in the “Society” section in a different sense than was coded for in the LIWC dictionary: it is used in the corpus to mean “the title prefixed to the name of an unmarried woman”, but is scored as negative in LIWC (with the likely intended meaning “to fail to reach an target or to acknowledge loss”). We would remove this word from LIWC for further analysis of this corpus (we would also remove the word “trust” here). The words matched by “miss*” aside, LIWC finds some positive words going up (), with “engage*” hinting at weddings. Without words that capture the specific behavior happening in the “Society” section, we are unable to see anything about college, graduations, or marriages, and there is much less to be gained about the text from the words in LIWC than some of the other dictionaries we have seen. Nevertheless, LIWC finds consensus with the “Society” section ranked the top section, due in large part to a lack of negative words “war” (rank 18) and “fight*” (rank 22).

The OL sentiment dictionary departs from the consensus and ranks the “Society” section at 11th out of the 24 sections. The top three words, “vice”, “miss”, and “concern”, contribute largely with respect to the rest of distribution, of which two are clearly being used in the wrong sense. For a more reasonable analysis we would remove both “vice” and “miss” from the OL dictionary to score this text. For a more reasonable analysis we remove both “vice” and “miss” from the OL dictionary to score this text, and in doing so the happiness goes from 0.168 to 0.297, making the “Society” section the second happiest of the 24 sections. Focusing on the words, we see that the OL dictionary finds many positive words increasing in frequency () that are mostly generic. In the word shift graph we do not find the wedding or university events as in sentiment dictionaries with more coverage, but rather a variety of positive language surrounding these events, for example 4. “works”, “benefit” (5), “honor” (6), “best” (7), “great” (9), “trust” (10), “love” (11), etc. While this does not provide insight into the topics, the OL sentiment dictionary with fixes from the word shift graph analysis does provide details on the emotive words that make the “Society” section one of the happiest sections.

In conclusion, we find that 4 of the 6 dictionaries score the “Society” section at number 1, and in these cases we use the word shift graph to uncover the nuances of the language used. We find, unsurprisingly, that the most matches are found by the labMT dictionary, which is in part built from the NYT corpus (see S3 Appendix for coverage plots). Without as much corpus-specific coverage, we note that while the nuances of the text remain hidden, the LIWC and OL dictionaries still highlight the positive language in this section. Of the two that did not score the “Society” section at the top, we are able to assess and repair the MPQA and the OL dictionaries by removing the words “mar*,retire*,vice*,bar*,miss*” and “vice,miss”, respectively. By identifying words used in the wrong sense/context using the word shift graph, we directly improve the sentiment score for the New York Times corpus from both MPQA and OL dictionaries closer to consensus. While the OL dictionary, with two corrections, agrees with the other dictionaries, the MPQA dictionary with five corrections still ranks the Society section of the NYT as the 15th happiest of the 24 sections.

In the first Figure in S4 Appendix we show scatterplots for each comparison, and compute the Reduced Major Axes (RMA) regression fit (Rayner, 1985). In the second Figure in S4 Appendix we show the sorted bar chart from each sentiment dictionary.

2.3.2 Movie Reviews Classification and Word Shift Graph Analysis

For the movie reviews, we first provide insight into the language differences and secondly perform binary classification of positive and negative reviews. The entire dataset consists of 1000 positive and 1000 negative reviews, as rated with 4 or 5 stars and 1 or 2 stars, respectively. We show how well each sentiment dictionary covers the review database in Fig 2.4. The average review length is 650 words, and we plot the distribution of review lengths in S5 Appendix. We average the sentiment of words in each review individually, using each sentiment dictionary. We also combine random samples of positive or negative reviews for varying from 2 to 900 on a logarithmic scale, without replacement, and rate the combined text. With an increase in the size of the text, we expect that the dictionaries will be better able to distinguish positive from negative. The simple statistic we use to describe this ability is the percentage of distributions that overlap the average.

To analyze which words are being used by each sentiment dictionary, we compute word shift graphs of the entire positive corpus versus the entire negative corpus in Fig 2.5. Across the board, we see that a decrease in negative words is the most important word type for each sentiment dictionary, with the word “bad” being the top word for every sentiment dictionary in which it is scored (ANEW does not have it). Other observations that we can make from the word shift graphs include a few words that are potentially being used out of context: “movie”, “comedy”, “plot”, “horror”, “war”, “just”.

Figure 2.5: Word shift graphs for the movie review corpus. By analyzing the words that contribute most significantly to the sentiment score produced by each sentiment dictionary we are able to identify words that are problematic for each sentiment dictionary at the word-level, and generate an understanding of the emotional texture of the movie review corpus. Again we find that coverage of the lexicon is essential to produce meaningful word shift graphs, with the labMT dictionary providing the most coverage of this corpus and producing the most detailed word shift graphs. All words on the left hand side of these word shift graphs are words that individually made the positive reviews score more negatively than the negative reviews, and the removal of these words would improve the accuracy of the ratings given by each sentiment dictionary. In particular, across each sentiment dictionary the word shift graphs show that domain-specific words such as “war” and “movie” are used more frequently in the positive reviews and are not useful in determining the polarity of a single review.

In the lower right panel of Fig 2.6, the percentage overlap of positive and negative review distributions presents us with a simple summary of sentiment dictionary performance on this tagged corpus. The ANEW dictionary stands out as being considerably less capable of distinguishing positive from negative. In order, we then see WK is slightly better overall, labMT and LIWC perform similarly better than WK overall, and then MPQA and OL are each a degree better again, across the review lengths (see below for hard numbers at 1 review length). Two Figures in S5 Appendix show the distributions for 1 review and for 15 combined reviews.

Figure 2.6: The score assigned to increasing numbers of reviews drawn from the tagged positive and negative sets. For each sentiment dictionary we show mean sentiment and 1 standard deviation over 100 samples for each distribution of reviews in Panels A–F. For comparison we compute the fraction of the distributions that overlap in Panel G. At the single review level for each sentiment dictionary this simple performance statistic (fraction of distribution overlap) ranks the OL dictionary in first place, the MPQA, LIWC, and labMT dictionaries in a second place tie, WK in fifth, and ANEW far behind. All dictionaries require on the order of 1000 words to achieve 95% classification accuracy.

Classifying single reviews as positive or negative, the F1-scores are: labMT .63, ANEW .36, LIWC .53, MPQA .66, OL .71, and WK .34 (see Table A.4). We roughly confirm the rule-of-thumb that 10,000 words are enough to score with a sentiment dictionary confidently, with all dictionaries except MPQA and ANEW achieving 90% accuracy with this many words. We sample the number of reviews evenly in log space, generating sets of reviews with average word counts of 4550, 6500, 9750, 16250, and 26000 words. Specifically, the number of reviews necessary to achieve 90% accuracy is 15 reviews (9750 words) for labMT, 100 reviews (65000 words) for ANEW, 10 reviews (6500 words) for LIWC, 10 reviews (6500 words) for MPQA, 7 reviews (4550 words) for OL, and 25 reviews (16250 words) for WK.

While we are analyzing the movie review classification, which has ground truth labels, we will take a moment to further support our claims for the inaccuracy of these methods at the sentence level. The OL dictionary, with the highest performance classifying individual movie reviews of the 6 dictionaries tested in detail, performs worse than guessing at classifying individual sentences in movie reviews. Specifically, 76.9/74.2% of sentences in the positive/negative reviews sets have words in the OL dictionary, and of these OL achieves an F1-score of 0.44. The results for each sentiment dictionary are included in Table

A.5, with an average (median) F1 score of 0.42 (0.45) across all dictionaries. While these results do cast doubt on the ability to classify positive and negative reviews from single sentences using dictionary based methods, we note that we need not expect the sentiment of individual sentences to be strongly correlated with the overall review polarity.

2.3.3 Google Books Time Series and Word Shift Analysis

We use the Google books 2012 dataset with all English books (Lin et al., 2012), from which we remove part of speech tagging and split into years. From this, we make time series by year, and word shift graphs of decades versus the baseline. In addition, to assess the similarity of each time series, we produce correlations between each of the time series.

Despite claims from research based on the Google Books corpus (Michel et al., 2011), we keep in mind that there are several deep problems with this beguiling data set (Pechenick et al., 2015). Leaving aside these issues, the Google Books corpus nevertheless provides a substantive test of our six dictionaries.

In Fig 2.7, we plot the sentiment time series for Google Books. Three immediate trends stand out: a dip near the Great Depression, a dip near World War II, and a general upswing in the 1990’s and 2000’s. From these general trends, a few dictionaries waver: OL does not dip as much for WW2, OL and LIWC stay lower in the 90’s and 2000’s, and labMT with go downward near the end of the 2000’s. We take a closer look into the 1940’s to see what each sentiment dictionary is picking up in Google Books around World War 2 in Figure in S6 Appendix.

Figure 2.7: Google Books sentiment time series from each sentiment dictionary, with stop values of 0.5, 1.0, and 1.5 from the dictionaries with word scores in the 1–9 range. To normalize the sentiment score, we subtract the mean and divide by the absolute range. We observe that each time series has increased variance, with a few pronounced negative time periods, and trending positive towards the end of the corpus. The score of labMT varies substantially with different stop values, although remaining highly correlated, and finds absolute lows near the World Wars. The LIWC and OL dictionaries trend down towards 1990, dipping as low as the war periods.

In each panel of the word shift Figure in S6 Appendix, we see that the top word making the 1940’s less positive than the the rest of Google Books is “war”, which is the top contributor for every sentiment dictionary except OL. Rounding out the top three contributing words are “no” and “great”, and we infer that the word “great” is being seen from mention of “The Great Depression” or “The Great War”, and is possibly being used out of context. All dictionaries but ANEW have “great” in the top 3 words, and each sentiment dictionary could be made more accurate if we remove this word.

In Panel A of the 1940’s word shift Figure in S6 Appendix, beyond the top words, increasing words are mostly negative and war-related: “against”, “enemy”, “operation”, which we could expect from this time period.

In Panel B, the ANEW dictionary scores the 1940’s of Google Books lower than the baseline as well, finding “war”, “cancer”, and “cell” to be the most important three words. With only 1030 words, there is not enough coverage to see anything beyond the top word “war,” and the shift is dominated by words that go down in frequency.

In Panel C, the WK dictionary finds the the 1940’s with slightly less happy than the baseline, with the top three words being “war”, “great”, and “old”. We see many of the same war-related words as in labMT, and in addition some positive words like “good” and “be” are up in frequency. The word “first” could be an artifact of first aid, a claim that could be substantiated with further analysis of the Google Books corpus at the 2-gram level beyond the scope of this manuscript.

In Panel D, the MPQA dictionary rates the 1940’s slightly less happy than the baseline, with the top three words being “war”, “great”, and “differ*”. Beyond the top word “war”, the score is dominated by words decreasing in frequency, with only a few words up in frequency. Without specific words increasing in frequency as contextual guides, it is difficult to obtain a good glance at the nature of the text. For this reason, having a higher coverage of the words in the corpus enables understanding.

In Panel E, the LIWC dictionary rates the 1940’s nearly the same as the baseline, with the top three words being “war”, “great”, and “argu*”. When the scores are nearly the same, although the 1940’s are slightly higher happiness here, the word shift is a view into how the words of the reference and comparison text vary. In addition to a few war related words being up and bringing the score down (“fight”, “enemy”, “attack”), we see some positive words up that could also be war related: “certain”, “interest”, and “definite”. Although LIWC does not manage to find World War II as a low point of the 20th century, the words that contribute to LIWCs score for the 1940’s compared to all years are useful in understanding the corpus.

In Panel F, the OL dictionary rates the 1940’s as happier than the baseline, with the top three words being “great”, “support”, and “like”. With 7 positive words up, and 1 negative word up, we see how the OL dictionary misses the war without the word “war” itself and with only “enemy” contributing from the words surrounding the conflict. The nature of the positive words that are up is unclear, and could justify a more detailed analysis of why the OL dictionary fails here.

2.3.4 Twitter Time Series Analysis

For Twitter data, we use the Gardenhose feed, a random 10% of the full stream. We store data on the Vermont Advanced Computing Core (VACC), and process the text first into hash tables (with approximately 8 million unique English words each day) and then into word vectors for each 15 minutes, for each sentiment dictionary tested. From this, we build sentiment time series for time resolutions of 15 minutes, 1 hour, 3 hours, 12 hours, and 1 day. In addition to the raw time series, we compute correlations between each time series to assess the similarity of the ratings between dictionaries.

In Fig 2.8, we present a daily sentiment time series of Twitter processed using each of the dictionaries being tested. With the exception of LIWC and MPQA we observe that the dictionaries generally track well together across the entire range. A strong weekly cycle is present in all, although muted for ANEW.

Figure 2.8: Normalized sentiment time series on Twitter using of 1.0 for all dictionaries. To normalize the sentiment score, we subtract the mean and divide by the absolute range. The resolution is 1 day, and draws on the 10% gardenhose sample of public Tweets provided by Twitter. All of the dictionaries exhibit wide variation for very early Tweets, and from 2012 onward generally track together strongly with the exception of MPQA and LIWC. The LIWC and MPQA dictionaries show opposite trends: a rise until 2012 with a decline after 2012 from LIWC, and a decline before 2012 with a rise afterwards from MPQA. To analyze the trends we look at the words driving the movement across years using word shift Figures in S7 Appendix.

We plot the Pearson’s correlation between all time series in Fig 2.9, and confirm some of the general observations that we can make from the time series. Namely, the LIWC and MPQA time series disagree the most from the others, and even more so with each other. Generally, we see strong agreement within dictionaries with varying stop values .

Figure 2.9: Pearson’s correlation between daily resolution Twitter sentiment time series for each sentiment dictionary. We see that there is strong agreement within dictionaries, with the biggest differences coming from the stop value of . The labMT and OL dictionaries do not strongly disagree with any of the others, while LIWC is the least correlated overall with other dictionaries. labMT and OL correlate strongly with each other, and disagree most with the ANEW, LIWC, and MPQA dictionaries. The two least correlated dictionaries are the LIWC and MPQA dictionaries. Again, since there is no publicly accessible ground truth for Twitter sentiment, we compare dictionaries against the others, and look at the words. With these criteria, we find the labMT dictionary to be the most useful.

The time series from each sentiment dictionary exhibits increased variance at the start of the time frame, when Twitter volume is low in 2008 and into 2009. As more people join Twitter and the Tweet volume increases through 2010, we see that LIWC rates the text as happier, while the rest start a slow decline in rating that is led by MPQA in the negative direction. In 2010, the LIWC dictionary is more positive than the rest with words like “haha”, “lol” and “hey” being used more frequently and swearing being less frequent than all years of Twitter put together. The other dictionaries with more coverage find a decrease in positive words to balance this increase, with the exception of MPQA which finds many negative words going up in frequency (see 2010 word shift Figure in Appendix S7). All of the dictionaries agree most strongly in 2012, all finding a lot of negative language and swearing that brings scores down (see 2012 word shift Figure in Appendix S7). From the bottom at 2012, LIWC continues to go downward while the others trend back up. The signal from MPQA jumps to the most positive, and LIWC does start trending back up eventually. We analyze the words in 2014 with a word shift against all 7 years of Tweets for each sentiment dictionary in each panel in the 2014 word shift Figure in Appendix S7: A. labMT scores 2014 as less happy with more negative language. B. ANEW finds it happier with a few positive words up. C. WK finds it happier with more negative words (like labMT). D. MPQA finds it more positive with less negative words. E. LIWC finds it less positive with more negative and less positive words. F. OL finds it to be of the same sentiment as the background with a balance in positive and negative word usage. From these word shift graphs, we can analyze which words cause MPQA and LIWC to disagree with the other dictionaries: the disagreement of MPQA is again marred by broad stem matches, and the disagreement of LIWC is due to a lack of coverage.

2.3.5 Brief Comparison to Machine Learning Methods

We implement a Naive Bayes (NB) classifier (sometimes harshly called idiot Bayes (Hand and Yu, 2001)) on the tagged movie review dataset, using 10% of the reviews for training and then testing performance on the rest. Following standard practice, we remove the top 30 ranked words (“stop words”) from the 5000 most frequent words, and use the remaining 4970 words in our classifier for maximum performance (we observe a 0.5% improvement). Our implementation is analogous to those found in common Python natural language processing packages (see “NLTK” or “TextBlob” in (Bird, 2006)).

As we should expect, at the level of single review, NB outperforms the dictionary-based methods with a classification accuracy of 75.7% averaged over 100 trials. As the number of reviews is increased, the overlap from NB diminishes, and using our simple “fraction overlapping” metric, the error drops to 0 with more than 200 reviews. Interestingly, NB starts to do worse with more reviews, and with more than 500 of the 1000 reviews concatenated, it rates both the positive and negative reviews as positive (Figure in S8 Appendix).

The rating curves do not touch, and neither do the standard deviation error bars (indicating that the result is not statistically significant), but they both go very slightly above 0 (again, see Figure in S8 Appendix). Overall, with Naive Bayes we are able to classify a higher percentage of individual reviews correctly, but with more variance.

In the two Tables in S8 Appendix we compute the words which the NB classifier uses to classify all of the positive reviews as positive, and all of the negative reviews as positive. The Natural Language Toolkit (NLTK (Bird, 2006)) implements a method to obtain the “most informative” words, by taking the ratio of the likelihood of words between all available classes, and looking for the largest ratio:


for all combinations of classes . This is possible because of the “naive” assumption that feature (word) likelihoods are independent, resulting in a classification metric that is linear for each feature. In S8 Appendix, we provide the derivation of this linearity structure.

We find that the trained NB classifier relies heavily on words that are very specific to the training set including the names of actors of the movies themselves, making them useful as classifiers but not in understanding the nature of the text. We report the top 10 words for both positive and negative classes using both the ratio and difference methods in Table in S8 Appendix. To classify a document using NB, we use the frequency of each word in the document in conjunction with the probability that that word occurred in each labeled class . While steps can be taken to avoid this type of over-fitting, it is an ever-present danger that remains hidden without word shift graphs or similar.

We next take the movie-review-trained NB classifier and use it to classify the New York Times sections, both by ranking them and by looking at the words (the above ratio and difference weighted by the occurrence of the words). We ranked the sections 5 different times, and among those find the “Television” section both by far the happiest, and by far the least happy in independent tests. We show these rankings and report the top 10 words used to score the “Society” section in Table A.3.

We thus see that the NB classifier, a linear learning method, may perform poorly when assessing sentiment outside of the corpus on which it is trained. In general, performance will vary depending on the statistical dissimilarity of the training and novel corpora. Added to this is the inscrutability of black box methods: while susceptible to the aforementioned difficulty, nonlinear learning methods (unlike NB) also render detailed examination of how individual words contribute to a text’s score more difficult.

2.4 Conclusion

We have shown that measuring sentiment in various corpora presents unique challenges, and that sentiment dictionary performance is situation dependent. Across the board, the ANEW dictionary performs poorly, and the continued use of this sentiment dictionary with clearly better alternatives is a questionable choice. We have seen that the MPQA dictionary does not agree with the other five dictionaries on the NYT corpus and Twitter corpus due to a variety of context, word sense, phrase, and stem matching issues, and we would not recommend using this sentiment dictionary. While the OL achieves the highest binary classification accuracy, in comparison to labMT, the WK, LIWC, and OL dictionaries fail to provide much detail in corpora where their coverage is lower, including all four corpora tested, the main goal of our analysis. Sufficient coverage is essential for producing meaningful word shift graphs and thereby enabling deeper understanding.

In each case, to analyze the output of the dictionary method, we rely on the use of word shift graphs. With this tool, we can produce a finer grained analysis of the lexical content, and we can also detect words that are used out of context and can mask them directly. It should be clear that using any of the dictionary-based sentiment detecting method without looking at how individual words contribute is indefensible, and analyses that do not use word shift graphs or similar tools cannot be trusted. The poor word shift performance of binary dictionaries in particular gravely limits their ability to reveal underlying stories.

In sum, we believe that dictionary-based methods will continue to play a powerful role—they are fast and well suited for web-scale data sets—and that the best instruments will be based on dictionaries with excellent coverage and continuum scores. To this end, we urge that all dictionaries should be regularly updated to capture changing lexicons, word usage, and demographics. Looking further ahead, a move from scoring words to scoring both phrases and words with senses should realize considerable improvement for many languages of interest. With phrase dictionaries, the resulting phrase shift graphs will allow for a more nuanced and detailed analysis of a corpus’s sentiment score (Alajajian et al., 2016), ultimately affording clearer stories for sentiment dynamics.


3.1 Introduction

The power of stories to transfer information and define our own existence has been shown time and again (Pratchett et al., 2003; Campbell, 1949; Gottschall, 2013; Cave, 2013). We as people are fundamentally driven to find and tell stories, likened to Pan Narrans or Homo Narrativus (Dodds, 2013). Stories are encoded in art, language, and even in the mathematics of physics: We use equations to represent both simple and complicated functions that describe our observations of the real world. In science, we formalize the ideas that best fit our experience with principles such as Occam’s Razor: The simplest story is the one we should trust. We tend to prefer stories that fit into the molds which are familiar, and reject narratives that do not align with our experience (Nickerson, 1998).

We seek here to better understand stories that are captured and shared in written form, a medium that since inception has radically changed how information flows (Gleick, 2011). Without evolved cues from tone, facial expression, or body language, written stories are forced to capture the entire transfer of experience on a page. An often integral part of a written story is the emotional experience that is evoked in the reader. Here, we use a simple, robust sentiment analysis tool to extract the reader-perceived emotional content of written stories as they unfold on the page.

We objectively test aspects of folkloristic theory (Propp, 1968; MacDonald, 1982), specifically the commonality of core stories within societal boundaries (Cave, 2013; da Silva and Tehrani, 2016). A major component of folkloristics is the study of society and culture through literary analysis. This is sometimes referred to as narratology, which at its core is “a series of events, real or fictional, presented to the reader or the listener” (Min and Park, 2016). In our present treatment, we consider the plot as the “backbone” of events that occur in a chronological sequence (more detail on previous theories of plot, and the framing we present next and adopt, are in Appendix B.1). While the plot captures the mechanics of a narrative and the structure encodes their delivery, in the present work we examine the emotional arc that is invoked through the words used. The emotional arc of a story does not give us direct information about the plot or the intended meaning of the story, but rather exists as part of the whole narrative (e.g., an emotional arc showing a fall in sentiment throughout a story may arise from very different plot and structure combinations). This distinction between the emotional arc and the plot of a story is one point of misunderstanding in other work that has drawn criticism from the digital humanities community (Jockers, 2014). Through the identification of motifs, narrative theories allow us to analyze, interpret, describe, and compare stories across cultures and regions of the world (Dundes, 1997; Dolby, 2008; Uther, 2011). We show that automated extraction of emotional arcs is not only possibly, but can test previous theories and provide new insights with the potential to quantify unobserved trends as the field transitions from data-scarce to data-rich (Kirschenbaum, 2007; Moretti, 2013).

The rejected master’s thesis of Kurt Vonnegut—which he personally considered his greatest contribution—defines the emotional arc of a story on the “Beginning–End” and “Ill Fortune–Great Fortune” axes (Vonnegut, 1981). Vonnegut finds a remarkable similarity between Cinderella and the origin story of Christianity in the Old Testament (see Fig. B.1 in Appendix B.2), leading us to search for all such groupings. In a recorded lecture available on YouTube (Vonnegut, 1995), Vonnegut asserted:

“There is no reason why the simple shapes of stories can’t be fed into computers, they are beautiful shapes.”

For our analysis, we apply three independent tools: Matrix decomposition by Singular Value Decomposition (SVD), supervised learning by agglomerative (hierarchical) clustering with Ward’s method, and unsupervised learning by a Self Organizing Map (SOM, a type of neural network). Each tool has different strengths: the SVD finds the underlying basis of all of the emotional arcs, the clustering classifies the emotional arcs into distinct groups, and the SOM generates arcs from noise which are similar to those in our corpus using a stochastic process. By considering the results of each tool independently, we are able to confirm our findings of broad support.

We proceed as follows. We first introduce our methods in Section 3.2, we then discuss the combined results of each method in Section 3.3, and we present our conclusions in Section 3.4. A graphical outline of the methodology and results can be found as Fig. B.2 in Appendix B.2.

3.2 Methods

3.2.1 Emotional arc construction

To generate emotional arcs, we analyze the sentiment of 10,000 word windows, which we slide through the text (see Fig. 3.1). We rate the emotional content of each window using our Hedonometer with the labMT dataset, chosen for lexical coverage and its ability to generate meaningful word shift graphs, specifically using 10,000 words as a minimum necessary to generate meaningful sentiment scores (Reagan et al., 2015; Ribeiro et al., 2016). We emphasize that dictionary-based methods for sentiment analysis usually perform worse than random on individual sentences (Reagan et al., 2015; Ribeiro et al., 2016), and although this issue can be resolved by using a rolling average of sentences scores, it betrays a basic misunderstanding of similar efforts (Jockers, 2014). In Fig. 3.2, we show the emotional arc of Harry Potter and the Deathly Hallows, the final book in the popular Harry Potter series by J.K. Rowling. While the plot of the book is nested and complicated, the emotional arc associated with each sub-narrative is clearly visible. We analyze the emotional arcs corresponding to complete books, and to limit the conflation of multiple core emotional arcs, we restrict our analysis to shorter books by selecting a maximum number of words when building our filter. Further details of the emotional arc construction can be found in Appendix B.3.

Figure 3.1: Schematic of how we compute emotional arcs. The indicated uniform length segments (gap between samples) taken from the text form the sample with fixed window size set at words. The segment length is thus for the length of the book in words, and the number of points in the time series. Sliding this fixed size window through the book, we generate sentiment scores with the Hedonometer, which comprise the emotional arc (Dodds et al., 2011).
Figure 3.2: Annotated emotional arc of Harry Potter and the Deathly Hallows, by J.K. Rowling, inspired by the illustration made by Medaris for The Why Files (Tenenbaum et al., 2015). The entire seven book series can be classified as a “Kill the monster” plot (Booker, 2006), while the many sub plots and connections between them complicate the emotional arc of each individual book: this plot could not be readily inferred from the emotional arc alone. The emotional arc shown here captures the major highs and lows of the story, and should be familiar to any reader well acquainted with Harry Potter. Our method does not pick up emotional moments discussed briefly, perhaps in one paragraph or sentence (e.g., the first kiss of Harry and Ginny). We provide interactive visualizations of all Project Gutenberg books at and a selection of classic and popular books at

3.2.2 Project Gutenberg Corpus

For a suitable corpus we draw on the open access Project Gutenberg data set (Various, Various). We apply rough filters to the collection (roughly 50,000 books) in an attempt to obtain a set of books that represent English works of fiction. We start by selecting for only English books, with total words between 20,000 and 100,000, with more than 40 downloads from the Project Gutenberg website, and with Library of Congress Class corresponding to English fiction111The specific classes have labels PN, PR, PS, and PZ.. To ensure that the 40-download limit is not influencing the results here, we repeat the entire analysis for each method with 10, 20, 40, and 80 download thresholds, in each case confirming the 40 download findings to be qualitatively unchanged. Next, we remove books with any word in the title from a list of keywords (e.g., “poems” and “collection”, full list in Appendix B.3

). From within this set of books, we remove the front and back matter of each book using regular expression pattern matches that match on 98.9% of the books included. Two slices of the data for download count and the total word count are shown in Appendix 

B.3 Fig. B.4. We provide a list of the book ID’s which are included for download in the Online Appendices at, the books are listed in Table B.1 in Appendix B.4, and we attempt to provide the Project Gutenberg ID when we mention a book by title herein. Given the Project Gutenberg ID , the raw ebook is available online from Project Gutenberg at, e.g., Alice’s Adventures in Wonderland by Lewis Carroll, has ID 11 and is available at We also provide an online, interactive version of the emotional arc for each book indexed by the ID, e.g., Alice’s Adventures in Wonderland is available at

3.2.3 Principal Component Analysis (SVD)

We use the standard linear algebra technique Singular Value Decomposition (SVD) to find a decomposition of stories onto an orthogonal basis of emotional arcs. Starting with the emotional arc (sentiment time series) for each book as row in the matrix , we apply the SVD to find


where contains the projection of each sentiment time series onto each of the right singular vectors (rows of

, eigenvectors of

), which have singular values given along the diagonal of , with . Different intuitive interpretations of the matrices and are useful in the various domains in which the SVD is applied; here, we focus on right singular vectors as an orthonormal basis for the sentiment time series in the rows of , which we will refer to as the modes. We combine and into the single coefficient matrix for clarity and convenience, such that now represents the mode coefficients.

3.2.4 Hierarchical Clustering

We use Ward’s method to generate a hierarchical clustering of stories, which proceeds by minimizing variance between clusters of books (Ward Jr, 1963). We use the mean-centered books and the distance function


for indexing the window in books to generate the distance matrix.

3.2.5 Self Organizing Map (SOM)

We implement a Self Organized Map (SOM), an unsupervised machine learning method (a type of neural network) to cluster emotional arcs (Kohonen, 1990). The SOM works by finding the most similar emotional arc in a random collection of arcs. We use an 8x8 SOM (for 64 nodes, roughly 5% of the number of books), connected on a square grid, training according to the original procedure (with winner take all, and scaling functions across both distance and magnitude). We take the neighborhood influence function at iteration as


for a node in the set of nodes , with distance function given above and total number of nodes . For results shown here we take . We implement the learning adaptation function at training iteration as , again with , a standard value for the training hyper-parameters.

3.3 Results

We obtain a collection of 1,327 books that are mostly, but not all, fictional stories by using metadata from Project Gutenberg to construct a rough filter. We find broad support for the following six emotional arcs:

  • “Rags to riches” (rise).

  • “Tragedy”, or “Riches to rags” (fall).

  • “Man in a hole” (fall-rise).

  • “Icarus” (rise-fall).

  • “Cinderella” (rise-fall-rise).

  • “Oedipus” (fall-rise-fall).

Importantly, we obtain these same six emotional arcs from all possible arcs by observing them as the result of three methods: As modes from a matrix decomposition by SVD, as clusters in a hierarchical clustering using Ward’s algorithm, and as clusters using unsupervised machine learning. We examine each of the results in this section.

3.3.1 Principal Component Analysis (SVD)

In Fig. 3.3 we show the leading 12 modes in both the weighted (dark) and un-weighted (lighter) representation. In total, the first 12 modes explain 80% and 94% of the variance from the mean centered and raw time series, respectively. The modes are from mean-centered emotional arcs, such that the first SVD mode need not extract the average from the labMT scores nor the positivity bias present in language (Dodds et al., 2015a). The coefficients for each mode within a single emotional arc are both positive and negative, so we need to consider both the modes and their negation. We can immediately recognize the familiar shapes of core emotional arcs in the first four modes, and compositions of these emotional arcs in modes 5 and 6. We observe “Rags to riches” (mode 1, positive), “Tragedy” or “Riches to rags” (mode 1, negative), Vonnegut’s “Man in a hole” (mode 2, positive), “Icarus” (mode 2, negative), “Cinderella” (mode 3, positive), “Oedipus” (mode 3, negative). We choose to include modes 7–12 only for completeness, as these high frequency modes have little contribution to variance and do not align with core emotional arc archetypes from other methods (more below).

Figure 3.3: Top 12 modes from the Singular Value Decomposition of 1,327 Project Gutenberg books. We show in a lighter color modes weighted by their corresponding singular value, where we have scaled the matrix such that the first entry is 1 for comparison (for reference, the largest singular value is 34.5). The mode coefficients normalized for each book are shown in the right panel accompanying each mode, in the range -1 to 1, with the “Tukey” box plot.

We emphasize that by definition of the SVD, the mode coefficients in can be either positive and negative, such that the modes themselves explain variance with both the positive and negative version. In the right panels of each mode in Fig. 3.3 we project the 1,327 stories onto each of first six modes and show the resulting coefficients. While none are far from 0 (as would be expected), mode 1 has a mean slightly above 0 and both modes 3 and 4 have means slightly below 0. To sort the books by their coefficient for each mode, we normalize the coefficients within each book in the rows of to sum to 1, accounting for books with higher total energy, and these are the coefficients shown in the right panels of each mode in Fig. 3.3. In Appendix B.5, we provide supporting, intuitive details of the SVD method, as well as example emotional arc reconstruction using the modes (see Figs. B.5B.7). As expected, less than 10 modes are enough to reconstruct the emotional arc to a degree of accuracy visible to the eye.

We show labeled examples of the emotional arcs closest to the top 6 modes in Figs. 3.4 and B.8.

Figure 3.4: First 3 SVD modes and their negation with the closest stories to each. To locate the emotional arcs on the same scale as the modes, we show the modes directly from the rows of and weight the emotional arcs by the inverse of their coefficient in for the particular mode. The closest stories shown for each mode are those stories with emotional arcs which have the greatest coefficient in . In parentheses for each story is the Project Gutenberg ID and the number of downloads from the Project Gutenberg website, respectively. Links below each story point to an interactive visualization on which enables detailed exploration of the emotional arc for the story.

We present both the positive and negative modes, and the stories closest to each by sorting on the coefficient for that mode. For the positive stories, we sort in ascending order, and vice versa. Mode 1, which encompasses both the “Rags to riches” and “Tragedy” emotional arcs, captures 30% of the variance of the entire space. We examine the closest stories to both sides of modes 1–3, and direct the reader to Fig. B.8 for more details on the higher order modes. The two stories that have the most support from the “Rags to riches” mode are The Winter’s Tale (1539) and Oscar Wilde, Art and Morality: A Defence of “The Picture of Dorian Gray” (33689). Among the most categorical tragedies we find Lady Susan (946) and Warlord of Kor (17958). Number 8 in the sorted list of tragedies is perhaps the most famous tragedy: Romeo and Juliet by William Shakespeare. Mode 2 is the “Man in a hole” emotional arc, and we find the stories which most closely follow this path to be The Magic of Oz (419) and Children of the Frost (10736). The negation of mode 2 most closely resembles the emotional arc of the “Icarus” narrative. For this emotional arc, the most characteristic stories are Shadowings (34215) and Battle-Pieces and Aspects of the War (12384). Mode 3 is the “Cinderella” emotional arc, and includes Mystery of the Hasty Arrow (17763) and Through the Magic Dorr (5317). The negation of Mode 3, which we refer to as “Oedipus”, is found most characteristically in This World is Taboo (18172), Old Indian Days (339), and The Evil Guest (10377). We also note that the spread of the stories from their core mode increases strongly for the higher modes.

3.3.2 Hierarchical Clustering

We show a dendrogram of the 60 clusters with highest linkage cost in Fig. 3.5. The average silhouette coefficient is shown on the bottom of Fig. 3.5, and the distributions of silhouette values within each cluster (see Figs. B.17B.18) can be used to analyze the appropriate number of clusters (Rousseeuw, 1987). A characteristic book from each cluster is shown on the leaf nodes by sorting the books within each cluster by the total distance to other books in the cluster (e.g., considering each intra-cluster collection as a fully connected weighted network, we take the most central node), and in parenthesis the number of books in that cluster. In other words, we label each cluster by considering the network centrality of the fully connected cluster with edges weighted by the distance between stories. By cutting the dendrogram in Fig. 3.5 at various linkage costs we are able to extract clusters of the desired granularity. For the cuts labeled C2, C4, and C8, we show these clusters in Figs. B.9B.11, and B.15. We find the first four of our final six arcs appearing among the eight most different clusters (Fig. B.15).

Figure 3.5: Dendrogram from the hierarchical clustering procedure using Ward’s minimum variance method. For each cluster, a selection of the 20 most central books to a fully-connected network of books are shown along with the average of the emotional arc for all books in the cluster, along with the cluster ID and number of books in each cluster (shown in parenthesis). The cluster ID is given by numbering the clusters in order of linkage starting at 0, with each individual book representing a cluster of size 1 such that the final cluster (all books) has the ID for the books. At the bottom, we show the average Silhouette value for all books, with higher value representing a more appropriate number of clusters. For each of the 60 leaf nodes (right side) we show the number of books within the cluster and the most central book to that cluster’s book network.

The clustering method groups stories with a “Man in a hole” emotional arc for a range of different variances, separate from the other arcs. In total these clusters (Panel A, E, and I of Fig. B.16) account for 30% of the Gutenberg corpus. The remainder of the stories have emotional arcs that are clustered among the “Tragedy” arc (32%), “Rags to riches” arc (5%), and the “Oedipus” arc (31%). A more detailed analysis of the results from hierarchical clustering can be found in Appendix B.6, and this result generally agrees with other attempts that use only hierarchical clustering (Jockers, 2015).a

3.3.3 Self Organizing Map (SOM)

Finally, we apply Kohonen’s Self-Organizing Map (SOM) and find core arcs from unsupervised machine learning on the emotional arcs. On the two dimensional component plane, the prescribed network topology, we find seven spatially coherent groups, with five emotional arcs. These spatial groups are comprised of stories with core emotional arcs of differing variance.

In Fig. 3.6 we see both the B-Matrix to demonstrate the strength of spatial clustering and a heat-map showing where we find the winning nodes. The A–I labels refer to the individual nodes shown in Fig. B.19, and we observe seven spatial groups in both panels of Fig. 3.6: (1) A and G, (2) B and I, (3) C, (4) D, (5) E, and (6) H, and (7) F. These spatial clusters reinforce the visible similarity of the winning node arcs, given that nodes H and F are close spatially but separated by the B-Matrix and contain very distinct arcs. We show the winning node emotional arcs and the arcs of books for which they are the winners in Fig. B.19. The legend shows the node ID, numbers the cluster by size, and in parentheses indicates the size of the cluster on that individual node. In Panels A and G we see varying strengths of the “Man in a hole” emotional arc. In Panels B and I, the second largest individual cluster consists of the “Rags to riches” arcs. In Panel C, and in Panel F, we find the “Oedipus” emotional arc, with a more pronounced positive start and decline in Panel C. In Panel D we see the “Icarus” arc, and in Panel E and Panel H we see the “Tragedy” arc. Each of these top stories are all readily identifiable, yet again demonstrating the universality of these story types.

Figure 3.6: Results of the SOM applied to Project Gutenberg books. Left panel: Nodes on the 2D SOM grid are shaded by the number of stories for which they are the winner. Right panel: The B-Matrix shows that there are clear clusters of stories in the 2D space imposed by the SOM network.

3.3.4 Null comparison

There are many possible emotional arcs in the space that we consider. To demonstrate that these specific arcs are uniquely compelling as stories written by and for homo narrativus, we consider the true emotional arcs in relation to their most suitable comparison: the book with randomly shuffled words (“word salad”) and the resulting text from a 2-gram Markov model trained on the individual book itself (“nonsense”). We chose to compare to “word salad” and “nonsense” versions as they are more representative of a null model: written stories that are without coherent plot or structure to generate a coherent emotional arc, which is not true of a stochastic process (e.g., a random walk model or noise). Examples of the emotional arc and null emotional arcs for a single book are shown in Fig. B.20, with 10 “word salad” and “nonsense” versions. Sampled text using each method is given in Appendix B.3. We re-run each method on the English fiction Gutenberg Corpus with the null versions of each book and verify that the emotional arcs of real stories are not simply an artifact. The singular value spectrum from the SVD is flatter, with higher-frequency modes appearing more quickly, and in total representing 45% of the total variance present in real stories (see Figs. B.22 and B.25). Hierarchical clustering generates less distinct clusters with considerably lower linkage cost (final linkage cost 1400 vs 7000) for the emotional arcs from nonsense books, and the winning node vectors on a self-organizing map lack coherent structure (see Figs. B.26 and B.29 in Appendix B.8).

3.3.5 The Success of Stories

To examine how the emotional trajectory impacts success, in Fig. 3.7 we examine the downloads for all of the books that are most similar to each SVD mode (for additional modes, see Fig. B.3 in Appendix B.2). We find that the first four modes, which contain the greatest total number of books, are not the most popular. Along with the negative of mode 2, both polarities of modes 3 and 4 have markedly higher median downloads, while we discount the importance of the mean with the high variance. The success of the stories underlying these emotional arcs suggests that the emotional experience of readers strongly affects how stories are shared. We find “Icarus” (-SV 2), “Oedipus” (-SV 3), and two sequential “Man in a hole” arcs (SV 4), are the three most successful emotional arcs. These results are influenced by individual books within each mode which have high numbers of downloads, and we refer the reader to the download-sorted tables for each mode in Appendix B.5.

Figure 3.7: Download statistics for stories whose SVD Modes comprise more than 2.5% of books, for the total number of books and the number corresponding to the particular mode. Modes SV 3 through -SV 4 (both polarities of modes 3 and 4) exhibit a higher average number of downloads and more variance than the others. Mode arcs are rows of and the download distribution is show in space from 20 to 30,000 downloads.

3.4 Conclusion

Using three distinct methods, we have demonstrated that there is strong support for six core emotional arcs. Our methodology brings to bear a cross section of data science tools with a knowledge of the potential issues that each present. We have also shown that consideration of the emotional arc for a given story is important for the success of that story. Of course, downloads are only a rough proxy for success, and this work may provide an outline for more detailed analysis of the factors that impact meaningful measures of success, i.e., sales or cultural influence.

Our approach could be applied in the opposite direction: namely by beginning with the emotional arc and aiding in the generation of compelling stories (Li et al., 2013). Understanding the emotional arcs of stories may be useful to aid in constructing arguments (Bex and Bench-Capon, 2010) and teaching common sense to artificial intelligence systems (Riedl and Harrison, 2015).

Extensions of our analysis that use a more curated selection of full-text fiction can answer more detailed questions about which stories are the most popular throughout time, and across regions (da Silva and Tehrani, 2016). Automatic extraction of character networks would allow a more detailed analysis of plot structure for the Project Gutenberg corpus used here (Bost et al., 2016; Prado et al., 2016; Min and Park, 2016). Bridging the gap between the full text stories (Nenkova and McKeown, 2012) and systems that analyze plot sequences will allow such systems to undertake studies of this scale (Winston, 2011). Place could also be used to consider separate character networks through time, and to help build an analog to Randall Munroe’s Movie Narrative Charts (Munroe, 2009).

We are producing data at an ever increasing rate, including rich sources of stories written to entertain and share knowledge, from books to television series to news. Of profound scientific interest will be the degree to which we can eventually understand the full landscape of human stories, and data driven approaches will play a crucial role.

PSD and CMD acknowledge support from NSF Big Data Grant #1447634.


4.1 Collective Philanthropy: Describing and Modeling the Ecology of Giving

The first paper is Collective Philanthropy: Describing and Modeling the Ecology of Giving by William L. Gottesman, Andrew James Reagan, and Peter Sheridan Dodds, cited as Gottesman et al. (2014).

4.1.1 Abstract

Reflective of income and wealth distributions, philanthropic gifting appears to follow an approximate power-law size distribution as measured by the size of gifts received by individual institutions. We explore the ecology of gifting by analyzing data sets of individual gifts for a diverse group of institutions dedicated to education, medicine, art, public support, and religion. We find that the detailed forms of gift-size distributions differ across but are relatively constant within charity categories. We construct a model for how a donor’s income affects their giving preferences in different charity categories, offering a mechanistic explanation for variations in institutional gift-size distributions. We discuss how knowledge of gift-sized distributions may be used to assess an institution’s gift-giving profile, to help set fund-raising goals, and to design an institution-specific giving pyramid.

4.1.2 Contribution

In this paper I prepared final versions of each visualization in the paper, working from the initial designs from both Professor Dodds and Bill Gottesman, and working closely with Professor Dodds in their preparation. Additionally and at the request of the reviewers, I performed the statistical tests for support of power law distributions discussed in the paper, and included in the Appendix. In addition to testing for support of power law distributions using the MLE estimator Clauset et al. (2009), I ran likelihood comparison tests across many distributions, which we argue in the manuscript are potentially more applicable here to determine the most appropriate distribution. In Figure The parameters for the various distributions mentioned in the paper are written using LaTeX variables, written in a .tex file by the MATLAB and Python scripts that perform the statistical procedures. To the extend possible, all figures and analysis can be reproduced by running a single script. In this Section we include a reprint of Figure 1, Figure S1, and the power law fit tables from the paper. The codebase for creating the figures and performing the statistical procedures is available at

Figure 4.1: A reprint of Figure 1 from Gottesman et al. (2014), part of the caption is as follows: “Gift size distributions for a range of institutions. The reported and were fitted to the region indicated by solid gray line, and the 95% CI of this fit, as well as year for which the fit is plotted, are included for each organization. The ranges over which the data were fit was chosen empirically; other approaches were found to be inconsistent (see Supplementary).”
Figure 4.2: A reprint of Figure S1 from Gottesman et al. (2014), part of the caption is as follows: “The Kolmogorov-Smirnoff statistic plotted over the log of , the minimum value fit for power law behavior, for the United Way of Chittenden County over the years 2006-2010. is generated from the ML estimate. Existence of multiple minima in our data indicate that there are multiple possible fitting regions for which the KS statistic details a good fit. The variability of this value over each year plotted produced widely varying scaling parameters , and thus cannot be used without actually looking at the data.”
Institution Year   D    p
Mount Sinai Hospital 2009 17618.40 450408.65 37259947 1.92 0.08 1 to 90 0.12 0.00
2010 19348.18 429587.88 27885708 2.02 0.10 1 to 90 0.10 0.00
Einstein School of Medicine 2006 3247.30 46940.29 2000000 1.79 0.02 1 to 2000 0.11 0.00
2007 4768.09 78762.48 5350000 1.71 0.01 1 to 2000 0.15 0.00
2008 10385.80 199751.68 10200000 1.80 0.01 1 to 2000 0.21 0.00
2009 5212.92 139468.89 10000000 1.84 0.01 1 to 2000 0.15 0.00
2010 4917.94 61893.49 2000000 1.80 0.06 1 to 2000 0.15 0.00
Univeristy of Vermont 1974 155.76 2811.94 200000 1.94 0.01 3 to 794 0.18 0.00
1980 284.31 5284.36 326000 1.85 0.03 3 to 794 0.11 0.00
1990 350.23 5382.45 500000 2.16 0.01 3 to 794 0.38 0.00
2000 805.33 15120.53 1488000 1.71 0.03 3 to 794 0.09 0.00
2010 741.40 17029.10 2000000 1.81 0.05 3 to 794 0.13 0.00
United Way, Chittendon County 2004 441.71 1133.02 30000 2.77 0.04 1 to 316 0.21 0.00
2005 464.47 1444.26 50000 2.58 0.22 1 to 316 0.13 0.00
2006 456.86 1199.92 25000 2.42 0.05 1 to 316 0.07 0.00
2007 456.16 1279.14 30000 2.42 0.14 1 to 316 0.07 0.00
2008 287.53 1089.92 45460 2.53 0.00 1 to 316 0.14 0.00
2009 278.93 1122.44 56500 2.55 0.08 1 to 316 0.12 0.00
2010 287.58 1271.10 70518 2.47 0.09 1 to 316 0.08 0.00
ECHO Science Museum 2005 977.77 3153.41 25000 1.66 0.03 2 to 88 0.20 0.00
2006 951.16 3415.22 25000 1.59 0.02 2 to 88 0.28 0.00
2007 941.61 3161.08 25000 1.59 0.07 2 to 88 0.31 0.00
2008 956.88 2688.31 20000 1.56 0.01 2 to 88 0.26 0.00
2009 676.84 2098.96 20000 1.73 0.15 2 to 88 0.17 0.00
Flynn Theater 2006 241.87 1528.82 65065 2.18 0.04 1 to 2000 0.26 0.00
2007 268.54 1732.33 60000 2.15 0.05 1 to 2000 0.25 0.00
2008 248.00 1015.39 27500 2.15 0.00 1 to 2000 0.22 0.00
2009 242.90 1212.42 40000 2.18 0.04 1 to 2000 0.23 0.00
2010 246.13 1606.43 70000 2.09 0.05 1 to 2000 0.22 0.00
Table 4.1: Summary statistics of all of the donation data is presented. The reported and range are fit with the MLE method, and the which was found to minimize the Kolmogorov-Smirnoff statistc D is reported along with D itself. In this case, lower values of D indicate a better fit.
Log-Normal Exponential Stretched Exp. Cutoff Power Law
Institution Year   p   LR   p   LR   p   LR   p   LR   p
Mount Sinai Hospital 2009 0.00 -0.21 0.67 31.80 -0.19 0.82 -0.53 0.30
2010 0.00 -0.00 0.99 47.31 0.46 0.60 -0.23 0.50
Einstein School of Medicine 2006 0.00 -6.22 378.82 -7.06 -8.31
2007 0.00 -0.30 0.59 17.65 -0.35 0.61 -0.67 0.25
2008 0.00 -1.03 0.37 1235.22 0.71 0.81 -2.85
2009 0.00 -2.48 0.13 578.27 -2.75 0.22 -5.82
2010 0.00 -1.52 0.22 842.87 -0.64 0.80 -5.19
Univeristy of Vermont 1974 0.00 -0.39 0.54 20.93 -0.49 0.54 -1.17 0.13
1980 0.00 -0.72 0.41 82.27 -0.81 0.47 -1.82
1990 0.00 -0.94 0.36 23.05 -1.11 0.34 -1.79
2000 0.00 -0.65 0.45 30.59 -0.78 0.44 -1.52
2010 0.00 -inf nan 7.75 0.39 0.34 -0.00 0.94
United Way, Chittendon County 2004 0.00 -0.46 0.47 28.75 -0.53 0.55 -1.29 0.11
2005 0.00 -0.08 0.77 54.69 0.36 0.74 -0.69 0.24
2006 0.00 -0.12 0.71 68.71 0.44 0.71 -0.85 0.19
2007 0.00 -0.61 0.43 48.21 -0.65 0.57 -1.64
2008 0.00 -0.13 0.72 46.52 0.14 0.90 -0.71 0.23
2009 0.00 -0.35 0.55 48.39 -0.28 0.80 -1.15 0.13
2010 0.00 -0.32 0.58 35.25 -0.30 0.77 -0.90 0.18
ECHO Science Museum 2005 0.00 -2.47 0.25 31.43 -3.04 0.21 -3.56
2006 0.00 -0.20 0.69 1.42 0.57 -0.28 0.68 -0.53 0.30
2007 0.00 -inf nan 4.56 0.20 0.35 0.00 1.00
2008 0.00 -inf nan 4.28 0.29 0.19 0.00 1.00
2009 0.00 -0.87 0.47 31.48 -1.23 0.44 -2.51
Flynn Theater 2006 0.00 -0.52 0.46 272.93 0.32 0.87 -2.80
2007 0.00 -0.06 0.80 4.53 0.14 -0.08 0.86 -0.26 0.47
2008 0.00 -0.56 0.45 303.73 0.38 0.86 -3.35
2009 0.00 -0.25 0.63 281.34 1.11 0.59 -2.16
2010 0.00 -3.96 129.19 -4.61 -6.78
Table 4.2: The results of the Likelihood-Ratio and its associated p-value are reported for different distributions. Here, positive values lend support to the Power Law and negative values to the other stated distribution. The significance of the LR is p, where low values of p indicate a trustworthy LR. Values for which are bolded.

4.2 Shadow networks: Discovering hidden nodes with models of information flow

Paper number two is Shadow networks: Discovering hidden nodes with models of information flow by James P. Bagrow, Suma Desu, Morgan R. Frank, Narine Manukyan, Lewis Mitchell, Andrew Reagan, Eric E. Bloedorn, Lashon B. Booker, Luther K. Branting, Michael J. Smith, Brian F. Tivnan, Christopher M. Danforth, Peter S. Dodds, and Joshua C. Bongard, cited as Bagrow et al. (2014).

4.2.1 Abstract

Complex, dynamic networks underlie many systems, and understanding these networks is the concern of a great span of important scientific and engineering problems. Quantitative description is crucial for this understanding yet, due to a range of measurement problems, many real network datasets are incomplete. Here we explore how accidentally missing or deliberately hidden nodes may be detected in networks by the effect of their absence on predictions of the speed with which information flows through the network. We use Symbolic Regression (SR) to learn models relating information flow to network topology. These models show localized, systematic, and non-random discrepancies when applied to test networks with intentionally masked nodes, demonstrating the ability to detect the presence of missing nodes and where in the network those nodes are likely to reside.

4.2.2 Contribution

This paper is the result of a multi-day intensive collaboration called a Flash Mob Research Event. The format is one or two days of everyone in the same room, brain storming how to tackle an important open question. An outline of the paper is written, and after the event each member works to complete their part in carrying out the research idea. My responsibility was to build reciprocal reply networks from Twitter data, in an effort to measure information flow over the network. The network construction proceeded in three steps: (1) build a network using replies, (2) measure information flow over this reciprocal reply network, and (3) collect edges in the network for the actual information flow. Each step of the construction would be carried out over a number of days, and using a single note on the VACC, we were able to build networks in memory for a total of 9 days. These 9 days were considered for combinations 3/3/3 or 4/4/1 days, respectively. These data were used in a real world test, to accompany testing of simulated data.

4.3 Human language reveals a universal positivity bias

Paper number three is Human language reveals a universal positivity bias by Peter Sheridan Dodds, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth, cited as Dodds et al. (2015a).

4.3.1 Abstract

Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (1) the words of natural human language possess a universal positivity bias; (2) the estimated emotional content of words is consistent between languages under translation; and (3) this positivity bias is strongly independent of frequency of word usage. Alongside these general regularities, we describe inter-language variations in the emotional spectrum of languages which allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts.

4.3.2 Contribution

In this paper I built the online appendices and performed additional tests of our method for building the sentiment timeseries for books (measuring their emotional arcs). This included building a fully interactive version of an application of this dataset to analyze the emotional arcs of stories, which was done for a selection of the Western Canon and Project Gutenberg books. In particular, we analyzed the emotional arc for these books in their original language, providing translations of the word shifts graphs into English. The translations relied upon the translations of Google Translate, as curated by Eric Clark. The additional statistical tests amounted to randomly shuffling the words in each book which we showcased, to demonstrate that the emotional arcs were meaningful.

4.4 Climate change sentiment on Twitter: An unsolicited public opinion poll

Paper number four is Climate change sentiment on Twitter: An unsolicited public opinion poll by Emily M. Cody, Andrew J. Reagan, Lewis Mitchell, Peter Sheridan Dodds, and Christopher M. Danforth, cited as Cody et al. (2015).

4.4.1 Abstract

The consequences of anthropogenic climate change are extensively debated through scientific papers, newspaper articles, and blogs. Newspaper articles may lack accuracy, while the severity of findings in scientific papers may be too opaque for the public to understand. Social media, however, is a forum where individuals of diverse backgrounds can share their thoughts and opinions. As consumption shifts from old media to new, Twitter has become a valuable resource for analyzing current events and headline news. In this research, we analyze tweets containing the word "climate" collected between September 2008 and July 2014. Through use of a previously developed sentiment measurement tool called the Hedonometer, we determine how collective sentiment varies in response to climate change news, events, and natural disasters. We find that natural disasters, climate bills, and oil-drilling can contribute to a decrease in happiness while climate rallies, a book release, and a green ideas contest can contribute to an increase in happiness. Words uncovered by our analysis suggest that responses to climate change news are predominantly from climate change activists rather than climate change deniers, indicating that Twitter is a valuable resource for the spread of climate change awareness.

4.4.2 Contribution

In this paper I was responsible for the data curation. This amounted to searching the Twitter database on the VACC for a variety of keywords, storing those results, and processing them into useful formats for analysis. Weighing at approximately 37TB of compressed JSON files, the Twitter database is difficult to search quickly over the GPFS architecture of the VACC, and only possible through the use of many short runtime (less than 2 hour) jobs. Given all of this, a single search of the database takes approximately 2 days if everything is running smoothly.

4.5 Reply to Garcia et al.: Common mistakes in measuring frequency dependent word characteristics

The fifth paper is Reply to Garcia et al.: Common mistakes in measuring frequency dependent word characteristics by P. S. Dodds, E. M. Clark, S. Desu, M. R. Frank, A. J. Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M. Kloumann, J. P. Bagrow, K. Megerdoomian, M. T. McMahon, B. F. Tivnan, and C. M. Danforth, cited as Dodds et al. (2015b).

4.5.1 Abstract

We demonstrate that the concerns expressed by Garcia et al. are misplaced, due to (1) a misreading of our findings in Dodds et al. (2015a); (2) a widespread failure to examine and present words in support of asserted summary quantities based on word usage frequencies; and (3) a range of misconceptions about word usage frequency, word rank, and expert-constructed word lists. In particular, we show that the English component of our study compares well statistically with two related surveys, that no survey design influence is apparent, and that estimates of measurement error do not explain the positivity biases reported in our work and that of others. We further demonstrate that for the frequency dependence of positivity —of which we explored the nuances in great detail in Dodds et al. (2015a) —Garcia et al did not perform a reanalysis of our data— they instead carried out an analysis of a statistically improper data set and introduced a nonlinearity before performing linear regression.

4.5.2 Contribution

For this paper I built a new online appendix, performed tests of the claims made by Garcia et al.(including re-making their visualizations), and built visualizations for the extended version of the reply (e.g. Table I and Figure 1 in the arXiv version). Below, we include a reprint of the aforementioned Figure 1 and reproduction of the Figure from Garcia et al.:

Figure 4.3: Reprint of Figure 1 from Dodds et al. (2015b), with the caption as follows: “Comparison of word ratings for three studies for overlapping words: labMT (Dodds et al., 2011), ANEW (Bradley and Lang, 1999), and Warriner and Kuperman (Warriner et al., 2013) Reduced major axis regression (Rayner, 1985) yield the fits .”
Figure 4.4: A reproduction of the Figure 1A and 1B from Garcia et al. (2015).

4.6 The game story space of professional sports: Australian Rules Football

Paper number six is The game story space of professional sports: Australian Rules Football by D. P. Kiley, A. J. Reagan, L. Mitchell, C. M. Danforth, and P. S. Dodds, cited as Kiley et al. (2016).

4.6.1 Abstract

Sports are spontaneous generators of stories. Through skill and chance, the script of each game is dynamically written in real time by players acting out possible trajectories allowed by a sport’s rules. By properly characterizing a given sport’s ecology of ‘game stories’, we are able to capture the sport’s capacity for unfolding interesting narratives, in part by contrasting them with random walks. Here, we explore the game story space afforded by a data set of 1,310 Australian Football League (AFL) score lines. We find that AFL games exhibit a continuous spectrum of stories rather than distinct clusters. We show how coarse-graining reveals identifiable motifs ranging from last minute comeback wins to one-sided blowouts. Through an extensive comparison with biased random walks, we show that real AFL games deliver a broader array of motifs than null models, and we provide consequent insights into the narrative appeal of real games.

4.6.2 Contribution

For this paper I consulted with lead author Dilan Kiley on the statistical methods used, and assisted in performing the statistical analysis by leveraging the computational resources of the VACC.

4.7 The Lexicocalorimeter: Gauging public health through caloric input and output on social media

Paper number seven is The Lexicocalorimeter: Gauging public health through caloric input and output on social media by S. E. Alajajian, J. R. Williams, A. J. Reagan, S. C. Alajajian, M. R. Frank, L. Mitchell, J. Lahne, C. M. Danforth, and P. S. Dodds, cited as Alajajian et al. (2016).

4.7.1 Abstract

We propose and develop a Lexicocalorimeter: an online, interactive instrument for measuring the “caloric content” of social media and other large-scale texts. We do so by constructing extensive yet improvable tables of food and activity related phrases, and respectively assigning them with sourced estimates of caloric intake and expenditure. We show that for Twitter, our naive measures of “caloric input”, “caloric output”, and the ratio of these measures are all strong correlates with health and well-being measures for the contiguous United States. Our caloric balance measure in many cases outperforms both its constituent quantities, is tunable to specific health and well-being measures such as diabetes rates, has the capability of providing a real-time signal reflecting a population’s health, and has the potential to be used alongside traditional survey data in the development of public policy and collective self-awareness. Because our Lexicocalorimeter is a linear superposition of principled phrase scores, we also show we can move beyond correlations to explore what people talk about in collective detail, and assist in the understanding and explanation of how population-scale conditions vary, a capacity unavailable to black-box type methods.

4.7.2 Contribution

For this paper I built an extensive online appendix and the accompanying website. The online appendix at features an interactive dashboard provided at In addition to this tool, we provide searchable maps for all food and activity words used in the study. Next, we show snapshots of the various visualizations available on the website, in Figures 4.54.8.

Figure 4.5: Lexicocalorimeter map, using square states to control for the disproportionate area and population of US States. Here, Vermont is highlighted by a hover.
Figure 4.6: Lexicocalorimeter food and activity shifts. Here we see which foods and which activities contribute to Vermont’s difference in caloric intake and expenditure from the US as a whole. We see that Bacon contributes most to caloric intake in Vermont relative to the average US intake, and overall Vermont is a middle-of-the-pack state (29th out of 49). On the right, Tweets from Vermont expend more calories than the US average with activities such as skiing, running, snowboarding, hiking, and sledding, giving the outdoorsy Vermont Twitter population the 3rd highest expenditure.
Figure 4.7: Overview of the Lexicocalorimeter dashboard. Each view is linked by hovering, and we can explore details of the caloric difference balances between states.
Figure 4.8: Snapshot of the Lexicocalorimeter activity search page. A similar page exists for foods. Here, we submit the query for “basketball”, seeing that Nebraskans Tweet more about basketball relative to other activities than other US States.

4.8 Tracking the Teletherms: The spatiotemporal dynamics of the hottest and coldest days of the year

Paper number eight is Tracking the Teletherms: The spatiotemporal dynamics of the hottest and coldest days of the year by Peter Sheridan Dodds, Lewis Mitchell, Andrew J. Reagan, and Christopher M. Danforth, cited as Dodds et al. (2016).

4.8.1 Abstract

Instabilities and long term shifts in seasons, whether induced by natural drivers or human activities, pose great disruptive threats to ecological, agricultural, and social systems. Here, we propose, measure, and explore two fundamental markers of location-sensitive seasonal variations: the Summer and Winter Teletherms — the on-average annual dates of the hottest and coldest days of the year. We analyze daily temperature extremes recorded at 1218 stations across the contiguous United States from 1853–2012, and observe large regional variation with the Summer Teletherm falling up to 90 days after the Summer Solstice, and 50 days for the Winter Teletherm after the Winter Solstice. We show that Teletherm temporal dynamics are substantive with clear and in some cases dramatic shifts reflective of system bifurcations. We also compare recorded daily temperature extremes with output from two regional climate models finding considerable though relatively unbiased error. Our work demonstrates that Teletherms are an intuitive, powerful, and statistically sound measure of local climate change, and that they pose detailed, stringent challenges for future theoretical and computational models.

4.8.2 Contribution

For this paper, I built the online appendices and transformed the visualizations into online, interactive versions at using D3 Javascript (Bostock et al., 2011). The online appendices are available at Maps of the United States are shown in Figure 4.9, with Voronoi cells for each station colored in addition to the direction and color of the arrows used in the static maps. Other features of these online maps include the ability to animate through time, select a fisheye lens for inspecting the map, and toggle between the various indicators (Summer/Winter Teletherm day and temperature).

Figure 4.9: Interactive teletherm map with time and variable controls. Select between the teletherm day & extent and teletherm temperature, the averaging window to compute the teletherms, and the time to show on the map. A linear color scale, “oranges”, is shown for teletherm day and extent. A diverging color scale is shown for temperatures, inspired by For each weather station, a tooltip hover shows details on demand.

To realize the goals of this research, the website is designed to communicate the patterns of Teletherm dynamics at both a local and a regional level. In addition to building interactive versions of the US maps, I worked with Professor Dodds to design novel visualizations for the individual station teletherm dynamics. These plots are shown in Figure 4.10, and accompany visualizations of the time dynamics of Teletherm days, extends, and temperatures. The online source code repository is publicly available at

Figure 4.10: Teletherm dials shows the yearly temperature dynamics for a single location over a period of time, and time series below show the trends for both temperature extremes and teletherm dates. The min and max temperature for each day of the year are smoothed over three 25 year windows, one for each dial, and show in blue and red, respectively. As in the paper, the smoothed temperature is computed with a Gaussian kernel smoothing over the average min/max over days of the year. To avoid issues with the boundary, to compute the Gaussian kernel the temperature is wrapped on both ends of the year (with the same data). Summer and winter solstice are shown with icons, and the details of the day of year are shown in the upper right of each dial (over which the hover is linked between each dial—they all move together).

4.9 Divergent Discourse Between Protests and Counter-Protests: #BlackLivesMatter and #AllLivesMatter

Paper number 10 is Divergent Discourse Between Protests and Counter-Protests: #BlackLivesMatter and #AllLivesMatter by Ryan J. Gallagher, Andrew J. Reagan, Christopher M. Danforth, and Peter Sheridan Dodds, cited as Gallagher et al. (2016).

4.9.1 Abstract

Since the shooting of Black teenager Michael Brown by White police officer Darren Wilson in Ferguson, Missouri, the protest hashtag #BlackLivesMatter has amplified critiques of extrajudicial killings of Black Americans. In response to #BlackLivesMatter, other Twitter users have adopted #AllLivesMatter, a counter-protest hashtag whose content argues that equal attention should be given to all lives regardless of race. Through a multi-level analysis, we study how these protests and counter-protests diverge by quantifying aspects of their discourse. In particular, we introduce methodology that not only quantifies these divergences, but also reveals whether they are from widespread discussion or a few popular retweets within these groups. We find that #BlackLivesMatter exhibits many information rich conversations, while those within #AllLivesMatter are more muted and susceptible to hijacking. We also show that the discussion within #BlackLivesMatter is more likely to center around the deaths of Black Americans, while that of #AllLivesMatter is more likely to sympathize with the lives of police officers and express politically conservative views.

4.9.2 Contribution

My main contribution to this paper was working closely with lead author Ryan Gallagher to collect the data from our Twitter database on the VACC. We collected data for a number of hashtags, specifically all of the following:

            {"re": re.compile(r"#alllivesmatter\b",flags=re.IGNORECASE)}},
            {"re": re.compile(r"#bluelivesmatter\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#policelivesmatter\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#michaelbrown\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#ferguson\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#freddiegray\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#ericgarner\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#icantbreathe\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#sarahbland\b",flags=re.IGNORECASE)},
            {"re": re.compile(r"#templeton\b",flags=re.IGNORECASE)},]

After collect the Tweets for these hashtags, they were reorganized by user, and then collected into a sqlite database using Django, a Python web framework. This web framework was then used to go back and collect the most recent 3,200 Tweets from each public Twitter account that we had found in our initial search. The collection ended on Nov 25th, 2015, so these Tweets were the 3,200 most recent as of that date. From this data, we were able to construct the social networks for analysis of the dynamics of these online communities.

5.1 Future directions

First we take a look to the future research around sentiment analysis, emotional arcs, and the related projects we covered in Chapter 4.

5.1.1 Sentiment analysis

Our work looked in detail at dictionary-based sentiment analysis methodology, focusing on the use of these methods in qualitative and quantitative analysis. Immediate directions for the extension of dictionary based methods can examine the creation and use of dictionaries that offer (1) many emotions (Section 1.2.1), (2) MWEs (Section 1.2.4), (3) multiple word senses (Section 1.2.4), and (4) corpus-specific tuning. We reviewed automated methods to build corpus-specific dictionaries in Section 1.2.5

, and while most approaches are low precision, we identified directions for that provide the highest precision and recall. Combining automated (machine learning, propagation-based) approaches with MWEs, word senses, and many emotions will provide many opportunities for the study of the sentiment properties of language and the improvement of sentiment analysis.

In addition to the improvement of the dictionaries, many unanswered questions remain around the visualization of sentiment analysis measures. We reviewed some approaches in Section 1.2.6 and reiterate that future work can (1) incorporate task-specific usability testing (Munzner, 2014), (2) visualize non-linear features (Ribeiro et al., 2016), and (3) continue to build more tools that enable other researchers to make use of visualization.

5.1.2 Emotional arcs

Here we enumerate some directions for research on emotional arcs in addition those mentioned at the end of Chapter 3 (see Section 3.4).

The emotional arcs of movies could be considered as a feature driving once controversial movies towards normalization over time, a closer examination of the trend presented by Amendola et al. (2015). Various studies have examined the changes in the valence of language over time, and in a similar fashion this will be possible to see how the emotional trajectories of stories has changed.

The emotional arc of a book can be used to predict the Library of Congress classification, using fiction and non-fiction separately to demonstrate the applicability of emotional arcs. In particular, one could feed the coefficient vector from the SVD projection for the first modes into a predictor and see how much predictive power is contained in each mode, and exploring can provide additional testing of how explanatory the first 6 modes are. Clustering on the emotional arc embedding vector would show whether these groups can be separated in a purely unsupervised manner.

Extending the approach of Bamman (2015) and the validation shown in Figure 1.8, it will remain important to keep people in the loop of the analysis of emotional arcs, since it is our reaction to stories that is being measured. A follow-up project to our work on emotional arcs could build a more complete user study to examine the human aspect of emotion in narrative more directly.

We broadly examined the the forefront of NLP research (Section 1.2.4), and can use the advancing methods to answer such questions as “is a character good or bad?”. The analysis of character networks (Section 1.3.3) will continue to improve with identification of the nature of relationships, and the events for particular characters (e.g., birth, marriage, death, and the associated sentiments).

Connecting the scripts, frames, and SIG-like approaches (see Section 1.3.1 and Section 1.3.4

) to narrative more directly to the emotional arcs will be provide a finer-grained emotional arc representation, connected to the events in a narrative. This approach will in-part realize the jump from a bag-of-words to a bag-of-stories approach to natural language. As neural network approaches pust the state-of-the-art in NLP, there may be utility to considering architectures that have an explicit representations of abstraction levels. This approach is analagous to the Convolution Neural Network (CNN) architecture that has proven successful in image recognition tasks. An example structure to build upon is the Historical Thesaurus of English

(Kay et al., 2009), as is done by Alexander et al. (2015)

. In contrast to this proposed approach, the “automatic” feature selection (magic) of neural networks remains powerful

(Radford et al., 2017).

5.1.3 Other projects

We have shown that it is possible to build population scale measures of well-being and public health. The Hedonometer and the Lexicocalorimeter can be utilized as only two of many broad measures that extend our dashboard of societal indicators; such additional “meters” of general interest that the Computational StoryLab has considered include such tools as an “insomniometer”. Considering the Lexicocalorimeter, taking these lexical meters from snapshot-in-time analysis to real-time feeds remains a difficult challenge that has been accomplished with and can be extended to additional meters.

There are many improvements possible for the visualizations hosted online at The teletherm animations can be improved through the use of the d3.timer module for smoother animation. Voronoi cells on the map are clipped at the boundary of the contiguous United States using a clipping mask that contains all 50 states as individual paths, and this does not work reliably in Google Chrome. More issues for improvement are noted in the “issues” tab of the online source code repository at In addition, it will be possible to extend the teletherm project to incorporate temperature data from across the world.

5.2 Parting thoughts

Narratives are not unique in their explanation of causal links between events, and often the “adjacent narratives” are in direct competition. We saw in Section 1.3.4 that the the disambiguation of competing event chains is an active area of NLP research. This is identified as one factor contributing to information overload on the Internet (Orman, 2015), and participating in a collective cognitive denial of service attack (King et al., 2016). We are biased to seeing the world through narratives that have the most support from our existing experiences. Embodied in the principle of Occam’s Razor, we often prefer stories that are the simplest. This premise is explored anecdotally (Storr, 2014), and the competition between competing narratives is a new avenue for computation study.

The use of narratives in science belies an understanding of natural phenomena through metaphor, the consequences of stories in science has been examined by Mahoney and Goertz (2006); Levy (2008); Collier (2011); Gelman and Basbøll (2014). Narrative itself has been in the spotlight, being put forth to frame the decisions of economists in times of crisis and related to the political functions of democratic elections (Shriller, 2017).

Every-day causality and personal narrative build upon a fundamental assumption of personal agency and free will. Post-hoc rationalization is only useful to explain behavior that was intentional. Deterministic laws of physics are at odds with this worldview, but the science of complex systems has shown us that systems at different levels can exhibit emergent behavior that cannot be predicted from lower level interactions (Anderson, 1972). Applying computational thinking to the human concepts of metaphor and narrative can force us to further elucidate these distinctions and provide us with a deeper understanding of the world around us as we see it.


  • Abbott (2008) Abbott, H. P. (2008). The Cambridge introduction to narrative. Massachusetts: Cambridge University Press.
  • Abney (1997) Abney, S. (1997). Part-of-speech tagging and partial parsing. In Corpus-based methods in language and speech processing, pp. 118–136. Springer.
  • Agarwal et al. (2011) Agarwal, A., B. Xie, I. Vovsha, O. Rambow, and R. Passonneau (2011). Sentiment analysis of twitter data. In Proceedings of the workshop on languages in social media, pp. 30–38. Association for Computational Linguistics.
  • Alajajian et al. (2016) Alajajian, S. E., J. R. Williams, A. J. Reagan, S. C. Alajajian, M. R. Frank, L. Mitchell, J. Lahne, C. M. Danforth, and P. S. Dodds (2016). The Lexicocalorimeter: Gauging public health through caloric input and output on social media. Available at
  • Alexander et al. (2015) Alexander, M., F. Dallachy, S. Piao, A. Baron, and P. Rayson (2015). Metaphor, popular science, and semantic tagging: Distant reading with the historical thesaurus of english. Digital Scholarship in the Humanities 30(suppl 1), i16–i27.
  • Amendola et al. (2015) Amendola, L., V. Marra, and M. Quartin (2015, Jul). The evolving perception of controversial movies. Palgrave Communication (1), 15038.
  • Amir et al. (2016) Amir, S., R. Astudillo, W. Ling, P. C. Carvalho, and M. J. Silva (2016). Expanding subjective lexicons for social media mining with embedding subspaces. arXiv preprint arXiv:1701.00145.
  • Anderson (1972) Anderson, P. W. (1972). More is different. Science 177(4047), 393–396.
  • Andor et al. (2016) Andor, D., C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins (2016). Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042.
  • Awad (2013) Awad, H. (2013). Culturally based story understanding. Ph. D. thesis, Citeseer.
  • Baccianella et al. (2010) Baccianella, S., A. Esuli, and F. Sebastiani (2010). SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, Volume 10, pp. 2200–2204.
  • Bagrow et al. (2014) Bagrow, J., S. Desu, M. R. Frank, N. Manukyan, L. Mitchell, A. Reagan, E. Bloedorn, L. K. Booker, L. B. Branting, M. J. Smith, B. F. Tivnan, C. M. Danforth, P. S. Dodds, and J. C. Bongard (2014). Shadow networks: Discovering hidden nodes with models of information flow. Preprint available at
  • Baker et al. (1998) Baker, C. F., C. J. Fillmore, and J. B. Lowe (1998). The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1, pp. 86–90. Association for Computational Linguistics.
  • Bamman (2015) Bamman, D. (2015, Apr). Validity.
  • Bamman et al. (2014) Bamman, D., B. O’Connor, and N. A. Smith (2014). Learning latent personas of film characters. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 352.
  • Bamman et al. (2014) Bamman, D., T. Underwood, and N. A. Smith (2014). A bayesian mixed effects model of literary character. In ACL (1), pp. 370–379.
  • Bar-Haim et al. (2011) Bar-Haim, R., E. Dinur, R. Feldman, M. Fresko, and G. Goldstein (2011). Identifying and following expert investors in stock microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1310–1319. Association for Computational Linguistics.
  • Bestgen et al. (2008) Bestgen, Y. et al. (2008). Building affective lexicons from specific corpora for automatic sentiment analysis. In LREC. Citeseer.
  • Bestgen and Vincze (2012) Bestgen, Y. and N. Vincze (2012). Checking and bootstrapping lexical norms by means of word similarity indexes. Behavior research methods 44(4), 998–1006.
  • Bex (2013) Bex, F. (2013). Values as the point of a story.
  • Bex and Bench-Capon (2010) Bex, F. J. and T. J. Bench-Capon (2010). Persuasive stories for multi-agent argumentation. In AAAI Fall Symposium: Computational Models of Narrative, Volume 10, pp.  04.
  • Bird (2006) Bird, S. (2006). Nltk: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69–72. Association for Computational Linguistics.
  • Blair-Goldensohn et al. (2008) Blair-Goldensohn, S., K. Hannan, R. McDonald, T. Neylon, G. A. Reis, and J. Reynar (2008). Building a sentiment summarizer for local service reviews. In WWW workshop on NLP in the information explosion era, Volume 14, pp. 339–348.
  • Bollen et al. (2011) Bollen, J., H. Mao, and X. Zeng (2011). Twitter mood predicts the stock market. Journal of Computational Science 2(1), 1–8.
  • Booker (2006) Booker, C. (2006). The Seven Basic Plots: Why We Tell Stories. New York: Bloomsbury Academic.
  • Bost et al. (2016) Bost, X., V. Labatut, and G. Linarès (2016). Narrative smoothing: dynamic conversational network for the analysis of tv series plots.
  • Bostock et al. (2011) Bostock, M., V. Ogievetsky, and J. Heer (2011). D3: Data-driven documents. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis).
  • Bradley and Lang (1999) Bradley, M. M. and P. J. Lang (1999). Affective norms for english words (ANEW): Stimuli, instruction manual and affective ratings. Technical report c-1, University of Florida, Gainesville, FL.
  • Brewer and Lichtenstein (1980) Brewer, W. F. and E. H. Lichtenstein (1980). Event schemas, story schemas, and story grammars. Center for the Study of Reading Technical Report; no. 197.
  • Cambria et al. (2014) Cambria, E., D. Olsher, and D. Rajagopal (2014). Senticnet 3: a common and common-sense knowledge base for cognition-driven sentiment analysis. In Proceedings of the twenty-eighth AAAI conference on artificial intelligence, pp. 1515–1521. AAAI Press.
  • Cambria and White (2014) Cambria, E. and B. White (2014). Jumping nlp curves: a review of natural language processing research [review article]. IEEE Computational Intelligence Magazine 9(2), 48–57.
  • Campbell (1949) Campbell, J. (1949). The Hero with a Thousand Faces (third ed.). California: New World Library.
  • Campbell and Moyers (1991) Campbell, J. and B. Moyers (1991). The Power of Myth. Anchor.
  • Cao and Cui (2016) Cao, N. and W. Cui (2016). Introduction to text visualization. Atlantis briefs in artificial intelligence ( 1.
  • Card et al. (2015) Card, D., A. E. Boydstun, J. H. Gross, P. Resnik, and N. A. Smith (2015). The media frames corpus: Annotations of frames across issues. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 2, pp. 438–444.
  • Cave (2013) Cave, S. (2013, Jul). The 4 stories we tell ourselves about death.
  • Chambers (2013) Chambers, N. (2013). Event schema induction with a probabilistic entity-driven model. In EMNLP, Volume 13, pp. 1797–1807.
  • Chambers and Jurafsky (2008) Chambers, N. and D. Jurafsky (2008). Unsupervised learning of narrative event chains. In ACL, Volume 94305, pp. 789–797. Citeseer.
  • Chambers and Jurafsky (2009) Chambers, N. and D. Jurafsky (2009). Unsupervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 602–610. Association for Computational Linguistics.
  • Chambers and Jurafsky (2010) Chambers, N. and D. Jurafsky (2010). A database of narrative schemas. In LREC.
  • Chambers et al. (2007) Chambers, N., S. Wang, and D. Jurafsky (2007). Classifying temporal relations between events. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 173–176. Association for Computational Linguistics.
  • Chen and Manning (2014) Chen, D. and C. D. Manning (2014). A fast and accurate dependency parser using neural networks. In EMNLP, pp. 740–750.
  • Cherny (2016) Cherny, L. (2016, Jun). The bones of a bestseller: Visualizing fiction.
  • Cheung et al. (2013) Cheung, J. C. K., H. Poon, and L. Vanderwende (2013). Probabilistic frame induction. arXiv preprint arXiv:1302.4813.
  • Chuang et al. (2012) Chuang, J., C. D. Manning, and J. Heer (2012). Termite: Visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces, pp. 74–77. ACM.
  • Chuang et al. (2012) Chuang, J., D. Ramage, C. Manning, and J. Heer (2012). Interpretation and trust: Designing model-driven visualizations for text analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 443–452. ACM.
  • Chung and Liu (2011) Chung, S. and S. Liu (2011). Predicting stock market fluctuations from Twitter. Berkeley, California.
  • Church (1988) Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the second conference on Applied natural language processing, pp. 136–143. Association for Computational Linguistics.
  • Clancy (2015) Clancy, E. (2015). A fabula of syuzhet: A contretemps of digital humanities and sentiment analysis. and
  • Clauset et al. (2009) Clauset, A., C. R. Shalizi, and M. E. J. Newman (2009). Power-law distributions in empirical data. SIAM Review 51, 661–703.
  • Cobley (2005) Cobley, P. (2005). Narratology. The Johns Hopkins Guide to Literary Theory and Criticism, 2nd ed. John Hopkins University Press, London.
  • Cody et al. (2015) Cody, E. M., A. J. Reagan, L. Mitchell, P. S. Dodds, and C. M. Danforth (2015). Climate change sentiment on twitter: An unsolicited public opinion poll. PLOS ONE.
  • Collier (2011) Collier, D. (2011). Understanding process tracing. PS: Political Science & Politics 44(04), 823–830.
  • da Silva and Tehrani (2016) da Silva, S. G. and J. J. Tehrani (2016). Comparative phylogenetic analyses uncover the ancient roots of Indo-European folktales. Royal Society Open Science 3(1).
  • DARPA (2011) DARPA (2011, 12). Broad agency announcement: Narrative networks. available at, accessed June 20, 2016.
  • Das and Chen (2007) Das, S. R. and M. Y. Chen (2007). Yahoo! for amazon: Sentiment extraction from small talk on the web. Management Science 53(9), 1375–1388.
  • De Smedt and Daelemans (2012) De Smedt, T. and W. Daelemans (2012). Pattern for Python. The Journal of Machine Learning Research 13(1), 2063–2067.
  • Decadt et al. (2004) Decadt, B., V. Hoste, W. Daelemans, and A. Van den Bosch (2004). Gambl, genetic algorithm optimization of memory-based wsd. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 108–112. Association for Computational Linguistics.
  • DeGraff and Harmon (2015) DeGraff, A. and D. Harmon (2015). Plotted: A Literary Atlas. Houghton Mifflin Harcourt.
  • DeRose (1988) DeRose, S. J. (1988). Grammatical category disambiguation by statistical optimization. Computational linguistics 14(1), 31–39.
  • Do et al. (2011) Do, Q. X., Y. S. Chan, and D. Roth (2011). Minimally supervised event causality identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, Stroudsburg, PA, USA, pp. 294–303. Association for Computational Linguistics.
  • Dodds (2013) Dodds, P. S. (2013). Homo Narrativus and the trouble with fame. Nautilus Magazine.
  • Dodds et al. (2015a) Dodds, P. S., E. M. Clark, S. Desu, M. R. Frank, A. J. Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M. Kloumann, J. P. Bagrow, K. Megerdoomian, M. T. McMahon, B. F. Tivnan, and C. M. Danforth (2015a). Human language reveals a universal positivity bias. PNAS 112(8), 2389–2394.
  • Dodds et al. (2015b) Dodds, P. S., E. M. Clark, S. Desu, M. R. Frank, A. J. Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M. Kloumann, J. P. Bagrow, K. Megerdoomian, M. T. McMahon, B. F. Tivnan, and C. M. Danforth (2015b). Reply to garcia et al.: Common mistakes in measuring frequency-dependent word characteristics. Proceedings of the National Academy of Sciences 112(23), E2984–E2985.
  • Dodds and Danforth (2009) Dodds, P. S. and C. M. Danforth (2009, July). Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies 11(4), 441–456.
  • Dodds et al. (2011) Dodds, P. S., K. D. Harris, I. M. Kloumann, C. A. Bliss, and C. M. Danforth (2011, 12). Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLoS ONE 6(12), e26752.
  • Dodds et al. (2016) Dodds, P. S., L. Mitchell, A. J. Reagan, and C. M. Danforth (2016, may). Tracking climate change through the spatiotemporal dynamics of the teletherms, the statistically hottest and coldest days of the year. PLOS ONE 11(5), e0154184.
  • Dolby (2008) Dolby, S. K. (2008). Literary Folkloristics and the Personal Narrative. Indiana: Trickster Press.
  • Dundes (1997) Dundes, A. (1997). The motif-index and the tale type index: A critique. Journal of Folklore Research, 195–202.
  • Ekman (1992) Ekman, P. (1992). An argument for basic emotions. Cognition & emotion 6(3-4), 169–200.
  • Elson (2012a) Elson, D. K. (2012a). Detecting story analogies from annotations of time, action and agency. In Proceedings of the LREC 2012 Workshop on Computational Models of Narrative, Istanbul, Turkey.
  • Elson (2012b) Elson, D. K. (2012b). Modeling narrative discourse. Ph. D. thesis, Citeseer.
  • Elson et al. (2010) Elson, D. K., N. Dames, and K. R. McKeown (2010). Extracting social networks from literary fiction. In Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 138–147. Association for Computational Linguistics.
  • Emerson et al. (2015) Emerson, J., N. Churcher, and A. Cockburn (2015). Tag clouds for software and information visualisation. In Proceedings of the 14th Annual ACM SIGCHI_NZ conference on Computer-Human Interaction, pp.  1. ACM.
  • Enderle (2015) Enderle, S. (2015, April). What’s a sine wave of sentiment?
  • Enderle (2016) Enderle, S. (2016, Sept). Brownian noise and plot arcs.
  • Esuli and Sebastiani (2006) Esuli, A. and F. Sebastiani (2006). Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of LREC, Volume 6, pp. 417–422. Citeseer.
  • Feinberg (2009) Feinberg, J. (2009). Wordle-beautiful word clouds.
  • Fellbaum (1998) Fellbaum, C. (Ed.) (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
  • Finlayson (2011) Finlayson, M. A. (2011). Learning narrative structure from annotated folktales. Ph. D. thesis, Massachusetts Institute of Technology.
  • Gabasova (2015) Gabasova, E. (2015, Dec). The star wars social network.
  • Gallagher et al. (2016) Gallagher, R. J., A. J. Reagan, C. M. Danforth, and P. S. Dodds (2016). Divergent discourse between protests and counter-protests: #blacklivesmatter and #alllivesmatter. CoRR abs/1606.06820.
  • Gao et al. (2016) Gao, J., M. L. Jockers, J. Laudun, and T. Tangherlini (2016). A multiscale theory for the dynamical evolution of sentiment in novels. In Behavioral, Economic and Socio-cultural Computing (BESC), 2016 International Conference on, pp. 1–4. IEEE.
  • Garcia et al. (2015) Garcia, D., A. Garas, and F. Schweitzer (2015). The language-dependent relationship between word happiness and frequency. Proceedings of the National Academy of Sciences 112(23), E2983.
  • Gelman and Basbøll (2014) Gelman, A. and T. Basbøll (2014). When do stories work? evidence and illustration in the social sciences. Sociological Methods & Research, 0049124114526377.
  • Giachanou and Crestani (2016) Giachanou, A. and F. Crestani (2016, June). Like it or not: A survey of twitter sentiment analysis methods. ACM Comput. Surv. 49(2), 28:1–28:41.
  • Gleick (2011) Gleick, J. (2011). The Information: A History, A Theory, A Flood. New York: Pantheon.
  • Golder and Macy (2011) Golder, S. A. and M. W. Macy (2011). Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science Magazine 333, 1878–1881.
  • Gonçalves et al. (2013) Gonçalves, P., M. Araújo, F. Benevenuto, and M. Cha (2013). Comparing and combining sentiment analysis methods. In Proceedings of the first ACM conference on Online social networks, pp. 27–38. ACM.
  • Gottesman et al. (2014) Gottesman, W. G., A. J. Reagan, and P. S. Dodds (2014). Collective philanthropy: Describing and modeling the ecology of giving. PLoS ONE 9, e98876.
  • Gottschall (2013) Gottschall, J. (2013). The Storytelling Animal: How Stories Make Us Human. New York, NY: Mariner Books.
  • Goyal et al. (2013) Goyal, A., E. Riloff, et al. (2013). A computational model for plot units. Computational Intelligence 29(3), 466–488.
  • Halvey and Keane (2007) Halvey, M. J. and M. T. Keane (2007). An assessment of tag presentation techniques. In Proceedings of the 16th international conference on World Wide Web, pp. 1313–1314. ACM.
  • Hamann (2012) Hamann, S. (2012). Mapping discrete and dimensional emotions onto the brain: controversies and consensus. Trends in cognitive sciences 16(9), 458–466.
  • Hamilton et al. (2016) Hamilton, W. L., K. Clark, J. Leskovec, and D. Jurafsky (2016). Inducing domain-specific sentiment lexicons from unlabeled corpora. arXiv preprint arXiv:1606.02820.
  • Hand and Yu (2001) Hand, D. J. and K. Yu (2001). Idiot’s bayes—not so stupid after all? International statistical review 69(3), 385–398.
  • Handler et al. (2016) Handler, A., M. J. Denny, H. Wallach, and B. O’Connor (2016). Bag of what? simple noun phrase extraction for text analysis. NLP+ CSS 2016, 114.
  • Harris (1959) Harris, W. F. (1959). The basic patterns of plot. Oklahoma: University of Oklahoma Press.
  • Harrison et al. (2010) Harrison, N. A., M. A. Gray, P. J. Gianaros, and H. D. Critchley (2010). The embodiment of emotional feelings in the brain. Journal of Neuroscience 30(38), 12878–12884.
  • Hatzivassiloglou and McKeown (1997) Hatzivassiloglou, V. and K. R. McKeown (1997). Predicting the semantic orientation of adjectives. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pp. 174–181. Association for Computational Linguistics.
  • Hearst (2009) Hearst, M. (2009). Search user interfaces. Cambridge University Press.
  • Hearst and Rosner (2008) Hearst, M. A. and D. Rosner (2008). Tag clouds: Data analysis tool or social signaller? In Hawaii International Conference on System Sciences, Proceedings of the 41st Annual, pp. 160–160. IEEE.
  • Heer (2014) Heer, J. (2014). Text visualizatoin. CSE 512 Lecture available at
  • Honnibal (2015) Honnibal, M. (2015, Aug). Displaying linguistic structure with css.
  • Hu and Liu (2004) Hu, M. and B. Liu (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. ACM.
  • Hutto and Gilbert (2014) Hutto, C. J. and E. Gilbert (2014, May). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media. AAAI Publications.
  • Jack et al. (2014) Jack, R. E., O. G. Garrod, and P. G. Schyns (2014). Dynamic facial expressions of emotion transmit an evolving hierarchy of signals over time. Current biology 24(2), 187–192.
  • Jockers (2014) Jockers, M. (2014, Jun). A novel method for detecting plot.
  • Jockers (2015) Jockers, M. (2015, Feb). The rest of the story.
  • Jockers (2013) Jockers, M. L. (2013). Macroanalysis: Digital methods and literary history. University of Illinois Press.
  • Kaji and Kitsuregawa (2007) Kaji, N. and M. Kitsuregawa (2007). Building lexicon for sentiment analysis from massive collection of HTML documents. In EMNLP-CoNLL, pp. 1075–1083.
  • Kay et al. (2009) Kay, C., J. Roberts, M. Samuels, and I. Wotherspoon (2009). Historical Thesaurus of the Oxford English Dictionary. Oxford University Press.
  • Kiley et al. (2016) Kiley, D. P., A. J. Reagan, L. Mitchell, C. M. Danforth, and P. S. Dodds (2016, May). Game story space of professional sports: Australian rules football. Phys. Rev. E 93, 052314.
  • Kim and Hovy (2004) Kim, S.-M. and E. Hovy (2004). Determining the sentiment of opinions. In Proceedings of the 20th international conference on Computational Linguistics, pp. 1367. Association for Computational Linguistics.
  • King et al. (2016) King, G., J. Pan, and M. E. Roberts (2016). How the chinese government fabricates social media posts for strategic distraction, not engaged argument. Harvard University.
  • Kiritchenko et al. (2014) Kiritchenko, S., X. Zhu, and S. M. Mohammad (2014). Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research 50, 723–762.
  • Kirschenbaum (2007) Kirschenbaum, M. G. (2007). The remaking of reading: Data mining and the digital humanities. In The National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation, Maryland.
  • Koch et al. (2016) Koch, A., H. Alves, T. Krüger, and C. Unkelbach (2016). A general valence asymmetry in similarity: Good is more alike than bad. Journal of Experimental Psychology: Learning, Memory, and Cognition 42(8), 1171.
  • Kohonen (1990) Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE 78(9), 1464–1480.
  • Kosinski et al. (2013) Kosinski, M., D. Stillwell, and T. Graepel (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences 110(15), 5802–5805.
  • Kuster (2015) Kuster, D. (2015, Jul). Exploring the shapes of stories using python and sentiment apis.
  • Lee et al. (2010) Lee, B., N. H. Riche, A. K. Karlson, and S. Carpendale (2010). Sparkclouds: Visualizing trends in tag clouds. IEEE transactions on visualization and computer graphics 16(6), 1182–1189.
  • Lehnert (1981) Lehnert, W. G. (1981). Plot units and narrative summarization. Cognitive Science 5(4), 293–331.
  • Levallois (2013) Levallois, C. (2013). Umigon: sentiment analysis for tweets based on terms lists and heuristics. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2, pp. 414–417.
  • Levy (2008) Levy, J. S. (2008). Case studies: Types, designs, and logics of inference. Conflict Management and Peace Science 25(1), 1–18.
  • Li et al. (2012) Li, B., S. Lee-Urban, D. S. Appling, and M. O. Riedl (2012). Crowdsourcing narrative intelligence. Advances in Cognitive Systems 2, 25–42.
  • Li et al. (2013) Li, B., S. Lee-Urban, G. Johnston, and M. Riedl (2013). Story generation with crowdsourced plot graphs. In AAAI.
  • Lin et al. (2012) Lin, Y., J.-B. Michel, E. L. Aiden, J. Orwant, W. Brockman, and S. Petrov (2012). Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 system demonstrations, pp. 169–174. Association for Computational Linguistics.
  • Lindquist et al. (2016) Lindquist, K., M. Gendron, A. Satpute, L. Barrett, M. Lewis, and J. Haviland-Jones (2016). Language and emotion: Putting words into feelings and feelings into words. Handbook of emotions.
  • Liu (2010) Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of natural language processing 2, 627–666.
  • Liu (2012) Liu, B. (2012, May). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies. San Rafael, CA: Morgan & Claypool Publishers.
  • Liu et al. (2007) Liu, Y., X. Huang, A. An, and X. Yu (2007). Arsa: a sentiment-aware model for predicting sales performance using blogs. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 607–614. ACM.
  • Lohmann et al. (2009) Lohmann, S., J. Ziegler, and L. Tetzlaff (2009). Comparison of tag cloud layouts: Task-related performance and visual exploration. In IFIP Conference on Human-Computer Interaction, pp. 392–404. Springer.
  • Luo et al. (2012) Luo, Z., M. Osborne, and T. Wang (2012). Opinion retrieval in twitter. In ICWSM.
  • MacDonald (1982) MacDonald, M. R. (1982). Storytellers Sourcebook: A Subject, Title, and Motif Index to Folklore Collections for Children. Michigan: Gale Group.
  • Mahoney and Goertz (2006) Mahoney, J. and G. Goertz (2006). A tale of two cultures: Contrasting quantitative and qualitative research. Political analysis 14(3), 227–249.
  • Mandera et al. (2015) Mandera, P., E. Keuleers, and M. Brysbaert (2015). How useful are corpus-based methods for extrapolating psycholinguistic variables? The Quarterly Journal of Experimental Psychology 68(8), 1623–1642.
  • Mani (2012) Mani, I. (2012). Computational modeling of narrative. Synthesis Lectures on Human Language Technologies 5(3), 1–142.
  • Mani et al. (2006) Mani, I., M. Verhagen, B. Wellner, C. M. Lee, and J. Pustejovsky (2006). Machine learning of temporal relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 753–760. Association for Computational Linguistics.
  • Manning et al. (2014) Manning, C. D., M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky (2014). The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pp. 55–60.
  • Marcus et al. (1993) Marcus, M. P., M. A. Marcinkiewicz, and B. Santorini (1993). Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2), 313–330.
  • McCloud (2006) McCloud, S. (2006). Making comics: storytelling secrets of comics, manga and graphic novels. New York: Harper.
  • McIntyre and Lapata (2010) McIntyre, N. and M. Lapata (2010). Plot induction and evolutionary search for story generation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, Stroudsburg, PA, USA, pp. 1562–1572. Association for Computational Linguistics.
  • Meeks and Averick (Meeks and Averick) Meeks, E. and M. Averick. A data-driven exploration of archer.
  • Michel et al. (2011) Michel, J.-B., Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, et al. (2011). Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182.
  • Mikolov and Dean (2013) Mikolov, T. and J. Dean (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.
  • Min and Park (2016) Min, S. and J. Park (2016). Narrative as a complex network: A study of Victor Hugo’s les misérables. In Proceedings of HCI Korea.
  • Mitchell et al. (2013) Mitchell, L., M. R. Frank, K. D. Harris, P. S. Dodds, and C. M. Danforth (2013, May). The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place. PLoS ONE 8(5), e64417.
  • Mohammad et al. (2013) Mohammad, S. M., S. Kiritchenko, and X. Zhu (2013, June). Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets. In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta, Georgia, USA.
  • Mohammad and Turney (2013) Mohammad, S. M. and P. D. Turney (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence 29(3), 436–465.
  • Moretti (2000) Moretti, F. (2000). Conjectures on world literature. New Left Review 1, 54.
  • Moretti (2007) Moretti, F. (2007). Graphs, Maps, Trees: Abstract Models for a Literary History. New York: Verso.
  • Moretti (2013) Moretti, F. (2013). Distant Reading. New York: Verso.
  • Mostafazadeh et al. (2016) Mostafazadeh, N., N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016, June). A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849w. Association for Computational Linguistics.
  • Munroe (2009) Munroe, R. (2009, 11). Movie narrative charts.
  • Munzner (2014) Munzner, T. (2014). Visualization analysis and design. CRC Press.
  • Nenkova and McKeown (2012) Nenkova, A. and K. McKeown (2012).

    A survey of text summarization techniques.

    In Mining text data, pp. 43–76. Berlin, Germany: Springer.
  • Neukom Institute (2016) Neukom Institute, D. (2016). Turing tests in creative arts: Digilit 2016.
  • Nickerson (1998) Nickerson, R. S. (1998). Confirmation Bias; A ubiquitous phenomenon in many guises. Review of General Psychology 2, 175–220.
  • Nielsen (2011) Nielsen, F. Å. (2011, May). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. In M. Rowe, M. Stankovic, A.-S. Dadzie, and M. Hardey (Eds.), CEUR Workshop Proceedings, Volume Proceedings of the ESWC2011 Workshop on ’Making Sense of Microposts’: Big things come in small packages 718, pp. 93–98.
  • O’Connor (2013) O’Connor, B. (2013). Learning frames from text with an unsupervised latent variable model. arXiv preprint arXiv:1307.7382.
  • Ogawa and Ma (2010) Ogawa, M. and K.-L. Ma (2010). Software evolution storylines. In Proceedings of the 5th international symposium on Software visualization, pp. 35–42. ACM.
  • Orman (2015) Orman, L. V. (2015). Information paradox: Drowning in information, starving for knowledge.
  • Owoputi et al. (2013) Owoputi, O., B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith (2013). Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics.
  • Pang and Lee (2004) Pang, B. and L. Lee (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the ACL.
  • Pappas et al. (2013) Pappas, N., G. Katsimpras, and E. Stamatatos (2013). Distinguishing the popularity between topics: A system for up-to-date opinion retrieval and mining in the web. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 197–209. Springer.
  • Pechenick et al. (2015) Pechenick, E. A., C. M. Danforth, and P. S. Dodds (2015). Characterizing the google books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. arXiv preprint arXiv:1501.00960.
  • Pennebaker et al. (2001) Pennebaker, J. W., M. E. Francis, and R. J. Booth (2001). Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates 71, 2001.
  • Pichotta and Mooney (2015) Pichotta, K. and R. J. Mooney (2015). Learning statistical scripts with lstm recurrent neural networks. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
  • Piper (2015a) Piper, A. (2015a). Novel devotions: Conversional reading, computational modeling, and the modern novel. New Literary History 46(1), 63–98.
  • Piper (2015b) Piper, A. (2015b, Mar). Validation and subjective computing.
  • Plutchik (1991) Plutchik, R. (1991). The emotions. University Press of America.
  • Plutchik (2001) Plutchik, R. (2001). The nature of emotions human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American scientist 89(4), 344–350.
  • Polanyi and Zaenen (2006) Polanyi, L. and A. Zaenen (2006). Contextual valence shifters. In Computing attitude and affect in text: Theory and applications, pp. 1–10. Springer.
  • Polti (1921) Polti, G. (1921). The Thirty-Six Dramatic Situations. Ohio: James Knapp Reeve.
  • Poria et al. (2013) Poria, S., A. Gelbukh, A. Hussain, N. Howard, D. Das, and S. Bandyopadhyay (2013). Enhanced senticnet with affective labels for concept-based opinion mining. IEEE Intelligent Systems 28(2), 31–38.
  • Porter (2001) Porter, M. F. (2001). Snowball: A language for stemming algorithms.
  • Prado et al. (2016) Prado, S. D., S. R. Dahmen, A. L. C. Bazzan, P. M. Carron, and R. Kenna (2016). Temporal network analysis of literary texts.
  • Pratchett et al. (2003) Pratchett, T., I. Stewart, and J. Cohen (2003). The Science of Discworld II: The Globe. London, UK: Ebury Press.
  • Propp (1968) Propp, V. (1968). Morphology of the Folktale. 1928. Texas: Texas University Press.
  • Pustejovsky et al. (2003) Pustejovsky, J., P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L. Ferro, et al. (2003). The timebank corpus. In Corpus linguistics, Volume 2003, pp.  40.
  • Radford et al. (2017) Radford, A., R. Jozefowicz, and I. Sutskever (2017). Learning to generate reviews and discovering sentiment.
  • Raftery (2011) Raftery, B. (2011, Sep). How Dan Harmon drives himself crazy making Community. at, accessed June 20, 2016.
  • Rao and Ravichandran (2009) Rao, D. and D. Ravichandran (2009). Semi-supervised polarity lexicon induction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 675–682. Association for Computational Linguistics.
  • Rayner (1985) Rayner, J. M. V. (1985). Linear relations in biomechanics: the statistics of scaling functions. J. Zool. Lond. (A) 206, 415–439.
  • Reagan et al. (2015) Reagan, A., B. Tivnan, J. R. Williams, C. M. Danforth, and P. S. Dodds (2015). Benchmarking sentiment analysis methods for large-scale texts: A case for using continuum-scored words and word shift graphs. Preprint available at
  • Regneri et al. (2010) Regneri, M., A. Koller, and M. Pinkal (2010). Learning script knowledge with web experiments. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, Stroudsburg, PA, USA, pp. 979–988. Association for Computational Linguistics.
  • Reiter et al. (2014) Reiter, N., A. Frank, and O. Hellwig (2014). An nlp-based cross-document approach to narrative structure discovery. Literary and Linguistic Computing 29(4), 583–605.
  • Ribeiro et al. (2016) Ribeiro, F. N., M. Araújo, P. Gonçalves, M. André Gonçalves, and F. Benevenuto (2016, July). SentiBench — a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Sci. 5(1), 23.
  • Ribeiro et al. (2016) Ribeiro, M. T., S. Singh, and C. Guestrin (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM.
  • Riedl (2016) Riedl, M. O. (2016). Computational narrative intelligence: A human-centered goal for artificial intelligence. arXiv preprint arXiv:1602.06484.
  • Riedl and Harrison (2015) Riedl, M. O. and B. Harrison (2015). Using stories to teach human values to artificial agents.
  • Rivadeneira et al. (2007) Rivadeneira, A. W., D. M. Gruen, M. J. Muller, and D. R. Millen (2007). Getting our head in the clouds: toward evaluation studies of tagclouds. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 995–998. ACM.
  • Robinson (2008) Robinson, D. L. (2008). Brain function, emotional experience and personality. Netherlands Journal of Psychology 64(4), 152–168.
  • Roemmele et al. (2017) Roemmele, M., S. Kobayashi, N. Inoue, and A. Gordon (2017, April). An rnn-based binary classifier for the story cloze test. Proceedings of Linking Models of Lexical, Sentential and Discourse-level Semantics, workshop at European Association for Computational Linguistics.
  • Rothe et al. (2016) Rothe, S., S. Ebert, and H. Schütze (2016). Ultradense word embeddings by orthogonal transformation. arXiv preprint arXiv:1602.07572.
  • Rousseeuw (1987) Rousseeuw, P. J. (1987).

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

    Journal of computational and applied mathematics 20, 53–65.
  • Ruiz et al. (2012) Ruiz, E. J., V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes (2012). Correlating financial time series with micro-blogging activity. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 513–522. ACM.
  • Rumelhart (1975) Rumelhart, D. E. (1975). Notes on a schema for stories. Representation and understanding: Studies in cognitive science 211(236), 45.
  • Russell (1980) Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology 39(6), 1161–1178.
  • Ruths (2016) Ruths, D. (2016, Mar). Why the force awakens is not just a remake of a new hope.
  • Saif et al. (2013) Saif, H., M. Fernandez, Y. He, and H. Alani (2013). Evaluation datasets for twitter sentiment analysis: A survey and a new dataset, the sts-gold.
  • Sandhaus (2008) Sandhaus, E. (2008). The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia.
  • Schank and Abelson (1977) Schank, R. C. and R. P. Abelson (1977). Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Psychology Press.
  • Schmidt (2015a) Schmidt, B. (2015a, Apr). Commodius vici of recirculation: The real problem with syuzhet.
  • Schmidt (2016) Schmidt, B. (2016, Jul). Plot arceology 2016: emotion and tension.
  • Schmidt (2015b) Schmidt, B. M. (2015b). Plot arceology: A vector-space model of narrative structure. In Big Data (Big Data), 2015 IEEE International Conference on, pp. 1667–1672. IEEE.
  • Schrammel et al. (2009) Schrammel, J., M. Leitner, and M. Tscheligi (2009). Semantically structured tag clouds: an empirical evaluation of clustered presentation approaches. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2037–2040. ACM.
  • Schrauf and Sanchez (2004) Schrauf, R. W. and J. Sanchez (2004). The preponderance of negative emotion words across generations and across cultures. Journal of Multilingual and Multicultural Development 25, 266–284.
  • Schulz (2011) Schulz, K. (2011, Jun). What is distant reading?
  • Shriller (2017) Shriller, R. J. (2017, Jan). Narrative economics. In C. F. for Research In Economics (Ed.), 129th annual meeting of the American Economic Association, Number 2069.
  • Si et al. (2013) Si, J., A. Mukherjee, B. Liu, Q. Li, H. Li, and X. Deng (2013). Exploiting topic based Twitter sentiment for stock prediction. In ACL (2), pp. 24–29.
  • Snow et al. (2008) Snow, R., B. O’Connor, D. Jurafsky, and A. Y. Ng (2008). Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pp. 254–263. Association for Computational Linguistics.
  • Snyder and Palmer (2004) Snyder, B. and M. Palmer (2004). The english all-words task. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 41–43. Association for Computational Linguistics.
  • Socher et al. (2013) Socher, R., A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Volume 1631, pp. 1642. Citeseer.
  • Stone et al. (1966) Stone, P. J., D. C. Dunphy, and M. S. Smith (1966). The general inquirer: A computer approach to content analysis. MIT Press.
  • Storr (2014) Storr, W. (2014). The unpersuadables: Adventures with the enemies of science. The Overlook Press.
  • Swafford (2015) Swafford, A. (2015, Mar). Problems with the syuzhet package.
  • Taboada et al. (2011) Taboada, M., J. Brooke, M. Tofiloski, K. Voll, and M. Stede (2011). Lexicon-based methods for sentiment analysis. Computational linguistics 37(2), 267–307.
  • Taboada and Grieve (2004) Taboada, M. and J. Grieve (2004). Analyzing appraisal automatically. In Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text (AAAI Technical Re# port SS# 04# 07), Stanford University, CA, pp. 158q161. AAAI Press.
  • Tang et al. (2014) Tang, D., F. Wei, B. Qin, M. Zhou, and T. Liu (2014). Building large-scale twitter-specific sentiment lexicon: A representation learning approach. In COLING, pp. 172–182.
  • Tenenbaum et al. (2015) Tenenbaum, D. J., K. Barrett, S. Medaris, and T. Devitt (2015, February). In 10 languages, happy words beat sad ones.
  • Thelwall et al. (2012) Thelwall, M., K. Buckley, and G. Paltoglou (2012). Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology 63(1), 163–173.
  • Thelwall et al. (2010) Thelwall, M., K. Buckley, G. Paltoglou, D. Cai, and A. Kappas (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61(12), 2544–2558.
  • Tobias (1993) Tobias, R. B. (1993). 20 Master Plots: And How to Build Them. Ohio: Writer’s Digest Books.
  • Toutanova et al. (2003) Toutanova, K., D. Klein, C. D. Manning, and Y. Singer (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 173–180. Association for Computational Linguistics.
  • Tumasjan et al. (2010) Tumasjan, A., T. O. Sprenger, P. G. Sandner, and I. M. Welpe (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM 10, 178–185.
  • Turney (2002) Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 417–424. Association for Computational Linguistics.
  • Turney and Littman (2003) Turney, P. D. and M. L. Littman (2003). Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4), 315–346.
  • Uther (2011) Uther, H.-J. (2011). The Types of International Folktales. A Classification and Bibliography. Based on the System of Antti Aarne and Stith Thompson. Part I. Animal Tales, Tales of Magic, Religious Tales, and Realistic Tales, with an Introduction (FF Communications, 284). Helsinki, Finland: Finnish Academy of Science and Letters.
  • Valls-Vargas et al. (2014) Valls-Vargas, J., S. Ontanón, and J. Zhu (2014). Toward automatic character identification in unannotated narrative text. In Seventh Intelligent Narrative Technologies Workshop.
  • Valls-Vargas et al. (2014) Valls-Vargas, J., J. Zhu, and S. Ontañón (2014). Toward automatic role identification in unannotated folk tales. In Tenth Artificial Intelligence and Interactive Digital Entertainment Conference.
  • Van Ham et al. (2009) Van Ham, F., M. Wattenberg, and F. B. Viégas (2009). Mapping text with phrase nets. IEEE transactions on visualization and computer graphics 15(6).
  • Van Rensbergen et al. (2016) Van Rensbergen, B., S. De Deyne, and G. Storms (2016). Estimating affective word covariates using word association data. Behavior Research Methods 48(4), 1644–1652.
  • Various (Various) Various. Project Gutenberg.
  • Velikovich et al. (2010) Velikovich, L., S. Blair-Goldensohn, K. Hannan, and R. McDonald (2010). The viability of web-derived polarity lexicons. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 777–785. Association for Computational Linguistics.
  • Viegas et al. (2009) Viegas, F. B., M. Wattenberg, and J. Feinberg (2009). Participatory visualization with wordle. IEEE transactions on visualization and computer graphics 15(6).
  • Viegas et al. (2007) Viegas, F. B., M. Wattenberg, F. Van Ham, J. Kriss, and M. McKeon (2007). Manyeyes: a site for visualization at internet scale. IEEE transactions on visualization and computer graphics 13(6).
  • Volger (1992) Volger, C. (1992). The writer’s journey. mythic structure for storytellers and screenwriters.
  • Vonnegut (1981) Vonnegut, K. (1981). Palm Sunday. New York: RosettaBooks LLC.
  • Vonnegut (1995) Vonnegut, K. (1995). Shapes of stories.
  • Ward Jr (1963) Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association 58(301), 236–244.
  • Warriner et al. (2013) Warriner, A. B., V. Kuperman, and M. Brysbaert (2013). Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior research methods 45(4), 1191–1207.
  • Watson and Clark (1999) Watson, D. and L. A. Clark (1999). The PANAS-X: Manual for the positive and negative affect schedule-expanded form: Manual for the positive and negative affect schedule-expanded form. Ph. D. thesis, University of Iowa.
  • Weingart (Weingart) Weingart, S. Not enough perspectives, pt. 1.
  • Whissell et al. (1986) Whissell, C., M. Fournier, R. Pelland, D. Weir, and K. Makarec (1986). A dictionary of affect in language: Iv. reliability, validity, and applications. Perceptual and Motor Skills 62(3), 875–888.
  • Williams (2016) Williams, J. R. (2016). Boundary-based mwe segmentation with text partitioning. arXiv preprint arXiv:1608.02025.
  • Wilson et al. (2005) Wilson, T., J. Wiebe, and P. Hoffmann (2005). Recognizing contextual polarity in phrase-level sentiment analysis. Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005).
  • Winston (2011) Winston, P. H. (2011). The strong story hypothesis and the directed perception hypothesis.
  • Wojcik et al. (2015) Wojcik, S. P., A. Hovasapian, J. Graham, M. Motyl, and P. H. Ditto (2015). Conservatives report, but liberals display, greater happiness. Science 347(6227), 1243–1246.
  • Wu (2016) Wu, S. (2016, Dec). An interactive visualization of every line in hamilton.
  • Xanthos et al. (2016) Xanthos, A., I. Pante, Y. Rochat, and M. Grandjean (2016). Visualising the dynamics of character networks. Digital Humanities 2016: Conference Abstracts, 417–419.
  • Youyou et al. (2015) Youyou, W., M. Kosinski, and D. Stillwell (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences 112(4), 1036–1040.
  • Zhu et al. (2014) Zhu, X., S. Kiritchenko, and S. M. Mohammad (2014). Nrc-canada-2014: Recent improvements in the sentiment analysis of tweets. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp. 443–447. Citeseer.

Appendix A Supplementary Material for Sentiment Dictionary Comparisons

a.1 S1 Appendix: Computational methods

All of the code to perform these tests is available and document on GitHub. The repository can be found here:

a.1.1 Stem matching

Of the dictionaries tested, both LIWC and MPQA use “word stems”. Here we quickly note some of the technical difficulties with using word stems, and how we processed them, for future research to build upon and improve.

An example is abandon*, which is intended to the match words of the standard RE form abandon[a-z]*. A naive approach is to check each word against the regular expression, but this is prohibitively slow. We store each of the dictionaries in a “trie” data structure with a record. We use the easily available “marisa-trie” Python library, which wraps the C++ counterpart. The speed of these libraries made the comparison possible over large corpora, in particular for the dictionaries with stemmed words, where the prefix search is necessary. Specifically, the “trie” structure is 70 times faster than a regular expression based search for stem words. In particular, we construct two tries for each dictionary: a fixed and stemmed trie. We first attempt to match words against the fixed list, and then turn to the prefix match on the stemmed list.

a.1.2 Regular expression parsing

The first step in processing the text of each corpora is extracting the words from the raw text. Here we rely on a regular expression search, after first removing some punctuation. We choose to include a set of all characters that are found within the words in each of the six dictionaries tested in detail, such that it respects the parse used to create these dictionaries by retaining such characters. This takes the following form in Python, for raw_text as a string:

punctuation_to_replace = ["---","--","’’"]
for punctuation in punctuation_to_replace:
    raw_text = raw_text.replace(punctuation," ")
words = [x.lower() for x in re.findall(\

a.2 S2 Appendix: Continued individual comparisons

Picking up right where we left off in Section 3.3, we next compare ANEW with the other dictionaries. The ANEW-WK comparison in Panel I of Fig. 2.1 contains all 1030 words of ANEW, with a fit of , making ANEW more positive and with increasing positivity for more positive words. The 20 most different scores are (ANEW,WK): fame (7.93,5.45), god (8.15,5.90), aggressive (5.10,3.08), casino (6.81,4.68), rancid (4.34,2.38), bees (3.20,5.14), teacher (5.68,7.37), priest (6.42,4.50), aroused (7.97,5.95), skijump (7.06,5.11), noisy (5.02,3.21), heroin (4.36,2.74), insolent (4.35,2.74), rain (5.08,6.58), patient (5.29,6.71), pancakes (6.08,7.43), hospital (5.04,3.52), valentine (8.11,6.40), and book (5.72,7.05). We again see some of the same words from the LabMT comparisons with these dictionaries, and again can attribute some differences to small sample sizes and differing demographics.

For the ANEW-MPQA comparison in Panel J of Fig. 2.1 we show the same matched word lists as before. The happiest 10 words in ANEW matched by MPQA are: clouds (6.18), bar (6.42), mind (6.68), game (6.98), sapphire (7.00), silly (7.41), flirt (7.52), rollercoaster (8.02), comedy (8.37), laughter (8.45). The least happy 5 neutral words and happiest 5 neutral words in MPQA, matched with MPQA, are: pressure (3.38), needle (3.82), quiet (5.58), key (5.68), alert (6.20), surprised (7.47), memories (7.48), knowledge (7.58), nature (7.65), engaged (8.00), baby (8.22). The least happy words in ANEW with score +1 in MPQA that are matched by MPQA are: terrified (1.72), meek (3.87), plain (4.39), obey (4.52), contents (4.89), patient (5.29), reverent (5.35), basket (5.45), repentant (5.53), trumpet (5.75). Again we see some very questionable matches by the MPQA dictionary, with broad stems capturing words with both positive and negative scores.

For the ANEW-LIWC comparison in Panel K of Fig. 2.1 we show the same matched word lists as before. The happiest 10 words in ANEW matched by LIWC are: lazy (4.38), neurotic (4.45), startled (4.50), obsession (4.52), skeptical (4.52), shy (4.64), anxious (4.81), tease (4.84), serious (5.08), aggressive (5.10). There are only 5 words in ANEW that are matched by LIWC with LIWC score of 0: part (5.11), item (5.26), quick (6.64), couple (7.41), millionaire (8.03). The least happy words in ANEW with score +1 in LIWC that are matched by LIWC are: heroin (4.36), virtue (6.22), save (6.45), favor (6.46), innocent (6.51), nice (6.55), trust (6.68), radiant (6.73), glamour (6.76), charm (6.77).

For the ANEW-Liu comparison in Panel L of Fig. 2.1 we show the same matched word lists as before, except the neutral word list because Liu has no explicit neutral words. The happiest 10 words in ANEW matched by Liu are: pig (5.07), aggressive (5.10), tank (5.16), busybody (5.17), hard (5.22), mischief (5.57), silly (7.41), flirt (7.52), rollercoaster (8.02), joke (8.10). The least happy words in ANEW with score +1 in Liu that are matched by Liu are: defeated (2.34), obsession (4.52), patient (5.29), reverent (5.35), quiet (5.58), trumpet (5.75), modest (5.76), humble (5.86), salute (5.92), idol (6.12).

For the WK-MPQA comparison in Panel P of Fig. 2.1 we show the same matched word lists as before. The happiest 10 words in WK matched by MPQA are: cutie (7.43), pancakes (7.43), panda (7.55), laugh (7.56), marriage (7.56), lullaby (7.57), fudge (7.62), pancake (7.71), comedy (8.05), laughter (8.05). The least happy 5 neutral words and happiest 5 neutral words in MPQA, matched with MPQA, are: sociopath (2.44), infectious (2.63), sob (2.65), soulless (2.71), infertility (3.00), thinker (7.26), knowledge (7.28), legacy (7.38), surprise (7.44), song (7.59). The least happy words in WK with score +1 in MPQA that are matched by MPQA are: kidnapper (1.77), kidnapping (2.05), kidnap (2.19), discriminating (2.33), terrified (2.51), terrifying (2.63), terrify (2.84), courtroom (2.84), backfire (3.00), indebted (3.21).

For the WK-LIWC comparison in Panel Q of Fig. 2.1 we show the same matched word lists as before. The happiest 10 words in WK matched by LIWC are: geek (5.56), number (5.59), fiery (5.70), trivia (5.70), screwdriver (5.76), foolproof (5.82), serious (5.88), yearn (5.95), dumpling (6.48), weeping willow (6.53). The least happy 5 neutral words and happiest 5 neutral words in LIWC, matched with LIWC, are: negative (2.52), negativity (2.74), quicksand (3.62), lack (3.68), wont (4.09), unique (7.32), millionaire (7.32), first (7.33), million (7.55), rest (7.86). The least happy words in WK with score +1 in LIWC that are matched by LIWC are: heroin (2.74), friendless (3.15), promiscuous (3.32), supremacy (3.48), faithless (3.57), laughingstock (3.77), promiscuity (3.95), tenderfoot (4.26), succession (4.52), dynamite (4.79).

For the WK-Liu comparison in Panel R of Fig. 2.1 we show the same matched word lists as before, except the neutral word list because Liu has no explicit neutral words. The happiest 10 words in WK matched by Liu are: goofy (6.71), silly (6.72), flirt (6.73), rollercoaster (6.75), tenderness (6.89), shimmer (6.95), comical (6.95), fanciful (7.05), funny (7.59), fudge (7.62), joke (7.88). The least happy words in WK with score +1 in Liu that are matched by Liu are: defeated (2.59), envy (3.05), indebted (3.21), supremacy (3.48), defeat (3.74), overtake (3.95), trump (4.18), obsession (4.38), dominate (4.40), tough (4.45).

Now we’ll focus our attention on the MPQA row, and first we see comparisons against the three full range dictionaries. For the first match against LabMT in Panel D of Fig. 2.1, the MPQA match catches 431 words with MPQA score 0, while LabMT (without stems) matches 268 words in MPQA in Panel S (1039/809 and 886/766 for the positive and negative words of MPQA). Since we’ve already highlighted most of these words, we move on and focus our attention on comparing the dictionaries.

In Panels V–X, BB–DD, and HH–JJ of Fig. 2.1 there are a total of 6 bins off of the diagonal, and we focus out attention on the bins that represent words that have opposite scores in each of the dictionaries. For example, consider the matches made my MPQA in Panel BB: the words in the top left corner and bottom right corner with are scored in a opposite manner in LIWC, and are of particular concern. Looking at the words from Panel W with a +1 in MPQA and a -1 in LIWC (matched by LIWC) we see: stunned, fiery, terrified, terrifying, yearn, defense, doubtless, foolproof, risk-free, exhaustively, exhaustive, blameless, low-risk, low-cost, lower-priced, guiltless, vulnerable, yearningly, and yearning. The words with a -1 in MPQA that are +1 in LIWC (matched by LIWC) are: silly, madly, flirt, laugh, keen, superiority, supremacy, sillily, dearth, comedy, challenge, challenging, cheerless, faithless, laughable, laughably, laughingstock, laughter, laugh, grating, opportunistic, joker, challenge, flirty.

In Panel W of 2.1, the words with a +1 in MPQA and a -1 in Liu (matched by Liu) are: solicitude, flair, funny, resurgent, untouched, tenderness, giddy, vulnerable, and joke. The words with a -1 in MPQA that are +1 in Liu, matched by Liu, are: superiority, supremacy, sharp, defeat, dumbfounded, affectation, charisma, formidable, envy, empathy, trivially, obsessions, and obsession.

In Panel BB of 2.1, the words with a +1 in LIWC and a -1 in MQPA (matched by MPQA) are: silly, madly, flirt, laugh, keen, determined, determina, funn, fearless, painl, cute, cutie, and gratef. The words with a -1 in LIWC and a +1 in MQPA, that are matched by MPQA, are: stunned, terrified, terrifying, fiery, yearn, terrify, aversi, pressur, careless, helpless, and hopeless.

In Panel DD of 2.1, the words with a -1 in LIWC and a +1 in Liu, that are matched by Liu, are: silly, and madly. The words with a +1 in LIWC and a -1 in Liu, that are matched by Liu, are: stunned, and fiery.

In Panel HH of 2.1, the words with a -1 in Liu and a +1 in MPQA, that are matched by MPQA, are: superiority, supremacy, sharp, defeat, dumbfounded, charisma, affectation, formidable, envy, empathy, trivially, obsessions, obsession, stabilize, defeated, defeating, defeats, dominated, dominates, dominate, dumbfounding, cajole, cuteness, faultless, flashy, fine-looking, finer, finest, panoramic, pain-free, retractable, believeable, blockbuster, empathize, err-free, mind-blowing, marvelled, marveled, trouble-free, thumb-up, thumbs-up, long-lasting, and viewable. The words with a +1 in Liu and a -1 in MPQA, that are matched by MPQA, are: solicitude, flair, funny, resurgent, untouched, tenderness, giddy, vulnerable, joke, shimmer, spurn, craven, aweful, backwoods, backwood, back-woods, back-wood, back-logged, backaches, backache, backaching, backbite, tingled, glower, and gainsay.

In Panel II of 2.1, the words with a +1 in Liu and a -1 in LIWC, that are matched by LIWC, are: stunned, fiery, defeated, defeating, defeats, defeat, doubtless, dominated, dominates, dominate, dumbfounded, dumbfounding, faultless, foolproof, problem-free, problem-solver, risk-free, blameless, envy, trivially, trouble-free, tougher, toughest, tough, low-priced, low-price, low-risk, low-cost, lower-priced, geekier, geeky, guiltless, obsessions, and obsession. The words with a -1 in Liu and a +1 in LIWC, that are matched by LIWC, are: silly, madly, sillily, dearth, challenging, cheerless, faithless, flirty, flirt, funnily, funny, tenderness, laughable, laughably, laughingstock, grating, opportunistic, joker, and joke.

In the off-diagonal bins for all of the dictionaries, we see many of the same words. Again MPQA stem matches are disparagingly broad. We also find matches by LIWC that are concerning, and should in all likelihood be removed from the dictionary.

a.3 S3 Appendix: Coverage for all corpuses

We provide coverage plots for the other three corpuses.

Figure A.1: Coverage of the words on twitter by each of the dictionaries.
Figure A.2: Coverage of the words in Google books by each of the dictionaries.
Figure A.3: Coverage of the words in the New York Times by each of the dictionaries.

a.4 S4 Appendix: Sorted New York Times rankings

Figure A.4: NYT Sections scatterplot. The RMA fit and for the formula . For the sake of comparison, we normalized each dictionary to the range [-.5,.5] by subtracting the mean score (5 or 0) and dividing by the range (8 or 2).
Figure A.5: Sorted bar charts ranking each of the 24 New York Times Sections for each dictionary tested.

a.5 S5 Appendix: Movie Review Distributions

Here we examine the distributions of movie review scores. These distributions are each summarized by their mean and standard deviation in panels of Figure 2 for each dictionary. For example, the left most error bar of each panel in Figure 2 shows the standard deviation around the mean for the distribution of individual review scores (Figure A.6).

Figure A.6: Binned scores for each review by each corpus with a stop value of .
Figure A.7: Binned scores for samples of 15 concatenated random reviews. Each dictionary uses stop value of .
Figure A.8: Binned length of positive reviews, in words.

a.6 S6 Appendix: Google Books correlations and word shifts

Figure A.9: Google Books correlations. Here we include correlations for the google books time series, and word shifts for selected decades (1920’s,1940’s,1990’s,2000’s).
Figure A.10: Google Books shifts in the 1920’s against the baseline of Google Books.
Figure A.11: Google Books shifts in the 1940’s against the baseline of Google Books.
Figure A.12: Google Books shifts in the 1990’s against the baseline of Google Books.
Figure A.13: Google Books shifts in the 2000’s against the baseline of Google Books.

a.7 S7 Appendix: Additional Twitter time series, correlations, and shifts

First, we present additional Twitter time series:

Figure A.14: Normalized time series on Twitter using of 1.0 for all. For resolution of 3 hours. We do not include any of the time series with resolution below 3 hours here because there are too many data points to see.
Figure A.15: Normalized time series on Twitter using of 1.0 for all. For resolution of 12 hours.

Next, we take a look at more correlations:

Figure A.16: Pearson’s correlation between Twitter time series for all resolutions below 1 day.

Now we include word shift graphs that are absent from the manuscript itself.

Figure A.17: Word Shifts for Twitter in 2010. The reference word usage is all of Twitter (the 10% Gardenhose feed) from September 2008 through April 2015, with the word usage normalized by year.
Figure A.18: Word Shifts for Twitter in 2012. The reference word usage is all of Twitter (the 10% Gardenhose feed) from September 2008 through April 2015, with the word usage normalized by year.
Figure A.19: Word Shifts for Twitter in 2014. The reference word usage is all of Twitter (the 10% Gardenhose feed) from September 2008 through April 2015, with the word usage normalized by year.

Finally, we include the results of each dictionary applied to a set of annotated Twitter data. We apply sentiment dictionaries to rate individual Tweets and classify a Tweet as positive (negative) if the Tweet rating is greater (less) than the average of all scores in dictionary.

0in0in Rank Dictionary % Tweets scored F1 of Tweets scored Calibrated F1 Overall F1 1. Sent140Lex 100.0 0.89 0.88 0.89 2. labMT 100.0 0.69 0.78 0.69 3. HashtagSent 100.0 0.67 0.64 0.67 4. SentiWordNet 98.6 0.67 0.68 0.67 5. VADER 81.3 0.75 0.81 0.61 6. SentiStrength 73.9 0.83 0.81 0.61 7. SenticNet 97.3 0.61 0.64 0.59 8. Umigon 67.1 0.87 0.85 0.58 9. SOCAL 82.2 0.71 0.75 0.58 10. WDAL 99.9 0.58 0.64 0.58 11. AFINN 73.6 0.78 0.80 0.57 12. OL 66.7 0.83 0.82 0.55 13. MaxDiff 94.1 0.58 0.70 0.54 14. EmoSenticNet 96.0 0.56 0.59 0.54 15. MPQA 73.2 0.73 0.72 0.53 16. WK 96.5 0.53 0.72 0.51 17. LIWC15 61.8 0.81 0.78 0.50 18. Pattern 69.0 0.71 0.75 0.49 19. GI 67.6 0.72 0.70 0.49 20. LIWC07 60.3 0.80 0.75 0.48 21. LIWC01 54.3 0.83 0.75 0.45 22. EmoLex 59.4 0.73 0.69 0.43 23. ANEW 64.1 0.65 0.68 0.42 24. USent 4.5 0.74 0.73 0.03 25. PANAS-X 1.7 0.88 0.01 26. Emoticons 1.4 0.72 0.77 0.01

Table A.1: Ranked results of sentiment dictionary performance on individual Tweets from STS-Gold dataset (Saif, 2013). We report the percentage of Tweets for which each dictionary contains at least 1 entry, the F1 score on those Tweets, and the overall classification F1 score. The calibrated F1 score tunes the decision threshold between positive and negative Tweets with a random 10% training sample.

a.8 S8 Appendix: Naive Bayes results and derivation

We now provide more details on the implementation of Naive Bayes, a derivation of the linearity structure, and more results from the classification of Movie Reviews.

First, to implement a binary Naive Bayes classifier for a collection of documents, we denote each of the

words in the given document as , thus the normalized word frequency is , and finally we denote the class labels . The probability of a document belonging to class can be written as

Since we do not know explicitly, we make the naive assumption that each word appears independently, and thus write

Since we are only interested in comparing and , we disregard the shared denominator and have

Finally we say that document belongs to class if . Given that the probabilities of individual words are small, to avoid machine truncation error we compute these probabilities in log space, such that the product of individual word likelihoods becomes a sum

Assigning a classification of class if is the same as saying that the difference between the two is positive, i.e. and since the logarithm is monotonic, . To examine how individual words contribute to this difference, we can write

We can see from the above that the contribution of each word (or more accurately, the likelihood of the frequency in document being predictive of class as ) is a linear constituent of the classification.

Next, we include the detailed results of the Naive Bayes classifier on the Movie Review corpus.

Figure A.20: Results of the NB classifier on the Movie Reviews corpus.
Figure A.21: NYT Sections ranked by Naive Bayes in two of the five trials.

0in0in Most informative Positive Negative Word Value Word Value 27.27 flynt 20.21 godzilla 26.33 truman 15.95 werewolf 20.68 charles 13.83 gorilla 15.04 event 13.83 spice 14.10 shrek 13.83 memphis 13.16 cusack 13.83 sgt 13.16 bulworth 12.76 jennifer 13.16 robocop 12.76 hill 12.22 jedi 11.70 max 12.22 gangster 11.70 200
NYT Society Positive Negative Word Value Word Value 26.08 truman 20.40 godzilla 20.49 charles 12.88 hill 12.11 gangster 12.88 jennifer 10.25 speech 10.73 fatal 9.32 melvin 8.59 freddie 8.85 wars 8.59 = 7.45 agents 8.59 mess 6.52 dance 8.59 gene 6.52 bleak 8.59 apparent 6.52 pitt 7.51 travolta

Table A.2: Trial 1 of Naive Bayes trained on a random 10% of the movie review corpus, and applied to the New York Times Society section. We show the words which are used by the trained classifier to classify individual reviews (in corpus), and on the New York Times (out of corpus). In addition, we report a second trial in Table A.3, since Naive Bayes is trained on a random subset of data, to show the variation in individual words between trials (while performance is consistent).

0in0in Most informative Positive Negative Word Value Word Value 18.11 shrek 34.63 west 17.15 poker 24.14 webb 15.25 shark 18.89 jackal 14.29 maggie 17.84 travolta 13.34 guido 17.84 woo 13.34 outstanding 17.84 coach 13.34 political 16.79 awful 13.34 journey 16.79 brenner 13.34 bulworth 15.74 gabriel 12.39 bacon 15.74 general’s
NYT Society Positive Negative Word Value Word Value 17.79 poker 33.39 west 13.84 journey 17.20 coach 13.84 political 17.20 travolta 8.90 tribe 15.18 gabriel 7.91 tony 12.14 pointless 7.91 price 9.44 stupid 7.91 threat 8.09 screaming 7.12 titanic 7.59 mess 6.92 dicaprio 7.42 boring 6.92 kate 7.08 =

Table A.3: Trial 2 of Naive Bayes trained on a random 10% of the movie review corpus, and applied to the New York Times Society section. We show the words which are used by the trained classifier to classify individual reviews (in corpus), and on the New York Times (out of corpus). This second trial is in addition to the first trial in Table A.2, since Naive Bayes is trained on a random subset of data, to show the variation in individual words between trials (while performance is consistent).

a.9 S9 Appendix: Movie review benchmark of additional dictionaries

Here, we present the accuracy of each dictionary applied to binary classification of Movie Reviews.

0in0in Rank Title % Scored F1 Trained F1 Untrained 1. OL 100 0.70 0.71 2. HashtagSent 100 0.67 0.66 3. MPQA 100 0.67 0.66 4. SentiWordNet 100 0.65 0.65 5. labMT 100 0.64 0.63 6. AFINN 100 0.67 0.63 7. Umigon 100 0.65 0.62 8. GI 100 0.65 0.61 9. SOCAL 100 0.71 0.60 10. VADER 100 0.67 0.60 11. WDAL 100 0.60 0.59 12. SentiStrength 100 0.63 0.58 13. EmoLex 100 0.65 0.56 14. LIWC15 100 0.64 0.55 15. LIWC01 100 0.65 0.54 16. LIWC07 100 0.64 0.53 17. Pattern 100 0.73 0.52 18. PANAS-X 33 0.51 0.51 19. Sent140Lex 100 0.68 0.47 20. SenticNet 100 0.62 0.45 21. ANEW 100 0.57 0.36 22. MaxDiff 100 0.66 0.36 23. EmoSenticNet 100 0.58 0.34 24. WK 100 0.63 0.34 25. Emoticons 0 26. USent 40

Table A.4: Ranked performance of dictionaries on the Movie Review corpus.

0in0in Rank Title % Scored F1 Trained of Scored F1 Untrained of Scored F1 Untrained, All 1. HashtagSent 100 0.55 0.55 0.55 2. LIWC15 99 0.53 0.55 0.55 3. LIWC07 99 0.53 0.55 0.54 4. LIWC01 99 0.52 0.55 0.54 5. labMT 99 0.54 0.54 0.54 6. Sent140Lex 100 0.55 0.54 0.54 7. SentiWordNet 99 0.54 0.53 0.53 8. WDAL 99 0.53 0.53 0.52 9. EmoLex 95 0.54 0.55 0.52 10. MPQA 93 0.54 0.55 0.52 11. SenticNet 97 0.53 0.52 0.50 12. SOCAL 88 0.56 0.55 0.49 13. EmoSenticNet 98 0.52 0.46 0.45 14. Pattern 81 0.55 0.55 0.45 15. GI 80 0.55 0.55 0.44 16. WK 97 0.54 0.45 0.44 17. OL 76 0.56 0.57 0.44 18. VADER 79 0.56 0.55 0.43 19. SentiStrength 77 0.54 0.54 0.41 20. MaxDiff 83 0.54 0.49 0.41 21. AFINN 70 0.56 0.56 0.39 22. ANEW 63 0.52 0.48 0.30 23. Umigon 53 0.56 0.56 0.30 24. PANAS-X 1 0.53 0.53 0.01 25. Emoticons 0 26. USent 2

Table A.5: Ranked performance of dictionaries on the Movie Review corpus, broken into sentences.
Figure A.22: Word shifts for the movie review corpus, with panel letters continuing from Fig. 2.5. We again see many of the same patterns, and refer the reader to Fig. 2.5 for a more in depth analysis.

a.10 S10 Appendix: Coverage removal and binarization tests of labMT dictionary

Here, we perform a detailed analysis of the labMT dictionary to further isolate the effects of dictionary coverage and scoring type. This analysis is motivated by ensuring that the our results are not confounded entirely by the quality of the word scores across dictionaries, such that the effect of coverage and scoring type are isolated. We focus on the Movie Review corpus for this analysis and analyzing the different between positive and negative reviews using word shift graphs. While our attention is focused on a qualitative understanding of the differences in these two sets of documents, we also report the accuracy of the labMT dictionary with the aforementioned modifications using the F1 score.

a.10.1 Binarization

First, we gradually reduce the range of scores in the labMT dictionary from a centered -4 4, down to just the integer scores and . This process is accomplished by first using a

, leaving words with scores from 1–4 and 6–9, and then applying a linear transformation to these sets of words. We subtract the center value of 5.0 from the words, leaving words with ranges from -4– -1 and 1–4, and then linearly map these sets to scores with a reduced range. For a binarization of 25%, we map -4– -1 to -3.25 – -1 and 1–4 to 1–3.25, reducing the range in direction from 3 to 2.25 (a 25% reduction). For a binarization of 50%, this becomes a map of -4– -1 to -2.5 – -1 and 1–4 to 1–2.5, leaving only half of the original range of values. Finally, a binarization of 100% sets the score for all words -4– -1 to -1, and words 1–4 to 1.

In Figs. A.23A.26 we observe that the binarization of the labMT dictionary results in observably different word shift graphs by changing which words contribute to the sentiment differences as well as reducing the difference in sentiment scores between the two corpora. Looking specifically at Fig. A.26, the top 5 words in the control word shift graph are bad, no, movie, worst, and war. In the binarized version, the top 5 are bad, no, movie, nothing, and worst. The top 5 from the continuous dictionary move into places 1, 2, 3, 5, and 10. Examining only the positive words that increased in frequency (not all shown in the Figure), we have “3. movie (3)”, “11. like (24)”, “32. funny (102)”, “33. better (46)”, and “43. jokes (133)” in the control version, with these words’ positions in the binarized version in parenthesis. In the binarized version, these top words are “3. movie (3)”, “24. like (11)”, “30. you (84)”, “36. up (126)”, “37. all (98)”, where the first number is the place in the overall list for the given labMT score list, with the place for that word in the control word shift graph in parenthesis.

In Figure A.27, the F1 score is show across this gradual, linear change to a binary dictionary. We observe that the full binarization of the labMT dictionary results in a degradation of performance, although the differences are not statistically significant.

Figure A.23: Word shift graph resulting from the 25% binarization of the labMT dictionary.
Figure A.24: Word shift graph resulting from the 50% binarization of the labMT dictionary.
Figure A.25: Word shift graph resulting from the 75% binarization of the labMT dictionary.
Figure A.26: Word shift graph resulting from the full binarization of the labMT dictionary.
Figure A.27: The direct binarization of the labMT dictionary results in a degradation of performance. The binarization is accomplished by linearly reducing the range of scores in the labMT dictionary from a centered -4 4 to the integer scores and .

a.10.2 Reduced coverage

Second, to test the effect of coverage alone, we systematically reduce the coverage of the labMT dictionary and again attempt the binary classification task of identifying Movie Review polarity. Three possible strategies to reduce the coverage are (1) removing the most frequent words, (2) removing the least frequent words, and (3) removing words randomly (irrespective of their frequency of usage).

In Figs. A.28A.46, we show the resulting word shift graphs with the control (all words included) alongside word shift graphs using the labMT dictionary with the least frequent (LF) and most frequent (MF) words removed. Each word shift graph with reduced coverage shows the number of words removed in parenthesis in the title, e.g., in Fig. A.28 we see the titles “LF Reduced coverage (511)” and “MF Reduced coverage (511)” which indicate that 511 words were removed in the indicated fashion. We first observe that the difference in sentiment scores between the positive and negative movie reviews is decreased from 0.17 to 0.02–0.05 and 0.09–0.15 for the LF and MF strategies, respectively, while noting that these differences do not result in predictive accuracy (i.e., classification accuracy is not statistically significant worsened). Examining the words in Fig. A.28 more closely, where only 5% of the words have been removed, we already observe departures in individual word contributions. Of the top 5 words in the control graph (“bad”, “no”, “movie”, “worst”, and “war”), we see only 3 of these in the top 5 for LF (all in the top 8) and only 1 in the top for MF (with 2 of the 5 showing on the graph at all). In the LF graph we lose words like “don’t”, “least”, “doesn’t”, “terrible”, “awful”, “problem”, and instead see the words “the”, “of”, “i”, “is”, “have” contribute more strongly. In the MF graph we lose common words like “best”, “family”, “love”, “life”, “like” and instead see the less common words “exc