Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not

03/11/2015 ∙ by Eitan Adam Pechenick, et al. ∙ The University of Vermont 0

Of basic interest is the quantification of the long term growth of a language's lexicon as it develops to more completely cover both a culture's communication requirements and knowledge space. Here, we explore the usage dynamics of words in the English language as reflected by the Google Books 2012 English Fiction corpus. We critique an earlier method that found decreasing birth and increasing death rates of words over the second half of the 20th Century, showing death rates to be strongly affected by the imposed time cutoff of the arbitrary present and not increasing dramatically. We provide a robust, principled approach to examining lexical evolution by tracking the volume of word flux across various relative frequency thresholds. We show that while the overall statistical structure of the English language remains stable over time in terms of its raw Zipf distribution, we find evidence of an enduring `lexical turbulence': The flux of words across frequency thresholds from decade to decade scales superlinearly with word rank and exhibits a scaling break we connect to that of Zipf's law. To better understand the changing lexicon, we examine the contributions to the Jensen-Shannon divergence of individual words crossing frequency thresholds. We also find indications that scholarly works about fiction are strongly represented in the 2012 English Fiction corpus, and suggest that a future revision of the corpus should attempt to separate critical works from fiction itself.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The incredible volume and free availability of the Google Books corpus Michel et al. (2011); Lin et al. (2012) make it an intriguing candidate for linguistic research. In a previous work Pechenick et al. (2015), we broadly explored the characteristics and dynamics of the English and English Fiction data sets from both the 2009 and 2012 versions of the corpus. We showed that the 2009 and 2012 English unfiltered data sets and, surprisingly, the 2009 English Fiction data set sets all become increasingly influenced by scientific texts throughout the 1900s, with medical research language being especially prevalent. We concluded that, without sophisticated processing, only the 2012 English Fiction data set is suitable for any kind of analysis and deduction as it stands. We also described the library-like nature of the Google Books corpus which reflects word usage by authors with each book, in principle, represented once. Word frequency is thus a deceptive aspect of the corpus as it is not reflective of how often these words are read, as might be informed by book sales and library borrowing data, much less spoken by the general public. Nevertheless, the corpus provides an imprint of a language’s lexicon and remains worthy of study, providing all caveats are clearly understood.

In this paper, we therefore focus on the 2012 version of the English Fiction data set. Fig. 1 shows the total number of 1-grams for this data set between 1800 and 2000 (1-grams are contiguous text elements and are more general than words including, for example, punctuation). An exponential increase in volume is apparent over time with notable exceptions during major conflicts when the total volume decreases. For ease of comparison with related work, and to avoid high levels of optical character recognition (OCR) errors due to the presence of the long s—e.g., “said” being read as “faid” Pechenick et al. (2015)—we omit the first two decades and concern ourselves henceforth with 1-grams between the years 1820 and 2000. In releasing the original data set, Michel et al. Michel et al. (2011) noted that English Fiction contained scholarly articles about fictional works (but not scholarly works in general), and we also explore this balance here.

Figure 1: The logarithms of the total 1-gram counts for the Google Books corpus 2012 English Fiction data set. An exponential increase in volume is apparent over time with notable exceptions during wartime when the total volume decreases. (This effect is clearest during the American Civil War and both World Wars.)

Many researchers have carried out broad studies of the Google Books corpus, examining properties and dynamics of entire languages. These include analyses of Zipf’s and Heaps’ laws as applied to the corpus Gerlach and Altmann (2013), the rates of verb regularization Michel et al. (2011), rates of word “birth” and “death” and durations of cultural memory Petersen et al. (2012a), as well as an observed decrease in the need for new words in several languages Petersen et al. (2012b). However, most of the studies were performed before the release of the second version, and, to our knowledge, none have taken into account the substantial effects of scientific literature on the data sets.

Here, we are especially interested with revisiting work on word “birth” and “death” rates as performed in Petersen et al. (2012a). As we show below in Sec. II), the methods employed in Petersen et al. (2012a) suffer from boundary effects, and we suggest an alternative approach insensitive to time range choice.

We do not, however, dispute that an asymmetry exists in the changes in word use. In our earlier work Pechenick et al. (2015), we observed this asymmetry in the contributions to the Jensen-Shannon divergence (defined below) between decades, with most large contributions being accounted for by words whose relative frequencies had increased over time. In this paper, we apply a similar information-theoretic approach to examine this effect for words moving across fixed usage frequency thresholds.

We structure the paper as follows. In Sec. II, we critique the method from Petersen et al. (2012a) which examines the birth and death rates of words in an evolving, time-coded corpus. In Sec. III, we recall and confirm a similar apparent bias toward increased usage rates of words from our prevoius paper. We then measure the flux of words across various relative frequency boundaries (in both directions) in the second English Fiction data set. We describe the use of the largest contributions to the Jensen-Shannon divergence between successive decades from among the words crossing each boundary as signals to highlight the specific dynamics of word growth and decay over time. In Sec. IV, we display examples of these word usage changes and explore the factors contributing to the observed disparities between growth and decay. We offer concluding remarks in Sec. V.

Ii Critique of earlier work

In Petersen et al. (2012a), Petersen et al. examined the birth and death rates of words over time for various data sets in the first version of the Google Books corpus. They defined the birth year and death year of an individual word as the first and last year, respectively, that the given word appeared above one twentieth its median relative frequency. Excluded from considerations were words appearing in only one year and words appearing for the first time before 1700. The rates of word birth and death, respectively, were found by normalizing the numbers of births and deaths by the total number of unique words in a given year.

Results typical to all data sets included decreased birth rates and increased death rates over time. These results are not implausible, and the results were noted to be qualitatively similar when one tenth the median frequency is used as a threshold. But the very specific nature of the analysis—particularly the multiple temporal restrictions on the words included in the analysis, the reliance on a particular proportion of each word’s median frequency, and the ignoring of all but the first and last crossings over this threshold—raise questions as to the robustness of the method.

Now, the common-sense interpretation of “word death” is clearly that a word falls out of usage (relatively) at a fixed point in history. Ignoring all but the first and last crossings of a threshold tied to both a word’s usage frequency and a specific time range appears to cause problems in this regard in Petersen et al. (2012a), and we find a boundary effect for death rates induced by the choice of the time range’s end point. To demonstrate this, we recreate the described analysis for the second version of English Fiction.

We note that in our analyses, the relative frequencies are coarse-grained at the level of decades (see Methods below). We excluded words appearing in only one decade (rather than year) and words appearing before the 1820s (instead of 1700). Again, this more recent initial cut-off date accounts for the high frequency of OCR errors observed before 1820. These differences with Petersen et al. (2012a) are not substantive, and allow us to re-examine their work and build out our own in meaningful ways.

We compare the birth and death rates as observed recently versus historically by performing the analysis with three different endpoints imposed: the 1950s, the 1970s, and the 1990s. We present the results of the recreation in Fig. 2 (c.f. Fig. 2 in Petersen et al. (2012a)).

Figure 2: Birth and death rates, with definitions based on the method used in Petersen et al. (2012a) for the 2012 version of English Fiction as observed between the 1820s and three different end-of-history boundaries. The lower panel shows death rates are affected by the choice of when history ends. Birth rates are similarly affected by moving the start of history.

Using the 1990s cutoff, the observed birth rates are qualitatively similar to those found for various data sets (from the 2009 version of the corpus) in Petersen et al. (2012a) and display spikes in the 1890s and 1920s (top panel in Fig. 2, light gray). We see that birth rates are not affected by moving the “end of history” back to the 1950s or 1970s.

The observed death rates with the 1990s boundary (bottom panel in Fig. 2, light gray) are also similar to that found in Petersen et al. (2012a), despite the lack of deaths detected during much of the 19th century. (Recall, we ignored words originating prior to 1820.)

However, as the terminal boundary is moved back to the 1970s, what was originally a stable region between the 1910s and 1940s turns into a apparent region of gradually increasing word extinction. (bottom panel in Fig. 2, gray). As the boundary is moved further back to the 1950s, the increase in death rate is no longer gradual (bottom panel in Fig. 2, dark gray). We thus see a clear dependence of the observations of the death rate on when the history of the corpus ends. Moving the “start of history” forward in time similarly affects birth rates.

Thus, while the method in Petersen et al. (2012a) provides a reasonable approach to analyzing dynamics and asymmetries in the evolutionary dynamics of a language data set, the results for birth and death rates in 2 depend on when the experiment is performed. So motivated, we proceed to develop an approach that is robust with respect to time boundaries.

Iii Methods

We coarse-grain the relative frequencies in the second English Fiction data set at the level of decades—e.g., between 1820-to-1829 and 1990-to-1999—by averaging the relative frequency of each unique word in a given decade over all years in that decade. (We weight each year equally.) This allows us to conveniently calculate and sort contributions to the Jensen-Shannon divergence (defined below) of individual 1-grams between any two time periods

iii.1 Statistical divergence between decades

As in our previous paper Pechenick et al. (2015), we examined the dynamics of the 2012 version of English Fiction by calculating contributions to the Jensen-Shannon divergence (JSD) Lin (1991) between the distributions of 1-grams in two given decades. We then used these contributions to resolve specific and important signals in dynamics of the language. (This material, which is presented in greater detail in our previous work, is outlined in sufficient detail below.)

Given a language with 1-gram distributions in the first decade and in second, the JSD between and can be expressed as

(1)

where is a mixed distribution of the two years, and is the Shannon entropy Shannon (2001) of the original distribution. The JSD is symmetric and bounded between 0 and 1 bit. These bounds are only observed when the distributions are identical and free of overlap, respectively.

The contribution from the word to the divergence between two decades, as derived from Eq. 1, is given by

(2)

where , so that contribution from an individual word is proportional to both the average frequency of the word and also depends on the ratio between the smaller and average frequencies. To elucidate the second dependency, we reframe the contribution as

(3)

Words with larger average frequencies yield larger contribution signals as do those with smaller ratios, , between the frequencies. A common 1-gram changing subtly can produce a large signal. So can an uncommon or new word given a sufficient shift from one decade to the next. , the proportion of the average frequency contributed to the signal, is concave (up) and symmetric about , where the frequency remains unchanged yielding no contribution. If a word appears or disappears between two decades (e.g., in the former case), then the contribution is maximized at precisely the average frequency of the word in question.

iii.2 Exploring asymmetric dynamics

We observed in Pechenick et al. (2015) that most large JSD contribution signals are due to words whose relative frequencies increase over time. In this paper, we confirm and explore this effect.

We texture our observations by examining JSD signals due to words crossing various relative frequency thresholds in either direction, as well as the total volume of word flux in either direction across these thresholds. It is both convenient and consistent to record flux over relative frequency thresholds instead of rank thresholds. To demonstrate this consistency, we observe in Fig. 3 that rank threshold boundaries correspond to nearly constant relative frequency thresholds, with the exception of the top 1-gram (always the comma), which decreases gradually in relative frequency. For thresholds of and below, we omit signals corresponding to references to specific years, since such references would otherwise overwhelm the charts for these thresholds.

Figure 3: Rank threshold boundaries correspond to nearly constant relative frequency threshold boundaries over many orders of magnitude, with the exception of the top 1-gram (always a comma), which decreases in relative frequency. The observed stability demonstrates that Zipf’s law remains largely unchanged for the English Fiction (2012) data set, even though individual words may vary greatly in rank over time.

Iv Results and Discussion

As seen in Fig. 4, more than half of the JSD between a typical given decade and the next is due to contributions from words increasing in relative usage frequency. The JSDs between 1820s, 1840s, and 1970s and their successive decades are the only exceptions. Moreover, when the time differential is increased to three decades, no exceptions remain. This confirms asymmetry exists between signals for words increasing and decreasing in relative use. We note relative extrema of the inter-decade JSD in the vicinity of major conflicts. Between the 1860s and successive decades, words on the rise contribute substantially to the JSD. This is consistent with words not relatively popular during wartime (specifically the American Civil War) being used more frequently in peacetime. A similar tendency holds for the JSD between the 1910s (World War I) and the 1920s. This is not as apparent in the JSD between the 1910s and the 1940s, possibly because the 1940s coincide with World War II. The absolute maximum for the single-decade curve corresponds to the divergence between the 1950s and 1960s. This suggests a strong effect from social movements. (For the 3-decade split, the absolute peak comes from the JSD between the 1940s and 1970s.)

Figure 4: Percent of JSD in English Fiction (version 2) due to words increasing in relative frequency of use for successive decades (dark gray), and decades three apart (light gray; e.g., 1990s versus 1960s). The contribution for successive decades is nearly always more than half—the exceptions are between the 1820s, 1840s, and 1970s, and their successive decades. For decades three apart, the contribution is always greater than 50%. The JSD between successive decades also shows peaks in the vicinity of major conflicts.
Figure 5: Total number of words () crossing relative frequency thresholds of , , , and in both directions between each decade and the next decade. For each threshold, the upward and downward flux roughly cancel. For either direction of flux, there appears to be little qualitative difference between the three smallest thresholds for which the downward flux between the 1950s and the 1960s is a minimum, the downward flux increases over the next two pairs of consecutive decades, then it dips again between the 1980s and 1990s. For the highest threshold, the increase between the 1960s and 1970s and the next pair of decades is more noticeable for the upward flux, as is the decrease between the last two pairs of decades.
Figure 6: Words crossing relative frequency threshold of between consecutive decades. Signals for each pair of decades are sorted and weighted by contribution to the JSD between those decades. Bars pointing to the right represent words that rose above the threshold between decades. Bars pointing left represent words that fell. In parentheses in each title is the total percent of the JSD between the given pair of decades that is accounted for by flux over the threshold.
Figure 7: Words crossing relative frequency threshold of between the 1970s and 1980s. Signals for each pair of decades are sorted and weighted by contribution to the JSD between those decades. Bars pointing to the right represent words that rose above the threshold between decades. Bars pointing left represent words that fell. (The first signal is the asterisk “*”.)
Figure 8: Words crossing relative frequency threshold of between the 1980s and 1990s. See the caption for Fig. 7 for details.
Figure 9: Words (not counting references to years) crossing relative frequency threshold of between the 1970s and 1980s. See the caption for Fig. 7 for details.
Figure 10: Words (not counting references to years) crossing relative frequency threshold of between the 1980s and 1990s. See the caption for Fig. 7 for details.
Figure 11: Words (not counting references to years) crossing relative frequency threshold of between the 1970s and 1980s. See the caption for Fig. 7 for details.
Figure 12: Words (not counting references to years) crossing relative frequency threshold of between the 1980s and 1990s. See the caption for Fig. 7 for details.
Figure 13: Words crossing relative frequency threshold of between the 1930s and 1940s. See the caption for Fig. 7 for details.
Figure 14: Words (not counting references to years) crossing relative frequency threshold of between the 1930s and 1950s. See the caption for Fig. 7 for details.
Figure 15: Words (not counting references to years) crossing relative frequency threshold of between the 1960s and 1970s. See the caption for Fig. 7 for details.

We next consider flux between decades across relative frequency thresholds of powers of 10 from down to .

In Fig. 5, we display the volume of flux of 1-grams in both directions across relative frequency thresholds of powers of 10 from down to . We first describe the very limited flux across the and boundaries (not shown in Fig. 5), and then investigate the richer transitions for the lower thresholds for , , , and .

Flux across the boundary between consecutive decades is almost nonexistent during the observed period. Between the 1820s and 1830s, the semicolon falls below the threshold. Between the 1840s and 1850s, “I” rises above the boundary. Between the 1910s and 1920s, “was” rises across. This is the entirety of the flux across , which shows the regime of 1-grams above this frequency (roughly the top 10 1-grams) is quite stable. The eleven 1-grams above threshold in the 1990s in decreasing order of frequency are: the comma “,”, the period “.”, “the”, quotation marks, “to”, “and”, “of”, “a”, “I”, “in”, and “was”.

The set of 1-grams with relative frequencies above (roughly to the top 100 1-grams) is also fairly stable. The flux of 1-grams across this boundary between consecutive decades is entirely captured by Fig. 6. Parentheses drop in (relative frequency of) use between the 1840s and 1850s and cross back over the threshold after the American Civil War (between the 1860s and 1870s). The same is true for before and after World War II (between the 1930s and 1940s and between the 1940s and 1950s, respectively). Beyond these, the flux is entirely due to proper words (not punctuation). For example, “made” fluctuates up and down over this threshold repeatedly over the course of a century. Between the 1870s and the 1880s, “made”, which sees slightly increased use, is the only word to cross the threshold. The most crossings is 12, which occurs between the first two decades. Also, “great” struggled over the first 5 decades and eventually failed to remain great by this measure. “Mr.” fluctuated across the threshold between the 1830s and 1910s. More recently (since the 1930s), “They” has been making its paces up and down across the threshold.

For each threshold between and , the upward and downward flux roughly cancel, which is consistent with Fig. 3. For both upward and downward flux, there appears to be little qualitative difference between the three smallest thresholds. For these thresholds, the downward flux between the 1950s and the 1960s is a minimum, the downward flux increases over the next two pairs of consecutive decades, then it dips again between the 1980s and 1990s. For the highest threshold, the increase between the 1960s and 1970s and the next pair of decades is more noticeable for the upward flux, as is the decrease between the last two pairs of decades.

In the experiment recreated in Fig. 2, the word birth rate initially exceeds the death rate by three orders of magnitude, and this gap declines gradually over the next two centuries. However, with respect to words fluctuating across relative frequency thresholds in opposite directions, we see no strong evidence of such marked asymmetry during any long period of time. With respect to total contributions to the JSD between consecutive decades, there is typically some bias toward toward words with increased relative use as seen in Fig. 4, but the difference need never be described in orders of magnitude.

To address the fluctuations during the last couple of decades, we begin by displaying in Fig. 7 the top 60 flux words between the 1970s and the 1980s sorted by contributions to the JSD between those decades. Note that this pair of decades corresponds to both a dip (below 50%) in the proportion of rising word contributions to the JSD and to an increase in the volume of downward flux (as well as upward flux for high thresholds). In Fig. 8, we show all 55 flux words between the 1980s and the 1990s.

Between each pair of decades, we see reduced relative use of particularly British words, including “England” between the first two decades and “King”, “George”, and “Sir” between the latter two. We also see reduced use of more formal-sounding words, such as “character”, “manner”, and “general” between the first two decades and “suppose”, “indeed”, and “hardly” between the latter two. Increasing are physical and emotional words. Those between the first two decades include “stared”, “breath”, “realized”, “shoulder” and “shoulders”, “coffee”, “guess”, “pain”, and “sorry.” Between the latter two, we see “chest”, “skin”, “whispered”, “hit”, “throat”, “hurt”, “control”, and “lives.” Also included are “phone” and “parents.”

In Figs. 9 and 10, we display the top 60 flux words, not counting references to years, across the threshold between the same decades. Many of the words declining below the threshold between the 1970s and 1980s are unusual spellings such as “tho”, proper names like “Balzac”, or words from non-English languages like “une.” Increasing across this threshold between the first two decades are a plethora of mostly female proper names, with “Jessica” and “Megan” leading. Also seen are “KGB” and “jeans.” (“KGB” decreases in the 1990s, as does “Russians.”) Increasing between the 1980s and 1990s are a few proper names; however, most of the signals here are social and sexual in nature, and in part point to the inclusion of academic, literary criticism. These include “lesbian” and “lesbians”, “AIDS”, and “gender” in the top positions. Also included are both “homosexuality” and the more general “sexuality.” We also see “girlfriend”, “boyfriend”, “feminist”, and “sexy.”

For contrast, we show in Fig. 12 the flux across a threshold of between the 1980s and 1990s (again, not counting years). In particular, while increases in “HIV” and “bisexual” make the list (similarly to many signals in Fig. 10), as do “fax”, “laptop”, and “Internet”, a great swath of the signals are accounted for by one franchise. We note increases in “Picard”, “TNG”, “Sisko”, and “DS9.” These latter signals should serve as a reminder that the word distributions in library-like Google Books corpus Pechenick et al. (2015), even for fiction, do not remotely resemble the contents of normal conversations (at least not for the general population). However, we do observe signals arising at this threshold from factors external to the imaginings of specific authors. It would therefore be premature to dismiss the contributions at this threshold because of an apparent overabundance of “Star Trek.” In fact, since “The Next Generation” and “Deep Space 9” aired precisely during these two decades, an abundance of “Star Trek” novels in the English Fiction data set is actually quite encouraging, because these novels do exist, are available in English, and are (clearly) fiction.

For consistency, we also include the flux (omitting years) across this threshold between the 1970s and 1980s in Fig. 11. While not particularly topical, we do see “AIDS” increase above this threshold a decade prior to its increase over as seen in Fig. 10.

The texture of the signals changes as we dial down the frequency threshold. We typically find that thresholds of and above produce signals with little to no noise. This is not surprising since this relative frequency roughly corresponds to rank threshold for the 1000 most common words (see Fig. 3) in the data set. Using a threshold of (fewer than 10,000 words fall above this frequency in any given decade), we see some noise (mostly in the form of familiar names), but still observe many valuable signals. Only when the threshold is reduced to does the overall texture of the signals become questionable as a result of a variety of proper nouns far less familiar than those observed with the previous threshold. However, at this threshold, we also observe several early signals of real social importance.

Curiously, between the 1930s and 1940s the volume of flux across each threshold is not atypical (see Fig. 5). Moreover, the asymmetry between the JSD contributions between those decades is very low. Yet it is obvious that we should expect signals of historical significance between these two decades. In Figs. 13 and 14, we see words crossing the and thresholds, respectively (with references to years omitted in Fig 14). For the higher threshold, only 56 words cross. The most noticeable such words that are more commonly used in the 1940s are “General” and “German.” Also, “killed” appears in this list. Words used less frequently include “pleasure”, “garden”, and “spirit.” For the lower threshold, we see the signals from prolific authors as in our previous paper Pechenick et al. (2015), particularly Upton Sinclair’s character, Lanny Budd. We also see more Nazis (“Nazi” and “Nazis”).

Last, we include one of the more colorful examples. In Fig. 15, we show signals (not including years) for words crossing the threshold between the 1960s and 1970s. Profanity dominates. We see more references to The World According to Garp (“Garp”) and “Star Trek”, again (“Kirk” this time). We also see more “computer”, “TV”, and “plastic.” Signals also appear for “blacks” and “homosexual”, for drugs (“drug” and “drugs”), and (plausibly) for the War on Drugs (“enforcement” and “cop”).

We refer to reader to our paper’s Online Appendices at http://www.compstorylab.org/share/papers/pechenick2015b/ for figures representing flux across relative frequency thresholds of , , and between consecutive decades over the entire period analyzed (the 1820s to the 1990s).

V Concluding remarks

We recall from Petersen et al. (2012b) and from our own work Pechenick et al. (2015) (Fig. 7d) that the rate of change of given language tends to slow down over time. This applies to the 2012 English Fiction data set and is not contested by us in the present paper. In the critiqued paper Petersen et al. (2012a), it was suggested that the birth and death rates of words can be calculated in an intuitive, albeit very specific manner. This experiment produces birth rates that begin vastly higher than death rates with both rates converging over time to around 1%. However, we have seen that these rates converge to roughly the same values at the end of the available history, regardless of when that is—i.e., the experiment depends on when you perform it, and recent results always appear qualitatively similar.

Beyond this boundary issue, we find another cause for concern. When the increased usage bias in the JSD contributions and the overall and directed volumes of flux are taken into account, we do not observe even the initial orders-of-magnitude gap between so-called birth and death rates. Rather, the JSD bias toward increased relative use of words is within one order of magnitude, and the flux across thresholds is typically balanced.

In fact, this latter point appears to be a fundamental facet of this data set. As we see in Fig. 3, the number of words above each threshold is roughly constant. This stability of the rank-frequency relation compels the observed balancing act (and is consistent with a stable Zipf law distribution Zipf (1949)). Previously in Pechenick et al. (2015) (Fig. 5d), we have seen the divergence between a given year and a target year tends to increase gradually with the time difference. This is not true when, for example, the target year—e.g., 1940—falls during a major war, in which case we see a spike in divergence. However, as the target year exits this period—e.g. enters the 1950s—the spike settles back into the original gradual growth pattern. It is plausible based on these earlier observations and the observations in this paper that the distribution of the language is self-stabilizing: the overall shape of the distribution does not appear to change drastically with time or with the total volume of the data set. As old words fall out of favor, new words inevitably appear to fill in the gaps.

Furthermore, despite the fact that the divergence between consecutive years has been observed to decay over time, we find no shortage of novel word introductions during the most recent decades (which have the lowest decade-to-decade JSDs). This apparent dissonance clearly invites further investigation.

Finally, while extremely specific fiction can be of great interest—whether it be in the form of war novels or volumes from the “Star Trek” franchise—vocabulary from these works is more easily studied when placed in proper context. Dialing down the relative frequency threshold across several orders of magnitude helps to capture this distinction. However, further experimentation is called for, since an automatic means of separating specific signals from the more general signals (e.g., “Star Trek” from social movements) could allow both a more intuitive grasp of the linguistic dynamics and might, ideally, allow investigators to hypothesize causal relationships between exogenous and endogenous drivers of the language.

Acknowledgements.
We thank Simon DeDeo for helpful discussions. PSD was supported by NSF CAREER Award # 0846668.

References

  • Michel et al. (2011) J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, et al., science 331, 176 (2011).
  • Lin et al. (2012) Y. Lin, J.-B. Michel, E. L. Aiden, J. Orwant, W. Brockman, and S. Petrov, in Proceedings of the ACL 2012 System Demonstrations (Association for Computational Linguistics, 2012), pp. 169–174.
  • Pechenick et al. (2015) E. A. Pechenick, C. M. Danforth, and P. S. Dodds, PLoS ONE 10, e0137041 (2015).
  • Gerlach and Altmann (2013) M. Gerlach and E. G. Altmann, Physical Review X 3, 021006 (2013).
  • Petersen et al. (2012a) A. M. Petersen, J. Tenenbaum, S. Havlin, and H. E. Stanley, Scientific reports 2 (2012a).
  • Petersen et al. (2012b) A. M. Petersen, J. N. Tenenbaum, S. Havlin, H. E. Stanley, and M. Perc, Scientific reports 2 (2012b).
  • Lin (1991) J. Lin, Information Theory, IEEE Transactions on 37, 145 (1991).
  • Shannon (2001) C. E. Shannon, ACM SIGMOBILE Mobile Computing and Communications Review 5, 3 (2001).
  • Zipf (1949) G. K. Zipf, Human Behaviour and the Principle of Least-Effort (Addison-Wesley, Cambridge, MA, 1949).