Log In Sign Up

Dating Texts without Explicit Temporal Cues

This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely solely on temporal cues implicit in the text. We consider both document-likelihood and divergence based techniques and several smoothing methods for both of them. Our best model predicts the mid-point of individuals' lives with a median of 22 and mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present day. We also show that this approach works well when training on such biographies and predicting dates both for non-biographical Wikipedia pages about specific years (500 B.C. to 2010 A.D.) and for publication dates of short stories (1798 to 2008). Together, our work shows that, even in absence of temporal extraction resources, it is possible to achieve remarkable temporal locality across a diverse set of texts.


page 1

page 2

page 3

page 4


Learning Dynamic Author Representations with Temporal Language Models

Language models are at the heart of numerous works, notably in the text ...

Automatic Document Sketching: Generating Drafts from Analogous Texts

The advent of large pre-trained language models has made it possible to ...

Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text

Multimodal learning is a recent challenge that extends unimodal learning...

Time Masking for Temporal Language Models

Our world is constantly evolving, and so is the content on the web. Cons...

A Survey on Temporal Reasoning for Temporal Information Extraction from Text (Extended Abstract)

Time is deeply woven into how people perceive, and communicate about the...

Go Forth and Prosper: Language Modeling with Ancient Textual History

We introduce a technique for improving document-level language models (L...

1 Introduction

Temporal analysis of text has been an active area of research since the early days of text mining with different focus in different disciplines. In early computational linguistics research it was primarily concerned with the fine-grained ordering of temporal events [Allen1983, Vilain1982]. Information retrieval research has focused largely on time-sensitive document ranking [Dakka et al.2008, Li and Croft2003], temporal organization of search results [Alonso et al.2009], and how queries and documents change over time [Kulkarni et al.2011].

This paper explores temporal analysis models that use ideas present in both computational linguistics and information retrieval. While some prior research has focused on extracting explicit mentions of temporal expressions [Alonso et al.2009], we investigate the feasibility of using text alone to assign timestamps to documents. Following previous document dating work [de Jong et al.2005, Kanhabua and Nørvåg2008, Kumar et al.2011], we construct supervised language models that capture the temporal distribution of words over chronons

, which are contiguous atomic time spans used to discretized the timeline. Each chronon model is smoothed by interpolation with the entire training set collection. For each test document, a unigram language model is computed and used to find the document’s similarity with each chronon’s language model. This provides a ranking over chronons for the document, representing the document’s likelihood of being similar to the time periods covered by each chronon 

[de Jong et al.2005, Kanhabua and Nørvåg2008].

Our chronon models are learned from Wikipedia biographies spanning 3800 B.C. to 2010 A.D. Wikipedia-based training is advantageous since its recency enables us to control against stylistic vs. content factors influencing vocabulary use (e.g. consider the difference between William Mavor’s 1796 discussion111 of Sir Walter Raleigh vs. a modern retrospective biography222

). This contrasts with resources such as the Google n-grams corpus

[Michel et al.2010], which is based on publication dates, and thus reflects information about when a document was written rather than what it is about.

Our methods, all of which use the Wikipedia biographies for training models, are evaluated on three tasks. The first is matched to the training data: predict the mid-point of an individual’s life based on the text in his or her Wikipedia biography. Our best model achieves a median error of 22 years and a mean error of 36 years. The second task is to predict the year for a set of events between 500 B.C. and 2010 A.D., using Wikipedia’s pages for events in each year.333 The best model gives a mean error of 36 years and median error of 21 years. The final task is predicting the publication dates of short stories from the Gutenberg project from the period 1798 to 2008.444 In comparison to biographies, these stories have far fewer mentions of historical named entities that with peaked time signatures useful for prediction. This, plus the difference in genre between Wikipedia biographies (training) and works of fiction, stand to make this task more challenging. However, the distributions learned from the biographies prove to be quite robust here: our best model achieves a mean error of 20 years and a median error of 17 years from the true publication date.

Our primary contribution is demonstrating the robustness and informativity of the implicit temporal cues available in text alone, across a diverse set of three prediction tasks. We do so for document collections spanning hundreds and thousands of years, whereas previous work has generally focused on relatively short periods (decades) for recent time spans. Note that we use a robust temporal expression identifier for English, Heidel-Time [Strötgen and Gertz2010], to identify and remove dates from all texts for both training and testing. While one could exploit a resource such as Heidel-Time to perform rule-based document dating (possibly in combination with our methods and others such as [Chambers2012]), this work demonstrates that text-based techniques can be used effectively for languages for which such temporal extraction resources are not available (Heidel-Time has resources only for English, German and Dutch).

A second contribution is a thorough exploration of the information retrieval approach for this task, including consideration of three different techniques for smoothing chronon language models and a comparison of generative (document-likelihood) and KL-divergence models for identifying the best chronon for a test document. We find that straightforward Jelenik-Mercer smoothing (basic linear interpolation) works the best, and that both document likelihood and KL-divergence based approaches perform similarly.

A specific task of interest in digital humanities is to identify and visualize text sequences relating to the same time period across a collection of books. Our approach can be used to timestamp subsequences of documents, which could be book-length narratives or fictions, without explicit dates.

2 Related Work

Corpora for temporal evaluation. With increased focus on temporal analysis, there have been efforts to create richly annotated corpora to train and evaluate temporal models, e.g. TimeBank [Pustejovsky et al.2003] and Wikiwars [Mazur and Dale2010] were created to provide a common set of corpora for evaluating time-sensitive models. Loureiro11-GIS use the above corpora to resolve geographic and temporal references in text while Chambers:Jurafsky:2008 use these to model event structures.

Semantic based temporal models. Time-sensitive models have also been developed using semantic properties of data. [Grishman et al.2002] use semantic properties of web-data to create and automatically update a database on infectious disease outbreaks. Other simpler approaches have been explored to analyse literary and historical documents as well as recent datasets such as tweets and search queries. Time based analysis of historical texts provides important information as to how significant events happened in the past on a temporal scale. The Google N-Grams viewer555, which uses word counts from millions of books and corresponding publication date, provides plots of n-gram word sequences over a timeline [Michel et al.2010]. This gives useful insights into historical trends of events/topics and writing styles. Time based analysis of tweets has gained popularity in recent years especially to capture current trending topics for tracking news items and market sentiment [Zhang et al.2010].

Time aware latent models. Another approach for temporal text analysis is latent variable based graphical models. Dynamic Topic Models [Blei and Lafferty2006] are used to analyze the evolution of topics over time in a large document collection [Wang et al.2008]. Wang:McCallum:2006 analyse variations in topic occurrences over a large corpora for a fixed time period. Manning:Hall:2008 investigate the history of ideas in a research field though latent variable approaches. Chi:Zhu:2007 use graphical models for temporal analysis of blogs and Zhang et al.[Zhang et al.2010] provide clustering techniques for time varying text corpora through hierarchical Dirichlet processes for modeling time sensitivity.

Temporal analysis using conventional language models. Time based text analysis has been explored using conventional language model based approaches for various applications e.g. time-sensitive query interpretation [Li and Croft2003, Dakka et al.2008], time-based presentation of search results [Alonso et al.2009], and modelling query and document changes over time [Kulkarni et al.2011]. [Li and Croft2003]

, one of the early temporal language models, use explicit document dates to estimate a more informative document prior. More recently, Dakka10 propose models for identifying important time intervals likely to be of interest for a query incorporating document publication date into the ranking function. AlonsoCIKM09 use explicit temporal metadata and expressions as attributes to cluster documents and create timelines for exploring search results.

Document dating—the task of this paper—is a closely related problem.  deJong05 follow a language model based approach to assign dates to Dutch newspaper articles from 1999-2005 by partitioning the timeline into discrete time periods. Kanhabua08ecdl extend this work to incorporate temporal entropy and search statistics from Google Zeitgeist. These approaches [de Jong et al.2005, Kanhabua and Nørvåg2008]

normalize the evidence for each chronon by the whole collection.  chambers:doctimestamps improve over these by including linguistic constraints such as NERs, POS tagging and regular expression based temporal relation constraints (e.g. “after”, “before” etc.) and using MaxEnt classifier for training.  Kanhabua:Romano:2012 use linguistics features such as sentence length, context, entity list in a document etc. to discover events over twitter and assign time stamp by framing it as a binary classification problem with the two classes as relevant and non-relevant. But, all these approaches worked for a small time range (6-10 years) but our datasets span around 5000 years and the evidence would die down after normalization.  Kumar11-cikm use divergence based methods and non-standard smoothing on Wikipedia biographies for the same task. We perform our experiments on two of their datasets, Wikipedia biographies and Gutenberg short stories, and we compare their smoothing method with standard Jelinek-Mercer and Dirichlet smoothing.

3 Document Collections

Our models are trained and evaluated on three datasets 666All three will be released upon publication, including processing and extraction needed for replication of experiments.

Wikipedia biographies (wiki-bio). The Wikipedia dump of English on September 4, 2010 are used 777 to obtain biographies of individuals who lived between the years 3800 B.C. to 2010 A.D.

We extract the lifetime of each individual via each article’s Infobox birth_date and death_date

fields. We exclude biographies which do not specify one of the fields or which fall outside the year range considered. If the birth date is missing, we approximate it as 100 years before the death date (similarly and conversely when the death date is missing). We perform this only to estimate the word distributions in the training set. All such documents are discarded for validation and test. We treat the life span of each individual as the article’s labeled time span. Note that the distribution of biographies is quite skewed toward recent times, as shown in birth histogram.

Figure 1: Graph of number of births per year in the Wikipedia biography training set.
Year Sample Text
400 B.C. The Carthaginians occupy Malta.
War breaks out between Sparta and Elis.
San Lorenzo Tenochtitlán is abandoned.
Thucydides, Greek historian dies.
The catapult is invented by Greek engineers.
50 A.D. Claudius adopts Nero.
Phaedrus, Roman fabulist dies.
The Epistle to the Romans is written.
Abgarus of Edessa, king of Osroene dies.
Hero of Alexandria invents steam turbine.
1000 A.D. Dhaka, Bangladesh, is founded.
The Diocese of Kołobrzeg is founded.
Garcia IV of Pamplona dies
Gunpowder is invented in China.
Middle Horizon period ends in the Andes.
2000 A.D. Tate Modern Gallery opens in London.
Tuvalu joins the United Nations.
The last Mini is produced in Longbridge.
The Constitution of Finland rewritten.
Patrick O’Brian, English writer dies.
Table 1: Sample text from 5 different years in wiki-year dataset.

The resulting dataset contains a total of 280,867 Wikipedia biographies of individuals whose lifetimes begin and end within the year range considered (3800 B.C. to 2010 A.D.). These biographies are randomly split into subsets for training, development, and testing. We remove documents from development and test sets if either their birth_date or death_date missing. This leaves us with 224,476 training articles, 8,358 development articles and 8,440 test articles.

Wiki-year pages (wiki-year). Wikipedia has a collection of pages corresponding to various years that describe the events that occurred for a given year. 888 Each page has the corresponding year as its label and the text contains all the events that occurred in that year – some examples are shown in sample-wiki-years. Pages for years before 500 B.C. at times contain events that span several years, so we restrict the documents to be those from 500 B.C. to 2010 A.D.999For example the events “Proto-Greek invasions of Greece.”, “Minoan Old Palace (Protopalatial) period starts in Crete.” etc. are present in the text for 1878 as well as 1880 B.C. These occurred around 1880 B.C. but their exact occurrence date is unknown.

The 2,511 documents for this span are divided into even years for development (1256 documents) and odd years for testing (1255 documents).

Table 1 shows random sample lines from four wiki-year pages. The lines are terse and the text as a whole contain very little temporal expressions.

Gutenberg short stories (gutss). We collected 678 English short stories published between 1798 to 2008, obtained from the Gutenberg Project. Whereas with Wikipedia biographies we use labeled time spans corresponding to lifetimes, Gutenberg stories are labeled by publication year. The average, minimum and maximum word count of these stories are (roughly) 14,000, 11,000 and 100,000 respectively. Stories are randomly split into a development and test set of 333 and 345 documents, respectively.

Notation. We refer to biographies, stories and Wiki-Year pages alike as documents, and each dataset as defining a document collection consisting of documents: .

4 Model

Similar to previous work, we represent continuous time via discrete units. Our formalization most closely follows that of Alonso et al. AlonsoCIKM09. The smallest temporal granularity we consider in this work is a single year.

4.1 Estimation

Let a span of multiple, contiguous years be some interval , where and refer to start and end years, respectively. As noted in §3, we also know the year range covered by each document collection and restrict our overall timeline correspondingly to the span , covering a total of years.

A chronon is an atomic interval upon which a discrete timeline is constructed [Alonso et al.2009]. In this paper, a chronon consists of years, where is a tunable parameter. Given , the timeline is decomposed into a sequence of contiguous, non-overlapping chronons , where .

A “pseudo-document” is created for each chronon as the concatenation of all training documents whose labeled span overlaps . For example, for a chronon size =25 years, the biography of Abraham Lincoln (1809-1865) is included in pseudo-documents for each of the chronons representing 1800-1825, 1826-1850, and 1851-1875.

A chronon model is estimated from the pseudo-document and smoothed via interpolation with the collection. Chronon models are smoothed in three ways: a) Jelinek-Mercer smoothing (JM) [Zhai and Lafferty2004], b) Dirichlet smoothing  [Zhai and Lafferty2004], and c) chronon-specific smoothing (CS) [Kumar et al.2011]. For all three, for each word , can be computed as a mixture of document and document collection maximum-likelihood (ML) estimates:


where and denote the frequency of word in the document or collection respectively, and are the document and collection lengths, and the parameter specifies the smoothing strength. In case of Jelenik-Mercer smoothing, the value of is chosen directly via tuning over values from zero to one.

With Dirichlet smoothing, is chosen as:


is a hyper-parameter tuned on the development set.

Chronon-specific smoothing, in turn, is a special case of Dirichlet smoothing where:


where denotes the document-chronon specific vocabulary for some collection document and pseudo-document and is a prior for hyper-parameter that is tuned on the development set.

4.2 Estimation

We calculate the affinity between each chronon and a document by estimating the discrete distribution . In the next section, we use to infer affinity between and different chronons. The mid-point of (see section 4.1) the most likely chronon is then returned as the predicted year by the model. We define two primary models for estimating . The first approach estimates the likelihood of for each chronon; via Bayes rule, this is combined with a chronon prior to calculate the likelihood of each chronon for . The second approach ranks chronons based on the divergence between latent unigram distributions and  [Lafferty and Zhai2001a].

Ranking by document likelihood

The language modeling approach for information retrieval was originally formulated as query-likelihood [Ponte and Croft1998]. For our task, the document is the “query” for which we wish to rank chronons. We refer to this approach as document-likelihood (DL).

We estimate via Bayes Rule. Assuming unigram modeling, the likelihood of a test document is given by:


where the parameters of are estimated from the chronon ’s pseudo-document , as described in Section 4.1.

Just as informed document priors (e.g. PageRank or document length) inform traditional document ranking in information retrieval, an informed prior over chronons has potential to benefit our task as well. We adopt a chronon prior intuitively informed by the distribution of training documents over chronons:


where is a training document, is the pseudo-document for chronon and is the number of dated training documents overlapping with chronon .

Ranking by model comparison

Zhai and Lafferty Lafferty01b propose ranking via KL-divergence between a query and each collection document. Kumar et al. Kumar11-cikm use this approach to compute , which is estimated by computing the inverse KL-divergence of and and normalizing this value with the sum of inverse divergences with all chronons :


It is straighforward to see that their formulation is rank equivalent to standard model comparison ranking with negative KL-divergence [de Jong et al.2005, Kanhabua and Nørvåg2008]:


Lafferty and Zhai showed such ranking is equivalent to generating the query (i.e. query-likelihood) assuming a uniform document prior and the query model being estimated by relative frequency  [Lafferty and Zhai2001b]. This means that for our task, if we adopt a uniform prior over chronons and estimate the document model by relative frequency, then KL-ranking and document-likelihood approaches will be rank equivalent.


Having determined , we choose the midpoint of the most likely chronon; for a chronon , the mid-point is .

5 Experimental Setup


To test the ability of word-based models to predict timestamps for documents, all temporal expressions identified in each document using the Heidel-Time temporal tagger [Strötgen and Gertz2010] are removed. All numeric tokens and standard stopwords are also removed. The remaining tokens produce a vocabulary of 374,973 words for the entire Wikipedia biography corpus. Heidel-Time also provides the first two dates present in the text, which we use as a strong baseline for the biography task.

Tuning and smoothing

For each model+task, we tune the parameters , , , and over the development sets of the corresponding dataset. As in prior work [de Jong et al.2005, Kanhabua and Nørvåg2008, Kumar et al.2011], we smooth chronon pseudo-document language models (for all models as well as smoothing techniques) but not document models. While smoothing both may potentially help, smoothing the former is strictly necessary for KL-divergence to prevent division by zero.

Target predictions

For Wikipedia biographies, the predicted represents the mid-point of the individual’s life span; for wiki-years, it is the year of the events on the page, and for Gutenberg short stories it is the publication date of the story. In later sections we will present the baseline predictions for for each dataset.

Error Measurement

When predicting a single year for a document, a natural error measure between the predicted year (mid-point) and the actual year is the difference . We compute this difference for each document, then compute and report the mean and median of differences across documents. Similar distance error measures have also been used with document geolocation [Eisenstein et al.2010, Wing and Baldridge2011].


For Wikipedia biographies the first baseline (baseline-ht) is the mid-point of the first two temporal-dates extracted by Heidel-Time [Strötgen and Gertz2010]. This is a highly effective baseline since it is often the case in Wikipedia biographies that the first two dates are the birth and death dates. The second baseline for biographies is to always predict the year that has greatest number of biographies spanning it, which is 1915 (baseline-1915). For Gutenberg stories, we take 1903, the midpoint of the range of publication dates (1798-2008) as the baseline (baseline-1903). For wiki-years, the baseline is the midpoint of the prediction range i.e. (baseline-755). This assumes that one knows a rough range of possible publication dates, which is reasonable for many applications and thus provides a good reference for comparison.

We also report oracle error which is the mean and median error which would occur if a model always picked the correct chronon. This error arises because chronons span multiple years; large chronons in particular will have higher oracle error (but may perform better for actual prediction due to better model estimation).

Figure 2: Tuning for over wiki-bio and wiki-years datasets for KL model. (for CS) and (for JM) are fixed at 0.01 and 0.99 respectively.
Figure 3:
Figure 4: Tuning for smoothing parameters ( and ) over wiki-bio and wiki-years datasets for KL model.

6 Results

6.1 Parameter tuning

We begin with year prediction experiments on the development sets to tune the parameters , or . We parametrize as a function of the average chronon size in the training set:


is a constant whose value is dependent upon the model and the task. The value of is tuned over the validation set.

Choice of chronon size and smoothing parameters.

We tune the chronon size () over the validation set and tune the smoothing parameters , , and (depending on the type of smoothing) for the best obtained. For tuning we assign an arbitrary value to the smoothing parameter . The is tuned for each dataset and KL model with CS and JM smoothings. DL model with Dirichlet/JM smoothing and KL model with Dirichlet smoothing use the same best obtained for KL model with JM smoothing on the respective datasets. For each dataset, model and smoothing triad, the smoothing parameter , , or is tuned. Tuning is performed to minimize the mean error on the development sets. The search space for smoothing parameters , and includes { , , …, 0.1, 0.25, 0.75, 0.9, 0.99, …, 0.999999999 }

Figures 2 and 4 shows the tuning of and smoothing parameters ( for JM and for CS) for the wiki-bio and wiki-years dataset. All triplets formed by KL/DL model JM/Dirichlet smoothing wiki-years/wiki-bio/gutss dataset use the optimum chronon-size obtained for the respective datasets from the KL model with JM smoothing.

From Figure 4 the mean error curve is generally smooth for and unlike the , chronon-size parameter (figure 2). This makes smoothing the LMs robust to a range of values. The has more fluctuation even in the optimal neighborhood, which makes tuning chronon-size more critical. A straightforward strategy to reduce this sensitivity is to smooth chronon models based on the word distributions of neighboring chronons as well as interpolating with the collection model, which we intend to explore in future. The optimal chronon sizes for the three datasets are 10 years for wiki-bio and gutss and 50 years for wiki-year.

6.2 Test results.

Table 2 shows the results for the various models on the test sets for all three datasets, using the parameters tuned on the corresponding development sets.


The models beat both baselines easily. Note that baseline-ht is quite strong for a large number of documents: it gives a median error of zero since over half of the documents have birth and death dates as their first dates. Nonetheless, it fails entirely for many documents, and obviously has limited applicability. The models all reduce error by one half in comparison to baseline-1915. The best model (DL + JM smoothing) achieves a mean error of 37.4 years, which is quite strong given that the prediction range is 5810 years. The mean oracle error for the best model is 2.5 years. The mean and median error was 36.6 and 22.0 years for the best performing model (DL + JM smoothing) on the development set.


The models beat baseline-755 comfortably. Despite the fact that the documents are relatively short and that any given document contains a number of often unrelated events (and thus low counts per word type), the results are in line with those for wiki-bio, with mean error of 37.9 and median error of 20 years for the best models. The mean oracle error, 12.4 years, for this dataset is higher due to the larger chronon size. The KL model with JM smoothing provided the best mean and median error of 36.7 and 21.0 years respectively over development set.


All models except the one that uses chronon-specific smoothing with KL-divergence outperform baseline-1905 on mean error, and even that one is better on median. Since these are works of fiction with few historical entities mentioned, the mean error of 22.9 and median error of 19.0 of the best models indicate that the approach is quite capable of exploiting implicit temporal cues of basic vocabulary choices. Also, recall that the model is trained on Wikipedia; this demonstrates that this choice of training set works well as the basis for predictions on other domains. The mean oracle error (for chronons of 10 years) is 2.5 years. For the development set, the mean and median error was 20.4 and 17.0 years for the best performing model (DL + JM smoothing).

Data Model(Smooth.)


baseline-ht 306.6 0.0
baseline-1915 81.1 38.5
KL(=) 42.8 22.5
KL(=0.999) 37.4 22.5
KL(=) 38.1 22.0
DL(=0.999) 37.4 22.5
DL(=) 38.0 22.0


baseline-755 627 627
KL(=0.99) 143.6 30.0
KL(=0.25) 37.9 20.0
KL(=0.01) 60.6 22.0
DL(=0.50) 37.9 20.0
DL(=0.01) 52.1 20.0


baseline-1905 37 50
KL(=) 39.6 19.0
KL(=0.999) 22.9 19.0
KL(=) 37.3 22.0
DL(=0.999) 22.9 19.0
DL(=) 37.4 23.0
Table 2: Test set results. =JM, =CS, and =Dirichlet smoothing. DL uses the non-uniform chronon prior. The best results are bolded, and the results of the best model on the corresponding development set are italicized.

6.3 Output analysis

Using the output on the development set, we find interesting patterns in the predictions made by the models and the way they use the words as evidence.

Time warps

flickRwormholes:2010 used geotags on Flickr images to identify wormholes—locations that are not physically near but which are nonetheless similar to one another. We observe some similar patterns, in our case time warps, in our dataset. These are particularly prominent in wiki-year documents due to their terseness as these are list of events that happened in a given year. Besides the models trained on wiki-bio set add to this phenomenon as the context for the two datasets are slightly different. A cluster of dev event years from between 250 to 150 A.D. (e.g. wiki-years 234, 214, 152, 156 etc.) are predicted to be in 2nd century B.C. (200 B.C. to 150 B.C.) by our model. These event years are very short with an average length of 40-50 words per document. The discriminatory tokens present in these texts include: Roman, Empire, Kingdom, Han, Dynasty, China, Selucid, Greek, etc.. In the 200-150 B.C. period, all the documents in training set are about Greek/Selucid, Roman and Chinese (mostly from Han dynasty) emperors/personalities (e.g. Attalus I, Eratosthenes, Plautus, Emperor Gaozu of Han, Emperor Hui of Han, Zhang Qian, Emperor Wen of Han etc.) and contain similar prominent terms as the wiki-year event texts. This common collection of terms pushes the model to resolve wiki-year texts to 2nd century B.C. This happens because of the relative frequency of such terms in B.C. and A.D.: although these terms are present in the A.D. chronons, their proportion with respect to other terms is much smaller. Test documents that contain these terms are thus attracted to the B.C. chronons since they have these terms in generally higher proportion.

meriwether komatsu capote cranmer payload
morelos kido stopes sap laila
hem shakuntala anthrax scooby crayon
plutarch sampaguita woodbury untimely teleplay
tele electorates derivatives polygram wavelength
Table 3: Top 25 most predicitve words in descending order (left to right and top to bottom) from wiki-bio dev set.
oneself primari ssu thebes porphyry
lysias confucius morality romana matteo
unbroken goodness timpul tarii grout
sinop cynical tub crates lantern
bite phila transaction corporeal conciliation

Table 4: Bottom 25 least predicitve words in descending order (left to right and top to bottom) from wiki-bio dev set.

Another interesting cluster is short documents containing similar terms from 200-800 A.D. that are resolved to the mid-6th century A.D. The short wiki-year texts (e.g. years 246, 486, 750, 822, etc.) contain co-occurring set of terms like Byzantine, Empire, Roman, Arab, Conquest, Islam, and Caliphate. These short year events text contain events related to mostly Byzantine wars, emperors, Islamic/Arab conquest, Caliphates etc. These are resolved to the mid-6th century A.D. period that predominantly contains biographies of Islamic Caliphates (e.g. Abd al-Malik, Abu Bakr, Ali, Umar etc.) and Byzantine emperors and prominent personalities (e.g. Maurice, Fausta, Constans II etc.) which has predominant terms such as: Byzantine, Empire, Caliph, Islam, and conquest.

Discriminative Words

Table 3 and 4 shows the top and bottom 25 words in the descending order of their strengths, where the predictive strength score of a word is calculated as average prediction error of all the documents that contain the word . The majority of the words that are most predictive are uncommon nouns, especially uncommon last names or famous titles e.g. capote, komatsu, and cranmer. Words such as tele, wavelength, electorates, teleplay, sap (the company) also have strong temporal connection as these have never been used before 19th century. The least predictive ones are mostly common words such as goodness, oneself, morality, tub, crates, and lantern. The uncommon words among the least predictive are generally present in just one or two documents for which our model performs very poorly. It is highly likely that these words might be inducing those warps due to their predominance and uniqueness.

7 Conclusion

Using words alone, it is possible to identify the time period that a document is about (via the Wikipedia datasets) or the time period in which it was written (via the Gutenberg dataset). In the former case, the presence of named entities dominates the texts, and their names provide strong evidence for particular historical periods. For the latter, the texts are fictional (including science fiction), and they rarely mention historical entities. For these, general terms that are indicative of a given time period dominate the prediction. Interestingly, the models that are used (successfully) for this later task are trained on Wikipedia biographies about historical individuals, but which were written in the last decade.

The predictions made by our models provide a natural counterpart to other temporally sensitive models of word choice, such as Dynamic Topic Models (DTMs)  [Blei and Lafferty2006]. DTMs assume that documents are labeled with dates; our model could thus be used to create labels for an otherwise un-dated set of documents which can then be analyzed with DTMs. An important aspect of our work is that it opens opportunities for analyzing sub-parts of documents, such as chapters, sections and paragraphs of books. Consider, for example, Samuel Goodrich’s “The Second Book of History” from 1840, which covers thousands of years of history for many parts of the world.

Of course, many texts include explicit dates, and exploiting their presence via approaches such as chambers:doctimestamps would only strengthen our predictions. Also, they create opportunities for using weaker, but more pin-pointed, supervision: strings identified as dates with high-confidence can be pivots for learning word distributions. This would obviate the need for labeled training material such as Wikipedia biographies, and thereby enable our methods to be used and adapted for a wide variety of genres. Given decent temporal expression identifiers for other languages, this could be used to bootstrap models for more languages as well.


This work was partially supported by a grant from the Morris Memorial Trust Fund of the New York Community Trust and a Temple Fellowship.


  • [Allen1983] James F. Allen. 1983. Maintaining knowledge about temporal intervals. Commun. ACM, 26(11):832–843, November.
  • [Alonso et al.2009] Omar Alonso, Michael Gertz, and Ricardo Baeza-Yates. 2009. Clustering and exploring search results using timeline constructions. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM ’09, pages 97–106, New York, NY, USA. ACM.
  • [Blei and Lafferty2006] David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In

    Proceedings of the 23rd international conference on Machine learning

    , ICML ’06, pages 113–120, New York, NY, USA. ACM.
  • [Chambers and Jurafsky2008] Nathanael Chambers and Daniel Jurafsky. 2008. Jointly combining implicit constraints improves temporal ordering. In EMNLP, pages 698–706. ACL.
  • [Chambers2012] Nathanael Chambers. 2012. Labeling documents with timestamps: Learning from their time expressions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 98–106, Jeju Island, Korea, July. Association for Computational Linguistics.
  • [Chi et al.2007] Yun Chi, Shenghuo Zhu, Xiaodan Song, Junichi Tatemura, and Belle L. Tseng. 2007. Structural and temporal analysis of the blogosphere through community factorization. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 163–172, New York, NY, USA. ACM.
  • [Clements et al.2010] Maarten Clements, Pavel Serdyukov, Arjen P. de Vries, and Marcel J. T. Reinders. 2010. Finding wormholes with flickr geotags. In Proceedings of the 32nd European conference on Advances in Information Retrieval, ECIR’2010, pages 658–661, Berlin, Heidelberg. Springer-Verlag.
  • [Dakka et al.2008] Wisam Dakka, Luis Gravano, and Panagiotis G. Ipeirotis. 2008. Answering general time sensitive queries. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM ’08, pages 1437–1438, New York, NY, USA. ACM.
  • [de Jong et al.2005] Franciska de Jong, Henning Rode, and Djoerd Hiemstra. 2005. Temporal Language Models for the Disclosure of Historical Text. In Humanities, computers and cultural heritage: Proceedings of the XVIth International Conference of the Association for History and Computing (AHC 2005), pages 161–168. Royal Netherlands Academy of Arts and Sciences.
  • [Eisenstein et al.2010] Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In

    Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

    , EMNLP ’10, pages 1277–1287, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Grishman et al.2002] Ralph Grishman, Silja Huttunen, and Roman Yangarber. 2002. Real-time event extraction for infectious disease outbreaks. In Proceedings of the second international conference on Human Language Technology Research, pages 366–369, San Diego, California. Morgan Kaufmann Publishers Inc.
  • [Hall et al.2008] David Hall, Daniel Jurafsky, and Christopher D. Manning. 2008. Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 363–371, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Kanhabua and Nørvåg2008] Nattiya Kanhabua and Kjetil Nørvåg. 2008. Improving temporal language models for determining time of non-timestamped documents. In Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries, ECDL ’08, pages 358–370, Berlin, Heidelberg. Springer-Verlag.
  • [Kulkarni et al.2011] Anagha Kulkarni, Jaime Teevan, Krysta M. Svore, and Susan T. Dumais. 2011. Understanding temporal query dynamics. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, pages 167–176, New York, NY, USA. ACM.
  • [Kumar et al.2011] Abhimanu Kumar, Matthew Lease, and Jason Baldridge. 2011. Supervised language modeling for temporal resolution of texts. In Proceeding of the 20th ACM Conference on Information and Knowledge Management (CIKM), pages 2069–2072.
  • [Lafferty and Zhai2001a] J. Lafferty and C. Zhai. 2001a. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 111–119. ACM.
  • [Lafferty and Zhai2001b] John Lafferty and Chengxiang Zhai. 2001b. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 111–119, New York, NY, USA. ACM.
  • [Li and Croft2003] Xiaoyan Li and W. Bruce Croft. 2003. Time-based language models. In Proceedings of the twelfth international conference on Information and knowledge management, CIKM ’03, pages 469–475, New York, NY, USA. ACM.
  • [Loureiro et al.2011] Vitor Loureiro, Ivo Anastácio, and Bruno Martins. 2011. Learning to resolve geographical and temporal references in text. In Isabel F. Cruz, Divyakant Agrawal, Christian S. Jensen, Eyal Ofek, and Egemen Tanin, editors, GIS, pages 349–352. ACM.
  • [Mazur and Dale2010] Pawel Mazur and Robert Dale. 2010. WikiWars: A new corpus for research on temporal expressions. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 913–922, Cambridge, MA, October. Association for Computational Linguistics.
  • [Michel et al.2010] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. 2010. Quantitative analysis of culture using millions of digitized books. Science, online, 16.12.2010.
  • [Nattiya Kanhabua and Stewart2012] Sara Romano Nattiya Kanhabua and Avaré Stewart. 2012. Identifying relevant temporal expressions for real-world events. In Proceedings of The SIGIR 2012 Workshop on Time-aware Information Access, Portland, Oregon.
  • [Ponte and Croft1998] Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’98, pages 275–281, New York, NY, USA. ACM.
  • [Pustejovsky et al.2003] James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, David Day, Lisa Ferro, Robert Gaizauskas, Marcia Lazo, Andrea Setzer, and Beth Sundheim. 2003. The TimeBank corpus. Corpus Linguistics, pages 647–656.
  • [Strötgen and Gertz2010] Jannik Strötgen and Michael Gertz. 2010. Heideltime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 321–324, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Vilain1982] Marc B. Vilain. 1982. A system for reasoning about time. In David L. Waltz, editor, AAAI, pages 197–201. AAAI Press.
  • [Wang and McCallum2006] Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 424–433, New York, NY, USA. ACM.
  • [Wang et al.2008] Chong Wang, David M. Blei, and David Heckerman. 2008. Continuous time dynamic topic models. In David A. McAllester and Petri Myllymäki, editors, UAI, pages 579–586. AUAI Press.
  • [Wing and Baldridge2011] Benjamin Wing and Jason Baldridge. 2011. Simple supervised document geolocation with geodesic grids. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, ACL, pages 955–964. The Association for Computer Linguistics.
  • [Zhai and Lafferty2004] Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214.
  • [Zhang et al.2010] Jianwen Zhang, Yangqiu Song, Changshui Zhang, and Shixia Liu. 2010. Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10, pages 1079–1088, New York, NY, USA. ACM.