Mapping Topic Evolution Across Poetic Traditions

06/28/2020 ∙ by Petr Plechac, et al. ∙ 0

Poetic traditions across languages evolved differently, but we find that certain semantic topics occur in several of them, albeit sometimes with temporal delay, or with diverging trajectories over time. We apply Latent Dirichlet Allocation (LDA) to poetry corpora of four languages, i.e. German (52k poems), English (85k poems), Russian (18k poems), and Czech (80k poems). We align and interpret salient topics, their trend over time (1600–1925 A.D.), showing similarities and disparities across poetic traditions with a few select topics, and use their trajectories over time to pinpoint specific literary epochs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Corpora & Model

To determine the evolution of topics across poetic traditions, we collect four poetry corpora in Czech, Russian, German and English. See table 1 for an overview, and where they were mined from. As these corpora are often contaminated with foreign language poems, we filter these with langdetect.111https://pypi.org/project/langdetect/

Language Poems Tokens Comment
Czech 15M
Corpus of Czech Verse
(http://versologie.cz)
Russian 2.7M
Poetic subcorpus of Russian National Corpus
(http://ruscorpora.ru)
German 12M
German Poetry Corpus, Textgrid + Deutsches Textarchiv
(http://github.com/thomasnikolaushaider/DLK)
English 22M
Project Gutenberg, Mined with GutenTag ’Poetry’
(https://gutentag.sdsu.edu/)
Table 1: Diachronic Poetry Corpora

To learn semantic topics, Latent Dirichlet Allocation (LDA) [1] has proved useful. We use the LDAMultiCore implementation as it is provided in genism222https://radimrehurek.com/gensim/models/ldamulticore.html [4]. LDA assumes that a particular document contains a mixture of few salient topics, with semantically related words.

We transform our documents to a bag of words representation,333As we deal also with highly inflected languages (Czech, Russian), lemmas were used instead of word forms. For lemmatization and POS-tagging of English and German texts we use the TreeTagger [5], for lemmatization and POS-tagging of Czech texts we use the MorphoDita [7], for lemmatization of Russian texts we use the MyStem [6]. In Czech, German, and English all the parts-of-speech except for nouns, adjectives, and verbs were filtered out.In Russian, the list of stopwords is provided by the NLTK library and manually extended by us. and set the desired number of topics=100 and train for 100 epochs (passes) to attain a reasonable distinctness of topics. We choose 100 topics as previous research on poetic topics [2], [3] determined this parameter to be be optimal for distant reading.

2 Experiments

We approach diachronic variation in poetry as distant reading task to visualize the development of interpretable topics over time and across languages. We retrieve the most important (likely) words for all topics and interpret these (sorted) word lists as aggregated topics. We are then able to manually translate several topics that align over all four corpora.

Figure 1: Size of Corpora over Time
Figure 2: Size of Corpora over Time; log(Size) at y-axis

To discover trends over time, we bin our documents into time slots of 25 years width each, except for early English where two large slots (1600–1674 and 1675–1749) were used due to sparse data. See figures 2 and 2 for a plot of the number of documents per bin. To visualize trends of singular topics over time, we follow the strategy of [2]: We aggregate all documents d in slot s

and sum the probabilities of topic

t given d and divide by the number of all d in s. This gives us the average probability of a topic per time slot. We then plot the trajectories for each single topic.

2.1 Literary Periods

Figure 3: Annotation Literary Periods from antikoerperchen.de

First, for context, we give a quick overview over German literary periods. See figure 3 for an annotation of literary periods in a small German corpus of school canon poetry (158 poems, mined from antikoerperchen.de). Even though the labels are not entirely standardized, we can clearly see many literary movements and periods. We have annotation for ’Barock’, ranging from 1625 to 1700, then leaving out ’Aufklärung’ (Enlightenment), while ’Empfindsamkeit’ (Sensibility) is only present with two poems, 1755 and 1780 respectively. Furthermore, we have the periods ’Sturm & Drang’ and ’Weimarer Klassik’ at the end of the 18th and beginning 19th century, Goethe and Schiller contributing to both. The latter heavily bleeds into ’Romantik’ (romanticism), which begins around 1800, and ends around 1870. Being such a long period there are many sub-periods, where ’Realismus’ (realism) is the only period that streches from romanticism into modernity, which itself is represented here by ’Symbolismus’ (symbolism) (1875–1925) and ’Expressionismus’ (expressionism) (1900–1930).

2.2 Alignment across languages

Based on a few selected topics, we can trace similarities and disparities over poetic traditions. See figures 49 for a selection of interpretable topic trends, where the four languages align.

Figure 4: Topic Nation

Figure 4 shows the topic "Nation", which has a similar trend in German, Czech, and Russian, but is not present in the English corpus (cf. completely different geopolitical situation of British empire). In the German corpus it emerges in the second half of the 18th century and peaks around 1825 to 1850 (outlining the period of ’Vormärz’). The same peak can be found in the Czech corpus (late National Revival), and slightly delayed in Russian. In all the three corpora the topic is getting more accented once again at the beginning of the 20th century.

Figure 5: Topic Sea

Figure 5 shows the topic "Sea", which has a similar rising tendency towards the second half of the 19th century. In Russian it is also associated with the period of romanticism (1825 to 1850).

Figure 6: Topic Sleep

The topic "Sleep" (Figure 6) seems correlated with the topic "Sea" in English, German, and Russian, but it is rather marginal in the Czech corpus.

Figure 7: Topic Sorrow

Figure 7 shows the topic "Sorrow" that has clearly different trends in English and German on one side and Czech and Russian on the other. In the first case it is associated with the period of romanticism (although becoming prominent earlier in English), and in the latter with late 19th century modernism (although in Russian it emerges already in the period of romanticism; 1825 to 1850).

Figure 8: Topic Stars

Figure 8 shows the topic "Stars", pronounced in English and German romanticism (1800 to 1825) and Russian romanticism (1825 to 1850). In Czech the peak occurs delayed in the generation of "Máj" (period 1850 to 1875). Note, that these authors claim themselves as the followers of Karel Hynek Mácha (1810–1836), who in turn is well-known for bringing English romanticism themes into Czech poetry.

Figure 9: Topic Wine

Lastly, figure 9 shows the topic "Wine" which is clearly associated with the Anacreontics. It is accented in early 18th century English poetry, second half 18th century German poetry, and late 18th century Czech poetry (almanacs edited by A. J. Puchmajer). In Russian poetry it surprisingly peaks in the period of romanticism (1825 to 1850).

3 Conclusion & Future Work

We have introduced Latent Dirichlet Allocation for a visualization of topic trends across languages, illustrating the similarities and disparities between different poetic traditions. We can show that some topics heavily align across languages, where some topics show a temporal delay (as they were picked up later in another language), and some topics were not as heavily discussed in other discourses.

References

  • [1] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation.

    Journal of machine Learning research

    3 (Jan), pp. 993–1022.
    Cited by: §1.
  • [2] T. N. Haider (2019) Diachronic topics in new high german poetry. Proceedings of the International Digital Humantities Conference DH2020 in Utrecht. Cited by: §1, §2.
  • [3] B. Navarro-Colorado (2018) On poetic topic modeling: extracting themes and motifs from a corpus of spanish poetry. Frontiers in Digital Humanities 5, pp. 15. Cited by: §1.
  • [4] R. Rehurek and P. Sojka (2011) Gensim—statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD. Cited by: §1.
  • [5] H. Schmid (1994)

    Probabilistic part-of-speech tagging using decision trees

    .
    In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK. Cited by: footnote 3.
  • [6] I. Segalovich (2003) A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In MLMTA, Cited by: footnote 3.
  • [7] J. Straková, M. Straka, and J. Hajič (2014-06) Open-source tools for morphology, lemmatization, pos tagging and named entity recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland, pp. 13–18. External Links: Link, Document Cited by: footnote 3.