1 Corpora & Model
To determine the evolution of topics across poetic traditions, we collect four poetry corpora in Czech, Russian, German and English. See table 1 for an overview, and where they were mined from. As these corpora are often contaminated with foreign language poems, we filter these with langdetect.111https://pypi.org/project/langdetect/
To learn semantic topics, Latent Dirichlet Allocation (LDA)  has proved useful. We use the LDAMultiCore implementation as it is provided in genism222https://radimrehurek.com/gensim/models/ldamulticore.html . LDA assumes that a particular document contains a mixture of few salient topics, with semantically related words.
We transform our documents to a bag of words representation,333As we deal also with highly inflected languages (Czech, Russian), lemmas were used instead of word forms. For lemmatization and POS-tagging of English and German texts we use the TreeTagger , for lemmatization and POS-tagging of Czech texts we use the MorphoDita , for lemmatization of Russian texts we use the MyStem . In Czech, German, and English all the parts-of-speech except for nouns, adjectives, and verbs were filtered out.In Russian, the list of stopwords is provided by the NLTK library and manually extended by us. and set the desired number of topics=100 and train for 100 epochs (passes) to attain a reasonable distinctness of topics. We choose 100 topics as previous research on poetic topics ,  determined this parameter to be be optimal for distant reading.
We approach diachronic variation in poetry as distant reading task to visualize the development of interpretable topics over time and across languages. We retrieve the most important (likely) words for all topics and interpret these (sorted) word lists as aggregated topics. We are then able to manually translate several topics that align over all four corpora.
To discover trends over time, we bin our documents into time slots of 25 years width each, except for early English where two large slots (1600–1674 and 1675–1749) were used due to sparse data. See figures 2 and 2 for a plot of the number of documents per bin. To visualize trends of singular topics over time, we follow the strategy of : We aggregate all documents d in slot s
and sum the probabilities of topict given d and divide by the number of all d in s. This gives us the average probability of a topic per time slot. We then plot the trajectories for each single topic.
2.1 Literary Periods
First, for context, we give a quick overview over German literary periods. See figure 3 for an annotation of literary periods in a small German corpus of school canon poetry (158 poems, mined from antikoerperchen.de). Even though the labels are not entirely standardized, we can clearly see many literary movements and periods. We have annotation for ’Barock’, ranging from 1625 to 1700, then leaving out ’Aufklärung’ (Enlightenment), while ’Empfindsamkeit’ (Sensibility) is only present with two poems, 1755 and 1780 respectively. Furthermore, we have the periods ’Sturm & Drang’ and ’Weimarer Klassik’ at the end of the 18th and beginning 19th century, Goethe and Schiller contributing to both. The latter heavily bleeds into ’Romantik’ (romanticism), which begins around 1800, and ends around 1870. Being such a long period there are many sub-periods, where ’Realismus’ (realism) is the only period that streches from romanticism into modernity, which itself is represented here by ’Symbolismus’ (symbolism) (1875–1925) and ’Expressionismus’ (expressionism) (1900–1930).
2.2 Alignment across languages
Figure 4 shows the topic "Nation", which has a similar trend in German, Czech, and Russian, but is not present in the English corpus (cf. completely different geopolitical situation of British empire). In the German corpus it emerges in the second half of the 18th century and peaks around 1825 to 1850 (outlining the period of ’Vormärz’). The same peak can be found in the Czech corpus (late National Revival), and slightly delayed in Russian. In all the three corpora the topic is getting more accented once again at the beginning of the 20th century.
Figure 5 shows the topic "Sea", which has a similar rising tendency towards the second half of the 19th century. In Russian it is also associated with the period of romanticism (1825 to 1850).
The topic "Sleep" (Figure 6) seems correlated with the topic "Sea" in English, German, and Russian, but it is rather marginal in the Czech corpus.
Figure 7 shows the topic "Sorrow" that has clearly different trends in English and German on one side and Czech and Russian on the other. In the first case it is associated with the period of romanticism (although becoming prominent earlier in English), and in the latter with late 19th century modernism (although in Russian it emerges already in the period of romanticism; 1825 to 1850).
Figure 8 shows the topic "Stars", pronounced in English and German romanticism (1800 to 1825) and Russian romanticism (1825 to 1850). In Czech the peak occurs delayed in the generation of "Máj" (period 1850 to 1875). Note, that these authors claim themselves as the followers of Karel Hynek Mácha (1810–1836), who in turn is well-known for bringing English romanticism themes into Czech poetry.
Lastly, figure 9 shows the topic "Wine" which is clearly associated with the Anacreontics. It is accented in early 18th century English poetry, second half 18th century German poetry, and late 18th century Czech poetry (almanacs edited by A. J. Puchmajer). In Russian poetry it surprisingly peaks in the period of romanticism (1825 to 1850).
3 Conclusion & Future Work
We have introduced Latent Dirichlet Allocation for a visualization of topic trends across languages, illustrating the similarities and disparities between different poetic traditions. We can show that some topics heavily align across languages, where some topics show a temporal delay (as they were picked up later in another language), and some topics were not as heavily discussed in other discourses.
Latent dirichlet allocation.
Journal of machine Learning research3 (Jan), pp. 993–1022. Cited by: §1.
-  (2019) Diachronic topics in new high german poetry. Proceedings of the International Digital Humantities Conference DH2020 in Utrecht. Cited by: §1, §2.
-  (2018) On poetic topic modeling: extracting themes and motifs from a corpus of spanish poetry. Frontiers in Digital Humanities 5, pp. 15. Cited by: §1.
-  (2011) Gensim—statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD. Cited by: §1.
Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK. Cited by: footnote 3.
-  (2003) A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In MLMTA, Cited by: footnote 3.
-  (2014-06) Open-source tools for morphology, lemmatization, pos tagging and named entity recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland, pp. 13–18. External Links: Cited by: footnote 3.