Log In Sign Up

A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining

by   Gerhard Hagerer, et al.

User-generated content from social media is produced in many languages, making it technically challenging to compare the discussed themes from one domain across different cultures and regions. It is relevant for domains in a globalized world, such as market research, where people from two nations and markets might have different requirements for a product. We propose a simple, modern, and effective method for building a single topic model with sentiment analysis capable of covering multiple languages simultanteously, based on a pre-trained state-of-the-art deep neural network for natural language understanding. To demonstrate its feasibility, we apply the model to newspaper articles and user comments of a specific domain, i.e., organic food products and related consumption behavior. The themes match across languages. Additionally, we obtain an high proportion of stable and domain-relevant topics, a meaningful relation between topics and their respective textual contents, and an interpretable representation for social media documents. Marketing can potentially benefit from our method, since it provides an easy-to-use means of addressing specific customer interests from different market regions around the globe. For reproducibility, we provide the code, data, and results of our study.


LT3 at SemEval-2020 Task 9: Cross-lingual Embeddings for Sentiment Analysis of Hinglish Social Media Text

This paper describes our contribution to the SemEval-2020 Task 9 on Sent...

Talaia: a Real time Monitor of Social Media and Digital Press

Talaia is a platform for monitoring social media and digital press. A co...

Over a Decade of Social Opinion Mining

Social media popularity and importance is on the increase, due to people...

Can Celebrities Burst Your Bubble?

Polarization is a growing, global problem. As such, many social media ba...

Vocabulary-based Method for Quantifying Controversy in Social Media

Identifying controversial topics is not only interesting from a social p...

CO.ME.T.A. – covid-19 media textual analysis. A dashboard for media monitoring

The focus of this paper is to trace how mass media, particularly newspap...

1 Introduction

Topic modeling on social media texts is difficult, since lack of data as well as spelling and grammatical errors can make the approach unfeasible. Dealing with multiple languages at the same time adds more complexity to the problem which oftentimes makes the approach unusable for domain experts. Thus, we propose a cross-lingually pre-trained deep neural network as a black box with very little textual pre-processing necessary before embedding the texts and forming their clustering and topic distributions.

For our method, we leverage current research regarding multi-lingual topic modeling, see Section 2. We provide an extensive description of a simple method to support domain experts from specific social media domains in its application in Section 3. We qualitatively demonstrate our topic model, its feasibility, and its cross-lingual semantic characteristics on English and German newspaper and social media texts in Section 4. We aim at inspiring pragmatic ideas to explore the potential for comparative, inter-cultural market research and agenda setting studies. Unsolved problems and future potential are given in Section 5.

Figure 1:

Plain text is first tokenized into sentences and passed to topic modeling and sentiment analysis. Topic modeling involves (1) converting sentences of both languages into embeddings with XLING, (2) clustering all embeddings with K-means and (3) deriving a topic label of each cluster. Sentiment analysis is performed using Textblob. Topic and sentiment scores are aggregated for the analysis.

2 Related Work

Topic modeling is meant to learn thematic structure from text corpora. With probabilistic topic modeling methods, such as latent semantic indexing (LSI) [12] or latent Dirichlet allocation (LDA) [3], researchers try to extend the capabilities of topic modeling for application from a single language to multiple languages. Using multi-lingual dictionaries and translated corpora is an intuitive way to tackle cross-lingual topic modeling problems [28, 26]. Further  examples exist with either  dictionaries or  translation
text collections [14, 4, 16]. However, this puts dependence on the availability of dictionary or good quality of translations. Significant manual labor and verification are required to prevent deteriorating noise.

Recently, methods converting words to vectors according to their semantics are widely adopted

[19]. Several studies showed text embeddings improve topic coherence [2, 23]. Regarding multi-linguality, embeddings in word level and sentence level enable text in different languages to be projected to the same vector space [5] such that semantically similar texts are clustered together independently of their languages. This favors studies on multi-lingual topic modeling without relying on dictionaries and translation [27, 6]. Although providing highly coherent topics, a recreation of word spaces is required when new text corpora are introduced. In our scenario, these limitations are not present.

Regarding the application of topic modeling, various social media corpora are studied by domain experts [24, 18]. They covered different domains, such as politics, marketing, and public health. Regarding media agenda setting, [13] studied on how much degree a Russian newspaper related to economic downturn. They also ”introduced embedding-based methods for cross-lingually projecting English frame to Russian” based on Media Frames Corpus. In contrast, we propose a straightforward topic modeling method without fine-tuning but only clustering necessary on a social media corpus. This enables further investigation on media agenda setting cross-lingually and cross-culturally.

3 Topic Modeling Method

Figure 1 shows the overall workflow of our topic modeling approach. We aim to conduct simple, cross-lingual topic modeling on user-generated content with no translation, dictionary, and parallel corpus required for aligning the semantic meanings across languages. Our approach solely depends on clustering sentence embeddings for topic modeling. Ready-made sentence representations simplify the approach, since these suppress too frequent, meaningless, and unimportant words automatically without the need to model that part explicitly [17].

3.1 Preprocessing

The raw texts of articles and comments are first tokenized into sentences with Natural Language Toolkit (NLTK). Then, URLs, specially for those enclosed with HTML <a> tag, are replaced with string ’url’. After that, sentences with character length smaller than 15 are omitted to minimize noise, since they appear inscrutable and they are only 6.6% out of all sentences which is a small portion. After preprocessing, there are 127,464 English sentences and 200,627 German sentences, i.e., total 328,091.

Figure 2: AIC plot indicates = 15 is the global minimum.

3.2 Cross-Lingual Embeddings

In the following paragraph, we provide an explanation of the pre-trained XLING model, which we use for the present work, based on the words of the authors [7]. XLING calculates ”sentence embeddings that map text written in different languages, but with similar meanings, to nearby embedding space representations”. Similarity is calculated mathematically as dot product between two sentence embeddings. In order to train the model, the authors ”employ four unique task types for each language pair in order to learn a function g”

, i.e., the eventual sentence-to-vector model. The architecture is based on a Transformer neural network

[25] tailored for modeling multiple languages at once. The tasks on which the model is eventually trained are ”(i) conversational response prediction, (ii) quick thought, (iii) a natural language inference, and (iv) translation ranking as the bridge task”. The data for training

”is composed of Reddit, Wikipedia, Stanford Natural Language Inference (SNLI), and web mined translation pairs”


3.3 Sentence Clustering

K-means clustering algorithm is implemented on both English and German sentence embeddings at the same time. Since XLING provides semantically aligned sentence embeddings of both languages, this joint clustering step helps to establish one topic model for two disjunct datasets irrespective of their language. Clustering is established for a varying number of clusters, ranging from 1 to 30. Elbow method is first used for choosing the optimal but the inertia (sum of squared distances of samples to their closest cluster center) of increasing decreases rapidly at the beginning and then gently without a significant elbow point. Therefore, Akaike Information Criterion (AIC) is adopted and = 15 is chosen as optimal value as it is the global minimum, see Figure 2. In Section 4.2, further discussion on topic coherence is conducted proving the fact that = 15 resulted semantically coherent topics.

3.4 Topic Labeling

To be able to derive a meaningful topic label for each sentence cluster, the respective top words of each cluster are required. In order to get the top word list, the clarity score is adopted [9]. According to [1], it ranks terms with respect to their importance of each cluster and language , such that


where and are the l1-normalized tf-idf scores of the word in the sentences within cluster and in all sentences, respectively, for a certain language.

Additionally, stopword removal from the top word lists is also a concern when calculating the clarity score. Generally, stopwords are the most frequent words in the documents and sometimes they are too dominant such that they interfere with the result from clarity scoring. Thus, we remove domain-specific high frequency words for each language from corresponding topic top word lists.

Topics are labeled manually based on the English and German top word lists. The results are shown in Table 1 and will be discussed further in Section 4.2 evaluating topic coherence across languages.

Topic English top words German top words
Environment pesticide, plant, soil, use, crop, fertilize, pesticide, garden, herbicide, grow pflanze, pestizid, dunger, boden, gulle, garten, anbau, gemuse, tomate, feld
Retailers store, whole, shop, groceries, supermarket, local, market, amazon, price, online, discount aldi, supermarkt, lidl, kauf, laden, lebensmittel, cent, einkauf, wochenmarkt
GMO               & organic gmo, label, gmos, monsanto, product, certificate, usda, genetic, product produkt, bioprodukt, lebensmittel, gesund, konventionell, biodiesel, herstellung, enthalt, monsanto, pestizid
Food products & taste taste, milk, sugar, cook, eat, fresh, flavor, fruit, potato, sweet kase, schmeckt, gurke, essen, analogkase, schmeckt, tomate, milch, geschmack, kochen
Food safety chemical, cancer, body, acid, effect, cause, toxic, toxin, glyphosat, disease dioxin, gift, grenzwert, ddt, menge, giftig, toxisch, substanz, chemisch, antibiotika
Research science, study, scientific, research, gene, scientist, genetic, human, stanford, nature gentechnik, natur, mensch, wissenschaft, lebenserwartung, genetisch, studie, menschlich, planet
Health             & nutrition eat, diet, healthy, nutritious, health, fat, calory, obesity, junk lebensmittel, essen, ernahrung, gesund, nahrungsmittel, lebensmittel, nahrung, fett, billig
Politics               & policy govern, public, politic, corporate, regulation, law, obama, vote politik, skandal, verantwortung, bundestag, schaltet, bestraft, strafe, kontrolle, kriminell
Animals             & meat meat, chicken, anim, cow, beef, egg, fed, raise, pig, grass tier, fleisch, eier, huhn, schwein, futter, kuh, verunsichert, vergiftet, deutsch
Farming farm, farmer, agriculture, land, sustain, crop, yield, acre, grow, local landwirtschaft, landwirt, bau, flache, okologisch, nachhaltig, konventionell, landbau, produktion, ertrag
Prices & profit price, consume, market, company, profit, product, cost, amazon, money verbrauch, preis, produkt, billig, qualitat, kunde, kauf, geld, unternehmen, kosten
Table 1: Top words for all meaningfull topics with = 15 of English and German data

Figure 3: Topic distributions with increasing number of topics . The percentage is the amount of sentences in garbage topics.

3.5 Sentiment Analysis

In addition to topic modeling, we conduct sentiment analysis to investigate the feasibility and meaning of cross-lingual topic-related sentiments in articles and respective comment sections. We make use of Textblob111 and Textblob-de222 to assign each of the English and German pre-processed sentences a polarity score. The polarity assignment is first proposed by [21] and reimplemented by [11]. Since the subjectivity assignment is not well-developed in Textblob-de, we filter out sentences with polarity equals to 0 for both English and German sentences in order to derive comparable results.

3.6 Topic and Sentiment Distributions

After assigning a labeled cluster, i.e., a topic, and a sentiment score to each sentence of the corpus, we derive the corresponding distributions.

For topic distributions, all sentences from a document are counted per topic. The distribution is then normalized to be comparable. For sentiment distributions, all sentences from a document are grouped per topic. Topic-wise sentiment distribution is derived based on the sentence-wise polarity scores and the respective median and quartiles. A document in that regard is either an article or all of its comments, i.e., its comment section.

Figure 4: Topic and sentiment distribution for Grocer.

4 Topic Coherence

In this section, we evaluate the feasibility and semantic coherence of our cross-lingual topic modeling qualitatively. Instead of providing quantitative coherence scores, we aim at a detailed, qualitative analysis of textual examples. We depict representative sentences and words of each topic in subsection 4.2. We investigate to what extent these are semantically coherent, also across languages. We expose the ratio of coherent and incoherent topics and how it develops with increasing number of topics in in subsection 4.3. Eventually, we show the distribution of topics in selected newspaper articles and their respective comment sections to relate the discussed content with our actual topic model on English and German texts.

4.1 Data

The collection of the data used in this study is described in another publication [10] as follows. For the analysis we downloaded ”news articles and reader comments of two major news outlets representative of the German and the United States (US) context”, i.e., and The creation dates of the texts are ”spanning from January 2007 to February 2020”. ”Articles and related comments on the issue of organic food were identified using the search terms organic food and organic farming and the German equivalents. For topic modeling, we utilized ”534 articles and 41,320 comments from the US for the years 2007 to 2020, and 568 articles and 63,379 comments from Germany for the years 2007 to 2017 and the year 2020”.

Figure 5: Topic and sentiment distribution for Öko-Test.

4.2 Multi-Linguality of Topics

In this section, we evaluate semantic coherence of our cross-lingual topic modeling by depicting the representative sentences and words for each topic and showing the semantic relation. Table 1 shows the first 10 English and German words having the highest clarity scores (see Section 3.4) in each cluster for = 15. Table 2

shows the first 3 English and German sentences whose embeddings have the largest cosine similarity to their corresponding cluster centroids. Both top words and top sentences indicate that the clusters are grouped reasonably in terms of semantics. For example, this is the case for the topic

Environment (pesticides & fertilizers) which is indeed related to use of pesticides in planting. Even though this also appears to be the case for the sentences in GMO & organic on the first glance, those are actually about organic food and how aspects such as GMO and pesticides relates to the food itself. This and the other representative top words and sentences indicate that clustering on cross-lingual sentence embeddings yield semantically coherent topics.

According to our analysis, top sentences from garbage clusters are always short in length with slightly more than 15 characters. Together with top words (Table 1), these hardly contribute to the organic food domain and corresponding entities. Thus, it is feasible in our case to ignore them.

4.3 Amount of Meaningful Topics

Besides providing coherent cross-lingual topics, our method performs well to distinguish usable from unusable topics, and it provides a constantly high number of relevant topics independently of the number of overall topics. Figure 3 is a Sankey diagram showing the flow of topic assignments for all English and German sentences with increasing number of clusters . Topic modeling is performed for each with all pre-processed sentences independently. It can be seen that more specific topics descend from general but related topics as indicated by the colors.

For instance, GMO & organic, Food safety, Environment (pesticides & fertilizers) and Farming & agricultural policy & food security for = 15 are derived from Organic vs. conventional farming for . Organic vs. conventional farming in generally focuses on advantages brought by organic farming when comparing to conventional farming, such as reducing persistent toxin chemicals from entering to environment, food, and bodies; thus, bio-products are recommended. For , the children topics are more specific. For example, GMO & organic shows the aims for having organic food, i.e., avoidance of GMO and poisoning with pesticides. Moreover, Food safety in is further split into Food safety and Environmental pollution.

To see how the topics relate to their actual sentences, we try to observe the top sentences of each topic, i.e., those sentences of which the embeddings are closest to the centroid. Both English and German sentences are similar and share strong semantic similarity. The food safety topic focuses on the toxicity issue of dioxin and other chemicals towards consumers. Environmental pollution, which is further splitted from it, for indeed tells contamination of water resources by chemicals. This shows that fine-grained topics and the way they develop with increasing have a meaningful relation to ancestor and sister topics.

Sentences without contribution to the organic food domain always remain in garbage clusters in a way that the proportion of usable and unusable clusters does not fluctuate. Thus, the topic model maintains its coherence independent to the number of topics and the despite the fact that k-means is not deterministic in its clustering. This property is helpful, since the number of topics can be chosen as high as necessary to provide a sufficient level of detail for the domain of interest. Moreover, this highlights the meaningfulness and robustness of the given sentence representations being able to separate noise from informative data in an unsupervised fashion.

4.4 Validation of Opinion Distributions

In this section, two real text examples are given to evaluate our method qualitatively. The first one is an article from New York Times, titled ’Major Grocer to Label Foods With Gene-Modified Content333, hereafter referred to as Grocer. It reported that the first retailer in the United States announced to label all of its genetically modified food sold in its stores. Advocating and opposing stakeholders stated their arguments regarding different aspects. The second example is from Der Spiegel, titled ’”Öko-Test” und Co. – Welche Lebensmittelsiegel wirklich taugen444, below denoted as Öko-Test. It reported that number of food claims, certifications, and seals in Germany were growing as organic labeling was a good promotional strategy indicating high food quality. However, consumers knew little about the details even when the tests for each label were transparent and well-documented. Based on these two summaries, it would be expected that topics related to supermarkets, retailers, and GMO labels are shown to be present in those articles. The Grocer article, however, expresses concerns about the consumption of genetically modified food, whereas Öko-Test discusses organic food labeling issues from various point of views, among others fair trading and organic fishing.

Topic Distribution

Figures 4 and 5 show the distribution of topics in the overall article sentences. It can be seen that the two topics Retailers and GMO & organic are mentioned the most in both articles, supporting our hypothesis. The comment section of the Grocer article corresponds to the article itself such that most of its sentences also talk about GMO & organic and the second most for Retailers. However, the commenters of the Öko-Test articles commented more about GMO & organic followed with Consumer prices & profit and Food products & taste. Even though the dominating topics in German differ between article and comments, it can be stated that the topic distribution overall still refers to the actual topics of the given texts and domains. At the same time, differences in the distribution not only between article and comments but also between languages and thus cultures are directly visible, providing a means for clear comparability in several respects.

Sentiment Distribution

Figures 4 and 5 also show the sentiment distribution. Generally, the sentiment of the Grocer article spreads out less than that of the Öko-Test article. It is observed that, in topic GMO & organic, comments score sentiment polarity ranging between to in Grocer and between and in Öko-Test. This means sentences from Grocer show weaker sentiment compared to those from Öko-Test

. The actual texts indicate that sentiment on our German data indeed has more variance than on English. Thus, the proposed multi-lingual sentiment analysis, Textblob and Textblob-de, appears to represent the data adequately in the given use case. However, it cannot be excluded that the sentiment distribution could be affected given the fact that two different but methodically similar frameworks are used. Different biases and variances could be caused by different models which have differences in the sentiment dictionary size and the subjectivity of human-assigned sentiment scores based on different cultures. Further studies should examine this problem for more robust, domain-independent multi-lingual sentiment prediction.

5 Conclusion

This case study shows that our technically simple approach successfully generates an high proportion of relevant and coherent topics for our domain, i.e., organic food products and related consumption behavior based on English and German social media texts. Moreover, the topics display the text contents correctly and support a domain expert in the content analysis of social media texts written in multiple languages.

However, the presented paper did not provide quantitative measurements of topic coherences and comparisons with the state-of-the-art. For mono-language topic modeling, it would be LDA [3]; for advanced cross-lingual topic modeling, it could be attention-based aspect extraction [15] utilizing aligned multi-lingual word vectors [8]. Several multi-lingual datasets would need to be included for a representative comparison. Since pre-trained models trained on external data are used for the proposed method, it might be relevant for coherence score calculation to include intrinsic coherence scoring methods based on train test splits, such as, UMass coherence score [20], and explore extrinsic methods calculated on external validation corpora, e.g., Wikipedia [22].

Regarding multi-lingual sentiment analysis, the difference in the sentiment analysis frameworks for different languages must be considered. For example, since two independent but similar sentiment analysis models are applied for English and German, the sentiment distribution could be affected. Therefore, future studies on developing and evaluating comparable sentiment models should be conducted.


  • [1] S. Angelidis and M. Lapata (2018) Summarizing opinions: aspect extraction meets sentiment prediction and they are both weakly supervised. ArXiv abs/1808.08858. Cited by: §3.4.
  • [2] F. Bianchi, S. Terragni, and D. Hovy (2020) Pre-training is a hot topic: contextualized document embeddings improve topic coherence. External Links: 2004.03974 Cited by: §2.
  • [3] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation.

    the Journal of machine Learning research

    3, pp. 993–1022.
    Cited by: §2, §5.
  • [4] J. Boyd-Graber and D. Blei (2012) Multilingual topic models for unaligned text. External Links: 1205.2657 Cited by: §2.
  • [5] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, and R. Kurzweil (2018) Universal sentence encoder. External Links: 1803.11175 Cited by: §2.
  • [6] C. Chang, S. Hwang, and T. Xui (2018) Incorporating word embedding into cross-lingual topic modeling. 2018 IEEE International Congress on Big Data (BigData Congress), pp. 17–24. Cited by: §2.
  • [7] M. Chidambaram, Y. Yang, D. Cer, S. Yuan, Y. Sung, B. Strope, and R. Kurzweil (2018) Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836. Cited by: §3.2.
  • [8] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §5.
  • [9] S. Cronen-Townsend, Y. Zhou, and W. B. Croft (2002) Predicting query performance. Cited by: §3.4.
  • [10] H. Danner, G. Hagerer, Y. Pan, and G. Groh (2021) The news media and its audience: agenda-setting on organic food in the united states and germany. Cited by: §4.1.
  • [11] T. De Smedt and W. Daelemans (2012) Pattern for python. The Journal of Machine Learning Research 13 (1), pp. 2063–2067. Cited by: §3.5.
  • [12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6), pp. 391–407. Cited by: §2.
  • [13] A. Field, D. Kliger, S. Wintner, J. Pan, D. Jurafsky, and Y. Tsvetkov (2018-October-November) Framing and agenda-setting in Russian news: a computational analysis of intricate political strategies. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 3570–3580. External Links: Link, Document Cited by: §2.
  • [14] E. Gutiérrez, E. Shutova, P. Lichtenstein, G. de Melo, and L. Gilardi (2016) Detecting cross-cultural differences using a multilingual topic model. Transactions of the Association for Computational Linguistics 4, pp. 47–60. Cited by: §2.
  • [15] R. He, W. S. Lee, H. T. Ng, and D. Dahlmeier (2017-07)

    An unsupervised neural attention model for aspect extraction

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 388–397. External Links: Link, Document Cited by: §5.
  • [16] J. Jagarlamudi and H. Daumé (2010) Extracting multilingual topics from unaligned comparable corpora. In Advances in Information Retrieval, C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, and K. van Rijsbergen (Eds.), Berlin, Heidelberg, pp. 444–456. External Links: ISBN 978-3-642-12275-0 Cited by: §2.
  • [17] H. K. Kim, H. Kim, and S. Cho (2017)

    Bag-of-concepts: comprehending document representation through clustering words in distributed representation

    Neurocomputing 266, pp. 336–352. Cited by: §3.
  • [18] N. Ko, B. Jeong, S. Choi, and J. Yoon (2018) Identifying product opportunities using social media mining: application of topic modeling and chance discovery theory. IEEE Access 6 (), pp. 1680–1693. External Links: Document Cited by: §2.
  • [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    External Links: 1301.3781 Cited by: §2.
  • [20] D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum (2011-07) Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK., pp. 262–272. External Links: Link Cited by: §5.
  • [21] B. Pang and L. Lee (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058. Cited by: §3.5.
  • [22] M. Röder, A. Both, and A. Hinneburg (2015) Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pp. 399–408. Cited by: §5.
  • [23] A. Srivastava and C. Sutton (2017) Autoencoding variational inference for topic models. External Links: 1703.01488 Cited by: §2.
  • [24] O. Tsur, D. Calacci, and D. Lazer (2015-07) A frame of mind: using statistical models for detection of framing and agenda setting campaigns. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1629–1638. External Links: Link, Document Cited by: §2.
  • [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.2.
  • [26] I. Vulic, W. De Smet, and M. Moens (2013-06) Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Information Retrieval 16, pp. . External Links: Document Cited by: §2.
  • [27] Q. Xie, X. Zhang, Y. Ding, and M. Song (2020) Monolingual and multilingual topic analysis using lda and bert embeddings. Journal of Informetrics 14 (3), pp. 101055. External Links: ISSN 1751-1577, Document, Link Cited by: §2.
  • [28] D. Zhang, Q. Mei, and C. Zhai (2010-07) Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 1128–1137. External Links: Link Cited by: §2.