Machine Learning meets Data-Driven Journalism: Boosting International Understanding and Transparency in News Coverage

06/16/2016 ∙ by Elena Erdmann, et al. ∙ TU Dortmund 0

Migration crisis, climate change or tax havens: Global challenges need global solutions. But agreeing on a joint approach is difficult without a common ground for discussion. Public spheres are highly segmented because news are mainly produced and received on a national level. Gain- ing a global view on international debates about important issues is hindered by the enormous quantity of news and by language barriers. Media analysis usually focuses only on qualitative re- search. In this position statement, we argue that it is imperative to pool methods from machine learning, journalism studies and statistics to help bridging the segmented data of the international public sphere, using the Transatlantic Trade and Investment Partnership (TTIP) as a case study.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 The need for cross-national analysis

The recently published news on the Panama Papers leak demonstrates firstly that tax fraud is an international phenomenon and secondly how cross-national cooperation can be beneficial to investigating and reporting. Admittedly, this is an exceptional case. Global events are still ”primarily covered in accordance with the traditional national outlook, i.e. national domestications and the ’domestic vs. foreign news” logic’” (Berglez, 2008, 847). A global public sphere to address globally relevant issues has not been established yet and national biases impede possible international approaches. This way ”the global sociopolitical order becomes defined by the realpolitik of nation-states that cling to the illusion of sovereignty despite the realities wrought by globalization” (Castells, 2008, 80). Reciprocal knowledge about controversial issues across national borders is necessary to provide common ground for fruitful global discussions and proposals.

In this position statement, we provide evidence that joining forces improves media transparency on a global scale: Combining machine learning with statistics and journalism studies contributes to bridging the segmented data of the international public sphere. Following an interdisciplinary approach we tackle the question of how methods from machine learning help to deepen our understanding of the discussion on cross-national issues. LDA, @TM, PDNs and word2vec are used to enhance transparency on international media coverage: Range, amount and framing of issues can be compared with fewer translation efforts. Differences in perception become obvious and evaluation divides can be interpreted. This will be demonstrated by an analysis of the coverage on the controversial Transatlantic Trade and Investment Partnership (TTIP) between the United States of America (U.S.) and the European Union (E.U.).

TTIP was designed to facilitate trade between the U.S. and the E.U. However, TTIP’s actual impact on economies and societies has been discussed controversially in both the U.S. and Europe. Media perception has differed in many aspects. The comparison of a U.S. newspaper (New York Times) and a German newspaper (Süddeutsche Zeitung) reveals that TTIP is more hotly debated in Germany than in the U.S., see Fig. 1(a). The New York Times highlighted the need of bank regulations and the threat that exporting nations pose to local markets. Whereas, Süddeutsche Zeitung focused largely on consumer protection. On the one hand TTIP was criticized for its implications on environmental and food standards, on the other for the negotiation proceedings that were characterized by democratic deficits and insufficient transparency. Intricate questions derive from the simple comparison of word frequencies: Why does the range of reported arguments differ so considerably? Is the German media reluctant to the Trade Partnership in general? And what are the reasons for the German obsession with the ’chlorinated chicken’ as evidenced in the significant number of these words in articles broaching the TTIP issue?

The TTIP example illustrates that media coverage can largely differ across nations. Public discussion is still strongly influenced by national media. Multiple languages add to the difficulties to frame a common global perspective. However, in a globalized world, crisis and political decisions have become far too complex to be dealt with on a national level only. Combining Machine Learning and data-driven journalism enables researchers to investigate large corpora of texts to reveal national patterns of argumentation, which in turn can promote international understanding.

(a) Comparison of number of articles containing ’TTIP’ in New York Times and Süddeutsche Zeitung.
(b) PDN created from the articles containing ’TTIP’ in Süddeutsche Zeitung. (German words were translated to English)
(c) Attentional curves capture the development of topics in news articles over time, here illustrated for the war on Ukraine.
Figure 1: Boosting international understanding and transparency in news coverage using machine learning and data-driven journalism.

2 ML meets Data-Driven Journalism

There is an arms race to ‘deeply’ understand text data, and consequently a range of different techniques has been developed for media analysis. However, when using them for data-driven journalism, e.g. to gain a deeper understanding of the news reception of important political and societal issues, there are also challenges.

Just to name few of the recent ML techniques, DeepDive Niu et al. (2012) aims to extract structured data from texts, Metro Maps Shahaf et al. (2012) extract easy to understand networks of news stories, and word2vec Mikolov et al. (2013)

computes Euclidean embeddings of words. It is trained on a corpus of documents and transforms each word into a vector by calculating word correlations. Similarity measures can be applied to the resulting vectors. Particularly, word2vec can be used to compute those words that are most likely to occur in the same context as a given word. It thus has the potential to reveal which words are linked closest to a given issue and hence provides a semantically enriched alternative to classical keyword searches, which is well used by journalists. Finally, topic models, have been used successfully in many scenarios, in particular to model discourses. Most prominent among them is Latent Dirichlet Allocation (LDA)

Blei et al. (2003)

that characterizes each topic as a list of words and their respective probabilities to appear in the topic. Topics over Time (TOT)

Wang & McCallum (2006) follows this paradigm, but introduces a temporal component. In TOT, each document has a timestamp and the probability of a topic grows and declines over time. Thus, TOT can be employed to analyze trends in news.

Due to this rich machine learning toolbox for analyzing news articles, it is tempting to put a stack of news articles on a data journalist’s desk saying ‘Enjoy’. Unfortunately, data-driven journalism is not that simple.

Reconsider topic models, the main focus of the present paper. TOT does not model attention of the crowd in a physically plausible way. Triggered by models from communication studies Kolb (2005) and the observation that the Shifted Gompertz distribution models attentional curves Bauckhage et al. (2014), we developed a novel Attentional Topic Model (@TM) Pölitz et al. (2016). It captures well the growth and decline of the popularity of topics in a physically plausible way.

Moreover, multinomial word distributions, such as in LDA and TOT capture the most common words used in each topic. However, they often fail to give a deeper understanding of topics required when investigating media discourse. That is why APMs Inouye et al. (2014), which discover word dependencies in each topic, have been introduced; essentially, they encode topics as weighted undirected graphs. Often, however, word dependencies are asymmetric. If the word ’treaty’ appears in a text, it is very likely that the text will refer to the museum’s ’secretary of state’, too. The phrase ’secretary of state’, on the other hand, is a very general term and can be used in many different contexts. Thus, it does not make the word ’treaty’ per se more likely. In Erdmann et al. (2016), we therefore extended APMs to directed dependencies using Poisson Dependency Networks Hadiji et al. (2015). Moreover, longer chains of directed dependencies may provide interesting clues to understand a topic.

Finally, topic models have been traditionally evaluated using intrinsic measurements such as the likelihood and the perplexity of topics Wallach et al. (2009). As these measurements do not necessarily correspond to human judgment Chang et al. (2009), we pool together the talents of journalists, machine learners and statisticians to obtain a better understanding of what makes a good topic. If we use topic models to create subcorpora e.g. for content analysis we have to ensure that the subcorpora are at least as good as the ones from other methods like keyword searches. The gold standard is the evaluation based on human judgment Stryker et al. (2006). The use of statistical methods helps to reduce the time requirement for human coders to come to a significant statement about the quality of a subcorpus. Moreover, our interdisciplinary research led to several interesting observations about the quality of topics: While researchers from a mathematical background tended to focus on topics linked to large quantities of documents, journalists oftentimes preferred those topics that were created from only few meaningful documents. Likewise words like ’can’, ’need’ and ’do’, that were considered stopwords by machine learners, really caught the journalist’s attention.

We believe that these small and seemingly insignificant notices can help to improve the application and lead a way to new computational models. Can new topic models be developed to cater better to the specific needs of journalists? Are there different approaches to gain deeper insight into each topic? We will illustrate this using the case of TTIP.

3 Towards an International View on TTIP

TTIP affects millions of people living in the U.S. and the E.U. and its negotiations have been controversial. However, content analysis of newspapers indicates that the issue is of diverging national importance. Both compared newspapers are high-circulation dailies from metropolises that exhibit a rather liberal orientation. Despite these similarities, coverage on TTIP varies significantly (see Fig. 1(a)).

It is notable that coverage on TTIP increases considerably from 2014 in Süddeutsche Zeitung (SZ), whereas the number of articles in the New York Times (NYT) remains unaltered. In order to shed light on this disparity the general coverage on the U.S.A. and Europe, respectively, was analyzed. In SZ the sub-corpus with all articles including the pattern of the letters usa contained 59.637 articles (36 per cent of the corpus) whereas the europe-corpus in NYT contained 34.177 (11 per cent of the corpus).

Figure 2: Topics in NYT found by LDA and labeled by journalists.

(Topic Models) LDA was used to find 100 topics in each sub-corpus. The topics were labeled by journalists using top words and top articles. The topics in NYT are illustrated in Fig. 2. A glance at the results shows distinctly the different perspectives on TTIP in the public spheres. In SZ TTIP appears on the top word list of a topic covering articles on policies of the European Commission along with conjoined words like customs, arbitration and investor protection. In NYT TTIP is not among the top words of the European Commission topic which is dominated by stakeholders dealing with financial issues around the euro crisis. Interpreting the LDA topics further TTIP plays a less significant role in the U.S.-European relations represented by the LDA topics which are mainly various international conflicts, art, sports and general economy related topics.

(Directed word dependencies) A PDN Hadiji et al. (2015) trained on the articles of Süddeutsche Zeitung shows that the constitution of the E.U. as a politico-economic union of 28 states places the question of parliamentary participation in the foreground (see Fig. 1(b)). In the U.S., this question does not arise.

(Attentional Topic Models) When analyzing the discourse on Europe in NYT through Attentional Topic Models Pölitz et al. (2016), we found no attentional topic corresponding to the TTIP. Instead, the discussion focused on different topics such as the war in the Ukraine (see Fig. 1(c)).

(word2vec) Analyzing word2vec results highlights diverse reciprocal perception: U.S. and German newspaper both coincide covering Germany mainly considering the recent migration. However, on the coverage on the U.S. SZ and NYT drift apart: In the SZ the importance of U.S. as an economic partner is demonstrated, whereas the NYT covers the U.S. in a broader range of topics including several sports (see Table 1). Comparing the use of TTIP indicates that the NYT uses more matter-of-fact words in connection with TTIP while SZ seems to include more commenting and evaluating words including ’chlorinated chicken’. Word2vec solves the mystery of the ’chlorinated chicken’: free trade agreement, genetically modified food and genetically modified corn are among the most similar words. For German TTIP opponents chicken meat disinfected with chlorine has become a symbol for lowering food safety standards and the disadvantages of TTIP in general.

Applying machine learning to understand the cross-nationally diverse discourse on TTIP highlights the benefit which can be derived from an interdisciplinary approach. Yet, this interdisciplinary approach is still in its infancy.

USA NYT: santander consumer, served chairman, basketball, womens hockey, oracle team, senior vice, kan, goodgame, divac, columbus ohio
SZ: kanada mexico, united states, china hongkong, largest market, most important trade partner, embargo, pacific states, china russia, great britain france, usa kanada
Germany NYT: germanys, europe, asylum seekers, migrants, german, plan distribute, migrants entered, accept migrants, hungary closed, human flow
SZ: europe, countries, immigrant, kanada australia, prospects of remaining, immigrants, arriving refugees, many refugees, most refugees, european countries
TTIP NYT: transatlantic trade, investment partnership, trade ministers, mr froman, trade agreement, trade negotiations, trade negotiators, trade talks, euus, trade commissioner
SZ: ceta, investor protection, free trade agreement, free trade agreement ttip, trade agreement, investment protection, eu kanada, transatlantic free trade agreement, chlorine chicken, free trade
Table 1: Most similar words to ’USA’, ’Germany’ and ’TTIP’ in NYT and SZ (translated from German) according to word2Vec.

4 Lesson Learned: Bridging Fields

The absence of a common public sphere has already been constituted as an enduring obstacle to further political and economic integration in Europe Habermas (2014); Jones et al. (2015); Vössing (2015); Risse (2015), a region where the difficulty of understanding between people speaking different languages becomes evident in spite of small distances. Agents in politics and business face a confusing multitude of partly conflicting national discourses. Therefore, finding common solutions in a democratic context is hindered. The defiance of finding common ground for discussion becomes even more challenging if international understanding is volitional. To amend the development of a global public sphere discussing and approaching international challenges it is imperative that computer scientists, information scientists, and experts in communication studies pool their talents and knowledge to help find efficient and effective ways of managing the news sources available. Research in this new field is necessarily interdisciplinary since developing new methods and applying established ones should eventually lead to instruments that enable not only researches and experienced data journalists but also practitioners in the media, in politics and business to compare debates internationally.

So far using machine learning methods for content analysis is still uncommon in communication studies and best practices for algorithmic text analysis (ATA) are still being negotiated. In the TTIP case they were used as part of a hybrid approach ”that combines computational and manual methods throughout the process . . . [to] retain the strengths of traditional content analysis while maximizing the accuracy, efficiency, and largescale capacity of algorithms for examining Big Data.” Lewis et al. (2013) Following this approach, patterns that so far have been hidden can be made visible. Transparency will be achieved on the prevailing debates, showing how they evolve and relate to existing narratives, identifying national frames and agenda setters and showing divergences and convergences across national debates. The focus of research should be on interlocking of long-term discourse patterns with current issues. Which arguments and frames have dominated the debate on the refugee crisis? Why is the opposition against TTIP so strong in the German speaking countries of Germany, Austria and Luxembourg while others embrace the deal? Why have Germany and France differed so fundamentally on how to handle the Greek debt crisis? Which persons and institutions are dominating the debates in the respective countries? In which areas do new topics or new frames emerge?

Our TTIP study also motivates to revisit a considerable number of important theories of communication studies. The mechanisms of Agenda-setting Bennett (2006) and issue attention cycles Downs (1972) can be visualized using clustering models in an entirely new dimension; the most important agents of the public discourse Habermas (1991)

can be analyzed with named-entity-recognition and network-visualizations, framing of news

Entman (1993) can be illustrated using sentiment-analysis. In a nutshell, the potential of machine learning for analyzing international communication and discourses is high. However, the potential can only be achieved if new methods are developed and made available as easy-to-use applications.

Overall, applying machine learning to broaden insight in international news coverage opens up fundamentally new intellectual territory with great potential to advance the state of the art of computer science and related disciplines and to provide unique societal benefits. Measures to achieve this potential involve intense interdisciplinary collaboration and the mutual objective to develop methods being easily usable for everyone interested in profound international understanding.


This work was supported by the DFG Collaborative Research Center SFB 876 project A6 and A1 and the Dortmund Center for Media Analysis (DoCMA).


  • Bauckhage et al. (2014) Bauckhage, C., Kersting, K., and Rastegarpanah, B. Collective attention to social media evolves according to diffusion models. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pp. 223–224, 2014.
  • Bennett (2006) Bennett, W.L. Toward a Theory of Press–State Relations in the US. Journal of Communication, 40(2):103 – 127, 2006.
  • Berglez (2008) Berglez, P. What Is Global Journalism? Journalism Studies, 9(6):845–858, December 2008.
  • Blei et al. (2003) Blei, D.M., Ng, A., and Jordan, M. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
  • Castells (2008) Castells, M. The New Public Sphere: Global Civil Society, Communication Networks, and Global Governance. The ANNALS of the American Academy of Political and Social Science, 616(1):78–93, March 2008.
  • Chang et al. (2009) Chang, J., Gerrish, S., Wang, C., Boyd-graber, J.L., and Blei, D.M. Reading tea leaves: How humans interpret topic models. In Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 288–296. 2009.
  • Downs (1972) Downs, A. Up and Down with Ecology-the Issue-Attention Cycle. The Public Interest, 0(28), 1972.
  • Entman (1993) Entman, R.M. Framing: Toward Clarification of a Fractured Paradigm. Journal of Communication, 43(4):51–58, December 1993.
  • Erdmann et al. (2016) Erdmann, E., Molina, A., and Kersting, K. Topic models with asymmetricword dependencies. manuscript under revision, 2016.
  • Habermas (1991) Habermas, J. The Structural Transformation of the Public Sphere: An Inquiry Into a Category of Bourgeois Society. MIT Press, 1991.
  • Habermas (2014) Habermas, J. The crisis of the European Union: a response. Polity Press, 1 edition, March 2014.
  • Hadiji et al. (2015) Hadiji, F., Molina, A., Natarajan, S., and Kersting, K.

    Poisson dependency networks: Gradient boosted models for multivariate count data.

    Machine Learning, pp. 1–31, 2015.
  • Inouye et al. (2014) Inouye, D., Ravikumar, P., and Dhillon, I. Admixture of poisson mrfs: A topic model with word dependencies. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 683–691, 2014.
  • Jones et al. (2015) Jones, E., Kelemen, R.D., and Meunier, S. Failing Forward? The Euro Crisis and the Incomplete Nature of European Integration. Comparative Political Studies, pp. 1–25, December 2015.
  • Kolb (2005) Kolb, S. Mediale Thematisierung in Zyklen. 2005.
  • Lewis et al. (2013) Lewis, S.C., Zamith, R., and Hermida, A. Content Analysis in an Era of Big Data: A Hybrid Approach to Computational and Manual Methods. Journal of Broadcasting & Electronic Media, 57:34–52, January 2013.
  • Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
  • Niu et al. (2012) Niu, F., Zhang, C., Ré, C., and Shavlik, J.W. Deepdive: Web-scale knowledge-base construction using statistical learning and inference. VLDB, 12:25–28, 2012.
  • Pölitz et al. (2016) Pölitz, C., Erdmann, E., Bauckhage, C., Müller, H., Morik, K., and Kersting, K. Attentional topic models: Gompertz captures growth and decline of popular topics. manuscript under revision, 2016.
  • Risse (2015) Risse, T. European public spheres. Contemporary European politics. Cambridge Univ. Press, Cambridge, 2015.
  • Shahaf et al. (2012) Shahaf, D., Guestrin, C., and Horvitz, E. Trains of thought: Generating information maps. In Proceedings of the 21st international conference on World Wide Web, pp. 899–908. ACM, 2012.
  • Stryker et al. (2006) Stryker, J.E., Wray, R.J., Hornik, R.C., and Yanovitzky, I. Validation of Database Search Terms for Content Analysis: The Case of Cancer News Coverage. Journalism & Mass Communication Quarterly, 83:413–430, 2006. ISSN 1077-6990, 2161-430X.
  • Vössing (2015) Vössing, K. Transforming public opinion about European integration: Elite influence and its limits. European Union Politics, 16:157–175, 2015.
  • Wallach et al. (2009) Wallach, H.M., Murray, I., Salakhutdinov, R., and Mimno, D. Evaluation methods for topic models. In Proceedings of the 26th International Conference on Machine Learning (ICML), ICML ’09, pp. 1105–1112, 2009.
  • Wang & McCallum (2006) Wang, X. and McCallum, A. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pp. 424–433, 2006.