In the past years, data derived from public social media has been successfully used for capturing diverse trends about health and disease-related issues such as flu symptoms, sentiments towards vaccination, allergies, and many others [1, 2, 3, 4, 5]
. Most of these approaches are based on natural language processing (NLP) and share a common workflow. This workflow involves data collection, human annotation of a subset of this data, training of a supervised classifier, and subsequent analysis of the remaining data. The approach has proven promising in many cases, but it also shares a few shortcomings. A major drawback of this type of research process is that a model, which was trained on data from previous years, might not generalize well into the future. This issue, commonly known as concept drift, may not necessarily be only related to overfitting, but may simply be a consequence of how language and content, especially on the internet, evolve over time. A similar effect has been suggested to be the main reason for the increasing inaccuracy of Google Flu Trends (GFT), one of the most well-known flu surveillance systems in the past . After launching the platform in 2003, GFT’s model had been retrained in 2009, which led to a significant improvement of its performance in the following years. However, during the influenza epidemic in 2012/13, the model’s performance decreased again and overestimated the extent of the epidemic by a large margin. Shortly after, it was discontinued [8, 9].
Apart from the issue of model drift, a second issue associated with current NLP models is that the collection of large amounts of labelled data, usually through platforms such as Amazon Turk, is very costly. Labelling a random subset of the collected social media data may be inefficient, as depending on the degree of filtering applied, large fractions of the collected data are possibly not relevant to the topic, and therefore have to be discarded.
Lastly, there is a growing interest in the public health field to capture more fine-grained categorizations of trends, opinions or emotions. Such categorizations could allow to paint a more accurate picture of the nature of the health issue at hand. However, multi-class annotations of a large sample of data again exponentially increases costs.
Here, we introduce Crowdbreaks333https://www.crowdbreaks.org, a platform targeted at tackling some of these issues. Crowdbreaks allows the continuous labelling of public social media content in a crowdsourced way. The system is built in a way which allows algorithms to improve as more labelled data is collected. This work describes the functionalities of the platform at its current state as well as its possible use cases and extensions.
2 Related Work
In recent years, a number of platforms have been launched which allow the public to contribute to solving a specific scientific problem. Among many others, examples of successful projects include the Zooniverse platform (formerly known as Galaxy Zoo) , Crowdcrafting , eBird (a platform for collecting ornithological data) , and FoldIt (a platform to solve protein folding structures) . Many of these projects have shown that citizen science can be used to help solve complex scientific problems. At the same time, there is a growing number of platforms which offer monetary compensations to workers for the fulfillment of microtasks (the most prominent example being Amazon Turk444https://aws.amazon.com/
). These platforms gain importance as the need for large amounts of labelled data for the training of supervised machine learning algorithms increases. Previous work focused mostly on efficiency improvement of large-scale human annotation of images, e.g. in the context of the ImageNet project. Most of these improvements include better ways to select which data to annotate, how to annotate (which is a UI specific problem) and what type of annotations (classes and subclasses) should be collected . Online task assignment algorithms have been suggested which may consider both label uncertainty as well as annotator uncertainty during the annotation process [16, 17]. Results suggest that this allows for a more efficient training of algorithms. More recently, a crowd-based scientific image annotation platform called Quantius has been proposed, showing decreased analysis time and cost . To our knowledge, no similar work has been proposed with the regard to the human annotation of textual data (such as tweets).
3 Platform overview
Crowdbreaks is a platform which aims at automatizing the whole process from data collection (currently through Twitter), filtering, crowdsourced annotation and training of Machine Learning classifiers. Eventually these algorithms can help evaluate trends in health behaviours, such as vaccine hesitancy or the risk potential for disease outbreaks.
Crowdbreaks consists of a data collection pipeline (“streaming pipeline”) and a platform for the collection of labelled data (“user interface”), connected through an API (Application Programming Interface), as schematized in figure 1.
3.1 Streaming pipeline
Currently Crowdbreaks consumes data from the Twitter streaming API only, therefore the rest of this work will focus on tweets as the only data source. However, it could be extended to any textual data which can be collected in the form of a data stream through an API. The Twitter API allows for the filtering of tweets by a specific set of keywords in real-time. Tweets collected contain at least one exact match within certain fields of the tweet object. Incoming tweets are put on a background job queue for filtering, pre-processing, geo-tag enrichment, and annotation with metadata, such as estimated relevance or sentiment (more on this in section4). Apart from filtering by a simple list of keywords mentioned before, Crowdbreaks also allows to further filter content by applying complex keyword queries, such as (keyword1 OR keyword2) AND keyword3. After these processing steps, tweets are stored in a database. Based on a relevance score (e.g. the uncertainty of a predicted label, see section 3.3.1) the tweet IDs are also pushed into a priority queue for subsequent labelling. Once the priority queue has reached a certain size, older items with low priority are removed from the queue and replaced with more recent items. Therefore the queue keeps a pool of recent and supposedly relevant tweets for labelling. Once a tweet has been labelled, it is ensured that the same tweet will be labelled by a certain number of distinct users in order to reach a consensus.
3.2 User interface
The user interface allows labelling of tweets based on answering of a sequence of questions. Arbitrary question sequences can be defined, which allow the annotation of multiple classes and subclasses to a single tweet. Most commonly, different follow-up questions would be asked depending on the answers given previously, e.g. whether or not the tweet is relevant to the topic at hand (see figure 2a). In the beginning of a question sequence an API call is made to the streaming pipeline to retrieve a new tweet ID from the priority queue (see section 3.1). Every question a user answers creates a new row in a database table, containing the respective user, tweet, question and answer IDs. After the user has successfully finished the question sequence the respective user ID is then added to a set, in order to ensure that the same tweet is not labelled multiple times by the same user.
Crowdbreaks supports multiple projects, each project may be connected to its own data stream from Twitter. New projects can be created through an admin interface, making it possible to control both the data collection, as well as to define project-specific question sequences. Eventually, visualizations, such as sentiment trends over time, may be presented to the public user, allowing the users to see the outcomes of their work. Crowdbreaks also features an integration of the question sequence interface with Amazon Turk, allowing the collection of labelled data through paid crowdworkers as an alternative to public users.
3.3 Sentiment analysis
In recent years, algorithms for sentiment analysis based on word embeddings have become increasingly more popular compared to traditional approaches which rely on manual feature engineering[19, 20, 21]
. Word embeddings give a high-dimensional vector representation of the input text, usually based on a pre-trained language model. Although these approaches may not consistently yield better results compared to traditional approaches, they allow for an easier automatization of the training workflow and are usually more generalizable to other problems. This is a desirable property in the context of Crowdbreaks, as it aims to further automatize this process and retrain classifiers automatically as more labelled data arrive. Furthermore, pre-trained word embeddings based on large Twitter corpora are available in different languages, which also make them interesting for following health trends in languages other than English.
3.3.2 Active Learning
Active learning frameworks have been proposed for a more efficient training of classifiers in the context of word embeddings [23, 24]. These frameworks allow algorithms to be trained with a much smaller number of annotated data, compared to a standard supervised training workflow (see figure 3). The query strategy, which is usually related to label uncertainty, is generally the critical component for the relative performance speed-up of these methods. In the context of Crowdbreaks, we are not only prioritizing data with higher label uncertainty, but also data which is more recent in time. Therefore, we are faced with a trade-off between exploration of more recent data vs. exploitation of previous data. Crowdbreaks can serve as a framework to explore these challenges and find the right balance.
3.3.3 Example use case
The intensity, spread and effects of public opinion towards vaccination on social media and news sources has been explored in previous work [25, 3]. Declines in vaccine confidence and boycotts of vaccination programs could sometimes be linked to disease outbreaks or set back efforts to eradicate certain diseases such as polio or measles [26, 27]. In particular, the potential benefits of real-time monitoring of vaccine sentiments as a tool for the improved planning of public health intervention programs has been highlighted [28, 29, 30].
Tracking of such sentiments towards vaccines is a primary use case of Crowdbreaks. Figure 4 shows real-time predictions based on a supervised bag-of-words fastText classifier . The predicted tweets were collected through the Twitter Streaming API using a list of vaccine-related keywords555The keywords include “vaccine”, “vaccination”, “vaxxer”, “vaxxed”, “vaccinated”, “vaccinating”, “vacine”. The classifier was trained using publicly available data provided in recent work by Bauch et al. . Please refer to their work for a detailed description of the collection and processing workflow of this data set.
3.4 Technologies used
Crowdbreaks uses a Python Flask API to interface between the components of the streaming pipeline and the user interface. The streaming pipeline makes use of Redis for the message queuing of the processing queue as well as the priority queue (see figure 1). Filtering and data processing, as well as NLP-related tasks are written in Python using the standard data analysis toolchain (numpy, scipy, nltk). Tweet objects are stored as flat files as well as in JSON format on Elasticsearch, which allows for an easier exploration and visualization of the data using Kibana. The user interface is built using Ruby on Rails with a postgres database backend in order to store the annotations, as well as user-related data.
4 Discussion & Future work
Here we introduced Crowdbreaks, a tool allowing any researcher to start measurements of health and other trends in real-time from public social media content. By involving crowdworkers as well as the general public, we hope that these models will eventually improve to a level at which they can be incorporated into mathematical models in order to predict actual health indicators. After proper validation and benchmarking, such models could eventually be used to improve public health decision-making, as well as risk assessments and disease forecasting. In the case of disease prediction, the precise understanding of the content (e.g. whether a tweet just raises awareness vs. actually reporting an infection) is crucial for the robustness of the model. As disease prediction solely from Twitter data remains to be a hard problem, previous work has suggested hybrid models between Twitter and less volatile data sources (such a Wikipedia page rate clicks) to be more robust [31, 32]. This may also serve as a future direction for disease prediction projects on Crowdbreaks.
We thank Sean Carroll, Yannis Jacquet, Djilani Kabaili and S.P. Mohanty for valuable discussions and help regarding the technical aspects of this project. Thanks also to Chloé Allémann for comments on a draft of the paper.
-  Aron Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In Proceedings of the first workshop on social media analytics, pages 115–122. ACM, 2010.
-  Michael J Paul and Mark Dredze. You are what you tweet: Analyzing twitter for public health. Icwsm, 20:265–272, 2011.
-  Marcel Salathé and Shashank Khandelwal. Assessing vaccination sentiments with online social media: Implications for infectious disease dynamics and control. PLoS Computational Biology, 7(10), 2011.
-  Michael J Paul and Mark Dredze. A model for mining public health topics from twitter. Health, 11:16–6, 2012.
-  Jon Parker, Yifang Wei, Andrew Yates, Ophir Frieder, and Nazli Goharian. A framework for detecting public health trends with twitter. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 556–563. ACM, 2013.
-  Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69–101, 1996.
-  Jeremy Ginsberg, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012, 2009.
-  David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The parable of google flu: Traps in big data analysis, 2014.
-  Declan Butler. When Google got flu wrong. Nature, 494(February):155–156, 2013.
-  Robert Simpson, Kevin R Page, and David De Roure. Zooniverse: observing the world’s largest citizen science platform. Proceedings of the 23rd International Conference on World Wide Web, pages 1049–1054, 2014.
-  Crowdcrafting. https://crowdcrafting.org. Accessed: April 2018.
-  Chris Wood, Brian Sullivan, Marshall Iliff, Daniel Fink, and Steve Kelling. eBird: Engaging birders in science and conservation. PLoS Biology, 9(12), 2011.
-  Firas Khatib, Frank Dimaio, Seth Cooper, MacIej Kazmierczyk, Miroslaw Gilski, Szymon Krzywda, Helena Zabranska, Iva Pichova, James Thompson, Zoran Popović, Mariusz Jaskolski, and David Baker. Crystal structure of a monomeric retroviral protease solved by protein folding game players. Nature Structural and Molecular Biology, 18(10):1175–1177, 2010.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision, 115(3):211–252, 2015.
-  Adriana Kovashka, Olga Russakovsky, Li Fei-Fei, Kristen Grauman, et al. Crowdsourcing in computer vision. Foundations and Trends® in Computer Graphics and Vision, 10(3):177–243, 2016.
Peter Welinder and Pietro Perona.
Online crowdsourcing: Rating annotators and obtaining cost-effective
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, CVPRW 2010, pages 25–32, 2010.
Chien-Ju Ho and Jennifer Wortman Vaughan.
Online Task Assignment in Crowdsourcing Markets.
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, (Kuhn 1955):45–51, 2012.
-  Alex J. Hughes, Joseph D. Mornin, Sujoy K. Biswas, David P. Bauer, Simone Bianco, and Zev J. Gartner. Quantius: Generic, high-fidelity human annotation of scientific images at 10^5 - clicks-per-hour. bioRxiv (preprint), 2017.
-  Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
-  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
-  Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Severyn, Simon Müller, Mark Cieliebak, Thomas Hofmann, and Martin Jaggi. Leveraging large amounts of weakly supervised data for multi-language sentiment classification. In Proceedings of the 26th International Conference on World Wide Web, pages 1045–1052. International World Wide Web Conferences Steering Committee, 2017.
-  M. Kholghi, L. De Vine, L. Sitbon, G. Zuccon, and A. Nguyen. Clinical information extraction using small data: An active learning approach based on sequence representations and word embeddings. Journal of the Association for Information Science and Technology, 68(September):2543–2556, 2017.
-  Ye Zhang and Byron Wallace. Active Discriminative Word Embedding Learning. NAACL, 2016.
-  Neil Seeman, Alton Ing, and Carlos Rizo. Assessing and responding in real time to online anti-vaccine sentiment during a flu pandemic. Healthc Q, 13(Sp):8–15, 2010.
-  Heidi J Larson and Isaac Ghinai. Lessons from polio eradication. Nature, 473(7348):446, 2011.
-  Maryam Yahya. Polio vaccines—“no thank you!” barriers to polio eradication in northern nigeria. African Affairs, 106(423):185–204, 2007.
-  Heidi J. Larson, David M.D. Smith, Pauline Paterson, Melissa Cumming, Elisabeth Eckersberger, Clark C. Freifeld, Isaac Ghinai, Caitlin Jarrett, Louisa Paushter, John S. Brownstein, and Lawrence C. Madoff. Measuring vaccine confidence: Analysis of data obtained by a media surveillance system used to analyse public concerns about vaccines. The Lancet Infectious Diseases, 13(7):606–613, 2013.
-  A. Demetri Pananos, Thomas M. Bury, Clara Wang, Justin Schonfeld, Sharada P. Mohanty, Brendan Nyhan, Marcel Salathé, and Chris T. Bauch. Critical dynamics in population vaccinating behavior. Proceedings of the National Academy of Sciences, page 201704093, 2017.
-  Chi Y. Bahk, Melissa Cumming, Louisa Paushter, Lawrence C. Madoff, Angus Thomson, and John S. Brownstein. Publicly available online tool facilitates real-time monitoring of vaccine conversations and sentiments. Health Affairs, 35(2):341–347, 2016.
-  David J. McIver and John S. Brownstein. Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time. PLoS Computational Biology, 10(4), 2014.
-  Mauricio Santillana, André T Nguyen, Mark Dredze, Michael J Paul, O Nsoesie, and John S Brownstein. Combining Search , Social Media , and Traditional Data Sources to Improve Influenza Surveillance. pages 1–15, 2015.