Image-based Social Sensing: Combining AI and the Crowd to Mine Policy-Adherence Indicators from Twitter

by   Virginia Negri, et al.
Politecnico di Milano

Social Media provides a trove of information that, if aggregated and analysed appropriately can provide important statistical indicators to policy makers. In some situations these indicators are not available through other mechanisms. For example, given the ongoing COVID-19 outbreak, it is essential for governments to have access to reliable data on policy-adherence with regards to mask wearing, social distancing, and other hard-to-measure quantities. In this paper we investigate whether it is possible to obtain such data by aggregating information from images posted to social media. Combining recent advances in image recognition technology with geocoding and crowdsourcing techniques to build a pipeline for image-based social sensing. Our aim is to discover in which countries, and to what extent, people are following COVID-19 related policy directives. We compared the results with the indicators produced within the Covid-19 behavior tracker initiative by ICL and YouGov (CovidDataHub). Preliminary results shows that social media images can produce reliable indicators for policy makers.



There are no comments yet.


page 2

page 5

page 7


NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset

Since the outbreak of coronavirus disease 2019 (COVID-19) in the late 20...

A Pipeline for Graph-Based Monitoring of the Changes in the Information Space of Russian Social Media during the Lockdown

With the COVID-19 outbreak and the subsequent lockdown, social media bec...

CovidSens: A Vision on Reliable Social Sensing based Risk Alerting Systems for COVID-19 Spread

With the spiraling pandemic of the Coronavirus Disease 2019 (COVID-19), ...

Challenges and Opportunities in Rapid Epidemic Information Propagation with Live Knowledge Aggregation from Social Media

A rapidly evolving situation such as the COVID-19 pandemic is a signific...

Dissecting the Meme Magic: Understanding Indicators of Virality in Image Memes

Despite the increasingly important role played by image memes, we do not...

Fusing Low-Latency Data Feeds with Death Data to Accurately Nowcast COVID-19 Related Deaths

The emergence of the novel coronavirus (COVID-19) has generated a need t...

An alternative analysis on the scientific output of Spanish Sociology What can altmetrics tell us?

In recent years, new indicators known as altmetrics have been introduced...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The massive number of images posted to social media each day111Five hundred million tweets are posted to Twitter each day in 2020 according to: represents a relatively untapped resource for mining useful social indicators and policy information. Providing better and more timely information to policy makers could provide widespread benefit by allowing for more reliable evidence-based decision making (Fritz et al., 2019).

While text in social media has been mined extensively in the past, images have seen less interest, likely due to the difficulty to extract semantic information from them. Indeed according to a recent survey on social sensing (Wang et al., 2019), one current research challenge is that of analysing the interdependent relationship between sensing measurements with different data modalities, such as text, sound, images, and video, in order to obtain more accurate sensing results. In our paper we focus on jointly analysing text and images from social media posts.

Recent advances in deep-learning based image-processing techniques mean that the salient information contained within images is becoming easier to extract 

(Lecun et al., 2015). Moreover, since each picture can contain a wealth of information (particularly compared to the limited amount of text that often accompanies it in platforms such as Twitter), we believe that image-based pipelines could substantially increase the breadth and depth of questions that can be answered using social media data.

The goal of this paper is thus to propose a

methodology for machine-learning enabled image-based social sensing of social indicators

. The methodology makes use of visual (and also textual) information from social media and processes it with both AI and crowdsourcing to discover and validate observations of social behaviour, which are then aggregated in now-casting fashion (Bańbura et al., 2013)

to estimate

statistical indicators of social behaviour that are useful to policy makers. In our work, a semi-automated social sensing pipeline is developed that combines automated image classification techniques with crowd based validation techniques, which then allows for reliable estimation of social indicators.

Figure 1. Examples of images posted to Twitter that could be useful for determining the level of COVID-19 related mask usage in different locations.

The specific application we tackle in this paper is that of monitoring indicators related to COVID-19 related policy directives such as the requirements to social distance and to wear masks. Images on Twitter, such as those shown in Figure 1 provide useful information to an analyst tasked with the problem of determining the amount of policy-adherence in different locations. Obviously a single analyst cannot alone monitor the massive flow of images on Twitter to determine the amount of adherence, but by leveraging the image classification techniques we can automatically extract only the relevant images from that massive stream. The resulting stream of relevant images may still be too large for a single analyst to deal with, but by recruiting a team of analysts through crowdsourcing, the capacity of the analysts can scale to fit the data need.

In this work we combine the strengths of automated Machine Learning based image filtering techniques, namely its speed and scalability, with those of crowdsourcing, in particular its accuracy and flexibility, to present a new methodology for estimating policy indicators using an image-based social sensing framework to mine images from Twitter.

The main contributions of this work are as follows:

  • We introduce a framework for image-based social sensing that allows for image-based evidence of certain COVID-19 related social behaviour (mask wearing, social distancing, etc.) to be aggregated into indicators.

  • We develop and test a series of image filters based on deep learning techniques to automatically select with high accuracy only those images which are photos depicting a certain type of event (e.g., 2 or more people meeting in a public place).

  • We build a crowdsourcing application that leverages the crowd to extract the necessary information from the selected pictures for calculating the desired indicators.

  • We derive indicators about COVID-related behaviors from the crowdsourcing results and compare them with data from other external sources, obtained through surveys (Institute of Global Health Innovation Imperial College London and YouGov, 2020).

The paper is structured as follows. We first discuss related work in Section 2, presenting the state-of-the-art on several aspects relevant for developing an image-based social sensing pipeline. In Section 3 we present our approach for extracting information from social media regarding the social impact of COVID-19. Specifically we develop a pipeline of tools to derive information from Twitter with an approach based on the combination of AI and crowdsourcing. Finally, in Section 4 we illustrate an experimental dataset and evaluate the performance of step in the pipeline before comparing the resulting indicators with an external data source based on surveys.

2. Related Work

In this section we discuss the different branches of research that we bring together in this paper, namely social sensing, citizen science, and deep learning.

2.1. Social Sensing versus Traditional Surveys

Social sensing has been proposed in the literature (Wang et al., 2015) as a term222A related term is crowd-sensing (Draghici and Steen, 2018). for describing the gathering of information from humans – using crowdsourcing, human-connected devices (mobiles, etc.) and/or by extracting information from social media – with the goal of mining social signals to gather situation awareness and support decision making.

The use of social media to gain timely evidence of ongoing emergency events (such of pictures of flooded towns, earthquake devastated buildings, or burnt forest, etc.) has been widely studied (Havas et al., 2017). Recent approaches propose to retrieve visual evidence on the events by combining both textual data mining and automated image classification, in order to reduce the information overload needed to inspect images manually (Huang et al., 2019; Imran et al., 2020). To the best of our knowledge, however, no previous work has looked to perform social sensing based on the visual information available in social media, i.e. to aggregate evidence from observations of social behaviour to compute near real-time social indicators that are useful for policy makers.

The traditional approach to collect policy adherence information is through surveys. In the context of the COVID-19 emergency, surveys regarding face mask usage have been performed333 and thematic maps have been produced based on survey results to study the evolution of mask use over time444 Another group collecting surveys based data on COVID-19 behavioural aspects is the Covid-19 Behavior Tracker initiative (CovidDataHub project) by the Institute of Global Health Innovation (IGHI) at Imperial College London and YouGov. Weekly survey data is available online for a selection of countries. In order to generate the data, each week around 1,000 people from each country are interviewed, and summary data is made available for the countries in which the target number of respondents is reached. While the total number of countries being surveyed is 30, the reports usually present data for a subset of countries, depending on the availability of data (e.g., only four countries were reported in the first week of August, 2020). This survey data is used as an external data source for validating the results of our pipeline.

The limitation of survey-based social indicator estimation is, of course, the need to reach a sufficient number of representative individuals on a regular basis. In this paper we propose a completely different approach based on social media and crowdsourcing that avoids altogether the need to find representative individuals and entice them to respond to online surveys.

2.2. Crowdsourcing and Citizen Science

The involvement of citizens in the solution of social science problems has been proposed and discussed in the literature. In the context of critical societal challenges, the authors of a recent roadmap paper (Fritz et al., 2019) discuss the difficulty of collecting data to measure the 232 indicators related to the Sustainable Development Goals (SDG) defined by the United Nations. Citizen-generated data is considered a possible non-traditional data source, that could be used to complement the official data sources which are often costly to produce in terms of both time and resources. The citizen-generated data often allows for wider coverage, both spatial and temporal. The collection process may different from traditional methods with the involvements of citizens at different expertise levels, actively or passively (through social media) collecting data to support scientists, or even fostering co-creation initiatives. The main issue in this case is the quality of the collected data. In addition to general data quality metrics, as the ones proposed by ISO 25012 for data quality in general and ISO 19157 for the quality of spatial data, the quality issue has been studied in depth in the context of crowdsourcing, with several strategies for ensuring data quality for this type of data (Daniel et al., 2018; Jin et al., 2020). In this paper we make use of simple majority-vote based crowdsourcing quality control techniques, deferring the implementation of more sophisticated techniques to future work.

Another important issue which arises when citizens are involved in the collection and/or assessment of data to support scientific projects is the size of the data to be analysed. In this case the task is to present the citizens with a manageable amount of data to analyse, and to select only the data that is relevant for the problem being studied. As noted in the previous section, AI techniques such as the use of automated classifiers can be employed to reduce the amount of information provided to the citizen scientists. We follow this approach in the development of our pipeline.

We note that there are emerging examples in the health care domain of even more collaborative approaches between AI and crowdsourcing, targeted in particular at helping communities at risk, where health-care providers and experts involved through crowdsourcing are actively supported by AI techniques (e.g., (Ruiz De Castaneda et al., 2019)).

2.3. Deep Learning for Image Filtering

A focus of our work is on automatically analysing the visual evidence emerging from social media before involving crowd workers and experts. Thus in this section we provide a brief overview of technology developments in deep learning based image processing that allow for implementing the large-scale filtering of images needed in this project. Later in Section 3 we will illustrate our specific approach and the specific types of filters we have developed.

As noted above, deep learning and in particular deep Convolutional Neural Network (CNN) architectures have massively improved the state-of-the-art performance in image recognition tasks (such as image classification, object recognition, segmentation, etc.) over the last few years 

(Lecun et al., 2015). Models like VGG (Simonyan and Zisserman, 2015), ResNet (He et al., 2016) and EfficientNet (Tan and Le, 2019)

come pre-trained on the massive image datasets, such as ImageNet 

(Russakovsky et al., 2015), and can then be fine-tuned on specific classification tasks for excellent performance, provided sufficient training data is available. In this paper, we make use of several pre-trained models. Moreover, in order to leverage task-specific training on large external collections, we often make use of models that have been fine-tuned and extended (in terms of the network architecture) for certain image processing tasks, for which we require a particular filter. In all cases, performance of these models could be further improved by trained on task specific data gathered during the crowdsourcing phase of our pipeline.

The functionality required for building an image filtering pipeline for social sensing can be summarised in three key questions:

  1. What type of objects does the image contain?

  2. Where was the photo taken?

  3. Was the photo seen before?

To answer the first question, it is important not only to look at which objects are displayed, but most importantly, whether the displayed content is safe to show to crowdworkers. For this purpose, specialised Not Safe For Work (NSFW) classifiers exist, such as Yahoo’s OpenNSFW555

. This model uses a convolutional neural network based on ResNet-50 

(He et al., 2016).

To identify specific objects in images, a number of techniques exist, with the most famous among them being YOLO (Redmon et al., 2015)

, a real-time object detector pre-trained on the COCO

(Lin et al., 2014) dataset. YOLO is fast and accurate, outperforming Faster-RNN (Ren et al., 2015)

, the previous state-of-the-art, both in terms of accuracy and speed (100 times faster). YOLO is a convolutional model that with a single pass is able to simultaneously predict multiple bounding boxes and class probabilities for each box. Thus, it can be trained end-to-end, differently from traditional region proposal networks, and as such is much faster and more accurate. The COCO dataset, on which YOLO is trained, provides a large number of object categories, allowing a filter based on it to be used in a wide variety of scenarios. Since our study focuses on COVID-related social behaviour, an object detector can be used to filter out images containing less than two people.

Where the photo was taken?

The second criterion for filtering the image content is to look at where the depicted event occurred. Scene classifiers can be used for this purpose, since we can select the types of locations of our interest. This approach allows us to be flexible in the choice of the scene and select the ones pertinent to the goal of the study. An open source repository containing various convolutional neural networks (CNNs) pre-trained on Places365 dataset 

(Zhou et al., 2017) is available666 This dataset gathers images belonging to 365 scene categories, which are sufficiently specific to be used in a wide range of tasks, (including in our case to detect whether the location is public or private).

2.4. Geolocating Observations

In the context on many social sensing projects that make use of data from social media, one of the important issues is the ability to associate a location to the information been extracted. Considering Twitter as the most common social network from which social media information is extracted, one problem is that only a small percentage of tweets are natively geolocated and all images are stripped of metadata for privacy issues. As a result, many authors have studied the problem of geolocating tweets from the available information (e.g., (Zheng et al., 2018; Middleton et al., 2018)).

In this paper we adopt the CIME geolocation algorithm proposed in the E2mC project (Francalanci et al, 2017; Francalanci et al., 2018), which for non-geolocated tweets extracts a possible location from the text and metadata of the post, using the Stanford Core Named Entity Extraction algorithm (Manning et al., 2014) and OpenStreetMap (Haklay and Weber, 2008) with the Nominatim API777 as a gazeteer and a context-based approach for disambiguation (Francalanci et al., 2018).

2.5. Preventing Duplicate Observations

It is important in our framework to have filters than can automatically detect and remove identical or similar images, since they would increase the workload for the crowd, and more importantly could bias and distort the estimates of social indicators through the double counting of individual observations.

Detecting similar images that come from the same original photo source is non-trivial, since intermediate processing might in different spatial resolutions or the enhancement with various image filters or event cropping of the source image. It is possible to train Deep Convolutional Neural Networks in a so-called Siamese Architecture for detecting near duplicates (Chopra et al., 2005). Simpler techniques that require no training, but instead make use of similarity preserving hash functions on images are also available, such as the well-known Perceptual Hash (P-hash) functions (Zauner, 2010). Modifications to an image, such as the rescaling and enhancement mentioned above, can easily be detected by comparing the images’ P-hash with previously seen values.

Figure 2. Components of the social sensing pipeline.

3. Approach

Our approach to image-based social sensing from Twitter leverages the respective strengths of machine learning based automated image classification techniques and human user based crowdsourcing. Figure 2 provides a depiction of the workflow developed. We now describe each of the components in the pipeline, focusing on the approach being followed to achieve the main goals:

  • automatically extract and select images from Twitter posts that likely provide evidence for estimating a social indicator (e.g. images of people meeting in a public space),

  • determine whether the candidate observations can be located automatically (at the country level), and

  • ask the crowd to validate and annotate the candidate observations, such that they can be aggregated into estimates of the indicator.

In the following subsections we illustrate the important steps in the pipeline.888A github repository with the pipeline code along with a dataset containing tweet ids and the corresponding validated annotations will be posted once the paper is accepted.

3.1. Keyword-based Crawling

The Twitter crawls analysed in this paper (see Section 4) were performed using the Tweepy library999 to access the Twitter API101010 During the crawl only tweets containing images were retrieved and retweets were excluded.

3.2. Removing Duplicate Images

Since each image represents a different observation for the purpose of estimating a indicator statistic (such as the amount of mask usage in a particular country), it is important that the same image does not pass multiple times through the pipeline. The same image of an event may be posted (or retweeted) by different people on social media, and thus may appear multiple times in our crawl. Thus at the very beginning of the pipeline we implement a check to remove duplicate image URLs from the crawl.

Checking for duplicate URLs does not guarantee the complete absence of duplicate images however, since the same image content could be available from different locations. One could remove duplicate images by computing their cryptographic hashes and discarding those already seen. However, this approach cannot detect if two images are the different but both come from the same original source image. The same image, for example, may have been uploaded with different spatial resolutions or been modified with colour filters. One way to allow for such transformations is to use a similarity-preserving hash function for comparing images. Similar images will have a same similarity hash and can thus be discarded. In our framework we adopt the perceptual hash (P-hash) (Zauner, 2010) function. The P-hash algorithm extracts transformation-invariant features from the multimedia object and computes their hash. As a result, similar images will have the same features, and thus also the same hash, as shown for instance in Figure 3.

(a) Original image
(b) Rescaled image
(c) Saturated image
(d) Blurred image
Figure 3. Transformations on an image that do not affect its P-hash value

3.3. Filtering Irrelevant Images

In order to fulfil our goal of gathering significant statistical information for policy makers, we must provide our crowd with only relevant data for the task. Thus we perform a set of filtering operations to extract only those images that are likely relevant. For this purpose, we built an image filtering pipeline based on deep learning techniques, including both state-of-the-art models pre-trained on large public datasets, and custom filters built according to our needs. The pipeline performs the following filter operations:

  • Removing non-photos

  • Removing NSFW content

  • Detecting the scene

  • Detecting people

In principle, the four classifiers can be applied in any order, however, we ordered them on the basis of their performance characteristics, considering both speed of execution and selectivity. The characteristics of the image classification filters applied separately on 1,000 randomly selected images is reported in Table 1. The geocoding algorithm requires more than a second per tweet, so it was performed after the image filtering.

Filter images removed time / image
person detector 78.6% 0.99s
photo detector 65.3% 0.58s
NSFW detector 7.7% 0.33s
public/private scene 20.3% 0.34s
Table 1. Selectivity and execution time for image filters on 1,000 randomly chosen images

3.3.1. Removing non-photos

In order to efficiently remove from the crawled images those that do not represent photos, a photo detector was implemented. Crawled images contained a significant percentage of irrelevant images corresponding to internet memes or modified photos with text. To tackle this problem, a VGG19 model, pre-trained on the ImageNet dataset

(Deng et al., 2009), was fine-tuned on a data set containing 3376 images labelled as memes / non acceptable photos (taken from the Reddit Memes Dataset111111, and 2448 images considered acceptable (taken from the Multi-Salient-Object (MSO) Dataset121212 To fine-tune the algorithm, VGG19’s last layer was substituted to adapt the model for the new classification task, and all layers inherited from the original architecture are held frozen during training. In this way, the model achieved excellent performance on the filtering task.

3.3.2. Removing NSFW content

It is critical to discard all Not Safe For Work (NSFW) content from our data before feeding it to the crowd workers. To ensure this, we made use of Yahoo’s implementation of a NSFW classifier, OpenNSFW131313 Deciding what type of content is safe or not is subjective and context-specific. Yahoo’s model specifically filters out pornographic content, while it does not address non-photos or offensive text, which we will both target by using a photo filter. It also does not address images depicting violence, which however we might want to include to investigate people’s behaviour to help policy makers. We use this model as a preliminary filter, knowing that it provides a limited guarantee on the accuracy of the output. We will thus necessarily warn our crowd on the probability of facing explicit content and ask to alert us in such a case.

3.3.3. Detecting the scene

Selecting the right scene in images allows extracting a more meaningful subset of data for our task. For this purpose we introduced a scene detector in our pipeline. This component consists in a convolutional neural network able to classify an image as belonging to one of a set of scene categories. In our framework we introduced an open-source model141414 pre-trained on Places365 (Zhou et al., 2017), a public dataset of images corresponding to 365 scene categories. For our specific task we thought a more meaningful distinction was between public and private scenes. Thus, we aggregated the original 365 scenes in these two subsets. Policy makers will surely be more interested in observing how people are behaving in public scenes such as streets or supermarkets, where they must conform to security regulations, rather than in private spaces.

3.3.4. Detecting People

We can greatly benefit from detecting required objects in a scene to narrow down the most relevant images for our purpose. For this purpose we introduced in our pipeline the YOLO (Redmon et al., 2015) object detector, pre-trained on the COCO dataset (Lin et al., 2014). In the specific scenario of gathering relevant images for policy makers during the COVID-19 outbreak, we extracted images containing people. In addition, filtering images with at least two people showed a significant increase of the quality of the result, for example discarding all selfies.

Figure 4. YOLO object detector used to detect people

3.4. Geocoding Images

In order to evaluate the social impact of COVID-19 in different countries, it is necessary to associate a location to each post. The geolocation was performed using the CIME service (Francalanci et al, 2017; Francalanci et al., 2018) described in Section 2, applying it on the textual part of the tweet, combined with the textual user location, if present. The geolocation was not performed whenever already available from Twitter itself, in which case we preferred to use the original one.

With the goal of creating thematic maps, we located each post with CIME and then extracted the country or territory it refers to. When multiple candidate locations were available, one has been chosen randomly to be shown to the crowd. The CIME function we used returns the coordinates of the centroid for each candidate location. We used this location to extract the corresponding contry or territory code. Only tweets with an associated location are sent to the crowd workers for the analysis, showing the name of the location as text.

Figure 5. Crowdsourcing interface with the questions posed on the left and the text of the tweet, the proposed country and the image from the tweet on the right.

3.5. Crowdsourcing

In our project we use a citizen science approach to complement the collection of information asking the crowd to evaluate the behaviour of people in the extracted images. A series of questions is posed to the crowd workers, to assess the visible behaviour of the people. In particular we focused on social distancing and the use of face masks, as this data is difficult to extract automatically151515See for example existing challenges on mask detection, e.g. and in many cases requires human judgement.

The open source PyBossa161616 platform for human data mining has been adopted in this project, with the extension of the Project Builder realised at the Citizen Science Center Zurich for an easier creation and management of crowdsourcing projects. Each post is shown to the crowd worker with the image and a proposed geolocation for it (see Fig. 5). A series of questions concerning the image contents related to the Covid-19 pandemic are proposed to the crowd worker, concerning social distance and face mask usage. The full list of questions are listed in the Appendix Appendix A: Crowdsourcing questions. Some of the queries are conditioned on the previous question.171717For instance, if the crowd worker’s response to the first question: “Is this a photo (rather than a cartoon, graph, meme, etc.)?” is “No”, then no further questions are asked of that image.

For each Tweet, a separate task is created, and redundancy is set to 3 such that three independent crowd workers have to analyse the tweet for the task to complete. This is done in order to ensure and be able to assess the quality of the crowdsourcing results. The crowd was composed by 38 volunteers from social networks and working students.

3.6. Result Aggregation and Visualisation

For the completed tasks, the most frequent response is selected using a majority mechanism to proceed with the analysis. In addition all posts for which a “surely not” answer was given for the question “Do you think the picture was likely taken in this location?” were discarded as they are not useful for the mapping purpose as their geolocation is not correct.

After the crowd assessment and aggregation, a thematic map is produced presenting the resulting indicators, i.e. percentages for the various questions across the countries for which sufficient data has be retrieved. The maps are generated using the Python plotly library with Mapbox choropleth polygon maps and are interactive, allowing the user to select different questions for display and by hovering on a country to see the count statistics for each from which the percentage indicator has been computed.

stage of pipeline # tweets
crawled tweets with images 470,255
after all image filtering 42,978
after automated geolocating 25,541
annotated via crowdsourcing 2,461
with location validated by crowd 2,061
Table 2. Number of tweets after each phase of the social-sensing pipeline

4. Experiments

We now discuss an actual execution of the pipeline and analyse the results it produced.

A Twitter crawl was performed on May 13, 2020 using the keywords: {coronavirus, corona, virus, covid, covid19, covid-19, flu, wuhan, Coronaviridae, N95}. The crawl was limited to tweets containing images and produced a total of 470,255 tweets, all posted within a 37-hour time period from May 12, 2020 02:02:06 to May 13, 2020 14:58:27 (GMT). After filtering the crawled images through the de-duplication + photo/NSFW/person/scene detection pipeline, a total of 42,978 tweets remained. Of those, only 3% were natively geolocated, (which is in line with percentages often reported in the literature). Using the CIME algorithm we were able to geolocate 25,541 tweets (59%), which were used for the crowdsourcing phase.

In Table 2 we show the number of tweets after each phase of the pipeline. We note that the number of tweets emerging from the geocoding step was too large to be evaluated in its entirety by the crowd available, and only around 10% of the tasks were completed by at least three crowd workers. Possible remedies for this scaling issue will be discussed later, but we note that even with a limited crowd resource, we are able to produce reasonable predictions (as noted in Section 4.2) based on the completed tasks for which the crowd confirmed the geolocation of the image.

Figure 6. ”Are people wearing masks?” The map shows the percentage of images in which people are in a public place and wearing a mask, using tweets crawled on May 13, 2020.

After the crowd evaluation phase, the thematic map was produced as shown in Figure 6, where the responses to the question ”Are people wearing masks?” i.e. the percentage of images in which individual are seen to be wearing masks is shown. In order to make sure the statistics shown are believable, only countries with at least 5 labelled posts are shown. For instance, by hovering the cursor over the United States in Fig. 6 we see that there were 139 validated tweets, with yes=51%, some of them=17% and no=32%.

Finally, to show variations in time for the indicators being estimated, we generated second map in Figure 7 using data from a second crawl of Twitter performed in August 2020.

Figure 7. Are people still wearing masks on the 1st of August? An updated map from a more recent crawl of Twitter shows changes in the usage of masks across countries.

4.1. Evaluating Individual Components

We now discuss the validity of the approach. In this section we analyse the performance of each component of the pipeline (namely the image filters and the geocoding), and then in Section 4.2 we compare the obtained country-wise indicators with those derived from an external survey-based data source.

4.1.1. Filter Evaluation

Each filter described in Section 3.3 was evaluated by computing its Precision, Recall and measure as shown in Table 3. Precision in this case measures the percentage of relevant images amongst those retrieved (accepted) by the filter, while Recall is the fraction of relevant images retrieved. Metrics were computed on a random sample of 700 images from the crawl, with ground truth annotations provided by three independent annotators and aggregated using majority-vote. Test images for the NSFW, people, and photo detectors were selected from the entire crawl, while for the scene classifier, they were chosen from those filtered by the photo detector, since the scene detector expects to see only photos (rather than cartoons, etc.).

Filter Precision Recall F1
photo detector 99.77% 94.67% 97.15%
people detector 95.81% 98.77% 97.26%
NSFW detector 99.38% 99.85% 99.61%
public-private scene 91.81% 96.91% 94.29%
Table 3. Evaluation of the various image filters in pipeline. Each filter is evaluated on 700 randomly chosen images with the ground truth labelled by 3 independent annotators.

4.1.2. Geocoding evaluation

One of the questions in the crowdsourcing task was evaluating the correctness of the geolocation. We excluded the posts locations which the crowd marked as “surely wrong”. From the data reported in Table 2, it can be seen that 84% of the posts with automatically extracted locations could be retained for further analysis.

4.2. External Validation

We evaluated the accuracy of the overall pipeline by comparing the statistics generated from the annotated images with an external online-survey dataset. In particular, we compare with the CovidDataHub survey dataset181818 produced by Imperial College London (ICL), since it provides information on a weekly basis on COVID-19 related social impact information. The particular aspect on which we focus for the validation concerns the use of masks in different countries, with the specific survey question being whether the respondent has “Worn a face mask outside my home”, with possible answers: Not at all, rarely, sometimes, frequently, always. For the period considered (May 11-17, 2020), the CovidDataHub portal provides data about 25 countries (out of 30 in which surveys are run). The corresponding question from the crowdsourcing stage of our image-based social sensing pipeline asked: “Are the people wearing masks?” with reference to an image of people in a public location, where the possible answers were: Yes, Some of them, No, Cannot tell. Images assigned the ’Cannot tell’ response were excluded from further analysis, and to enable the comparison with the survey data, we map the survey responses down to three categories as follows: {Not at all, rarely} =¿ No, {Sometimes} =¿ Sometimes, {Frequently, Always} =¿ Yes.

For our image-based pipeline, we include only countries and territories for which we have collected at least 5 annotated images. This corresponded to 40 countries in total as shown in Figure 6. It should to be noted that the map includes a number of countries for which performing online surveys might be complicated for various reasons.

Figure 8. Are people wearing masks? Comparison of results from manual surveys (left) and our social media based pipeline (right). Here we include only countries with at least 20 labelled images as of mid May, 2020.

In Figure 8, we compare charts of the survey results (CovidDataHub data) and the pipeline for 9 countries with a large representation in our data (more than 20 annotated images each). The visual correspondence between the two charts, which are based on completely independent data sources, lends a great deal of support to our approach. To further investigate the correspondence, we computed Pearson’s correlation between the survey responses and the results of our social sensing pipeline, as shown in Figure 9. If we treat the survey and crowdsourcing responses as binary (by removing the response category ‘Sometimes’/‘Some of them’), the resulting Pearson’s correlation across the 19 common countries for the two data sources is 0.89, indicating that the indicators vary in a very similar way across those countries.

Figure 9. Correlation between our social-media based frequency estimates (rows) and survey responses (columns) across 16 countries for a question about mask wearing. If question responses are instead treated as binary (by removing the category ‘sometimes’/‘some of them’) the correlation between the estimates increases to 0.89.

In conclusion, even with a limited number of annotations from the crowd we see that the indicators produced are well correlated with external survey data. It should be noted that the information gathered through surveys can themselves be variable. For instance, in the last available period at the moment of writing this paper, for the week July 27-August 2, 2020, the CovidDataHub shows data only for four countries. With our method the number of covered countries should be less dependent on the period and crowd workers can help providing some useful indications to decision makers (see Fig.

7, showing all countries with at least 5 posts).

5. Conclusions

In this paper we have demonstrated that it is possible to build a social sensing pipeline with humans in-the-loop for collecting important policy indicators from social media images posted on Twitter. The method is scalable since it combines both machine learning based automated filters to reduce the amount of social media data to be manually analysed and crowdsourcing based annotation for deriving indicators for the problem at hand. The presented approach is general and can be extended to other contexts of investigation, selecting the appropriate filters and questions to the crowd.

There are a number of directions for improving the image-based social sensing pipeline, including:

  • Multi-linguality: extend the crawl to languages such as Arabic, Russian, Spanish, Portuguese, Indonesian, etc.

  • Evaluate the approach for new indicators, such as monitoring climate events (such as floods, hurricanes, etc.) or engagement in social movements (such as the ”Black Lives Matter” protest movement).

  • Study feedback mechanisms and non-monetary incentives to increase crowd involvement and overcome scaling issues.

  • Make use of the crowd responses to fine-tune specific filters and improve their performance over time.

  • Incorporate more sophisticated crowdsourcing quality control mechanisms, such as worker vetting.

  • Investigate time series aspects of the indicators produced, and the ability to now-cast and pre-empt survey results.


This work was partially funded by the European Commission H2020 project (anonymized) Crowd4SDG “Citizen Science for Monitoring Climate Impacts and Achieving Climate Resilience”, project no. 872944.

This work expresses the opinions of the authors and not necessarily those of the European Commission. The European Commission is not liable for any use that may be made of the information contained in this work.

The authors thank all students working as crowd workers and all other anonymous contributors to the crowdsourcing phase.

The proof-of-concept prototype for this research was started during the VersusVirus and EUvsVirus online hackathons run in April and May 2020.

Appendix A: Crowdsourcing questions

Questions asked during the crowdsourcing step were the following, with some questions contingent on the answer to the previous question.

  1. Is this a photo (rather than a cartoon, graph, meme, etc.)?
    - Yes, No, Not Sure

  2. Does it look like it has been taken recently (in the last three months)?
    - Yes, No, Cannot tell

  3. Are there people in this image?
    - Yes, No, Not Sure

  4. Are the people wearing masks?
    - Yes, Some of them, No, Cannot tell

  5. If so, which type?
    - Scarf, Cloth, Surgical, FP2, FP3, Gas mask, Other, Cannot tell

  6. Are the people wearing the mask correctly?
    - Yes, No, Only some of them, Cannot tell, Not sure

  7. How many people are there in the image?
    - 1, 2, 3, 4, 5 or more

  8. Are they respecting social distance?
    - Yes, No, Cannot tell.

  9. Are they in a public place (shops, outdoors, …)?
    - Yes, No, Not sure

  10. If they are in a public place, what type? - street/square, park, shop, hospital, outdoors, other, cannot tell

  11. What are the people doing?
    - socializing, exercizing, shopping, queuing, volunteering, protesting, working, other, cannot tell

  12. We have associated a country or territory with this image. Do you think the picture was likely taken in this location?
    - Yes, Maybe, Surely not, Cannot tell.


  • Bańbura et al. (2013) Marta Bańbura, Domenico Giannone, Michele Modugno, and Lucrezia Reichlin. 2013. Chapter 4 - Now-Casting and the Real-Time Data Flow. In Handbook of Economic Forecasting, Graham Elliott and Allan Timmermann (Eds.). Handbook of Economic Forecasting, Vol. 2. Elsevier, 195 – 237.
  • Chopra et al. (2005) S. Chopra, R. Hadsell, and Y. LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    , Vol. 1. 539–546 vol. 1.
  • Daniel et al. (2018) Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 1–40.
  • Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
  • Draghici and Steen (2018) Adriana Draghici and Maarten Van Steen. 2018. A Survey of Techniques for Automatically Sensing the Behavior of a Crowd. ACM Comput. Surv. 51, 1, Article 21 (Feb. 2018), 40 pages.
  • Francalanci et al (2017) Chiara Francalanci, Barbara Pernici, and Gabriele Scalia. 2017. Exploratory spatio-temporal queries in evolving information. In International Workshop on Mobility Analytics for Spatio-temporal and Social Data. Springer, 138–156.
  • Francalanci et al. (2018) Chiara Francalanci, Barbara Pernici, Gabriele Scalia, and Gunter Zeug. 2018. Talking about places: considering context in geolocation of images extracted from tweets. GI_Forum 2018, 6 (2018), 243–250.
  • Fritz et al. (2019) Steffen Fritz, Linda See, Tyler Carlson, Mordechai Muki Haklay, Jessie L Oliver, Dilek Fraisl, Rosy Mondardini, Martin Brocklehurst, Lea A Shanley, Sven Schade, et al. 2019. Citizen science and the United Nations sustainable development goals. Nature Sustainability 2, 10 (2019), 922–930.
  • Haklay and Weber (2008) Mordechai (Muki) Haklay and Patrick Weber. 2008. OpenStreetMap: User-Generated Street Maps. IEEE Pervasive Computing 7, 4 (2008), 12–18.
  • Havas et al. (2017) Clemens Havas, Bernd Resch, Chiara Francalanci, Barbara Pernici, Gabriele Scalia, Jose Luis Fernandez-Marquez, Tim Van Achte, Gunter Zeug, Maria Rosa Rosy Mondardini, Domenico Grandoni, et al. 2017. E2mc: Improving emergency management service practice through social media and crowdsourcing analysis in near real time. Sensors 17, 12 (2017), 2766.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770–778. arXiv:1512.03385
  • Huang et al. (2019) Xiao Huang, Cuizhen Wang, Zhenlong Li, and Huan Ning. 2019. A visual–textual fused approach to automated tagging of flood-related tweets during a flood event. International Journal of Digital Earth 12, 11 (2019), 1248–1264.
  • Imran et al. (2020) Muhammad Imran, Ferda Ofli, Doina Caragea, and Antonio Torralba. 2020. Using AI and Social Media Multimodal Content for Disaster Response and Management: Opportunities, Challenges, and Future Directions. Information Processing & Management 57, 5 (2020), 102261.
  • Institute of Global Health Innovation Imperial College London and YouGov (2020) Institute of Global Health Innovation Imperial College London and YouGov. 2020. COVID-19 Behavior Tracker.
  • Jin et al. (2020) Yuan Jin, Mark Carman, Ye Zhu, and Yong Xiang. 2020. A technical survey on statistical modelling and design methods for crowdsourcing quality control. Artificial Intelligence 287 (2020), 103351.
  • Lecun et al. (2015) Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature Cell Biology 521, 7553 (27 May 2015), 436–444.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (Cham). Springer, 740–755.
  • Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014.

    The Stanford CoreNLP Natural Language Processing Toolkit. In

    Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System Demonstrations. 55–60.
  • Middleton et al. (2018) Stuart E. Middleton, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2018. Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging. ACM Trans. Inf. Syst. 36, 4 (2018), 40:1–40:27.
  • Redmon et al. (2015) Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 779–788. arXiv:1506.02640
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 91–99.
  • Ruiz De Castaneda et al. (2019) Rafael Luis Ruiz De Castaneda, Andréw Michaël Durso, Nicolas Ray, Jose Luis Fernandez Marquez, David J Williams, Gabriel Alcoba, François Chappuis, Marcel Salathé, and Isabelle Bolon. 2019. Snakebite and snake identification: empowering neglected communities and health-care providers with AI. THe Lancet Digital Health 1, 5 (2019), e202–e203.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Tan and Le (2019) Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97). PMLR, 6105–6114.
  • Wang et al. (2015) Dong Wang, Tarek Abdelzaher, and Lance Kaplan (Eds.). 2015. Social Sensing. Morgan Kaufmann, Boston.
  • Wang et al. (2019) Dong Wang, Boleslaw K Szymanski, Tarek Abdelzaher, Heng Ji, and Lance Kaplan. 2019. The age of social sensing. Computer 52, 1 (2019), 36–45.
  • Zauner (2010) Christoph Zauner. 2010. Implementation and Benchmarking of Perceptual Image Hash Functions. Master’s thesis. University of Applied Sciences Hagenberg.
  • Zheng et al. (2018) Xin Zheng, Jialong Han, and Aixin Sun. 2018. A Survey of Location Prediction on Twitter. IEEE Trans. Knowl. Data Eng. 30, 9 (2018), 1652–1671.
  • Zhou et al. (2017) Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017.

    Places: A 10 Million Image Database for Scene Recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (7 2017), 1452–1464.