Curating Social Media Data

by   Kushal Vaghani, et al.

Social media platforms have empowered the democratization of the pulse of people in the modern era. Due to its immense popularity and high usage, data published on social media sites (e.g., Twitter, Facebook and Tumblr) is a rich ocean of information. Therefore data-driven analytics of social imprints has become a vital asset for organisations and governments to further improve their products and services. However, due to the dynamic and noisy nature of social media data, performing accurate analysis on raw data is a challenging task. A key requirement is to curate the raw data before fed into analytics pipelines. This curation process transforms the raw data into contextualized data and knowledge. We propose a data curation pipeline, namely CrowdCorrect, to enable analysts cleansing and curating social data and preparing it for reliable analytics. Our pipeline provides an automatic feature extraction from a corpus of social media data using existing in-house tools. Further, we offer a dual-correction mechanism using both automated and crowd-sourced approaches. The implementation of this pipeline also includes a set of tools for automatically creating micro-tasks to facilitate the contribution of crowd users in curating the raw data. For the purposes of this research, we use Twitter as our motivational social media data platform due to its popularity.



There are no comments yet.


page 1


Social Media Mining Toolkit (SMMT)

There has been a dramatic increase in the popularity of utilizing social...

SocialML: machine learning for social media video creators

In the recent years, social media have become one of the main places whe...

Post or Tweet: Lessons from a Study of Facebook and Twitter Usage

This workshop paper reports on an ongoing mixed-methods study on the two...

Characterizing and Predicting Supply-side Engagement on Crowd-contributed Video Sharing Platforms

Video sharing and entertainment websites have rapidly grown in popularit...

Intent Classification using Feature Sets for Domestic Violence Discourse on Social Media

Domestic Violence against women is now recognized to be a serious and wi...

What Makes a Data-GIF Understandable?

GIFs are enjoying increasing popularity on social media as a format for ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Background

Ever since the dawn of the industrial age, understanding of data to gain knowledge and wisdom has been given utmost importance. Data can be termed as merely a collection of facts such as numbers, words, measurements and posts on blogs [38]. Today, the continuous improvement in connectivity, storage and processing allows access to data deluge from open and private data sources. Such raw data needs to be processed to increase its usefulness. Once raw data is processed and transformed into knowledge; this can help achieve meaningful insights and decision-making processes. With the modern popularity of social media networks such as Twitter111, Facebook222 and LinkedIn333; an enormous amount of open data content (e.g., tweets on Twitter) is published on a daily basis [141]. As an example, there are approximately 500 million tweets posted each day on Twitter444 It is no secret [91] that the world is glued to social media with user populations in millions. The data within these social channels natively captures the pulse and opinions of the masses in a way never before available. This opens up new opportunities for deeper understanding of several aspects such as trends, opinions and influential actors. Such data can provide valuable insights to aid decision making in diverse areas such as marketing, public policy and healthcare. Organisations can use social data to target and validate their marketing campaigns; governments can device better policies and improve their services. As an example, the Australian government’s Department of Jobs and Small Businesses Website555 articulates that it uses social media to improve stakeholder engagement amongst other items such as countering inaccurate news and promoting transparency. Another research study [66], links social media usage to brand and product loyalty. Organisations as well as governments, therefore, consider analysis of such information as a vital asset and a strategic priority. Raw data from social media sites needs to be pre-processed, contextualized and prepared (i.e., curated) for analytics. Motivations for curating social media data are discussed in Section 1.2. The curation process consists of ingesting, cleaning, merging, linking, enriching and preparing the data for analytics. In short, it transforms the raw (structured, semi-structured and unstructured) data into curated data, i.e., contextualized data and knowledge [36]. This curated data is then made available to applications and end-users.

1.2 Motivations and Problems

There are several motivations and research issues in preparing the raw social data for analytics tasks. Raw data from social platforms is generally semi-structured; consisting of unstructured parts such as text and media with some structured parts such as friend/follower relationships. Structured data is organised and easy to process, e.g., list of followers on Twitter in the standard JSON666 format; while unstructured data is difficult to process [94]. Next, since the social networks allow their users to express themselves without any restrictions (e.g., freeform text in tweet text as an example); there is a high amount of noise in the raw data [139, 63]. Such noise includes misspellings, slang words, abbreviations, truncations, incorrect syntax, grammatical errors. Table 1.1 illustrates some examples of such noise prevalent in the social medium. In addition, texts or words expressed often, need proper contextualization to be comprehended and relevant. For example, Figure 1.1 shows two tweets taken from Twitter which contain the word doctor. Whilst the second tweet (B) is related to health; the first tweet (A) refers to a song.

Issue Example
Spelling Mistakes e.g., healht, hspital
Abbreviations e.g., lol (laugh out loud), aust.(Australia)
Phonetic subsitutions e.g., lyk (like), 2 (to)
Jargons e.g., pill (for medicine)
Truncation e.g., tom (for tomorrow)
Deletion of words e.g., gng home (for ’I am going home’)
Table 1.1: Common text issues in social media
Figure 1.1: Tweet Example: Same word “doctor” used in different context.

Essentially the quality of the raw social data is low [92]; which introduce linguistic challenges in machine processing for analytics and can lead to inaccurate analysis [3]. These quality issues are further compounded due to large volumes of data generated daily (size) at a continuous rate (i.e., dynamism). Without a robust data cleansing and curation process to resolve these issues; the results of carrying out analytics would be erratic. Here, cleansing refers to improving the quality of raw data; while curation refers to a process to produce contextualized data which can be used for analytics [36, 119]

. Research has shown that state-of-art natural language processing (NLP) systems perform significantly worse on social media text 

[63]. In order to better understand these challenges, we consider a motivating scenario in Twitter.

An Example from Twitter

Shown below are three tweets extracted from a corpus of tweet data in public response to Australian government’s annual budget announcement in 2016. While there are thousands of similar tweets in understanding raw social data, the tweets illustrated below highlight some of the challenges with social media data:

  1. "MRI and CT Scan must be at loooowest $ for needy patients #budget"

  2. "Healht insurers given all clear. OMG!!"

  3. "@arcgp: low socio-economic bypass pat. head to emergcy departments as aussie govt’s budget freezes #budget2016"

A few issues are apparent from the above sample tweets. First, there is no preference for any standard form of text or language. That is, such social medium data is full of grammatical and syntactical errors. Second, there is a heavy inclination of using internet slangs such as jargons and abbreviations. For example, the words MRI, CT and OMG are some of the short forms or abbreviations used in the tweets above. Tweet “c” consists of the word bypass, which is a medical term for a surgery or a heart surgery. Third, there are numerous spelling errors with words such as loooowest and healht spelled incorrectly. Such examples also point out that individual social media posts or tweets in this case are usually very short or sparse. Table 1.1 highlights some of common issues found in social media text. Such issues necessitate proper cleansing and curation of data for robust analytics. Ineffective or lack of robust cleansing and curation of such data may lead to the following implications: Faulty Decisions Without a robust cleansing and curation, social raw data fed into for deeper analytics is not fit for use [36]

. For example, consider an analyst who wants to classify tweets (from Twitter) related to doctors to ascertain general feedback as positive or negative. An analyst would rely on classifier or other analytical tools which would computationally parse the tweet and tag them with labels such as

doctor or other. Tweets illustrated in Figure 1.1, would both be classified into class doctor, however, this would be inaccurate for tweet (A). The above example highlights that the lack of proper cleansing and curation leads to relevant data points not picked up or assigned incorrectly. In short, without an effective cleansing and curation, decision making can be severely compromised due to incorrect judgements. High costs Decision makers in organisations rely on data analytics to gauge customer sentiments and thereby improve marketing and other strategies. Social media data due to its popularity forms an key part of this analytics. Impacts of poor quality data on organisations has been widely studied; which stipulated that reliance on bad data can have adverse affects [83, 127]. Such adverse affects could manifest in forms of dissatisfaction of customers, loss of business and credibility; all of which lead to higher costs to run the business in longer term. For example, misspellings in customers records (e.g., name and address) often lead to delivery errors in mail or products leading to higher maintenance costs for systems. To summarize, we can compare raw data with a raw material such as iron ore. There is a limited use of raw iron; but once it is cleaned and alloyed into steel, it becomes useful for construction and other industries. Similarly curation offers a solution to transform raw data into something useful for analytics process [35].

1.3 Contributions

To address the above mentioned challenges, in this thesis we propose, an extensible social data curation pipeline for transforming raw social data into contextualized data and knowledge [23]. Two main phases include: (i) Automated feature extraction and correction- We design and implement micro-services to extract features such as keywords from a corpus of tweet data and automatically perform major data cleansing tasks on extracted keywords.
(i) Crowdsourced correction- We extend our approach, to use the crowd inputs to further cleanse data which could not be corrected in the earlier step. In order to achieve this, we take extracted features (e.g., keyword) from the earlier step and automatically generate micro-tasks with possible options for the user to choose from. These micro-tasks are presented to users within a simple web interface. Our micro-tasks generation service uses external knowledge bases such as Bing spell check777 to suggest possible answers. Further, we select a corpus of tweets; perform the automated and crowd correction step and then use a classifier to measure its accuracy. We compare the results against a similar classification without using our approach.

1.3.1 CrowdCorrect Curation Pipeline

We propose CrowdCorrect as a social data curation pipeline consisting of several steps. The first step covers automatic feature extraction (e.g., keywords and named entities) and correction. We focus on three types of textual issues found in social media, namely:

  1. Misspellings - we provide services to correct the spelling

  2. Abbreviations - we provide services to replace to full form, e.g., Aus. to Australia

  3. Jargons - we provide services to normalize with a more standard form, e.g., replace cardiologist with doctor

The automatic correction relies on external knowledge sources to identify the best possible match. In the second step, we design micro-tasks and use the wisdom of the crowd to identify and correct information items that could not be corrected in the first step. In this step, the micro-tasks are automatically generated using extracted features and possible correction suggestions are chosen using external knowledge sources. For example, we pick an extracted keyword such as helht

and use a spell check service to provide suggestions. We aggregate the results after running an experiment with crowd users and select the result with highest votes. The micro-task is then presented to a crowd user with a simple option to select the right suggestion as per their choice. We then aggregate the answers to avoid any user bias. CrowdCorrect is offered as an open source project, that is publicly available on GitHub

888 Both the contributions are also further discussed below.

1.3.2 Automated Feature Extraction and Correction

We implement a set of micro-services to automatically extract features and correct raw social data. These services will extract:

  1. Lexical features - such as keywords

  2. Natural-Language features - such as named-entities (e.g., person, product etc), part-of-speech (e.g., verb, noun etc.)

  3. Time and Location features - mentions of location or time within the data

We then design and implement services to use the extracted keywords in the previous step and to identify and correct misspellings, abbreviations and jargons (i.e., special words or expressions used by a profession or group that are difficult for others to understand). These services leverage external knowledge bases and services namely Bing999 ( for misspellings), Cortical101010 synonyms) and STAND4111111 api.php(for abbreviations). The result of this step will be an annotated dataset which contain the cleaned and corrected raw data.

1.3.3 Crowdsourced Correction

Further we design and implement micro-tasks and use the wisdom of the crowd to identify and correct information items that could not be corrected in the first step. The micro-tasks are automatically generated using extracted keywords and possible correction suggestions are sourced from above mentioned knowledge bases and services. Crowd users are shown a generated micro-task in a Web browser to select the correct suggestion. Several rules are used to ensure maximum coverage of the feature dataset. These rules govern which feature item would be picked for the “next” microtask based on number of answers given by existing users. The output of this step is crowd corrected data.

1.4 Thesis Organization

This thesis is organized as follows: In Chapter 2, we present the background and state of the art in Social data curation, crowd-sourcing and crowd-sourced curation. In Chapter 3, we present details about the design and implementation for CrowdCorrect. In Chapter 4, we present details about the design and implementation of the CrowdCorrect platform along with the experimental evaluation. Finally in Chapter 5, we provide concluding remarks of this thesis and possible future work.

2.1 Social Media Data

Social networks or microblogging sites started appearing in public domain as early as 2003 [100]; when a site called MySpace111 was launched. By design, social network sites empowered their users to build relationships, communities, popularity, express opinions and concerns amongst other social benefits [151, 24, 7]. This drove popularity of social media sites [151]; which continued to rise in the last decade with sites such as Twitter222, Facebook333 and Instagram444 having record number of users across the continents555 Governments and organizations have also jumped onboard to examine their policies [30, 80, 39, 25, 18, 28, 33, 145], develop products and guage sentiments [93], marketing strategies [150, 74, 1, 157, 73] and so on. As such, information shared online nowadays is predominately user generated content [68, 26, 27, 147, 10, 134]. Table 2.1 illustrates statistics of some of the popular sites as of 2018.

Twitter Facebook Instagram
Purpose Micro-blogging Social Networking Social media sharing
Active Users 328 million 2.1 billion 700 million
Daily data Stats 500 million tweets 300 million photos 95 million posts
Finer Stats (per second) 6000 tweets 500,000 links  4500 photos
Table 2.1: 2018 Stats of Popular Social Media Sites


User content generated on social network sites along with linkage data [5] (e.g., user information, friends, followers and location) is collectively referred as social media data. This data can be classified as open data, since it is available publicly and can be queried [75]. Data from each social channel (e.g., Twitter and Facebook) can be broken down into individual messages or more commonly known as posts or blogs. As an example, each individual post on Twitter is termed as a tweet. Further, each post can contain several smaller artefacts such as text, media and hyperlinks. For illustration, take a look at Figure 2.1 for content and linkage data which can be extracted from a tweet on Twitter. Sometimes sites place limitations on words a post can contain (e.g., 140 characters for a tweet on Twitter); making it sparse. On any given day, there are millions of posts on popular social channels on diverse range of topics imaginable, therefore, making it a gold mine for information.

Figure 2.1: Parts of tweet broken down.

Velocity and Volume are well-known challenges in analysing social data [42]; due to the size or load of data. This stems both from the popularity as well as round the clock availibility of social channels on Web and mobile. In addition, data collected is also often of either semi-structured (e.g., JSON format for a tweet) or unstructured (e.g., text) variety. However, it is also the quality of data, that poses significant challenges when making sense of it. As noted by a Eisenstein J. [63], social medium contains user content which defies expectations about vocabulary, spelling, reliability and syntax. For example, in part-of-speech tagging experiment, the Stanford tagger666 falls well below in accuracy when posed with Twitter content [76]. As such several studies such as [63], have pointed out that state-of-the-art natural language processing (NLP) systems perform significantly worse on social media text. A lot of research work has looked at techniques for domain specific approaches to harness social data. Examples of such work include disaster management [6] and engagement for airlines operators [107]. Other approaches have focussed on specific attributes (such as like feature on Facebook and re-tweet on Twitter) on social sites while some research works have studied the popularity of the social media sites as well. Examples of such include news propagation ability in Twitter [104], benefits of Facebook friends [64], recommendation system of YouTube [58] and combining blogs with networking aspects of Tumblr [45]. Finally, challenges in performing social media analytics such as data volume and quality have been pointed out in some research works such as in [142, 19, 20]. Such works usually fall short of any specific guidelines to solve the issues. Focussing on Twitter, there is large number of work presenting mechanisms to capture, store, query and analyze Twitter data [78]. These works focus on understanding various aspects of Twitter data, including the temporal behaviour of tweets arriving in a Twitter [120], user influence measurement in Twitter [43], measuring message propagation in Twitter [158]

and sentiment analysis of Twitter audiences 

[12]. The above mentioned research work is closely related, in that they focus on social media, but not directly related to our research problem. Our work is more closely related to improving the quality of the social data via a robust curation before fed into for deeper analytics. The closest approaches to our work is in dealing with noisy text [13, 144]. We further look at related work in data quality in Section 2.3.

2.2 Data Curation

In this section, we discuss the background and importance of data curation. Further on, we also look at related work and techniques for curating social media data. Understanding and analysing data is considered as a vital capability for critical decision making in governments and organizations [21, 36, 22, 34, 31, 37, 8, 9, 110]. Any issues with raw data introduce significant challenges in extracting actionable intelligence. Such issues as discussed in the previous sections, include noise (quality related issues), dealing with different data types (from structured to unstructured) amongst the volume and velocity of data. It is therefore, important to transfrom the raw data into contextualized data and knowledge; which analytics tools and end-users can consume. There are further rationale for a transformation of raw data for proper selection and preservation as discussed by Lord et al. [109]. Given the nature of social media data (as discussed earlier), it is an ideal candidate for such a transformation. Data curation is a process that takes raw data as an input and produces curated or contextualized data and knowledge; which can then be consumed for deeper analytics [21, 119, 32, 29]. Simply put in [55], “Data curation is the active and on-going management of data through its lifecycle of interest and usefulness; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time”. As such, the curation process abstracts and adds value to the data thereby making it useful for users engaging in analysis and data discovery. In order to transform the raw data into contextualized data and knowledge; a curation process typically consists of a number of iterative activities, which is discussed in Section  2.2.1. The term curation, in the past commonly referred to library and museum professionals [113]. The curators and their curation activities formed the backbone of musuem or library management. The skills and knowledge of the staff - the curators; added value to physical objects so as to provide context and history for their research and learning. In a way, data curation was a term formed to explicitly transfer curation guidelines and techniques as used by museum and library professionals on physical objects to data [17]. Related techniques to data curation include ETL (Extract, Transform and Load) systems777, entity deduplication [47] and various other data integration systems such as schema integration [49, 125], graph modeling and processing [15, 81, 16] and federation of data [48]. Such systems are distinct from curation in that curation views transformation of raw data and the curation sub-tasks in a wholesome manner [143]. The goal and focus of such tools is not particularly on building a scalable curation pipeline.

2.2.1 Data Curation Activities

Data curation usually consists of a pipeline of iterative activities, techniques and algorithms [36]. Figure 2.2 highlights some of the major activities within a curation pipeline. These activities include:

  1. Ingest: Identification and extraction of data and knowledge e.g., from a data source such as a database [79] or using human [133];

  2. Cleanse: process to improve the quality of data, e.g., identify and remove unwanted items from data [102]. Cleansing improves data quality (discussed in depth in Section  2.3); while linking and enriching add value to the data;

  3. Link: process to link data with other relevant data items, e.g., entity linking [135, 65];

  4. Enrich: Use internal as well as external sources to enrich the data, e.g., use knowledge bases such as Wikipedia888 [149];

  5. Merge: Identify and merge data as relevant, e.g., merging of data streams [67];

  6. Maintain: Preserve and make data available as required, e.g., store data in formats to promote re-use [132, 46].

Figure 2.2: Activities in a curation process.

Simple Curation Pipeline Example Let’s consider an example from Twitter as a simple curation process. Say we want to perform analytics on English language tweets related to today’s news in Sydney, Australia. It is possible to Ingest tweets using Twitter’s API999 and store them inside a relational database such as SQL Server101010 Then, we can cleanse any tweets which are not in English; using a tool such as LingPipe111111 In this cleansing step, we can also automatically correct any misspellings using any off-the-shelf spellcheck tool such as Bing Spell Check API121212 Further, we can add value, by annotating the tweets with additional information contained in the URLs or links inside the tweets. Refer to Figure 2.3, by extracting the content on the link within the sample tweet, we could annotate the tweet with additional information such as “11.6c” cold expected in Sydney. We store the annotated tweets inside the database as a curated set and then perform analytics on them. For example, we could classify the various news items into sports, weather and politics. Then rank them in order of social media popularity, i.e., number of tweets.

Figure 2.3: Example of extracting and annotation of Twitter data.

2.2.2 Data Curation Approaches and Frameworks

Data Curation is an umbrella term of activities which is often combined into one or several approaches. This section lists some of the popular curation approaches. Further, we look at curation platforms and Application Programmers Interface (API) discussed in literature.


The curation activities discussed in the previous section are usually combined inside a curation approach depending on the end goal (e.g., analytics, curate content only and enrich). These approaches to data curation along with identified research works are shown in Figure 2.2.

Approach Technique References
Collaboration Platforms Curate content such as news, blogs using an online platform e.g., Storify131313, Wikipedia141414 [41], [156]
Curation at source Integrate lightweight curation activities in other workflows. [56], [85]
Master data management Create and maintain single source of data with curation activities performed on it. [117], [114]
Crowdsourcing Utilize the collective wisdom of crowds to perform intensive or simpler curation activities. [60], [121]
Curation At Scale End-to-end data curation pipeline. [143]
Table 2.2: Approaches to Curation.

Colloboration platforms allow content to be aggregated or extracted from various sources and curated collectively by users. Platforms such as Storify151515 are prime examples of such platforms. These platforms rely on user inputs and their motivations to collaborate. In other words, due to manual nature of curation tasks, the process is often time consuming. Also the end-goal here is to develop a story from already published content (Twitter and news articles) and not on improving the quality of the underlying data. On the other hand, master data management approach focusses on creating a single source of curated data for an enterprise. This might sound ideal but challenging given the amount of data (along with number of sources) in any organisation’s nowadays along with the need to create a uniform model across it. Further, mining one single source of data is challenging and there are quality trade-off’s [137]. Curation activities can also be integrated and combined into other workflow activites for example embedding capture and curation as a part of researcher’s working practices [86]. While such an approach may help customisation; this may lead to bespoke and non-standardization leading to increased maintenance. Curation at scale implies building an end-to-end curation pipeline for scalability, automation and quality. Finally, data curation can be a resource intensive and complex task, where it’s beyond the capacity of a single individual. Using crowdsourcing platforms such as Amazon Mechanical Turk161616, certain well defined curation tasks such as cleansing can be outsourced to a crowd of users [70]. Although, the crowd-sourced approach can effectively use the wisdom of the crowd users; the participation and motivation of the users can introduce challenges such as longer times, user selection and biased opinions. Our research focusses on improving the quality of the social media data by a robust cleansing process within an extensible curation pipeline. Our data curation approach is different as compared to the ones discussed above; in that we combine the automated and scalable nature of curation at scale; along with crowdsourcing to achieve our goals.

Curation Platforms and APIs

Curation platforms provide a curation pipeline with a focus on all or a particular curation activity (e.g., linking). Below we discuss some of the curation platforms. DataTamer [143]

is an end-to-end curation system for integrating and transforming multiple data sources into a single predefined data structure for further reuse. The system uses machine learning algorithms to inspect a data source and then automatically extract entities, perform deduplication, transformation and mapping. Further, a human can intervene and specify transformations and mapping manually via a user interface. Then an expert or domain expert validates the data transformation. However, DataTamer is not designed to perform any quality checks on the data itself; as is the case with our research. Also DataTamer system is not geared towards the largely unstructured nature of social media data. Another system ZenCrowd 

[59] proposes a curation process with a focus on linking entities in text to an external knowledge base. The system works by automatically extracting limited set of features (e.g., persons, entities and organisations) from an HTML page text. Then it uses an algorithmic matcher to extract more information about each extracted feature from linked open data cloud171717 The results of the algorithmic matcher are scored using a probablistic method. Further, low scoring results are passed onto a crowd task module to automatically create a crowd task; which is posted on a crowdsourcing platform. While the results show higher precision; there is uncertainty when entities would have any number of textual imperfections (e.g., misspellings and abbreviations) as prevalent in social media. Several other similar approaches to assist a curation process have been discussed in literature. Kurator [61] is a curation workflow for aiding curation for scientific data; although limited to spreadsheet data. Many commercial tools such as Dremio181818 and Snowflake191919 have also sprung out; which can be leveraged to build a custom curation process. In addition to curation systems discussed above; Beheshti et al. [36] proposed a set of basic data curation APIs. These APIs are exposed as RESTful services such that they can be used by researchers and developers alike. The services cover extraction, linking and classification of raw (open) data.

2.3 Data Cleansing in Curation

Data cleansing is an important phase in the data curation process, as we discussed earlier. Considering the theme of our research work, we now discuss the background on data cleansing. Further we look at issues and existing techniques for cleansing social media like data.

2.3.1 Data Quality

In order to better understand the data cleansing process, it is important to first understand the concept of data quality. One popular way to understand data quality is to comprehend its “fitness for use” [155]. Similar definitions exist, for example researchers in [154], define data quality as data that are fit for use by data consumers. Also in data quality literature, data is associated with several dimensions that imply overall quality. For example, in another such research work, data is broken into several dimensions such as timeliness, accuracy, completeness and consistency to guage its quality [106]. Researchers corroborate that accuracy is straightforward to evaluate as its merely comparing the correct value versus the observed value. A further argument is that timeliness and completeness are also relatively straightforward to evaluate. Consistency is viewed as slightly more complex as it relies on ongoing comparison of other dimensions. Other research works have added more dimensions such as interpretability and accessibility [153]. Yet another research work [154], classifies data into categories such as intrinsic, contextual, representational and accessibility, each having a set of dimensions.

Figure 2.4: Dimensions for Data Qaulity202020

Data with poor quality is also referred to as dirty or bad data [154]. The impacts of poor quality data are serious. For example, for businesses, this can have negative consequences such as increased costs, inaccurate decisions leading to unsatisfied customers. A recent study [14]

has shown that bad quality data can lead to not only economical but also social consequences for organisations. As per estimates discussed in 

[96], around 25% of data in organisations is dirty. This includes both structured (for example an entity stored in a relational database) and unstructured( for example text and emails) Data quality in largely unstructured online user generated content, as in social media sites (e.g., Twitter), is more averse. As noted earlier, Eisenstein et. al. [63] view social medium content as one that defies expectations about vocabulary, spelling, reliability and syntax. We discussed common problems with social media data earlier which include slangs, non-standard text and grammar. Such linguistic noise are often couple with brevity (140 characters for tweet) making the quality of data poor. In the next section, we look at data cleansing; an activity to improve data quality.

2.3.2 Data Cleansing

Data cleansing is a vital task to improve the quality and thus usefulness of data. From a process perspective, data cleansing is defined as enitirety of operations performed on existing data to remove anomolies such that the result set of data is accurate representation of the mini-world [116]. An anamoly is typically a data value which has an incorrect representation [116]. Common anomalies are often inaccurate, duplicated or incomplete pieces of data. Anomalies usually arise due to erroneous inputs or measurements while collecting, inputting or maintaining data. In simpler words, cleansing makes the data fit for use by removing any uncertainties in data.

2.4 Social Media Data Curation

In social media curation, the focus is to transform raw data (unstructured to semi-structured) into curated data. Key challenges include; ingestion of continuous flowing social data, cleansing of the data due to its noisy nature as discussed earlier, linking and enriching the data for a given context. Past research work highlighted the need for curating social media data as pointed out by Duh et. al. in [62]; given social media’s wider reach and acceptance. At the same time challenges in data collection, preparation and analysis have also been widely articulated [4]. Mining social media sites rely on parts of curation techniques for specific purposes. For example, taking some examples from Twitter related research there has been work done to understand the emotions in a tweet [128], identify mentions of a drug in a tweet [77] or detecting political opinions in tweets [111]. Another research study highlighted the need for curating the Tweets but did not provide a framework or methodology to generate the contextualized version of a tweet  [62]. One of the closer related work is lexical normalisation of social media text [82]. The proposed approach uses a classifier to target out-of-vocabulary words and normalises lexically similar words. This works well in isolation for a subset of noisy text issues; without an aim to contextualise the data for analytics.

2.5 Crowdsourcing for Curation

While technology continues to evolve at rapid pace; there are many problems where human intelligence and interpretation is more effective. As an example, consider a simple task of tagging images with types of animals such as dog, cat or horse. This is relatively hard for a computer program to analyse given the different physical characteristics of various animals. On the other hand, such a task is relatively easier for humans. Employing a labour workforce to accomplish such tasks is both time-consuming and expensive. With the rapid advancement and scalability of Web technologies, outsourcing tasks to a crowd of users has become popular [40]. This has led to keen interest in the research community on areas such crowdsourcing concepts, effectiveness, crowd selection, crowd motivation, tasks design and practical use cases.

2.5.1 Crowdsourcing Concepts

The term crowdsourcing was coined in 2006 in [89] as “taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call”. The underlying principle asserted in [146], is that the collective wisdom of a large group of people produce superior results than an individual. Today’s growth in Web and mobile technologies have created an atmosphere to tap into distributed large groups of people at scale [97]. This has also led to the original definition of crowdsourcing in [89] rather obsolote; with crowdsourcing campaigns maturing to target specific crowds, availability of crowdsourcing API (application programming interfaces) and combination of machine and human inputs [98]. Popular examples of crowdsourcing are Wikipedia212121 and Threadless222222 Wikipedia allows volunteers around the world to create and edit content. Threadless allows a community of users to select and create t-shirt designs for an incentive. Further platforms such as Amazon Mechanical Turk232323 and UpWork242424 have allowed organisations to rapidly create and deploy crowd tasks at scale. Another crowdsourcing platform Figure-Eight (formerly known as CrowdFlower)252525 provides an ability to annotate unstructred data with crowd judgements for feeding as training data to machine learning programs. Several research studies have applied crowdsourcing to solve problems such as assembling dictionaries [105], outer space mapping [112] and aiding in disaster relief situations [160, 130]. Before any form of crowdsourcing can take place; both the problem and anticipated crowd inputs must be clearly defined. Several studies [122, 123, 98] have categorized content or inputs from crowd as either objective or subjective; and contributions or responses from users to be either aggregated or filtered. An illustration from [122] is shown in Figure 27. The four types of crowdsourcing are shown with an example for each. Idea and solution crowdsourcing contributions can be termed macro-tasking; whereas crowd-voting and micro-tasks can be thought as micro-tasking due to the level of granularity [72]. Responses from micro-tasks based crowdsourcing are usually aggregated, e.g., aggregate the total votes from a poll. On the other hand, contributions from macro-tasks based crowdsourcing are usually selected or filtered as required.

Figure 2.5: Ilustration of crowd sourcing types272727

2.5.2 Crowd Participation and Crowdsourcing Effectiveness

The success of crowdsourcing campaign depends on the performance of the crowds. Several recent studies such as [72] have highlighted that crowd of users can be slow, give wrong answers or opinions and also use the platform to spam without doing any work. There is a tradeoff with uncertainity when dealing with contributions from the crowd. They propose to have proper worker evaluation techniques [72] or compute consensus [90] such as aggregating results to remove any undesired contributions. Further they highlight the importance of proper technical infrastructure setup for example an easily accessible web-based tool and tasks design so as to improve crowd efficiency and accuracy. Researchers have also looked closely at what makes a crowd tick to understand the effectivess of crowdsourcing. Relying on a pool of potentially unknown crowd of people can be like a double edged sword instead of a trsuted employee. Such studies have brought forth issues like incentives [95, 129], gaining social capital [124], game mechanics [54] or general public good [103] plays a part in motivating a crowd user.

2.5.3 Micro-task Design

Our research is closely tied to creating microtasks for social media curation; therefore it is important to discuss studies pertaining to microtask design. As discussed earlier, microtasks are usually have low level of granularity with contributions that need to be aggregated for better results. A recent study by Gaidraju et al. [71] broke down microtasks in several types such as: Information finding, Verfication and Validation, Interpretation and Analysis, Surveys and Content access. Information finding tasks delegate process of searching to a crowd of users. As an example, “find a hospital in West London”. Verification and validation or moderation tasks require a crowd user to validate a piece of information such as “Is Italy a country ?”. In Interpretation tasks, crowd workers are often asked to use their mental skills such as “Choose the best colour for Father’s day”. Content access tasks require crowd users to view or access a content such as an advertisement. The choice of the task type for a problem has a direct implication to the accuracy of results. Since microtasks are often undertaken by non-experts, they need to be simple to process both mentally and logistically. They should not be too time-consuming nor should they require a high degree of expertise or too much introductory training [51]. Several research studies [72, 131] have proposed guidelines for defining input statement or problem, determining task type, task interface design and finding workers for best results. For interface design, the guidelines point to designing simple tasks with clear, short instructions to attract workers and reduce human errors. However at the same time they highlighted that interface design for crowdsourcing is similar to dark art and lot of more research is required to understand its impact on crowd performance and accuracy.

2.5.4 Crowdsourcing for Curating Social Media Data

Crowdsourcing has been used to curate social media posts themselves. For example tools such as Storify282828 and Curated.by292929 allow users to collect and curate tweets into stories making them easier to read. Due to its inherent scalability, crowdsourcing can be used to leverage collective wisdom of a group of people for many data processing and curation tasks [72]. Current limitations of machine learning algorithms and computer programs in tasks that are fairly easier for humans to process are also candidates for crowdsourcing [152]. For example crowdsourcing has been studied and quantified for extraction [87], collection of data [136], data cleansing and assessment [148, 50, 2], entity-resolution [152] and enrichment [57]. Such work relies on designing a crowd facing tool or interface and gathering contributions or answers from a crowd of users then aggregating the results. To our knowledge a lot of such research highlighted above deals with curating and cleansing data which is essentially structured, i.e., for example in a relational database or non social media related. The sparse, unstrcutured and noisy form of social media data coupled with limitations in natural language processing makes crowdsourcing an attractive preposition. There have been limited work or research looking at using crowdsourcing for curating social media data. One such work CrisisTracker [130] extracts tweets from Twitter in real-time during a natural disaster. It then automatically detects localised events or stories based on clustering of tweet’s data. Finally, the system uses a crowd of users to curate a story by ranking them. Here crowdsourcing is used for a limited purpose of ranking.

2.6 Summary and Discussion

Analytics of social media data is quite important and can be a vital priority and asset for organisations and government. This has been driven by large footprints of the social media channels such as Twitter. As we discussed in Section 2.1, there has been prior research work on various facets of social media such as  (i) Domain specificsuch as disaster management using Twitter, (ii) Feature specificsuch as re-tweet feature in Twitter or like in Facebook, (iii) Popularityof the social media sites, (iv) High level Analyticsthat visualise trends in and social media such as popular topics . One of the biggest challenge to perform accurate analytics of social media data is the poor data quality (Section 2.3.1) due to non-standardization of user inputs. These quality issues discussed in Section 2.1 include misspellings, grammatical errors and use of slangs. Further we saw how terms and words can imply different meanings when used in different contexts. Such linguistic problems of social media data often introduce more challenges in computational analysis [3]. Data curation (Section 2.2) helps in transforming the raw data into contextualised data and knowledge, which can then be used for deeper analytics. There are several known approaches to curating data and prior research work has also looked at proposing curation platforms or services (discussed in Section 2.2.2). The key approaches are  (i) Collaboration platformswhich help collectively curate content from various sources; (ii) Master Data Managementwhere the purpose is to build one single model of data; (iii) Curation at Sourcewhere curation is viewed as a part of other larger task; (iv) Curation at Scalewhere the aim is to build an extensible end-to-end pipeline; and (v) Crowdsourcingthat uses inputs from crowd users to help curate data . In addition to above, several platforms such as DataTamer [143] have been proposed which have limitations in dealing with low quality data. Further, we then discussed (in Section 2.5) the background and benefits which Crowdsourcing brings along with important issues to address such as task design and user motivations. Many popular examples and platforms for crowdsourcing were illustrated. Following on, we discussed some applications where crowdsourcing has been used to curate social media data in Section 2.5.4.

Challenges and Recommendation

We have acknowledged that we need to transform the raw data into contextualized data and knowledge for carrying out robust analytics. Our key motivation is to help improve the underlying data quality, which is low in the social media channels. As an example, let us consider a simple curation pipeline to ingest, cleanse and enrich a corpus of few thousand tweets. To our knowledge, there are no existing off-the-shelf solutions that cater to this problem. There are however, several individual components such as a spell checker. Further, many existing curation approaches cater for structured data and are not geared towards the unstructured nature of social media. Various curation approaches are proposed in literature that are usually intended for specific purposes. We haven’t come across an approach that leverages both automated curation along with the power of crowd-based curation into a single pipeline. Our research focusses on improving the quality of the social media data by a robust cleansing process within an extensible curation pipeline. Our approach is different in that we combine the automated and scalable nature of curation at scale; along with crowdsourcing to achieve our goals.

3.1 Introduction

Data cleansing or correction aims to improve the quality of data by removing errors and inconsistencies [126]. As discussed in previous chapters, in the context of social media this is challenging due to high usage of slangs, abbreviations and acronyms. Data cleansing forms an integral part of the data curation activity. Before cleansing, raw data needs to selected, ingested and key features (e.g., keywords) needs to be extracted within a curation pipeline. We discuss related techniques in literature to social media data cleansing followed by our approach.

3.1.1 Related Work

As a response to such unusual style and syntactical error prone nature of social media data text; the research community has looked at two major approaches namely, normalization and domain adaptation or contextualization [63]. Normalization approaches tend to find and replace non-standard words or terms with contextually correct ones. In other words, the idea is to fix or fit the data such that analytics tools can consume. A familiar example is of spelling checker algorithms [88]

; which uses pattern matching and n-gram analysis to correct words. For example, the tweet

“njoying at a bday” is normalised to “enjoying at a birthday”. Other examples of such approaches are machine translation [11], Twitter pre-processing approaches [52] and noisy channel models [53]. Contextualization techniques works in reverse, i.e., making the tools smarter to adapt to bad data. These techniques apply nature language processing (NLP) algorithms like part-of-speech tagging [76, 118]

and named entity recognition 

[69] to label and train the cleansing process. Essentially such approaches stem from a closely related field of noisy text analytics. The closest work in this category to our approach is the noisy-text project111 This research work is close in that it deals with quality issues in text in general; but does not look at wider issues in social media data. Despite such work, curation and cleansing of social media text remains a challenge. The cleansing of social media data goes beyond a simple spell correction. The range of problems presented with out-of-vocabulary words, abbreviations, slangs, inconsistent grammar and the use of emoticons; make automated normalization or contextualization difficult if not impossible [52]. Normalization assumes that there is some direct mapping from out-of-vocabulary words to normal words. This can be misleading for social media data. For example, do we normalize the abbreviated slang word pat. to patient or something else. Further, some words such as howdy have no direct mapping in English. Also, incorrect normalization can also result in semantic ambiguity [63]. For example how do we normalize the Twitter post, howdy baby. Automated contextualization using parts-of-speech tagging also has many limitations. For example, Twitter data is composed of so many different styles and slangs with a lot of exceptions. The inherent presence of other non-standard textual items such as hashtags make tagging or named entity recognition difficult. To sum up, Einsentein et al. [63] illustrates that state-of-art Natural Language Processing (NLP) systems perform significantly worse on social media text. Crowdsourcing has shown potential in problems which are relatively easier to solve for humans such as image labeling [84] or annotationg parts of text [69]. Also, crowdsourcing can be leveraged to accomplish tasks on a global scale by rapidly mobilising large number of people [101]. As an example, anyone with access to internet can perform micro-tasks using platforms such as Amazon Mechanical Turk222 or Figure Eight333 Social media services such as Twitter also have support for publishing simple tasks using Twitter Polls444 . One could also put together a simple web-based interface and share micro-tasks with friends, colleagues or anyone else. Crowdsourcing has already been used for collection of data [136], data cleansing and assessment [148, 50, 2], entity-resolution [152] and enrichment [57].

3.1.2 CrowdCorrect

In order to address the challenges discussed above, we combine automated approaches with crowdsourcing approaches into an extensible curation and cleansing pipeline, CrowdCorrect. Our rationale is that a human should be able to identify and correct issues such as slang words given a clear well-defined task; relatively easily. As such, the cleansing of social media text can benefit from leveraging crowdbased approaches along with automated approaches. More specifically, this chapter discusses three phases that form the pipeline: (i) Pre-processing: Ingestion and extractiontechniques that leverage off-the-shelf tools and APIs to ingest raw data and extract features (e.g., keywords, named-entities etc.) on social media data; (ii) an Automated curation: extraction and correction techniques that leverages external knowledge bases and services to automatically correct features on social media data; (iii) a Crowdsourced curation: correction techniques that uses a crowd of users to identify and correct features which failed in the earlier step. An overview of our curation pipeline is illustrated in Figure 3.1, to enable analysts cleansing and curating social data and preparing for social media analytics. There are three steps to pre-process (ingest and extract), automatically correct and crowd-sourced correction. As an example, tweets are presented as raw inputs. The following sections discuss in detail about our contributions as shown in the illustration. We use and describe examples from Twitter as that is our social media site for research purposes.

Figure 3.1: CrowdCorrect Curation Pipeline.

3.2 Pre-processing : Ingestion and Extraction

This section presents an architectural overview of the ingestion of raw data and extraction of features from the data. At first, we develop services to ingest data from social media channels such as Twitter (refer Section 3.2.1). Ingestion takes the data and makes it available within our data store. Then, we perform extraction of features (e.g., keywords) using off-the-shelf extraction micro-services; developed previously within our research group. These services are outlined in the research paper by Beheshti et. al. [36] and illustrated in Figure 3.2 using tweet from Twitter as an example.

Figure 3.2: An example from Twitter: Extraction services.

3.2.1 Ingestion Service

We implemented a set of micro-services (Twitter) to obtain and persist data for further use within a data lake, CoreDB [19]. This enables us to deal with dynamism of the data arrivals and also large sets of social media data. Then, we define a schema for information items and persist them MongoDB (a data island in our data lake) in JSON555 format. JSON is a popular and simple to parse text format for data interchange.

Figure 3.3: A tweet ingested from Twitter.

Each tweet within Twitter contains several attributes. As an example, refer to Figure 3.3 for a tweet. Some of the important attributes ingested from the tweet are discussed are:

  1. Text - Text within a tweet;

  2. Hashtags - List of hashtags within the tweet e.g., #JazzFit;

  3. Links - List of links mentioned within the tweet e.g.,;

  4. User - Name and other details of the user e.g., UtahJazz;

  5. Geo - Location from where the tweet was posted e.g., Utah, USA.

Post the ingestion process, the raw tweets in JSON format are available for further use. An example tweet stored in JSON format is illustrated below in Figure 3.4.

Figure 3.4: Tweet stored in JSON format.

3.2.2 Extraction Service

Next in our curation pipeline, we design and implement services to extract items from the raw data. These items consists of features which are of value for driving meaningful references. These features include:

  1. Lexical features: These are words that form part of vocabulary of language. This includes keywords, misspellings, abbreviations and slangs. For example from the tweet in Figure 3.3; tournament would be extracted as a keyword.

  2. Natural Language features: Words that can be extracted from analysis of natural language such as named-entities (e.g., person name, organisation, product etc.) and part-of-speech (e.g., noun and verb). For example from the tweet in Figure 3.3; Men would be extracted as a noun.

  3. Time and Location features: Mention of time and location in the social media post, i.e., a tweet. For example in Twitter, tweet may contain the location of posting. For example from the tweet in Figure 3.3; “14 June 19” would be extracted as date.

To sum up, we perform data curation feature engineering by identifying variables that encode information for analytics. We extract these variables for cleansing and curation further down the pipeline. The extracted features are stored in a featureDB using Microsoft SQL server database engine.

3.3 Automated Curation : Extraction and Correction

This section presents the architectural overview of automated correction step for CrowdCorrect pipeline. In this step, we leverage external knowledge sources and services to automatically correct the data. It is important to note that, we perform curation and correction of extracted features; variables that encode information and help derive meaningful inferences. We term this as data curation feature engineering. Examples of features extracted from tweet text are keywords, named-entities and so on. This is further discussed in the following section.

3.3.1 Automated Correction Services

Once the extracted features are available; we implement services to automatically identify and correct those features. We focus on correcting misspellings, jargons (i.e., special words or expressions used in a professional context, which are difficult to understand) and abbreviations using external knowledge sources; available as services as illustrated below in Figure 3.5.

Figure 3.5: List of external knowledge source for each feature.

This automated correction step does the following. It submits each extracted feature to each of the three external services (shown in Figure 3.5). The services return with a matching words with scores. For example, for the misspelled word “healht”; the Microsoft Cognitive service API666, returns the word “health” with a score of 1. Similarly, the abbreviations API777 and Jargons API888 return likely matches with scores. The automated correction step outputs cleaned and corrected raw data in an annotated dataset format.

3.4 Crowd Curation : Correction Tasks

In this step, we design a simple web interface to facilitate users in the crowd to correct items which were not corrected in the last step. To achieve this goal, we design two micro-tasks namely suggestion and correction micro-tasks. Both these tasks are automatically generated and presented in a web-interface. To automatically generate a micro-task; we designed a heuristic, which determines what task to present to the user. Our goal is to have a hybrid combinations of crowd workers and automated techniques such that we can build collective intelligence. The design of the two types of micro-tasks are illustrated in the next sub-section.

3.4.1 Crowd Tasks Generation

The core of the crowd-sourced correction relies on crowd tasks generator service. This service accesses the annotated dataset and builds a simple micro-task in form of multiple choice question and answer format. The two key types of tasks generated include:

  1. Suggestions - A micro-task for the user to select if the presented feature within a tweet is a jargon, abbreviation or misspelling.

  2. Correction - A micro-task for the user to select the possible match for a presented tweet, keyword (feature) and issue (jargon, abbreviation or misspelling).

Possible answers for the suggestions tasks are jargon, abbreviation, misspelling or none. For example, we present a tweet; “Hosp. are running short on trained doctors”. Along with a tweet, we also present the crowd user with a question if the keyword “Hosp.” is a jargon, abbreviation or a misspelling. A user can select their answer by click of a radio button on the web page. Further, we also have a option of selecting none for cases where there are no issues. Similarly possible answers for corrections are sourced from the external knowledge sources used earlier for misspellings, jargons and abbreviation. For example, from the earlier suggestion question “Hosp. are running short on trained doctors”; we would have possibly verified “Hosp.” to be an abbreviation. Then in a correction question, we present the user with a question to provide us a full-form. We present a range of options sourced from external knowledge sources (abbreviations API, in this case). In addition, we also allow a user to type in, if they desire to do so in a free text field. The correct answer in this example would be “hospital”. The interfaces for these tasks are illustrated further in the next chapter.

Suggestion Micro-tasks
Figure 3.6: Algorithm for automatically generating suggestions micro-tasks.

We design and implement an algorithm to present a tweet with an extracted feature(e.g., keyword) to ask the crowd user if the extracted feature can be considered as misspelled, jargon or abbreviation. An illustration of this is shown in Figure 3.6.

Correction Micro-tasks

We design and implement a corrections algorithm for users to select the correct form of a feature. For example, if a tweet’s feature (keyword) is identified as a abbreviation; we automatically generate correction matches and present it to them to select the most appropriate. The automatic generation of correction matches relies on the the external knowledge sources (services) we mentioned earlier. An illustration of a correction micro-task is presented in Figure 3.7.

Figure 3.7: Algorithm for automatically generating correction micro-tasks.

3.5 Conclusion

In this chapter, we discussed the challenges in analysing raw social data. Particularly we looked at issues that arise in social media data due to unusual syntax and style of text. We looked at normalization and contextualization techniques to improve the quality of text data. However, due to the range of problems encountered such as out-of vocabulary words, abbreviations and slangs. cleansing and curation of social media text remains a challenge. We address the above challenge by proposing CrowdCorrect, an extensible curation pipeline. CrowdCorrect perfoms automated curation of features, followed by crowd-sourced approach to correct features which failed in the automated step.

4.1 Introduction

Social network sites by design empower their users to express and share their ideas, thoughts and opinions to a wider audience. This has lead to an exponential rise in popularity of the social media sites such as Twitter and Facebook [151]. The data within these social channels natively captures the beats of the masses [141]. This has opened up new opportunities for deeper understanding of several aspects such as trends, opinions and influential actors. This social media data provides valuable insights to aid decision making in diverse areas such as marketing, public policy and healthcare. Therefore, analytics of social media data is considered as vital and strategic priority for organisations and government. Raw data from social platforms is generally semi-structured and noisy [139][63]. Such noise can include misspellings, slang words, abbreviations, truncations, incorrect syntax and grammatical errors. To sum up, the quality of the raw social data is low [92]; which introduce linguistic challenges in algorithms for analytics and can lead to inaccurate analysis [3]. Therefore, there is a need to transform raw data into contextualized data and knowledge. This transformation process is referred to as data curation and cleansing forms an integral part of it. Next, we look at our proposed cleansing and curation pipeline for social media data, namely CrowdCorrect. CrowdCorrect is an extensible social media data cleansing and curation pipeline. The key focus for the pipeline is to cleanse raw social data; using both automated and crowd-sourced techniques. The pipeline consists of set of micro-services that also leverage external knowledge sources and services. An illustration of this pipeline was presented in Chapter 3 earlier. The micro-services are broken down into three activities namely pre-processing, automatic correction and crowd sourced correction. The key motivation for the development of CrowdConnect, is the low quality raw data on social sites such as Twitter. The quality challenges posed by raw social data and the difficulty faced by automated techniques lead us to leverage crowd-sourcing approaches. In order to understand these challenges and evaluate CrowdCorrect, we present a motivating scenario in the next section. Further on, we discuss the implementation of CrowdCorrect and experimentation using the motivating use case.

4.2 Motivating Scenario

In order to evaluate our CrowdCorrect pipeline, we looked for potential use cases within the social media channels. Then, we narrowed down on a use case containing a corpus of tweets from Twitter in advent of budget announcement by the Australian Government. The key criteria for selection of a use case was presence of large number of issues that consisted of usage of slang words (jargons and abbreviations) and misspellings. Further, we also considered an analytics task related to “understanding Government’s Budget in the context of Urban Social Issues”. A typical governments’ budget denote how policy objectives are reconciled and implemented in various categories and programs. Figure 4.1(A) shows the overall budget categories for the 2017 Federal budget.

Figure 4.1: An example of budget categories (A) and associated programs  (B)222

Budget categories (e.g., Health, Social-Services, Transport and Employment) are then broken down into hierarchical set of programs (e.g., Medicare Benefits in Health, and Aged Care in Social-Services). These programs refers to a set of activities or services that meet specific policy objectives of the government [99]. An example of such programs are shown in Figure 4.1(B). Social media channels are abuzz with reactions pertaining to government’s budget announcements. In order to accurately guage public opinion on various programs related to budget; an analyst would tend to first classify social media feed into the various categories respectively. This would be difficult using traditionally adopted budget systems, which may make it difficult to accurately evaluate the governments’ services requirements and performance.

4.3 Methodology

As the first step, we analyze the different budget categories from our selected use case. Since there are many categories (e.g., health, defence and social welfare), therefore for the purpose of cleansing and curation, we picked the “health” category. That is, identifying and curating tweets related to health category within a corpus of budget related tweets. The key reason for picking health was also tied to the various key government spending initiatives with regards to medicare333 and hospital treatments in the budget announcement. These health related initiatives are always a constant source of debates in the popular media, leading to increased public attention and scrutiny [138][140]. There are number of issues in social media data, some of which we highlighted in Figure 1.1. We selected three classes of textual issues which we could use as a basis for cleansing and curating tweets. Namely, they are jargons, abbreviations and misspellings. Jargons and abbreviations are part of slang words used widely on the Internet. Together, these three classes are the most popular types of issues found in social media channels such as Twitter [159][115][108]. The key challenge for an analyst in such a scenario would be to be able to cleanse and classify tweets correctly in the health category and associated government programs. Then, as discussed in the following sections, we designed and implemented a set of micro-services and a user interface, which would collectively form parts of the CrowdCorrect pipeline (Figure 3.1). This is discussed in section 4.4 in more detail. The micro-services are for ingestion, automated and crowdsourced cleansing and curation. Finally, to evaluate our pipeline, we designed an experiment to automatically correct tweets and then engage crowd users to help us further cleanse and curate the tweets, as discussed in the Section 4.5.

4.4 Implementation

Implementation consists of building a pipeline of activities starting with: (i) ingestion of raw tweets; (ii) followed by extraction of keywords (features); (iii) micro-services to automatically cleanse the tweets; (iv) building a crowd facing front-end tool and related services; and (v) Prepare the curated data set . After these steps, machine learning algorithms may be used to classify the tweet as related to specific budget category. As we discuss the details in subsequent sections, the pipeline persists raw ingested social media data using MongoDB444 for raw data and SQL Server555 for curated result set. Microservices for automatic and crowd cleansing were developed using the Microsoft’s .NET framework666 In addition, we leverage existing services [35] for the purposes of data ingestion and extraction of features from raw data. We discuss each step in detail, in the following sections.

4.4.1 Data Ingestion and Extraction of Features

At this initial step, we import social data using micro-services and store them inside MongoDB in JSON777 format. In the budget scenario, the Treasurer announced the budget on Tuesday 3 May, 2016. We collected all tweets from one month before and two months after the budget announcement. This comprised of about 15 million raw tweets which were persisted and indexed inside our MongoDB data store. They key fields within the persisted tweets for us are text and hashtags (Section 3.2.1), as we build upon them to perform the cleansing task. Following on, we perform feature extraction using existing open source services 888 [35]. In particular, we leverage the keyword extraction service. Keywords are words for great importance or value. From a scientific perspective, keywords help to filter and index data. In essence, we cleanse and curate keywords within text; as they are candidates for machine-learning classifiers to classify items into appropriate classes. Table 4.1 shows an example of a tweet (with misspelling and abbreviation) and extracted keywords using our services. At the end of this step, we would have extracted all the keywords from raw tweets.

Tweet Extracted Keywords
My cardio won’t like the govt plan on hulthcare #ausbudget cardio, govt, plan, ausbudget, hulthcare
Table 4.1: Keywords extracted from tweet.

4.4.2 Automated Correction Microservices

Now, we develop a set of micro-services to automatically correct keywords from the tweets. In order to achieve this goal, we link extracted information to external knowledge base and services as shown in the Table 4.2. For misspellings, cleansing services replace keywords with the possible match with highest score. For example, the Bing Spell check returns a correct spelling with a statistical score. Similarly for abbreviations, we perform the same process. For jargons, we developed a list of standard forms of words related to health category of the budget. This is form of background knowledge or meta-data for a given use case. For example, both cardiologist and neurologist may refer to the term doctor. Our services inspects jargons and then matches them to the standard forms and if a match is found, then a replacement takes place.

Service Purpose Link
Microsoft Bing Spell check Identify misspellings \(\)
Abreviations API Identify a word as an abbreviation and get the full form \(\)
Jargons Find matching words \(\)
Table 4.2: Reference for External Services.

It is important to note here, that since the automated services rely on best scores from external services to replace keywords, this is likely to introduce errors or wrong word matches. For example, we checked score for a misspelled word cardo against Bing Spell check from a tweet. Bing’s best match was card with score of about 90%. Ideally in this scenario, the correct word would be cardio. Therefore, crowdsourcing can help us identify and correct issues, which failed to be rectified in the automated step.

4.4.3 Crowdsourced Correction

In this step, we developed a simple web-based user interface along with a set of micro-services. This we interface can be accessed via a web browser such as Chrome999 Each crowd task consists of ten questions with multiple choice of answers, which a user can answer with a simple click. An illustration of crowd tasks generated from our pipeline is shown in Figures 4.2 and 4.3.

Figure 4.2: An example of a Suggestion Crowd Task generated from our tool.
Figure 4.3: An example of a Correction Crowd Task generated from our tool.

There are three types of questions we pose to a crowd user:

  1. Identification: Identify if a tweet is related to the health category or not ?

  2. Suggestion: Suggest if a keyword is a misspelling, jargon or an abbreviation. As an option, a user can choose none if there are no matches.

  3. Correction: Select the best answer for correction or option to write your own.

The identification task helps us narrow down the tweets which belong to health category. We have used this to filter out tweets from the initial 15 million tweet dataset. The suggestion tasks help us to identify if a particular keyword is a misspelling, jargon or abbreviation. Therefore, there are one of these three options for a crowd user to choose from. In order to present a suggestion task, we developed a heuristic which ensures we get maximum tweet coverage. In the context of the heuristic, a social-item is a piece of data such as a tweet. Once, we identify a particular keyword within a tweet, as a misspelling for example, we once again leverage the same set of services (refer to Figure   4.2) to present options to the user. We use a simple algorithm of high score from all crowd user’s answers to judge if a particular keyword is a jargon, keyword or abbreviation. The psuedocode for the correction heuristic, shown in Figure 3.6 in the chapter 3. The result of the crowd cleansed and curated tweets were persisted inside Microsoft SQL Server database101010 . We ran a simple heuristic of selecting the best result based on majority vote score. An illustration of this heuristic is highlighted in 3.7 in Chapter 3.

4.5 Evaluation

We have used three months of Twitter data from May 2016 to August 2016 which was roughly fifteen million tweets. The size of raw tweets was large and also had large number of tweets not related to the health catefory, therefore, we first ran a crowdsourced identification task, after ingesting the tweet data inside MongoDB. The aim of this task was for a crowd user to simply identify if a tweet belonged to health category or not. Following on, we run an experiment using identified tweets from our use case from the earlier step. This experiment included extracting keywords, automated correction and finally crowdsourced correction. In order to demonstrate the effectiveness of the CrowdCorrect approach, we created two datasets in the field of healthcare. Raw tweets forms the first part of dataset; while curated tweets; where all jargons, misspellings and abbreviations were corrected, formed the other datset. The following steps summarise our methodology used to perform an experiment: (i) Data Ingestion - ingestion three months of Twitter data which included time before and after the budget announcement, (ii) Identification crowd task - where we asked crowd users to identify if a tweet belongs to health or not, (iii) Keyword extraction - extraction of keywords from the tweet, (iv) Automated correction and finally and (v) Crowdsourced correction - based on generating crowd micro-tasks.

Finally for evaluation of our approach, we developed four machine learning classifier using a binomial logistic regression and a gradient descend algorithm. A logistic regression classifier is a generalized linear model that we can use to model or predict categorical outcome variables. On the other side, gradient descend algorithm is widely used in optimization problems, and aims to minimize a cost function. The classifiers were trained to match tweets against two classes namely:

health or Other. That is, we wanted to evaluate the effectiveness of cleansing and curation operations of the proposed approach.

4.5.1 User Selection for Crowdsourced Tasks

In our evaluation of the experiment, we asked students enrolled in semester two, 2017 in the Web Application Engineering cs9321/16s2 to be the main participants as the crowd users. In addition, we also encouraged members of the service oriented computing group at the University of New South Wales to be members of the crowd. Finally, we also invited a set of crowd users from a local organisation121212 to be part of the experiment in 2018. The total strength of our crowd users was close to 500 people. Each potential participant gets an invitation via an email. A web link in the email navigated the user to the web interface for crowd micro tasks, after providing two unique identifiers: name and email address. Each invitation email also had information which were number of questions to answer, duration and a help guide. Discussion. In this experiment, the classifiers have been constructed to verify if a tweet is relevant to health category or not. First we trained two classifiers (Logistic Regression and gradient descend algorithms) using the raw and curated tweets. For training classifiers, we filtered out tokens occurred for less than three times. We also removed punctuation and stop words. We used porter stemmer for stemming the remaining tokens. The results of our experiment are summarised in the Figure 4.4. Both logistic regression and gradient descend algorithm outperformed in the curated dataset. In particular, the gradient descend algorithm has improved the precision by 4%, and the amount of improvement using the logistic regression algorithm is 5%. In addition, Figure 4(B) illustrates the measure improvement in F-measure: the Fmeasure has improved in both gradient descend classifier and logistic regression classifier by 2% and 3% respectively.

Figure 4.4: Results of experiment run - comparison of Raw and Curated data.

4.6 Conclusion

In this chapter, we detailed our implementation of CrowdCorrect cleansing and curation pipeline. We discussed on how CrowdCorrect ingests data and then extracts keywords from tweets to perform further cleansing on them. Further, we illustrated how the automated and crowdsourced activities leverage the external knowledge bases and services. We also illustrated how we have developed tools to engage the crowd and gather feedback and use it for correction of the raw data. The end result or output from utilizing our approach is a curated set of data which can then be fed for further analytics. Although, we illustrated one use case with Australian Government Budget tweets; this approach can be used for other use cases as well. Our experimental results illustrate the effectiveness of using curated dataset over raw data for any reliable analytics task. Our approach leverages the collective wisdom of the crowd to improve the quality of raw data leading to a more robust curation. Essentially, it adds another layer of quality check and correction where automated approaches struggle. One downside of our approach is the engagement of the crowd ,i.e., selection, motivation and unbiased participation of the crowd users which needs to be addressed.

5.1 Concluding Remarks

Social media sites have grown rapidly since their first introduction in 2000s[44]. The popular sites such as Twitter111, Facebook222 and Instagram333, have combined user populations that run into billions444 Further, governments and organisations have also taken to the social media examine their policies [39][80], develop products and guage sentiments [93], marketing strategies [150] and so on. Naturally, due to the immense popularity of social media channels, a lot of valuable data is published on a daily basis. This social data is openly available to be queried; and analytics of such social data has become a vital priority for organisations and government. However, there are many roadblocks to utilize this data for valuable purpose. As such, this data is constantly flowing (velocity) and is large (volume). In our research work, we examined and proposed a framework for another major roadblock, which is the quality of this data. In this thesis, we discussed that due to non-standardization of use of language on social sites such as Twitter, making sense of the data is difficult. Raw social data usually contains misspellings, slangs and lack the use of proper grammar[139][63]. Data Cleansing and Curation offers a potential solution to cleanse raw social data. Automated cleansing techniques perform poorly against social media data[3]. To address these challenges we proposed an extensible curation and cleansing framework, CrowdCorrect. In our proposed curation framework, we embedded a crowdsourced approach or crowd cleansing in addition to automated techniques. We discussed the motivations and rationale of building a cleansing and curation pipeline CrowdCorrect in Chapter 1. Below we summarise the contributions of our research:

  1. Study of State of the Art. At first, we looked at research work for social media analytics. We found that a lot of research was focussed on specific attributes (features such as Like in Facebook) or research on social media sites themselves. There isn’t a significant body of research looking at solving quality issues in social media data for analytics. Further, we discussed what data cleansing and curation are and also discussed the various curation techniques and frameworks in literature. We found that a lot of existing frameworks either cater for structured data or do not contain an end-to-end pipeline with a focus on cleansing. Then, we discussed research work and applicability of crowdsourcing techniques, also specifically with social media data.

  2. CrowdCorrect. We proposed an extensible cleansing and curation pipeline for social media data. This pipeline ingests, extracts raw data and features from social media sites. Further, we discussed the use of automated and crowdsourcing techniques to cleanse and curate the raw social data. In order to achieve this, we leveraged external knowledge sources and services. CrowdCorrect pipeline has two major activities: (i) Automated feature extraction and correction- We discussed the design and implemention of micro-services to extract features such as keywords from a corpus of tweet data and automatically perform major data cleansing tasks on extracted keywords.
    (ii) Crowdsourced correction- We discussed our approach to then use crowd inputs to further cleanse data which could not be corrected in the earlier step. In order to achieve this, we take extracted features (e.g., keywords) from the earlier step and automatically generate micro-tasks with possible options for the user to choose from. These micro-tasks are presented to users within a simple web interface. Our micro-tasks generation service uses external knowledge bases such as Bing spell check555 to suggest possible answers.

5.2 Future Directions

Given the vitality of analytics of social media data, there is a lot of possible future work to extend the research. In our research work, we focussed on extracted keywords for cleansing and curation. As a future work, this can be extended to look at other extractable features within the raw social media text such as named entities, topics and sentiments. The pipeline can also be extended to automatically use tools such as Twitter Polls666 to passively engage crowd users. In addition, as an ongoing and future work, we propose designing micro-tasks to turn the knowledge of the domain expert into a domain mediated model presented as a set of rule-sets to support cases where the automatic curation algorithms and the knowledge of the crowd may not able to properly contextualize the social items.


  • [1] B. Abu-Salih, P. Wongthongtham, S. Beheshti, and D. Zhu (2015) A preliminary approach to domain-based evaluation of users’ trustworthiness in online social networks. In 2015 IEEE International Congress on Big Data, New York City, NY, USA, June 27 - July 2, 2015, pp. 460–466. Cited by: §2.1.
  • [2] M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, S. Auer, and J. Lehmann (2013) Crowdsourcing Linked Data Quality Assessment. In The Semantic Web - ISWC 2013, D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi, G. Weikum, H. Alani, L. Kagal, A. Fokoue, P. Groth, C. Biemann, J. X. Parreira, L. Aroyo, N. Noy, C. Welty, and K. Janowicz (Eds.), Vol. 8219, pp. 260–276. External Links: ISBN 978-3-642-41337-7 978-3-642-41338-4, Document Cited by: §2.5.4, §3.1.1.
  • [3] M. Adedoyin-Olowe, M. M. Gaber, and F. Stahl A Survey of Data Mining Techniques for Social Network Analysis. pp. 25 (en). Cited by: §1.2, §2.6, §4.1, §5.1.
  • [4] N. Agarwal and Y. Yiliyasi (2010) Information quality challenges in social media.. In ICIQ, Cited by: §2.4.
  • [5] C. C. Aggarwal (2011) An Introduction to Social Network Data Analytics. In Social Network Data Analytics, pp. 1–15 (en). External Links: ISBN 978-1-4419-8461-6 978-1-4419-8462-3, Link, Document Cited by: §2.1.
  • [6] D. E. Alexander (2014) Social media in disaster risk reduction and crisis management. Science and engineering ethics 20 (3), pp. 717–733. Cited by: §2.1.
  • [7] M. Allahbakhsh, A. Ignjatovic, B. Benatallah, S. Beheshti, E. Bertino, and N. Foo (2012) Reputation management in crowdsourcing systems. In 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing, CollaborateCom 2012, Pittsburgh, PA, USA, October 14-17, 2012, pp. 664–671. Cited by: §2.1.
  • [8] M. Allahbakhsh, A. Ignjatovic, B. Benatallah, S. Beheshti, E. Bertino, and N. Foo (2013) Collusion detection in online rating systems. In Web Technologies and Applications - 15th Asia-Pacific Web Conference, APWeb 2013, Sydney, Australia, April 4-6, 2013. Proceedings, pp. 196–207. Cited by: §2.2.
  • [9] M. Allahbakhsh, A. Ignjatovic, B. Benatallah, S. Beheshti, N. Foo, and E. Bertino (2014) Representation and querying of unfair evaluations in social rating systems. Computers & Security 41, pp. 68–88. Cited by: §2.2.
  • [10] F. Amouzgar, A. Beheshti, S. Ghodratnama, B. Benatallah, J. Yang, and Q. Z. Sheng (2018) ISheets: A spreadsheet-based machine learning development platform for data-driven process analytics. In Service-Oriented Computing - ICSOC 2018 Workshops - ADMS, ASOCA, ISYyCC, CloTS, DDBS, and NLS4IoT, Hangzhou, China, November 12-15, 2018, Revised Selected Papers, pp. 453–457. Cited by: §2.1.
  • [11] A. Aw, M. Zhang, J. Xiao, and J. Su (2006) A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL on Main conference poster sessions, pp. 33–40. Cited by: §3.1.1.
  • [12] Y. Bae and H. Lee (2012) Sentiment analysis of twitter audiences: measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology 63 (12), pp. 2521–2535. Cited by: §2.1.
  • [13] T. Baldwin Han, Bo, Paul Cook and Timothy Baldwin (2013) Lexical Normalisation of Short Text Messages, ACM Transactions on Intelligent Systems and Technology 4(1), pp. 5-27.. ACM Transactions on Intelligent Systems and Technology, pp. 28 (en). Cited by: §2.1.
  • [14] D. Ballou, S. Madnick, and R. Wang (2003) Special section: assuring information quality. Journal of Management Information Systems 20 (3), pp. 9–11. Cited by: §2.3.1.
  • [15] A. Barnawi, O. Batarfi, S. Beheshti, R. E. Shawi, A. G. Fayoumi, R. Nouri, and S. Sakr (2014) On characterizing the performance of distributed graph computation platforms. In Performance Characterization and Benchmarking. Traditional to Big Data - 6th TPC Technology Conference, TPCTC 2014, Hangzhou, China, September 1-5, 2014. Revised Selected Papers, pp. 29–43. Cited by: §2.2.
  • [16] O. Batarfi, R. E. Shawi, A. G. Fayoumi, R. Nouri, S. Beheshti, A. Barnawi, and S. Sakr (2015) Large scale graph processing systems: survey and an experimental evaluation. Cluster Computing 18 (3), pp. 1189–1213. Cited by: §2.2.
  • [17] N. Beagrie (2008) Digital curation for science, digital libraries, and individuals. International Journal of Digital Curation 1 (1), pp. 3–16. Cited by: §2.2.
  • [18] A. Beheshti, B. Benatallah, and H. R. Motahari-Nezhad (2018) ProcessAtlas: A scalable and extensible platform for business process analytics. Softw., Pract. Exper. 48 (4), pp. 842–866. Cited by: §2.1.
  • [19] A. Beheshti, B. Benatallah, R. Nouri, V. M. Chhieng, H. Xiong, and X. Zhao (2017) Coredb: a data lake service. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2451–2454. Cited by: §2.1, §3.2.1.
  • [20] A. Beheshti, B. Benatallah, R. Nouri, and A. Tabebordbar (2018) CoreKG: a knowledge lake service. PVLDB 11 (12), pp. 1942–1945. Cited by: §2.1.
  • [21] A. Beheshti, B. Benatallah, A. Tabebordbar, H. R. Motahari-Nezhad, M. C. Barukh, and R. Nouri (2018-08-23) DataSynapse: a social data curation foundry. Distributed and Parallel Databases. External Links: ISSN 1573-7578, Document, Link Cited by: §2.2.
  • [22] A. Beheshti, F. Schiliro, S. Ghodratnama, F. Amouzgar, B. Benatallah, J. Yang, Q. Z. Sheng, F. Casati, and H. R. Motahari-Nezhad (2018) IProcess: enabling iot platforms in data-driven knowledge-intensive processes. In Business Process Management Forum - BPM Forum 2018, Sydney, NSW, Australia, September 9-14, 2018, Proceedings, pp. 108–126. Cited by: §2.2.
  • [23] A. Beheshti, K. Vaghani, B. Benatallah, and A. Tabebordbar (2018) CrowdCorrect: a curation pipeline for social data cleansing and curation. In International Conference on Advanced Information Systems Engineering, pp. 24–38. Cited by: §1.3.
  • [24] A. Beheshti, K. Vaghani, B. Benatallah, and A. Tabebordbar (2018) CrowdCorrect: A curation pipeline for social data cleansing and curation. In Information Systems in the Big Data Era - CAiSE Forum 2018, Tallinn, Estonia, June 11-15, 2018, Proceedings, pp. 24–38. Cited by: §2.1.
  • [25] S. Beheshti, B. Benatallah, and H. R. Motahari-Nezhad (2016) Galaxy: A platform for explorative analysis of open data sources. In Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016., pp. 640–643. Cited by: §2.1.
  • [26] S. Beheshti, B. Benatallah, and H. R. Motahari-Nezhad (2016) Scalable graph-based OLAP analytics over process execution data. Distributed and Parallel Databases 34 (3), pp. 379–423. Cited by: §2.1.
  • [27] S. Beheshti, B. Benatallah, H. R. M. Nezhad, and M. Allahbakhsh (2012) A framework and a language for on-line analytical processing on graphs. In Web Information Systems Engineering - WISE 2012 - 13th International Conference, Paphos, Cyprus, November 28-30, 2012. Proceedings, pp. 213–227. Cited by: §2.1.
  • [28] S. Beheshti, B. Benatallah, H. R. M. Nezhad, and S. Sakr (2011) A query language for analyzing business processes execution. In Business Process Management - 9th International Conference, BPM 2011, Clermont-Ferrand, France, August 30 - September 2, 2011. Proceedings, pp. 281–297. Cited by: §2.1.
  • [29] S. Beheshti, B. Benatallah, and H. R. M. Nezhad (2013) Enabling the analysis of cross-cutting aspects in ad-hoc processes. In Advanced Information Systems Engineering - 25th International Conference, CAiSE 2013, Valencia, Spain, June 17-21, 2013. Proceedings, pp. 51–67. Cited by: §2.2.
  • [30] S. Beheshti, B. Benatallah, S. Sakr, D. Grigori, H. R. Motahari-Nezhad, M. C. Barukh, A. Gater, and S. H. Ryu (2016) Process analytics - concepts and techniques for querying and analyzing process data. Springer. Cited by: §2.1.
  • [31] S. Beheshti, B. Benatallah, S. Venugopal, S. H. Ryu, H. R. Motahari-Nezhad, and W. Wang (2017) A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99 (4), pp. 313–349. Cited by: §2.2.
  • [32] S. Beheshti, H. R. M. Nezhad, and B. Benatallah (2012) Temporal provenance model (TPM): model and query language. CoRR abs/1211.5009. Cited by: §2.2.
  • [33] S. Beheshti, S. Sakr, B. Benatallah, and H. R. M. Nezhad (2012) Extending SPARQL to support entity grouping and path queries. CoRR abs/1211.5817. Cited by: §2.1.
  • [34] S. Beheshti, A. Tabebordbar, B. Benatallah, and R. Nouri (2016) Data curation apis. CoRR abs/1612.03277. Cited by: §2.2.
  • [35] S. Beheshti, A. Tabebordbar, B. Benatallah, and R. Nouri (2017) On automating basic data curation tasks. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 165–169. Cited by: §1.2, §4.4.1, §4.4.
  • [36] S. Beheshti, A. Tabebordbar, B. Benatallah, and R. Nouri (2017) On automating basic data curation tasks. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3-7, 2017, pp. 165–169. Cited by: §1.1, §1.2, §1.2, §2.2.1, §2.2.2, §2.2, §3.2.
  • [37] S. Beheshti, S. Venugopal, S. H. Ryu, B. Benatallah, and W. Wang (2013) Big data and cross-document coreference resolution: current state and future opportunities. CoRR abs/1311.3987. Cited by: §2.2.
  • [38] G. Bellinger, D. Castro, and A. Mills (2004) Data, information, knowledge, and wisdom. Cited by: §1.1.
  • [39] J. C. Bertot, P. T. Jaeger, and D. Hansen (2012) The impact of polices on government social media usage: issues, challenges, and recommendations. Government information quarterly 29 (1), pp. 30–40. Cited by: §2.1, §5.1.
  • [40] D. C. Brabham (2008) Crowdsourcing as a model for problem solving: an introduction and cases. Convergence 14 (1), pp. 75–90. External Links: Document, Link, Cited by: §2.5.
  • [41] A. Bruns and T. Highfield (2015) 18. from news blogs to news on twitter: gatewatching and collaborative news curation. Handbook of digital politics 325. Cited by: Table 2.2.
  • [42] E. Cambria, D. Rajagopal, D. Olsher, and D. Das (2013) Big social data analysis. Big data computing 13, pp. 401–414. Cited by: §2.1.
  • [43] M. Cha, H. Haddadi, F. Benevenuto, P. K. Gummadi, et al. (2010) Measuring user influence in twitter: the million follower fallacy.. Icwsm 10 (10-17), pp. 30. Cited by: §2.1.
  • [44] D. Chaffey (2016) Global social media research summary 2016. Smart Insights: Social Media Marketing. Cited by: §5.1.
  • [45] Y. Chang, L. Tang, Y. Inagaki, and Y. Liu (2014) What is tumblr: a statistical overview and comparison. ACM SIGKDD explorations newsletter 16 (1), pp. 21–29. Cited by: §2.1.
  • [46] S. Chaudhuri and U. Dayal (1997) An overview of data warehousing and olap technology. ACM Sigmod record 26 (1), pp. 65–74. Cited by: item 6.
  • [47] S. Chaudhuri, V. Ganti, and R. Motwani (2005) Robust identification of fuzzy duplicates. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pp. 865–876. Cited by: §2.2.
  • [48] B. Chen, J. Oliver, D. Schwartz, W. Lindsey, and A. MacDonald (2005-January 27) Data federation methods and system. Google Patents. Note: US Patent App. 10/850,826 Cited by: §2.2.
  • [49] L. Chiticariu, M. A. Hernández, P. G. Kolaitis, and L. Popa (2007) Semi-automatic schema integration in clio. In Proceedings of the 33rd international conference on Very large data bases, pp. 1326–1329. Cited by: §2.2.
  • [50] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye (2015) Katara: a data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1247–1261. Cited by: §2.5.4, §3.1.1.
  • [51] J. Cibej, D. Fiser, and I. Kosem (2015) The role of crowdsourcing in lexicography. In Proc. of the fourth biennial conference on electronic lexicography, eLex, pp. 72–79. Cited by: §2.5.3.
  • [52] E. Clark and K. Araki (2011) Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. Procedia-Social and Behavioral Sciences 27, pp. 2–11. Cited by: §3.1.1.
  • [53] P. Cook and S. Stevenson (2009) An unsupervised model for text message normalization. In Proceedings of the workshop on computational approaches to linguistic creativity, pp. 71–78. Cited by: §3.1.1.
  • [54] S. Cooper, F. Khatib, A. Treuille, J. Barbero, J. Lee, M. Beenen, A. Leaver-Fay, D. Baker, Z. Popović, et al. (2010) Predicting protein structures with a multiplayer online game. Nature 466 (7307), pp. 756. Cited by: §2.5.2.
  • [55] M. H. Cragin, P. B. Heidorn, C. L. Palmer, and L. C. Smith (2007) An educational program on data curation. Cited by: §2.2.
  • [56] E. Curry, A. Freitas, and S. O’Riáin (2010) The role of community-driven data curation for enterprises. In Linking enterprise data, pp. 25–47. Cited by: Table 2.2.
  • [57] N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu (2009) A web of concepts. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–12. Cited by: §2.5.4, §3.1.1.
  • [58] J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, et al. (2010) The youtube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pp. 293–296. Cited by: §2.1.
  • [59] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, pp. 469–478. Cited by: §2.2.2.
  • [60] A. Doan, R. Ramakrishnan, and A. Y. Halevy (2011) Crowdsourcing systems on the world-wide web. Communications of the ACM 54 (4), pp. 86–96. Cited by: Table 2.2.
  • [61] L. Dou, G. Cao, P. J. Morris, R. A. Morris, B. Ludäscher, J. A. Macklin, and J. Hanken (2012) Kurator: a kepler package for data curation workflows. Procedia Computer Science 9, pp. 1614–1619. Cited by: §2.2.2.
  • [62] K. Duh, T. Hirao, A. Kimura, K. Ishiguro, T. Iwata, and C. A. Yeung (2012) Creating stories: social curation of twitter messages.. In ICWSM, Cited by: §2.4.
  • [63] J. Eisenstein (2013) What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies, pp. 359–369. Cited by: §1.2, §1.2, §2.1, §2.3.1, §3.1.1, §4.1, §5.1.
  • [64] N. B. Ellison, C. Steinfield, and C. Lampe (2007) The benefits of facebook "friends:" social capital and college students’ use of online social network sites. Journal of Computer-Mediated Communication 12 (4), pp. 1143–1168. Cited by: §2.1.
  • [65] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios (2007) Duplicate record detection: a survey. IEEE Transactions on knowledge and data engineering 19 (1), pp. 1–16. Cited by: item 3.
  • [66] İ. E. Erdoğmuş and M. Cicek (2012) The impact of social media marketing on brand loyalty. Procedia-Social and Behavioral Sciences 58, pp. 1353–1360. Cited by: §1.1.
  • [67] H. Eul (1996-July 23) Method for merging data streams. Google Patents. Note: US Patent 5,539,749 Cited by: item 5.
  • [68] F. Figueiredo, J. M. Almeida, M. A. Goncalves, and F. Benevenuto (2014) On the dynamics of social media popularity: a youtube case study. ACM Transactions on Internet Technology (TOIT) 14 (4), pp. 24. Cited by: §2.1.
  • [69] T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze (2010) Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 80–88. Cited by: §3.1.1.
  • [70] E. Freitas (2016) Big Data Curation. In New Horizons for a Data-Driven Economy, pp. 87–118 (en). External Links: ISBN 978-3-319-21568-6 978-3-319-21569-3, Link, Document Cited by: §2.2.2.
  • [71] U. Gadiraju, G. Demartini, R. Kawase, and S. Dietze (2015-07) Human Beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. IEEE Intelligent Systems 30 (4), pp. 81–85 (en). External Links: ISSN 1541-1672, Document Cited by: §2.5.3.
  • [72] H. Garcia-Molina and V. Verroios Challenges in Data Crowdsourcing. pp. 14 (en). Cited by: §2.5.1, §2.5.2, §2.5.3, §2.5.4.
  • [73] S. M. Ghafari, S. Yakhchi, A. Beheshti, and M. Orgun (2018) SETTRUST: social exchange theory based context-aware trust prediction in online social networks. In Data Quality and Trust in Big Data - 5th International Workshop, QUAT 2018, Held in Conjunction with WISE 2018, Dubai, UAE, November 12-15, 2018, Revised Selected Papers, pp. 46–61. Cited by: §2.1.
  • [74] S. M. Ghafari, S. Yakhchi, A. Beheshti, and M. Orgun (2018) Social context-aware trust prediction: methods for identifying fake news. In Web Information Systems Engineering - WISE 2018 - 19th International Conference, Dubai, United Arab Emirates, November 12-15, 2018, Proceedings, Part I, pp. 161–177. Cited by: §2.1.
  • [75] F. Giglietto, L. Rossi, and D. Bennato (2012) The open laboratory: limits and possibilities of using facebook, twitter, and youtube as a research data source. Journal of Technology in Human Services 30 (3-4), pp. 145–159. External Links: Document, Link, Cited by: §2.1.
  • [76] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith (2010) Part-of-speech tagging for twitter: annotation, features, and experiments. Technical report Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science. Cited by: §2.1, §3.1.1.
  • [77] R. Ginn, P. Pimpalkhute, A. Nikfarjam, A. Patki, K. O’Connor, A. Sarker, K. Smith, and G. Gonzalez (2014) Mining twitter for adverse drug reaction mentions: a corpus and classification benchmark. In Proceedings of the fourth workshop on building and evaluating resources for health and biomedical text processing, Cited by: §2.4.
  • [78] O. Goonetilleke, T. Sellis, X. Zhang, and S. Sathe (2014) Twitter analytics: a big data management perspective. ACM SIGKDD Explorations Newsletter 16 (1), pp. 11–20. Cited by: §2.1.
  • [79] R. Grover and M. J. Carey (2015) Data ingestion in asterixdb.. In EDBT, pp. 605–616. Cited by: item 1.
  • [80] L. Hagen, R. Scharf, S. Neely, and T. Keller (2018) Government social media communications during zika health crisis. In Proceedings of the 19th Annual International Conference on Digital Government Research: Governance in the Data Age, pp. 12. Cited by: §2.1, §5.1.
  • [81] M. Hammoud, D. A. Rabbou, R. Nouri, S. Beheshti, and S. Sakr (2015) DREAM: distributed RDF engine with adaptive query planner and minimal communication. PVLDB 8 (6), pp. 654–665. Cited by: §2.2.
  • [82] B. Han, P. Cook, and T. Baldwin (2013) Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology (TIST) 4 (1), pp. 5. Cited by: §2.4.
  • [83] A. Haug, F. Zachariassen, and D. Van Liempd (2011) The costs of poor data quality. Journal of Industrial Engineering and Management (JIEM) 4 (2), pp. 168–193. Cited by: §1.2.
  • [84] J. He, J. van Ossenbruggen, and A. P. de Vries (2013) Do you need experts in the crowd?: a case study in image annotation for marine biology. In Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, pp. 57–60. Cited by: §3.1.1.
  • [85] M. Hedges and T. Blanke (2012) Sheer curation for experimental data and provenance. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, pp. 405–406. Cited by: Table 2.2.
  • [86] M. Hedges and T. Blanke (2013) Digital libraries for experimental data: capturing process through sheer curation. In International Conference on Theory and Practice of Digital Libraries, pp. 108–119. Cited by: §2.2.2.
  • [87] C. Heipke (2010) Crowdsourcing geospatial data. ISPRS Journal of Photogrammetry and Remote Sensing 65 (6), pp. 550–557. Cited by: §2.5.4.
  • [88] V. J. Hodge and J. Austin (2003) A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE transactions on knowledge and data engineering 15 (5), pp. 1073–1081. Cited by: §3.1.1.
  • [89] J. Howe (2006) The rise of crowdsourcing. Wired magazine 14 (6), pp. 1–4. Cited by: §2.5.1.
  • [90] N. Q. V. Hung, H. H. Viet, N. T. Tam, M. Weidlich, H. Yin, and X. Zhou (2018) Computing crowd consensus with partial agreement. IEEE Transactions on Knowledge and Data Engineering 30 (1), pp. 1–14. Cited by: §2.5.2.
  • [91] R. Hutt (2017) The worldś most popular social networks, mapped. Note: 2018-12-30 Cited by: §1.1.
  • [92] A. Immonen, P. Pääkkönen, and E. Ovaska (2015) Evaluating the quality of social media data in big data architecture. Ieee Access 3, pp. 2028–2043. Cited by: §1.2, §4.1.
  • [93] B. Jeong, J. Yoon, and J. Lee (2017) Social media mining for product planning: a product opportunity mining approach based on topic modeling and sentiment analysis. International Journal of Information Management. Cited by: §2.1, §5.1.
  • [94] A. Katal, M. Wazid, and R. Goudar (2013) Big data: issues, challenges, tools and good practices. IEEE. Cited by: §1.2.
  • [95] N. Kaufmann, T. Schulze, and D. Veit (2011) More than fun and money. worker motivation in crowdsourcing-a study on mechanical turk.. In AMCIS, Vol. 11, pp. 1–11. Cited by: §2.5.2.
  • [96] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin (2015) Bigdansing: a system for big data cleansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1215–1230. Cited by: §2.3.1.
  • [97] J. H. Kietzmann, K. Hermkens, I. P. McCarthy, and B. S. Silvestre (2011) Social media? get serious! understanding the functional building blocks of social media. Business horizons 54 (3), pp. 241–251. Cited by: §2.5.1.
  • [98] J. H. Kietzmann (2017) Crowdsourcing: a revised definition and introduction to new research. Business Horizons 60 (2), pp. 151–1kietzmann2017crowdsourcing53. Cited by: §2.5.1.
  • [99] N. W. Kim, J. Jung, E. Ko, S. Han, C. W. Lee, J. Kim, and J. Kim (2016) Budgetmap: engaging taxpayers in the issue-driven classification of a government budget. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 1028–1039. Cited by: §4.2.
  • [100] D. L. King (2015) Why use social media. Library Technology Reports 51 (1), pp. 6–9. Cited by: §2.1.
  • [101] A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Horton (2013) The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work, pp. 1301–1318. Cited by: §3.1.1.
  • [102] S. Krishnan, D. Haas, M. J. Franklin, and E. Wu (2016) Towards reliable interactive data cleaning: a user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 9. Cited by: item 2.
  • [103] S. Kuznetsov (2006) Motivations of contributors to wikipedia. ACM SIGCAS computers and society 36 (2), pp. 1. Cited by: §2.5.2.
  • [104] H. Kwak, C. Lee, H. Park, and S. Moon (2010) What is twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web, pp. 591–600. Cited by: §2.1.
  • [105] N. Lanxon (2011) How the oxford english dictionary started out like wikipedia. Wired, Jan, pp. 2011–01. Cited by: §2.5.1.
  • [106] R. Lederman, G. Shanks, and M. R. Gibbs (2003) Meeting privacy obligations: the implications for information systems development. ECIS 2003 Proceedings, pp. 96. Cited by: §2.3.1.
  • [107] R. Leung, M. Schuckert, and E. Yeung (2013) Attracting user social media engagement: a study of three budget airlines facebook pages. In Information and communication technologies in tourism 2013, pp. 195–206. Cited by: §2.1.
  • [108] F. Liu, F. Weng, and X. Jiang (2012) A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 1035–1044. Cited by: §4.3.
  • [109] P. Lord, A. Macdonald, L. Lyon, and D. Giaretta (2004) From Data Deluge to Data Curation. In In Proc 3th UK e-Science All Hands Meeting, pp. 371–375. Cited by: §2.2.
  • [110] Z. Maamar, S. Sakr, A. Barnawi, and S. Beheshti (2015) A framework of enriching business processes life-cycle with tagging information. In Databases Theory and Applications - 26th Australasian Database Conference, ADC 2015, Melbourne, VIC, Australia, June 4-7, 2015. Proceedings, pp. 309–313. Cited by: §2.2.
  • [111] D. Maynard and A. Funk (2011) Automatic detection of political opinions in tweets. In Extended Semantic Web Conference, pp. 88–99. Cited by: §2.4.
  • [112] E. McLaughlin (2014) Image overload: help us sort it all out nasa requests. Cnn. com. Cited by: §2.5.1.
  • [113] K. Moore (1994) Museum management. Psychology Press. Cited by: §2.2.
  • [114] H. Morris and D. Vesset (2005) Managing master data for business performance management: the issues and hyperion’s solution. IDC white paper. Cited by: Table 2.2.
  • [115] R. C. Mosley Jr (2012) Social media analytics: data mining applied to insurance twitter posts. In Casualty Actuarial Society E-Forum, Vol. 2, pp. 1. Cited by: §4.3.
  • [116] H. Müller and J. Freytag (2005) Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. Für Informatik. Cited by: §2.3.2.
  • [117] B. Otto and A. Schmidt (2010) Enterprise master data architecture: design decisions and options. In 15th International Conference on Information Quality (ICIQ 2010), Little Rock, Cited by: Table 2.2.
  • [118] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith (2013) Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 380–390. Cited by: §3.1.1.
  • [119] C. Palmer, N. M. Weber, A. Renear, and T. Muñoz (2013) Foundations of data curation: the pedagogy and practice of" purposeful work" with research data. Cited by: §1.2, §2.2.
  • [120] R. D. Perera, S. Anand, K. Subbalakshmi, and R. Chandramouli (2010) Twitter analytics: architecture, tools and analysis. In Military Communications Conference, 2010-MILCOM 2010, pp. 2186–2191. Cited by: §2.1.
  • [121] D. Porcello and S. Hsi (2013) Crowdsourcing and curating online education resources. Science 341 (6143), pp. 240–241. Cited by: Table 2.2.
  • [122] J. Prpic, P. P. Shukla, J. H. Kietzmann, and I. P. McCarthy (2015-01) How to work a crowd: Developing crowd capital through crowdsourcing. Business Horizons 58 (1), pp. 77–85. External Links: ISSN 00076813 Cited by: §2.5.1.
  • [123] J. Prpic and P. Shukla (2016) Crowd science: measurements, models, and methods. In System Sciences (HICSS), 2016 49th Hawaii International Conference on, pp. 4365–4374. Cited by: §2.5.1.
  • [124] D. R. Raban (2008) The incentive structure in an online information market. Journal of the American Society for Information Science and Technology 59 (14), pp. 2284–2295. Cited by: §2.5.2.
  • [125] E. Rahm and P. A. Bernstein (2001) A survey of approaches to automatic schema matching. the VLDB Journal 10 (4), pp. 334–350. Cited by: §2.2.
  • [126] E. Rahm and H. H. Do (2000) Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23 (4), pp. 3–13. Cited by: §3.1.
  • [127] T. C. Redman (1998) The impact of poor data quality on the typical enterprise. Communications of the ACM 41 (2), pp. 79–83. Cited by: §1.2.
  • [128] K. Roberts, M. A. Roach, J. Johnson, J. Guthrie, and S. M. Harabagiu (2012) EmpaTweet: annotating and detecting emotions on twitter.. In LREC, Vol. 12, pp. 3806–3813. Cited by: §2.4.
  • [129] J. Rogstadius, V. Kostakos, A. Kittur, B. Smus, J. Laredo, and M. Vukovic (2011) An assessment of intrinsic and extrinsic motivation on task performance in crowdsourcing markets.. ICWSM 11, pp. 17–21. Cited by: §2.5.2.
  • [130] J. Rogstadius, M. Vukovic, C. Teixeira, V. Kostakos, E. Karapanos, and J. A. Laredo (2013) CrisisTracker: crowdsourced social media curation for disaster awareness. IBM Journal of Research and Development 57 (5), pp. 4–1. Cited by: §2.5.1, §2.5.4.
  • [131] A. Rumshisky (2011) Crowdsourcing word sense definition. In Proceedings of the 5th Linguistic Annotation Workshop, pp. 74–81. Cited by: §2.5.3.
  • [132] E. A. Rundensteiner, A. Koeller, and X. Zhang (2000) Maintaining data warehouses over changing information sources. Communications of the ACM 43 (6), pp. 57–62. Cited by: item 6.
  • [133] F. Sadeghi, S. K. Kumar Divvala, and A. Farhadi (2015) Viske: visual knowledge extraction and question answering by visual verification of relation phrases. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1456–1464. Cited by: item 1.
  • [134] F. Schiliro, A. Beheshti, S. Ghodratnama, F. Amouzgar, B. Benatallah, J. Yang, Q. Z. Sheng, F. Casati, and H. R. Motahari-Nezhad (2018) ICOP: iot-enabled policing processes. In Service-Oriented Computing - ICSOC 2018 Workshops - ADMS, ASOCA, ISYyCC, CloTS, DDBS, and NLS4IoT, Hangzhou, China, November 12-15, 2018, Revised Selected Papers, pp. 447–452. Cited by: §2.1.
  • [135] W. Shen, J. Wang, and J. Han (2015) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27 (2), pp. 443–460. Cited by: item 3.
  • [136] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pp. 510–526. Cited by: §2.5.4, §3.1.1.
  • [137] R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo (2011) Managing one master data–challenges and preconditions. Industrial Management & Data Systems 111 (1), pp. 146–162. Cited by: §2.2.2.
  • [138] J. A. Smith and M. Herriot (2017) Positioning health promotion as a policy priority in australia. Health Promotion Journal of Australia 28 (1), pp. 5–7. Cited by: §4.3.
  • [139] A. J. Soto, C. Ryan, and F. P. Silva Data Quality Challenges in Twitter Content Analysis for Informing Policy Making in Health Care. pp. 10 (en). Cited by: §1.2, §4.1, §5.1.
  • [140] P. M. Sowa, S. Kault, J. Byrnes, S. Ng, T. Comans, and P. A. Scuffham (2018) Private health insurance incentives in australia: in search of cost-effective adjustments. Applied health economics and health policy 16 (1), pp. 31–41. Cited by: §4.3.
  • [141] S. Stieglitz, M. Mirbabaie, B. Ross, and C. Neuberger (2018) Social media analytics at Challenges in topic discovery,data collection, and data preparation. International Journal of Information Management 39, pp. 156–168. Cited by: §1.1, §4.1.
  • [142] S. Stieglitz, M. Mirbabaie, B. Ross, and C. Neuberger (2018) Social media analytic-challenges in topic discovery, data collection, and data preparation. International journal of information management 39, pp. 156–168. Cited by: §2.1.
  • [143] M. Stonebraker, D. Bruckner, and I. F. Ilyas Data Curation at Scale: The Data Tamer System. pp. 10 (en). Cited by: §2.2.2, §2.2, §2.6, Table 2.2.
  • [144] L. V. Subramaniam, S. Roy, T. A. Faruquie, and S. Negi (2009) A survey of types of text noise and techniques to handle noisy text. pp. 115. External Links: ISBN 978-1-60558-496-6 Cited by: §2.1.
  • [145] Y. J. Sun, M. C. Barukh, B. Benatallah, and S. Beheshti (2015) Scalable saas-based process customization with casewalls. In Service-Oriented Computing - 13th International Conference, ICSOC 2015, Goa, India, November 16-19, 2015, Proceedings, pp. 218–233. Cited by: §2.1.
  • [146] J. Surowiecki, M. P. Silverman, et al. (2007) The wisdom of crowds. American Journal of Physics 75 (2), pp. 190–192. Cited by: §2.5.1.
  • [147] A. Tabebordbar and A. Beheshti (2018) Adaptive rule monitoring system. In Proceedings of the 1st International Workshop on Software Engineering for Cognitive Services, SE4COG@ICSE 2018, Gothenburg, Sweden, May 28-2, 2018, pp. 45–51. Cited by: §2.1.
  • [148] Y. Tong, C. C. Cao, C. J. Zhang, Y. Li, and L. Chen (2014) Crowdcleaner: data cleaning for multi-version data on the web via crowdsourcing. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pp. 1182–1185. Cited by: §2.5.4, §3.1.1.
  • [149] R. Troncy (2016) Linking entities for enriching and structuring social media content. In Proceedings of the 25th International Conference Companion on World Wide Web, pp. 597–597. Cited by: item 4.
  • [150] T. L. Tuten and M. R. Solomon (2017) Social media marketing. Sage. Cited by: §2.1, §5.1.
  • [151] J. Van Dijck and T. Poell (2013) Understanding social media logic. Media and communication 1 (1), pp. 2–14. Cited by: §2.1, §4.1.
  • [152] J. Wang, T. Kraska, M. J. Franklin, and J. Feng (2012) Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5 (11), pp. 1483–1494. Cited by: §2.5.4, §3.1.1.
  • [153] R. Y. Wang, M. P. Reddy, and H. B. Kon (1995) Toward quality data: an attribute-based approach. Decision support systems 13 (3-4), pp. 349–372. Cited by: §2.3.1.
  • [154] R. Y. Wang and D. M. Strong (1996) Beyond accuracy: what data quality means to data consumers. Journal of management information systems 12 (4), pp. 5–33. Cited by: §2.3.1, §2.3.1.
  • [155] S. Watts, G. Shankaranarayanan, and A. Even (2009) Data quality assessment in context: a cognitive perspective. Decision Support Systems 48 (1), pp. 202–211. Cited by: §2.3.1.
  • [156] E. Yakel (2007) Digital curation. OCLC Systems & Services: International digital library perspectives 23 (4), pp. 335–340. Cited by: Table 2.2.
  • [157] S. Yakhchi, S. M. Ghafari, and A. Beheshti (2018) CNR: cross-network recommendation embedding user’s personality. In Data Quality and Trust in Big Data - 5th International Workshop, QUAT 2018, Held in Conjunction with WISE 2018, Dubai, UAE, November 12-15, 2018, Revised Selected Papers, pp. 62–77. Cited by: §2.1.
  • [158] S. Ye and S. F. Wu (2010) Measuring message propagation and social influence on twitter. com. In International Conference on Social Informatics, pp. 216–231. Cited by: §2.1.
  • [159] M. Zappavigna (2012) Discourse of twitter and social media: how we use language to create affiliation on the web. Vol. 6, A&C Black. Cited by: §4.3.
  • [160] M. Zook, M. Graham, T. Shelton, and S. Gorman (2010) Volunteered geographic information and crowdsourcing disaster relief: a case study of the haitian earthquake. World Medical & Health Policy 2 (2), pp. 7–33. Cited by: §2.5.1.