A Total Error Framework for Digital Traces of Humans

07/18/2019
by   Indira Sen, et al.
0

The interactions and activities of hundreds of millions of people worldwide are recorded as digital traces every single day. When pulled together, these data offer increasingly comprehensive pictures of both individuals and groups interacting on different platforms, but they also allow inferences about broader target populations beyond those platforms, representing an enormous potential for the Social Sciences. Notwithstanding the many advantages of digital traces, recent studies have begun to discuss the errors that can occur when digital traces are used to learn about humans and social phenomena. Incidentally, many similar errors also affect survey estimates, which survey designers have been addressing for decades using error conceptualization frameworks such as the Total Survey Error Framework. In this work, we propose a conceptual framework to diagnose, understand and avoid errors that may occur in studies that are based on digital traces of humans leveraging the systematic approach of the Total Survey Error Framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 9

page 10

page 11

page 12

03/04/2017

Tracing Networks of Knowledge in the Digital Age

The emergence of new digital technologies has allowed the study of human...
02/22/2021

Entities of Interest

In the era of big data, we continuously - and at times unknowingly - lea...
12/29/2020

Supporting Human Memory by Reconstructing Personal Episodic Narratives from Digital Traces

Numerous applications capture in digital form aspects of people's lives....
04/10/2019

Searching Heterogeneous Personal Digital Traces

Digital traces of our lives are now constantly produced by various conne...
03/30/2020

Survey Data and Human Computation for Improved Flu Tracking

While digital trace data from sources like search engines hold enormous ...
08/05/2021

Multi-clue reconstruction of sharing chains for social media images

The amount of multimedia content shared everyday, combined with the leve...
03/19/2020

Top-k queries over digital traces

Recent advances in social and mobile technology have enabled an abundanc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

When investigating social phenomena, for decades, the empirical social sciences have relied on surveying samples of individuals taken from well-defined populations as one of their main data sources, e.g. general national populations. An accompanying development was the constant improvement of methods as well as statistical tools to collect and analyze survey data. These days, survey methodology can be considered an academic discipline of its own. It has distilled its history of research dedicated to identifying and analyzing the various errors that occur in the statistical measurement of collective behavior and attitudes as well as generalizing to larger populations. The Total Survey Error framework (TSE) provides a conceptual structure to identify, describe, and quantify the errors of survey estimates  (Biemer, 2010; Groves and Lyberg, 2010; Weisberg, 2009; Groves et al., 2011). While not existing in one single canonical form, the tenets of the TSE are stable and provide survey designers with a guideline for balancing cost and efficacy of a potential survey and, not least, a common vocabulary to identify error sources in their research design from sampling to inference. In the remainder, we will refer to the concepts of the TSE as put forth by Groves et al. (2011), p. 48.

Figure 1:

Potential measurement and representation errors in a digital trace based study lifecycle. Errors are classified according to their sources: errors of measurement (due to how the construct is measured from nonreactive digital traces) and errors of representation (due to generalising from the population studied to the target population of interest)[Best viewed in color]

222Icons used in this image have been designed by Becris, Elias Bikbulatov and Pixel perfect from www.flaticon.com

Nowadays, digital traces of human behavior that have not been collected in a scientifically pre-designed process but are captured by web-based platforms and other digital technologies such as smartphones, fitness devices, RFID mass transit cards or credit cards become increasingly available to researchers. Research based on digital trace data is frequently referred to as “Social Sensing” – i.e., studies that repurpose individual users of technology and their traces as sensors for larger patterns of behavior or attitudes in a population (An and Weber, 2015; Sakaki, Okazaki, and Matsuo, 2010). It is especially the often easily accessible data from social media and other platforms on the World Wide Web that has become of heightened interest to scientists in various fields aiming to understand, explain or predict human behavior (Watts, 2007; Lazer et al., 2009; Salganik, 2017).333Researchers refer to digital trace data with different names including social data (Olteanu et al., 2019; Alipourfard, Fennell, and Lerman, 2018) and big data (Salganik, 2017; Pasek et al., 2019).

Beside studying platforms per se to understand user behavior and societal implications of web platforms (see e.g. DiMaggio et al. (2001); Robinson (2011); Hampton and Wellman (2003)), digital trace data promises inferences to broad target populations similar to surveys, but at a lower cost, with larger samples (Olteanu et al., 2019; Salganik, 2017; Boyd and Crawford, 2012). In addition to increasing scale and decreasing costs, it is also generally available much sooner after or even during events (e.g. elections, disasters) compared to survey responses, which makes it particularly interesting for studying reactions to unforeseeable events which surveys can only ask about in retrospect. But digital trace data also comes with various challenges: the data is captured as people make use of specific platforms, data access is regulated by platform owners and is often nontransparent; “platform owners are usually not concerned with maintaining instrumentation consistency for the benefit of research” (Howison, Wiggins, and Crowston, 2011); the design and features of the platform, the community around the platform and their behavior, as well as the community around other similar platforms, may change over time and impact the behavior of users; lastly, humans and non-humans can self-select into the platform population (even multiple times). Therefore, this data may be incomplete and biased in many ways which raise epistemological concerns that are accompanied by ethical and legal concerns as well as typical methodological challenges of big data studies (Tufekci, 2014; Olteanu et al., 2019; Ruths and Pfeffer, 2014).444We refer to error as the difference between the obtained value and the true value we want to measure, while bias refers to systematic errors, following the definitions in Weisberg’s ‘The Total Survey Error Approach’, p.22 (Weisberg, 2009) While not all of these issues can be mitigated, they can be documented and examined for each particular study that makes use of digital traces, to understand the limits and generalizability of the study’s insights. Only by developing a thorough understanding of the limitations of a study, can we make it comparable with other studies.

Figure 2: Total Survey Error Components Linked to Steps in the Measurement and Representational Inference Process (Groves et al., 2011)

Our Contributions. In this work, we put forth a framework that allows to describe, analyze and mitigate errors that occur in digital trace based studies that aim to make inferences about a theoretical construct (see Figure 1) in a larger target population of entities beyond the platforms providing the digital traces. Our work adds to the growing literature on identifying errors in digital trace based research (Olteanu et al., 2019; Tufekci, 2014; Hsieh and Murphy, 2017; Ruths and Pfeffer, 2014; Lazer, 2015), by highlighting these errors through the lens of survey methodology and leveraging its systematic approach. We believe that by establishing a connection between errors in digital trace based studies and the TSE, we can establish a shared vocabulary for social scientists and computational social scientists and help them document, communicate and compare their research. Further, we can make use of error identification methods and solutions developed in survey methodology and increase its utility for researchers familiar with the TSE. To establish this connection, we map errors to their pertinent counterpart in the Total Survey Error framework where applicable, and on the other hand, describe new types of errors that can arise. To that end, we highlight the distinction between two main sources of errors – measurement and representation errors, as done in the TSE (cf. Figure 2). Next, we link errors to the design decision steps inherent to conducting observational studies using digital traces: errors due to conceptualizations of theoretical constructs, selecting platforms that record users’ behavior, data collection strategy, data preprocessing and data analysis as shown in Figure 1. Finally, we give suggestions on how to avoid these errors and discuss the applicability of our error framework for digital trace data. Our framework is mainly inspired by – but not limited to – social media data such as Twitter, Facebook, and Reddit data as well as other data collected on web platforms such as search engine queries, used in “Social Sensing” studies.

Related Work

Our error framework for digital traces is inspired by insights from two disciplines: those that have been developed and refined for decades in (i) survey methodology as well as insights from the relatively new but rapidly developing field of (ii) computational social science, which has increasingly sought to understand the uses of newer forms data developed in the digital age. In this section, we discuss some of the main threads of research spanning these fields.

Survey Methods. The Total Survey Error Framework is an amalgamation of efforts to consolidate the different errors in the survey pipeline (Groves and Lyberg, 2010; Biemer and Christ, 2008)

. Recently, researchers have tried to explore the efficacy of non-probability sampled surveys such as opt-in web surveys 

(Goel, Obeng, and Rothschild, 2015, 2017), where they find that adjustment methods can improve estimates even if there are representation errors. On a similar vein, Kohler, Kreuter, and Stuart (2019) find that nonprobability samples can be effective in certain cases. Meng (2018) poses the question “Which one should we trust more, a 5% survey sample or an 80% administrative dataset?” and introduces the concept of the ‘data defect index’ to make surveys more comparable. Researchers have also attempted to integrate survey data and digital traces to analyze and uncover social scientific questions (Stier et al., 2019).

The Potentials of Digital Traces. Web and social media data has been leveraged to ‘predict the future’ in many areas such as politics, health, and economics (Askitas and Zimmermann, 2015; Phillips et al., 2017). It is also considered as a means for learning about the present, e.g., for gaining insights into human behavior and opinions, such as using search queries to examine agenda-setting effects (Ripberger, 2011), leveraging Instagram to detect drug use (Yang and Luo, 2017), and measure consumer confidence through Twitter (Pasek et al., 2018). Acknowledging the potential advantages of digital trace data, the American Association for Public Opinion Research (AAPOR) has commissioned reports to understand the feasibility of using digital trace data for public opinion research (Japec et al., 2015; Murphy and others, 2018).

The Pitfalls of Digital Traces. There is some important work aimed at uncovering errors that arise from using online data or, more generally, various kinds of digital trace data. Recently, researchers have studied the pitfalls related to political science research using digital trace data (Gayo-Avello, 2012; Metaxas, Mustafaraj, and Gayo-Avello, 2011; Diaz et al., 2016), with  Jungherr et al. (2017), especially focusing on issues of validity. Researchers have also analysed biases in digital traces arising due to demographic differences (Pavalanathan and Eisenstein, 2015; Olteanu, Weber, and Gatica-Perez, 2016), platform effects (Malik and Pfeffer, 2016) and data availability (Morstatter et al., 2013; Pfeffer, Mayer, and Morstatter, 2018). More generally, Olteanu et al., provide a comprehensive overview of the errors and biases that could potentially affect studies based on digital behavioral data as well as outlining the errors in an idealized study framework  (Olteanu et al., 2019), while Tufekci outlines errors that can occur in Twitter-based studies (Tufekci, 2014). Recenty Jungherr (2017) calls for conceptualizing a measurement theory that may adequately account for the pitfalls of digital traces.

Addressing and Accounting for the Pitfalls of Digital Traces. Recently, researchers working with digital traces have used techniques typically used in surveys, to correct representation errors (Zagheni, Weber, and Gummadi, 2017; Fatehkia, Kashyap, and Weber, 2018; Wang et al., 2019). Besides, addressing specific biases, researchers have also attempted to identify and document the different errors in digital traces as a first step to addressing them. In addition to an overview of biases in digital trace data, Olteanu et al. provide a framework using an idealized pipeline to enumerate various errors and how they may arise (Olteanu et al., 2019). While highly comprehensive, they do not establish a connection with the Total Survey Error framework (Groves and Lyberg, 2010), despite the similarities of errors that plague surveys as well. Ruths and Pfeffer prescribe actions that researchers can follow to reduces errors in social media data in two steps: data collection and methods (Ruths and Pfeffer, 2014). We extend this work in two ways: (1) by expanding our understanding of errors beyond social media platforms to other forms of digital trace data and (2) by taking a deep dive at more fine-grained steps to understand which design decision by a researcher contributes to what kind of error. Hsieh et al., conceptualize the Total Twitter Error (TTE) Framework where they describe three kinds of errors in studying unstructured Twitter textual data, which can be mapped to survey errors: coverage error, query error and interpretation error  (Hsieh and Murphy, 2017). The authors also provide an empirical study of inferring attitudes towards political issues from tweets and limit their error framework to studies which follow a similar lifecycle. We aim to extend this framework to more diverse inference strategies, beyond textual analysis as well as for social media platforms other than Twitter and other forms of digital trace data. Finally, the AAPOR report on public opinion assessment from ‘Big Data’ sources (which includes large data sources other than digital trace data such as traffic and infrastructure data) describes a way to extend the TSE to such data, but cautions that a potential framework will have to account for errors that are specific to big data (Japec et al., 2015). We restrict our error framework to digital trace data from the web which are collected based on a typical social sensing study pipeline, highlight where and how some of the new types of errors may arise, and how researchers may tackle them. Recent efforts to document issues in using digital trace data include “Datasheets for Datasets”, proposed by Gebru et al. (2018) where datasets are accompanied with a datasheet that contains its motivation, composition, collection process, recommended uses. We propose a similar strategy to document digital trace based research designs to enable better communication, reproducibility, and reuse of said data.

In the following sections, we begin by discussing measurement and representation concerns of research based on digital trace data, before looking at the different phases of conducting such a study and outlining the respective sources of errors that may appear at each stage.

Surveys and Digital Traces: Similarities and Dissimilarities

It is crucial to understand the differences and commonalities between a typical survey and a typical digital trace data study of human behavior to develop an understanding of the limitations and strengths of both data types for social science research (cf. Figure  3).

Figure 3: The divergence between survey estimates and digital trace data estimates. The primary point of difference is that digital trace data is entirely nonreactive due to the lack of a solicitation of responses. In surveys, users are surveyed, and their responses are used to construct the final estimate, whereas, in digital traces, a subset of entities (which are usually users) of a platform or their signals are used to construct the estimate.666Icons used in this image have been designed by Hadrien, Becris, Freepik, Smartline, Pixel perfect and prettycons from www.flaticon.com

The life-cycle of a survey from a quality perspective is explained by  Groves et al. (2011) in Figure 2. Researchers usually use a sampling frame as a stand-in for the target population of interest. The sampling frame is used to further sample potential respondents since it is not feasible to survey the entire population. Instead, to guarantee generalization to a well-defined population, the sampling design (the process of selecting the sample) should be probabilistic. The survey instrument or questionnaire is constructed so that the responses to it can allow the researchers to measure the theoretical construct they are interested in. Next, the sampled units are contacted with the questionnaire, and their responses are recorded. In this step, some of the sampled units may not respond to the questionnaire. Therefore the final set of respondents might deviate from the sample. Finally, the responses are processed, coded, and aggregated to the final survey estimate.

While digital traces are often used to measure the same things as surveys do, they proceed differently. Like in a survey, the researcher defines the ideal measurement which will measure the theoretical construct. She then picks the source of the digital traces, often one or more web platforms. In this case, the platform(s) chosen act(s) as the sampling frame. Depending on data accessibility, all entities on the platform as well as the signals produced by them, may be available to the researcher.777Signals here refer to content (tweets on Twitter, posts on Facebook or Instagram, boards on Pinterest) or interactions (likes, retweets, friending) generated by entities on digital platforms. We use the term signals and traces interchangeably. A specific distinction of digital traces and surveys is that researchers can often have access to the ‘census’ of a platform rather than its sample. Therefore, there is often no need for sampling due to logistical infeasibility (as is typically done in surveys). Though the entire sampling frame or census is available, the researcher may still sample signals due to restrictions imposed by data providers 888The best-known example: Twitter’s restricted-access streaming APIs. and/or other technical, legal or ethical restrictions. Therefore, depending on her needs as well as the construct being studied, the researcher devises a data collection or sampling strategy – which usually involves a set of queries that further subset the platform data, producing the final sample that will be used for the study.999We use the term ‘platform data’ to refer to entities or signals generated by entities on that particular platform for the rest of the paper. The signals or the entities that make up the sample can be further filtered or preprocessed, before finally undergoing coding (usually automated) and being aggregated to produce the final estimate.

Note that while both processes have some overlapping stages (for example, coding and aggregation), the first, primary point of difference is that digital trace data is typically nonreactive due to the lack of a solicitation of responses. This has the advantage that researchers can directly observe how entities behave rather than asking them and therefore, only having access to self-reported data. It has the disadvantage that researchers may run the risk of misunderstanding these signals, which can be avoided in a survey by constructing more efficient questionnaires. Much of the divergence between surveys and digital trace based study rests on how we may effectively use potentially noisy, unsolicited signals to understand theoretical constructs while keeping in mind that these signals may not be effective proxies for that construct. The second point of divergence is the fact that the difference between units of observation and units of analysis in digital trace based studies is often more blurred than in surveys and their distinction harder to attain. Units of observation are the level at which data is collected and units of analysis are the level at which the inference is made (Sedgwick, 2014). In survey based social science studies, the unit of observation and analysis is, in most cases, an individual, who is filling in a questionnaire. In digital trace based studies, even though the unit of analysis is also often an individual (i.e., researchers are interested in individuals’ behavior, opinions, etc.), the units of observation is commonly not individuals. More commonly, the units of observation are various signals that although are produced by individual entities (e.g., Twitter users) are collected independently of the entities based on other query criteria, such as tweets collected by hashtags (Tufekci, 2014), keywords (O’Connor et al., 2010) or location information (Bruns and Weller, 2016). Failing to account for the difference between the two units may lead to ecological fallacies or reductionism (Freedman, 1999). These pitfalls manifest even more strongly in studies using digital trace data since the unit of observation, and the units of analysis are often not clearly defined. In the next section, we explain the two main sources of errors in surveys and how we may extend them for digital traces.

A General Distinction: Measurement and Representation Errors

For surveys, the literature aimed at understanding and aggregating errors focuses on multiple dimensions, most prominently the source of errors: the distinction between measurement errors and representation errors, or the type

of errors: the distinction between systematic errors (biases) and random errors (variance). For our framework, we focus on the former, following Groves et al.’s approach in making an overall distinction between errors in, firstly, defining and measuring a theoretical construct with chosen indicators

(measurement errors) and, secondly, the errors arising when inferring from a (non-representative) sample population to the defined target population (representation errors) as illustrated in Figure 2 (Groves et al., 2011). We adopt this specific distinction as it is helpful in conceptually untangling different fallacies potentially plaguing surveys as well as related research designs with non-designed, non-reactive, and non-probabilistic digital data. Further, note that errors can be both systematic (biased) or randomly varying at every step of the inferential process we describe in the remainder.

Measurement error pertains to “the extent to which a given test/instrumentation is an effective measure of a theoretical construct” (Straub, Boudreau, and Gefen 2004, highlight ours). Accordingly, the first step of a survey-driven – and any similar study – requires defining the theoretical construct of interest and establishing a theoretical link between the captured data and the construct (Howison, Wiggins, and Crowston, 2011). Survey researchers usually start by defining the main construct of interest (e.g., “political leaning”, “personality”, “attitude towards X”, ) and potentially related constructs. This is followed by the development or reuse of scales (i.e., sets of questions and items) able to measure the construct adequately. In developing scales, content validity, convergent construct validity, discriminant construct validity, internal consistency, as well as other quality marks, are checked (cf. Straub, Boudreau, and Gefen (2004)). The design of items usually follows a quite fixed pre-defined research question and theorized constructs (notwithstanding some adaptations after pre-testing), followed by fielding the actual survey. Groves et al. (2011) further point out response errors that arise when recording the response of survey participants even when an ideal measurement has been found, i.e., the solicitation of actual information in the field, which can be hampered by social desirability and other issues, and lastly processing errors introduced when processing data, such as coding textual answers into quantitative indicators and data cleaning. Besides validity, individual responses may also suffer from variability over time or between participants, contributing to low reliability.

At each design decision step of the research process with digital traces, investigators can likewise commit related measurement errors and must take similar precautions. In the remainder, we describe a process pipeline (see Figure 1) that researchers typically follow in digital trace based studies and link the decisions with the respective errors that are likely to arise. Note that in an ideal world, that design pipeline would be followed sequentially, whereas in reality, researchers will start their process at different points in the pipeline, e.g., get inspiration in form of an early-stage research question through a platform’s available data and its structure, and then iteratively refine the several steps of their research process. This is a notably different premise vis-à-vis survey-based research, since data is largely given by what a platform stores and what is made available to the researcher – hence, the theoretical fitting often happens post hoc (Howison, Wiggins, and Crowston, 2011). A major difficulty is hence that, to avoid measurement errors, instead of designing a survey scale, web trace studies must consider several steps of the process at once: (i) definition of a theoretical construct aligned with an ideal measurement (ii) the platform to be chosen, (iii) ways to extract (subsets of) data from the platform, (iv) ways to process the data, and (v) the actual measurement of the construct through manifest indicators extracted from the data (Lazer, 2015). In the following sections, we will delve into each of those steps and detail the effect of each choice made by the researcher on the resulting measurement errors.

Representation errors, on the other hand, are made whenever conditions exist or are created that assign unequal chances to individual target population units to be represented in the eventual aggregated measurement/indicator statistics of interest, even if a theoretically perfect way of measuring a construct would be used.

Survey design begins its quest for unbiased representation by clearly defining the target population that the construct of interest should be measured for / inferred to, e.g., the national population of a nation-state. Then, a sampling frame is defined, the best approximation of all units in the target population, e.g., telephone lists or (imperfect) population registers, resulting in under- or over-coverage of population elements, constituting coverage error. Ineligible units might add noise to the sampling frame, such as business telephone numbers. Note that coverage error “exists before the sample is drawn and thus is not a problem arising because we do a sample survey” (Groves et al., 2011). Sampling error then comes into play when a subset of units is selected from the sampling frame giving some members of the frame lower chances to be chosen than others. Importantly, sampling error can only arise when random (or another strategy of) sampling has to be executed because there is no feasible way of reaching all elements in the sampling frame – if one could access the complete frame with minimal cost, sampling (error) would not occur. Further, if chosen individuals drawn as part of the sample refuse to answer the whole survey, we speak of unit nonresponse errors, or item nonresponse errors if only certain items are not collected from certain respondents (e.g., if they break off the survey process after some questions). While in most cases providing insufficient responses to items hinders valid inferences regarding the topical research questions, nonresponse to demographic items can also hinder post-survey adjustment of representation errors.101010While not mentioned explicitly by Groves, this affects the adjustment step and becomes much more important when working with digital traces. Lastly, Groves et al. (2011) list adjustment error, occurring when reweighting is applied post-survey to under- or over-represented cases due to any of the representation errors described above. The reweighting is usually based on socio-demographic attributes of individuals and often their belonging to a certain stratum.

Again, working with digital trace data requires reflecting on representation errors in all process steps. Large differences in regard to representation compared to surveys lie in (i) the a priori given sampling frame that is restricted by the platform and usually induces a large coverage error if the target population is off-platform (non-probabilistic), and (ii) the lack of a specific request stimulus in found data (non-reactive). Also, often research studies based on digital trace data do not clearly define what their actual unit of analysis is, e.g., individual humans or organizations or separate signals (like posts, tweets, likes, etc.). This frequently limits the understanding of what the eventual aggregate statistic represents since it might either be an aggregate of signals without taking into account the entities that created them or take into account these entities, e.g., by aggregating Social Media posts per user profile. This can lead to methodological as well as epistemological issues. We want to highlight, in fact, that for any study with non-reactive, non-probabilistic digital trace data that aims to make inferences about a target population of entities (e.g. humans), what is regarded as “units of analysis” always should be these entities, not signals or single expressions of those entities. For most of these entities this usually corresponds to single user accounts on platforms as a digital representation. In the remainder, “entities” or “units of analysis” will in many cases refer to user accounts as proxies for the actual individuals of a target population, accordingly.

In the following sections, we will discuss the process steps of the research pipeline described in Figure 1 and will refer to the different kinds of measurement and representation errors lurking in each step.

Definition of Construct and Ideal Measurement

Constructs are abstract ‘elements of information’ (Groves et al., 2011) that a scientist attempts to quantify through the definition of the construct, minding its translation into an ideal measurement, followed by recording responses through the actual (survey) instrument and finally, by analysis of the responses. A non-definition or under-definition of the construct or a mismatch between the construct and the envisioned measurement corresponds to issues of validity. Given that digital trace data from any particular platform is not specifically produced for the inferential study, researchers must establish a link between the behavior that is observable on the platform and the theoretical construct of interest. The first step of transforming a construct into a measurement involves thinking about competing and related constructs, in the best case rooted in theory. Next, one has to deliberate whether a potential ideal measurement is feasible to extract from the given data and sufficiently captures the construct, and if the envisioned measurement does not also – or instead – capture other constructs, i.e., think about convergent vs. discriminant validity of a measurement  (Jungherr et al., 2017).

Since digital trace based studies might proceed in a non-linear fashion, researchers may or may not begin their study with a theoretical construct and a pre-defined measurement in mind. Sometimes a researcher might start with a construct but re-evaluate it, and the corresponding measurement, throughout the study depending on the nature of the digital traces. Since the data is largely given by what the platform/system stores, what is available for the public and/or what can be accessed via Application Programming Interfaces (APIs), it may require rethinking the original construct and its definition. Another alternative, which Salganik (2017) describes as the “ready-made” approach, is to start with a platform or dataset, and then envision constructs that can be studied from that particular platform or dataset.

[colback=black!5!white,colframe=black,title=Example: Construct Definition,breakable, enhanced] An example of a construct that researchers aimed to measure with digital trace data is presidential approval. Whereas in a survey one expresses the defined construct as directly as possible in the questionnaire (“Do you approve or disapprove of the way Donald Trump is handling his job as president?”,111111The survey question for presidential approval has remained largely unchanged throughout the years since its inception: https://news.gallup.com/poll/160715/gallup-daily-tracking-questions-methodology.aspx) a digital trace data researcher may consider the act of tweeting about the president positively to be equivalent to approval (O’Connor et al., 2010; Pasek et al., 2019). While positive mentions may indicate approval, researchers should also investigate whether the tweets focus on the presidential role. Due to the unsolicited nature of Twitter data, it can be difficult to disentangle if the tweets are targeted towards presidential or private activities. While comments about the president’s private life may indirectly impact approval ratings, they do not directly measure how the president is handling his job, thereby weakening the measurement.

Platform Selection

In selecting a platform, the researcher needs to ensure the general existence of a link between digital trace data that is observable on the platform and the theoretical construct of interest – irrespective of whether she begins the study with a theoretical construct and defined measurement, or defers that to later stages. She, however, also needs to account for the (almost guaranteed) divergence between the target population of interest and the platform population and the impact of the platform and its community on the observable traces. In this section, we discuss the errors that may occur due to the chosen platform(s).

Platform Coverage Error. The gap between the target population and the platform population is the platform coverage error, a representation error. It is related to coverage error in the TSE, as the “sampling frame”of a given platform is usually not aligned with the target population; the difference in digital trace based research being that it is set ex-ante and unchangeable by the researcher for a given platform. Different web and social media platforms, for instance, exhibit variable inclusion probabilities.121212http://www.pewinternet.org/2018/03/01/social-media-use-in-2018/ Twitter’s demographics, as a particular example, tend to be very different from population demographics (Mislove et al., 2011; Blank, 2017), while Reddit users are predominantly young, Caucasian and male.131313https://www.pewinternet.org/2013/07/03/6-of-online-adults-are-reddit-users/ Population discrepancy could be due to differences in internet penetration or social media adoption rates in different socio-demographic or geographical groups, independent from the particular platform (Wang et al., 2019). Secondly, particular platforms attract specific audiences, for instance, because of topical or technological idiosyncrasies.

[colback=black!5!white,colframe=black,title=Example: Platform Coverage,breakable, enhanced] To illustrate platform selection errors, we again turn to our running example of measuring presidential approval using social media data: Setting aside data collection abilities, if we have access to all posts about the president from social media platforms, our sampling frame is restricted to users who have chosen or  self-selected to express their opinion on social media. These respondents may not be a uniform representation of the valid adult population. Indeed, those who are not on social media or do not express their political opinions on social media might have radically different views. If there is a systemic bias in opinion expression on social media, for example, that political conservatives are less likely to be vocal, then the nonresponse bias can lead to highly misleading estimates (Fischer and Budescu, 1995). A platform coverage error can be thought of as a counterpart to coverage error. Researchers, inspired by survey methodology, may reweight participants by socio-demographics (directly available or inferred) to potentially obtain a representative sample (Pasek et al., 2018; Locker, 1993; An and Weber, 2015) (see Section “Analysis and Inference”), though the efficacy of these correction methods depends on the nature of the self-selection of users (Schnell, Noack, and Torregroza, 2017).

Platform Affordances Error. Platform-specific socio-cultural norms – implicit ones or explicated as community guidelines – and the platform’s design and technical constraints may impact the behavior of users and lead to measurement errors which we together summarize as platform affordances error. For example, Facebook recommends “people you may know” and therefore impacts the friendship links that people create (Malik and Pfeffer, 2016), while Twitter enforces a 280-character limit on tweets can influence the writing style of Twitter users (Gligorić, Anderson, and West, 2018); perceived or explicated community norms can likewise influence what and how users post, for example politically conservative users being less open about their opinion on a platform they regard as unwelcoming of conservative statements. A major challenge for digital trace data-based studies is to disentangle what Ruths and Pfeffer (2014) call “platform-driven behavior” from psychological behavior.

Just as question-wording or social desirability in surveys may influence answers, users may alter their behavior as a response to technical and social changes of the platform. Changing norms or technical settings may also affect the validity of longitudinal studies 

(Bruns and Weller, 2016) since these changes may cause “behavioral drifts” (Salganik, 2017), contributing to what  Lazer (2015) describes as “construct instability”.

[colback=black!5!white,colframe=black,title=Example: Platform Affordances,breakable, enhanced] Platform norms such as the character limit (technical affordance) or terms of service and cultural norms can also inhibit how and to what extent users express their opinion about the president, leading to platform affordance errors. For example, users may have to write terse tweets or a thread consisting of multiple tweets to express their opinion on Twitter.

Data Collection

Query Type Definition Examples of Research Question
Keyword
Using keywords including terms, hashtags, image tags
regular expressions to subset signals such as posts
(tweets, comments, images) or entities (users)
Predicting Influenza rates from search queries (Yuan et al., 2013)
Understanding the use and effect of psychiatric drugs
through Twitter (Buntain and Golbeck, 2015)
Attribute
Using attributes such as location or community
affiliation to subset entities such as users who may have
the relevant attribute in their biography
Inferring demographic information through mobility patterns
on photosharing platforms (Riederer et al., 2015)
Geographic Panels used to study responses to mass shootings
and TV advertising (Zhang, Hill, and Rothschild, 2016)
RDG
Generating random digits and using them as identifiers
of platform entities
Studying collective privacy behaviour of users (Garcia et al., 2018)
Understanding the demographics and voting behaviour of
Twitter users (Barberá, 2016)
Structure
Using structural properties of entities or signals to select
data such as interactions (retweeting, liking, friending)
Understanding the influence of users (Cha et al., 2010)
Predicting political affinity of Twitter users based
on their mention networks (Conover et al., 2011)
Table 1: Different types of feature-based data collection strategies, their explanation and example.

After choosing a platform, the next step of a digital trace based study consists of collecting data, e.g., through official APIs, web scraping, or collaborations with platform/data providers. Then, even if the full data of recorded traces is in principle available from the platform, researchers often select a subset of signals, entities or both, by querying the data based on explicit features (e.g., keywords, cf. Table 1). This is usually done (i) to discard all signals (e.g., tweets) that are presumed not to carry any information relevant to the construct or (ii) to discard user profiles (or other representations of the units of analysis) that are not related to the elements in the target population. Additionally, if a researcher selects signals (say tweets with a certain keyword), she may have to collect additional information of the entity generating the signal (the author of the tweet).

Of course, different platforms have different data access policies: while for some platforms like Github or Stack Exchange, the entire history of signals ever generated are available and the researcher can make selections from this full set freely, others like Facebook Ads or Google Trends only share aggregated signals, And platforms such as Twitter feature access restrictions that impose a mandatory selection of a subset of their data instead of the researcher performing this selection solely based on self-defined features.141414Platforms deal with content deletion by users in different ways, which can also affect the resultant samples, a point we hope to explore in future work.

However, while the “forced” selection from the set of all recorded data of a system can happen, the voluntarily query-based reduction of data – on top of a potential forced selection – is the norm in digital trace studies of human behavior and attitudes, not seldom simply to reduce the data volume one has to process. We will, therefore, below discuss two main errors that follow from the deliberate selection process through queries and afterward address the problem of non-controlled selection.

Signal Selection Error. Typically, researcher-specified queries are used to capture signals broadly relevant to the construct of interest. When the data collection strategy is devised based on the research question or construct of interest (as is often the case in studies based on digital trace data), query choices may lead to measurement errors. The difference between the ideal measurement and collected signal is termed signal selection error. The signal selection error is related to measurement error151515Not all “measurement errors”, but the specific “measurement error” box on the left side of Fig. 2 in the TSE, which is defined as the difference between the ideal measurement and the response obtained through a survey. Low precision and low recall of queries may directly impact the measurement which can be established from the subselection of platform data used. He and Rothschild examine different methods of obtaining relevant political tweets, establishing that bias exists in keyword-based data collection (He and Rothschild, 2016) affecting both the users included in the sample as well as the sentiment of the resultant tweets.

[colback=black!5!white,colframe=black,title=Example: Signal Selection,breakable, enhanced] To illustrate, say we aim to capture all search queries about the US president through a search engine. If search phrases that mention the keyword “Trump” are collected, phrases unrelated to Donald Trump could also be collected – phrases which refer to Melania Trump or Donald Trump Jr. would be included and lead to noise. The presence of these ineligible signals would decrease the reliability and validity of the measurement. Likewise, relevant tweets might be excluded that simply refer to “president” or “the Donald”. One way of assessing signal selection error is through analysis of precision and recall of the queries being used. Ruiz et al. attempt to quantify coverage for various keywords related to a single topic using exact and bounded approximation algorithms 

(Ruiz, Hristidis, and Ipeirotis, 2014). While low precision can usually be addressed to some extent in subsequent filtering steps after the data collection is finished (see Section “Data Preprocessing”), the non-observation of relevant signals cannot be remedied without repeating the data selection step.

Entity Selection Error. While signals are filtered according to their estimated ability to measure the construct of interest, their exclusion can entail the filtering out of entities that produced them (think of Twitter profiles attached to tweets), if no other signals of an entity remain in the subset.161616This is typically true for most data collection strategies except Random Digit Generation where entities are selected independent of their characteristics. In this manner, entities with specific attributes (e.g., teenagers) might be filtered out simply because they do not produce specific signals (e.g., they use different terms to refer to the US president). The error incurred due to the gap between the selection of entities and the sampling frame is called the entity selection error. It is a representation error related to the sampling error in TSE. It is also related to the coverage error if one considers the researcher-specified query result set as a second sampling frame,171717Collection of data through explicit features is equivalent to defining a boundary around the signals or entities to be used for analysis further on (González-Bailón et al., 2014). where entity selection error is the gap between the query boundaries and the platform population. Of course, this error occurs as well if entities are selected directly by their features (not via their signals), for instance, removing user-profiles deemed irrelevant for inferences to the target population as determined by their indicated race or age. This is especially critical when certain entities are less likely to correctly respond to demographic fields (e.g. female editors in Wikipedia) (Pavalanathan and Eisenstein, 2015).

There are many approaches to collecting data (cf. Table 1) and each comes with different types of signal and entity selection pitfalls. As keyword-based search is a popular choice, Tufekci analyzes how hashtags can be used for data collection on Twitter and finds that hashtag usage tapers down as time goes on – users continue to discuss a certain topic, they merely stop using pertinent hashtags (Tufekci, 2014). For collecting data related to elections or political opinion, many studies use mentions of the political candidates to collect data related to them (Barberá, 2016; O’Connor et al., 2010; Diaz et al., 2016; Stier et al., 2018). While this a high-precision query, it may have low recall, excluding users who refer to political candidates with nicknames, thereby reducing the sample’s generalizability. On the other hand, it may also include ineligible users who might have been referring to someone with the same name as the politician, in case a candidate’s name is common.

In addition to keyword selection, other sources for data selection are also in use, e.g., those based on attributes, such as location (Bruns and Stieglitz, 2014) or structural characteristics (Demartini, 2007) or affiliation to particular subcommunities or lists (Chandrasekharan et al., 2017), and random digit generation (RDG) (Barberá, 2016). Each method has different strengths and weaknesses, with random digit generation (RDG) being closest to the Random Digit Dialing method of conducting surveys, although the method may generate a very small sample of relevant tweets (Barberá, 2016). Lists of users, subcommunity selection, as well as selection based on attributes, similar to keywords, restrict the dataset to entities who have either chosen to be on them, have been chosen by other entities to appear on them, or have opted (self-selected) to declaring that particular attribute (say, location), often leading to (additional) coverage error since the characteristics of selected entities may be systematically different from the target population (Cohen and Ruths, 2013). Further, when network data is collected, different crawling and sampling strategies may impact the accuracy of various global and local statistical network measures such as centrality or degree (Galaskiewicz, 1991; Borgatti and Krackhardt, 2006; Kossinets, 2006; Wang et al., 2012; Lee and Pfeffer, 2015; Costenbader and Valente, 2003), the representation of minorities and majorities in the sample network (Wagner et al., 2017) and the estimation of dynamic processes such as peer effects in networks (Yang, Ribeiro, and Neville, 2017). This is especially important if the construct of interest is operationalized with structural measurements (e.g., if political leaning is assessed based on the connectivity between users and politicians or if extroversion is assessed based on the number of interaction partners).

For those platforms where data access is regulated to a provider-determined subset, researchers only get access to data that may or may not be a probabilistic sample of the whole set of digital traces in the system. As the most popular example, Twitter provides users with varying level of access. A popular choice of obtaining Twitter data is through the 1% ‘spritzer’ API, yet research has found that the free 1% sample is significantly different from the commercial 10% ‘gardenhose’ API (Morstatter et al., 2013). Apart from errors introduced through deliberate querying by the researcher as discussed above – which still apply in this case –, these selection mechanisms devised by providers may lead to an additional representation error that can best be linked to actual sampling error in the TSE, when the given selection is non-probabilistic. An example of this is vocal users being assigned a higher inclusion probability than other users, simply because they produce more signals.

[colback=black!5!white,colframe=black,title=Example: Entity Selection,breakable, enhanced] When studying political opinions on Twitter, vocal or opinionated individuals’ opinions will be overrepresented, especially when data is collected based on signals (e.g., tweets), instead of individual accounts (e.g., tweets stratified by account activity) (Barberá, 2016; O’Connor et al., 2010; Pasek et al., 2018; Diaz et al., 2016) simply because they tweet more about the topic and have a higher probability of being included in the sample than others. Further certain groups of entities (e.g., teenagers or Spanish speaking people living in the US) may be underrepresented if keyword lists are generated that mainly capture how adult Americans talk about politics.

A potential solution to mitigating differences in activity due to this “sampling error” of entities could be: stratified sampling of specific accounts (An and Weber, 2015; Lin et al., 2013)

to explore how a user’s activity level influences their efficacy as a ‘social sensor.’

Finally, a researcher should assess the impact of their data collection strategy on the sample they obtain, either through a triangulation approach which compares different querying strategies (Denzin, 2012) or by comparison with a ‘random’ sample from that particular platform (Morstatter et al., 2013).

Data Preprocessing

Data preprocessing refers to the process of removing noise in the form of ineligible units or items from the raw dataset as well as augmenting it with extraneous information or additionally needed meta-data.

Entity Preprocessing

Entity preprocessing is done since researchers may want additional information about the analysis units (augmentation) or may want to discard ineligible units mistakenly selected into the sampling frame due to data collection (reduction). These steps could lead to the following dissimilar but related representation errors.

Entity Augmentation Error.

In their downstream analysis, researchers may want to reweight digital trace data by socio-demographic attributes and/or by activity levels of individuals. The former is traditionally also done in surveys to mitigate representation errors (discussed as ‘Adjustment’ in Section “Analysis and Inference”). However, since such attributes are rarely pre-collected and/or available in platform data, demographic attribute inference for accounts of individuals is a popular way to achieve such a task, often with the help of machine learning methods 

(Zhang et al., 2016; Rao et al., 2010; Sap et al., 2014; Wang et al., 2019). Naturally, such demographic attribute inference itself is a task that may be affected by various errors  (Karimi et al., 2016; McCormick et al., 2017) and can be especially problematic if there are different error rates from different groups of people (Buolamwini and Gebru, 2018). Platforms may also offer aggregate information about their user base, which can potentially also be supplied through provider-internal inference methods and prone to the same kind of errors without the researcher knowing (Zagheni, Weber, and Gummadi, 2017). The overall error incurred due to the efficacy of entity augmentation methods is denoted as entity augmentation error. Entity augmentation error may be quantified based on an error analysis of the methods used for annotating auxiliary information of the data, such as the accuracy of demographic inference. The reliability of multiple augmentation methods can also be assessed (González-Bailón and Paltoglou, 2015).

Entity Reduction Error. In addition to entity augmentation, preprocessing steps usually followed in the literature include removing inactive users, spam users, and non-human users – comparable to “ineligible units” in survey terminology – or filter content based on various observed or inferred criteria (e.g., location of a tweet, the topic of a message). Similar to entity augmentation error, the methods for detecting and removing specific type of users can themselves have hidden biases depending on how they were built. Therefore this step confers an additional layer of representation errors, particularly entity reduction error. This error can be the result of researchers not removing ineligible units that contribute to noisy signals or filtering criteria that are too strict. As an approximate counterpart to this error one can consider coverage error in surveys. Previous research compares various methods for bot (De Cristofaro et al., 2018) and spam detection (Wu et al., 2018), which are often used in preprocessing digital trace data, and find that different methods have varying performance depending on the characteristics of the data.

[colback=black!5!white,colframe=black,title=Example: Entity Preprocessing,breakable, enhanced] Augmentation: Twitter users’ age, gender, ethnicity, or location may be inferred to understand how presidential approval differs across demographics, as is typically done in surveys. It has been found that automated gender inference methods have higher error rates for African American faces (Buolamwini and Gebru, 2018), therefore, gender inferred through such means would over- or underestimate approval rates among African American populaces due to an entity augmentation error.

Reduction: While estimating presidential approval from tweets, researchers are usually not interested in posts that are created by bots or organizations. In such cases, one can detect such accounts or detect first-person accounts as done by An and Weber (2015), even if the first selection process for data collection has retained them.

Entity reduction error can be assessed by understanding the criteria or definition chosen for exclusion, and researchers should note if their criteria could potentially exclude entities that act as effective “sensors”.

Signal Preprocessing

In contrast to reducing or augmenting entities or units of analysis, signals in the form of content or produced expressions may also be preprocessed before the statistical analysis, i.e., the individual expressions are filtered or augmented. These steps could lead to the following measurement errors.

Signal Augmentation Error.

A form of content augmentation comprises the generation of auxiliary information through sentiment detection or named entity recognition on a text post, the annotation of “like” actions with receiver and sender information, or the annotation of image material with recognized objects. This augmentation of content is done as part of measuring the theoretical concept and mainly pertains to the machine-reliant coding of user-generated content. An error may be introduced in this step due to the precision and recall of the annotation method. If automated or semi-automated methods are used, algorithmic interpretability is yet another challenge.

Just as survey responses are coded to ascertain categories and inadequately trained coders can cause processing or coding errors, signal augmentation error occurs due to inaccurate categorization (manual or automatic).

A typical method for augmenting digital trace data is through sentiment or stance detection. Although there is a large body of research on sentiment analysis, specifically sentiment analysis for digital content, there are still many challenges related to it 

(Kenyon-Dean et al., 2018; Puschmann and Powell, 2018). An added layer of complexity is due to the multimodal nature of digital trace data: a post’s sentiment may be strongly tied to its surrounding contexts such as links embedded in the post, the attributes of the author of the post or even multimedia content present in the post. For example, a tweet’s sentiment may depend on the tweet it is replying to or quoting, often rendering automated sentiment analysis incorrect that is trained to make a decision solely based on the text of the tweet.

Signal Reduction Error. Finally, certain signals may be ineligible items for a variety of reasons such as spam, hashtag hijacking or because they are irrelevant to the task at hand. The error incurred due to the removal of ineligible signals is termed as signal reduction error.

Generally, for all types of preprocessing errors, researchers should note how different groups of users interact with the platform. For example, systemic biases could be introduced if certain groups of users are more willing to share information that makes metadata inference easier, impacting entity augmentation. Additionally, different styles and behaviors may affect how signals are produced (what Olteanu et al. (2019) describe as content biases), which in turn affects signal preprocessing errors since common tools for signal augmentation and reduction may have differential errors for different types of content (González-Bailón and Paltoglou, 2015). [colback=black!5!white,colframe=black,title=Example: Signal Preprocessing,breakable, enhanced] Augmentation: Political tweets are often annotated with sentiment to understand public opinion (O’Connor et al., 2010; Barberá, 2016)

. However, the users of different social media platforms might use very different vocabularies than those covered in popular sentiment lexicon approaches and even use words in different contexts, leading to misidentification or under coverage of sentiments for a certain platform or subcommunity on that platform.

Reduction: Signal reduction errors occur when researchers decide, e.g., to remove Tweets that do not contain any textual content but only hyperlinks or embedded pictures/videos. To overcome preprocessing errors, researchers should note if the methods or techniques used are suitable for the particular type of digital traces in question. In case methods have been developed for a different type of content, domain adaptation may also be used to improve the performance of methods trained on a data source different from the current data of interest (Yang and Eisenstein, 2017; Hamilton et al., 2016)

Analysis and Inference

Finally, after having preprocessed the dataset, we move on to measuring a final indicator for the construct of interest based on the preprocessed data. As noted before, depending on the nature and availability of the digital traces, the construct and its corresponding measurement may be defined or redefined at this stage. We discuss the resulting errors below.

Adjustment Error. To draw conclusions about the target population, a researcher can attempt to account for coverage and selection errors. To do so, she may use techniques leveraged to improve the representativeness of estimates obtained from non-probabilistic samples, such as opt-in web surveys (Goel, Obeng, and Rothschild, 2015, 2017). These web-based surveys are usually not employing probability sampling, and researchers have suggested specific ways to handle their nonprobabilistic nature, which may also be applied to digital traces (Kohler, Kreuter, and Stuart, 2019). Usually these methods constitute reweighting through techniques like raking or post-stratification. Due to the resources available as well as the availability of demographic information, the choice of method can also cause errors which we label adjustment error, in line with the adjustment error as pointed out by Groves et al. (2011).

A few researchers have explored adjustment in digital trace based studies (Zagheni and Weber, 2015; Barberá, 2016; Pasek et al., 2018, 2019; Wang et al., 2019) using calibration or post-stratification. Broadly, there are two approaches to reweighting in digital trace based studies. The first approach reweights the digital trace data sample according to a known population distribution (Yildiz et al., 2017; Zagheni and Weber, 2015) (obtained through the census, for example). The second approach reweights the survey statistic according to the demographic distribution of the online platform  (Pasek et al., 2018, 2019) (found through social media, web usage surveys or provided by the platform itself). While the second method has the advantage of bypassing the use of biased methods for demographic inference (thus mitigating entity augmentation errors), the platform demographics might not apply to the particular dataset since all users are not equally active on all topics and the error introduced by this reweighting step is difficult to quantify. For the first method, while researchers often use biased methods for demographic inference, the errors of these methods can be quantified through error analysis.

[colback=black!5!white,colframe=black,title=Example: Adjustment,breakable, enhanced] When comparing presidential approval on Twitter with survey data, Pasek et al. (2019) reweight the survey estimates with Twitter usage demographics but fail to find alignment between the two measures.181818Cf. footnote 12 in Section “Platform Selection” Platform Selection for Twitter’s demographic composition. In this case, the researchers assume that the demographics of Twitter users is the same as for a subset of Twitter users tweeting about the president, an assumption which might not be true. Previous research has shown that political users tend to have different characteristics than random Twitter users (Cohen and Ruths, 2013) and that they tend to be younger and have a higher chance of being white men (Bekafigo and McBride, 2013), therefore using Twitter demographics as a proxy for political Twitter users may lead to adjustment errors.

Signal Measurement Error. After we have obtained the augmented signals, either through manual coding or an automated process, they are aggregated to arrive at the final estimate. Just as the set of entities is adjusted through reweighting to correct for representation errors, different kinds of signals as well can be taken into account to different degrees in the final estimate, e.g., in order to account for differences in activity of the entities generating the signals (e.g. power users’ posts) or their relevance for the construct to be measured. This can lead to erroneous final estimates, which we denote as signal measurement error. This error may arise due to the choice of modeling or aggregation methods used by the researcher as well as how the units of observation are mapped to units of analysis.

In particular, after augmenting signals with the required extraneous information, the researcher can calculate the final estimate through many different techniques, from simple counting to complex machine learning models. While simpler methods may be less powerful, they are often more interpretable and less costly (can work with less data). When machine learning methods are used, representation and measurement errors become connected since these methods are usually trained on the data that has been collected for this study (unless pre-trained models are used). If the data is not representative for the target population and the construct, i.e., not all dimensions of the construct are captured during the data collection process and/or the users in the training data deviate systematically from the users in target populations, the method cannot learn to accurately annotate signals with respect to the construct.

Finally, researchers should also account for the heterogeneity of digital traces: that the signals are generated by various subgroups of entities who may have different behavior. Aggregation may mask or even reverse underlying trends of digital traces due to signal measurement errors in the form of Simpson’s paradoxes (a type of ecological fallacy  Alipourfard, Fennell, and Lerman (2018); Howison, Wiggins, and Crowston (2011); Lerman (2018)).

[colback=black!5!white,colframe=black,title=Example: Signal Measurement,breakable, enhanced] Depending on the way the construct was defined (say, positive sentiment towards the president), and signal augmentation performance (lexicons which count words with a sentiment polarity), the researcher obtains tweets whose positive and negative words have been counted. Now the researcher may define a final link function which aggregates all the signals to a single aggregate. She may choose to count the normalized positive words in a day (Barberá, 2016), the ratio of positive and negative words per tweet or add all the ratio of all tweets in a day (Pasek et al., 2019). The former aggregate of counting positive words in a day may underestimate negative sentiments of a particular day, while in the latter aggregate tweets which report both negative and positive stances would be neutralized to having no sentiment.

A user may have expressed varying sentiment in multiple tweets. The researcher faces the choice of averaging the sentiment across all tweets, taking the most frequently expressed sentiment or not aggregating them at all and (implicitly) assuming each tweet to be a signal of an individual entity (Tumasjan et al., 2010; O’Connor et al., 2010; Barberá, 2016) which may amplify some entities’ voices at the cost of others.

To avoid the vocabulary miss-match between pre-defined lexica and social media language, researchers may want to use machine learning methods that learn which words indicate positive sentiment towards the president by looking at extreme cases (e.g., tweets about the president from his supporter and critics). The validity of this measurement depends on the data it is trained on and to what extent this data is representative of the population and the construct. That means the researchers have to show that the selected supporters and critics expose positive and negative sentiment towards the president in a similar way as random users.

Application of the Error Framework

The aim of our error framework is to systematically describe and document errors in research designs making use of digital traces, and to so with a vocabulary shareable between disciplines. We hope that by systematizing the description of errors, computational social science studies become more comparable. To illustrate the practical applicability and the utility of the framework, we will now look at a typical computational social science study through the lens of our error framework. Future work will include further case studies to showcase the applicability of our framework.

Case Study: Nowcasting Flu Activity through Wikipedia Views

Our case study focuses on a study on predicting the prevalence of Influenza-like Illnesses (ILI) from Wikipedia usage by  McIver and Brownstein (2014), which stands as an illustrative example for a line of research on predictions based on aggregated digital trace data. With regard to construct definition, the construct is quite clearly defined as a case of ILI of a person as diagnosed by the CDC. The target population is described to be the ”American Population”, which could refer to all U.S. citizens residing in the U.S., all U.S. residents or simply the same population for which the Center for Disease Control and Prevention (CDC) data is collected that is used as the gold standard set. Typically the construct of interest here would be measured by reports from local medical professionals, and then be recorded by a central agency like the CDC in the US. The authors investigate whether Wikipedia usage rates would be an adequate replacement for reports by the CDC. In this case, the authors specifically use views on selected Wikipedia articles relevant to ILI topics to predict ILI prevalence. In terms of the ideal measurement for this construct envisioned, the important question is what the act of accessing ILI related Wikipedia pages implies about the person who looks at these pages: do we assume that these persons suffer from influenza or related symptoms themselves, or could this also be people with a general interest in learning about the disease, potentially also inspired by media coverage of a topic? To test construct validity in such cases, one option would be to survey a random sample of Wikipedia readers that land on the selected ILI-related articles to find out about their health status or alternative motivations. Further surveys and/or focus groups could help to learn about if, how, and where people search for advice online when being sick. Additionally, even if we assume that mainly people that feel sick consume ILI-related articles on Wikipedia, feeling sick does not necessarily mean that a viewer really has the flu. Only a medical expert can discern symptoms of flu from other diseases with similar symptoms.

People frequently arrive at Wikipedia articles coming from a Google search (McMahon, Johnson, and Hecht, 2017; Dimitrov et al., 2019). This means that Google’s ranking of Wikipedia‘s ILI-related articles with respect to certain search queries impacts the digital traces we observe in Wikipedia data as well. Also, if users do not use a search engine but navigate Wikipedia, past research has found that article structure and the position of links play an important role in the navigation habits of users (Lamprecht et al., 2017). Such platform affordances can impact the way users arrive at and interact with the articles presented, thereby affecting viewing behavior. While it is impossible to control for all such factors, reflecting on them and describing them can help to make research more reproducible and comparable.

Errors of non-observation in using Wikipedia manifest in the form of platform coverage error due to the readers of Wikipedia not being representative of the target population.

Another issue occurs when only aggregated signals (here aggregated views that correspond to our unit of observation) are available is that the signals cannot be matched to the entities that produced them (here people that are our unit of analysis). This not only makes it difficult to identify if multiple units of observation are referring to the same unit of analysis (multiple views of ILI-related Wikipedia pages made by the same person) but also complicates accounting for all forms of representation errors.

The authors of the example paper curate a list of 32 Wikipedia articles that they identified to be relevant for ILI, including Avian influenza, Influenza Virus B, Centers for Disease Control and Prevention, Influenza Virus C, Common Cold, Vaccine, Influenza. This choice of query might lead to both measurement errors as well as representation errors. A signal selection error could potentially occur if the views for selected articles do not indicate a strong primary interest in ILI (‘Centers for Disease Control and Prevention’, ‘Vaccine’ are viewed for many other reasons), or other relevant articles’ views are not counted.On the other hand, noisy article selection could lead to entity selection error where all selected articles might be primarily viewed out of interest in ILI, but only by a specific part of the target population, while another part of the population (e.g., elderly people) distribute their views over articles that are not a good indicator for “pure ILI interest” (and had, therefore, not be included in the signal selection).

A detailed explanation for the selection criteria of the particular set of Wikipedia articles would be useful for comparing the results with related approaches. For example, it could be mentioned if the selection process was informed by specific theory or by empirical findings from interviews or focus groups. Such information would help to assess the impact of signal selection errors. Another thing to keep in mind is the aspect of time, since existing articles may change, new articles may be added during the field observation, leading to unstable estimates.

Finally, daily flu rates are estimated using a Generalised Linear Model (GLM) on the article views. Since there is no information about the humans behind these views, it is difficult to ascertain if there should be a one-to-one mapping between views and people infected with influenza or another form of aggregation, leading to a signal measurement error, if for example views of power users of Wikipedia are counted much more often than the average readers’.

Conclusion

The use of digital trace data of human behaviors, such as web and social media data, has become of great interest for various research communities, including social sciences as well as other disciplines. Application areas of this type of data include, but are not limited to, real-time predictions, now-casting, and many other forms of Social Sensing, for example, for public opinion research. In some cases, digital trace data are considered to be a less time-consuming and less costly alternative or addition to surveys. One of the crucial challenges of digital trace data is identifying and disentangling the various kinds of errors that may originate in the unsolicited nature of this data, in effects related to specific platforms, as well as in data collection and analysis strategies. Multiple errors can potentially occur during the construction of a digital trace data research design, and it is difficult to pinpoint which error has what effect.

To make research on human behavior and opinions that are either based on digital trace data or survey data more comparable and to increase the reproducibility of digital trace data research, it is important to describe potential error sources and mitigation strategies systematically. Therefore, we add to the growing body of literature that aims at understanding errors in digital trace data research by conceptualizing a preliminary error framework for digital traces that identifies typical errors that may occur in such studies. We provided a suggestion for comparing these errors to their survey methodology counterpart based on the TSE, in order to (i) draw from methods developed in survey research to quantify and mitigate errors and (ii) develop a shared vocabulary to enhance the dialogue among all scientists from heterogeneous disciplines working in the area of computational social science.

Just as survey methodology has benefited from understanding potential limitations in a systematic manner, our proposed framework acts as a set of recommendations for researchers on what to reflect and on how they may use digital traces for studying social indicators. We recommend researchers not only to use our framework to think about the various types of errors that may occur in a planned study but to also use the framework to more systematically document errors or limitations. Such design documents should be shared with the collected data where possible. An error framework is the first step in delineating each source and type of error to prevent them, improve how studies and metrics that make use of digital traces can be documented or audited, and finally, draw better inferences from such data.

Acknowledgments. We thank Haiko Lietz, Sebastian Stier, members of the CSS Department at GESIS as well as participants of the Demography workshop at ICWSM’2019 for their helpful feedback and suggestions.

References

  • Alipourfard, Fennell, and Lerman (2018) Alipourfard, N.; Fennell, P. G.; and Lerman, K. 2018. Can you trust the trend?: Discovering simpson’s paradoxes in social data. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 19–27. ACM.
  • An and Weber (2015) An, J., and Weber, I. 2015. Whom should we sense in “social sensing”-analyzing which users work best for social media now-casting.

    EPJ Data Science

    4(1):22.
  • Askitas and Zimmermann (2015) Askitas, N., and Zimmermann, K. F. 2015. The internet as a data source for advancement in social sciences. International Journal of Manpower 36(1):2–12.
  • Barberá (2016) Barberá, P. 2016. Less is more? how demographic sample weights can improve public opinion estimates based on twitter data. Work Paper NYU.
  • Bekafigo and McBride (2013) Bekafigo, M. A., and McBride, A. 2013. Who tweets about politics? political participation of twitter users during the 2011gubernatorial elections. Social Science Computer Review 31(5):625–643.
  • Biemer and Christ (2008) Biemer, P., and Christ, S. 2008. Weighting survey data. In In de Leeuw ED, Hox JJ, Dillman DA, eds. International handbook of survey methodology. New York: Lawrence Erlbaum.
  • Biemer (2010) Biemer, P. P. 2010. Total survey error: Design, implementation, and evaluation. Public Opinion Quarterly 74(5):817–848.
  • Blank (2017) Blank, G. 2017. The digital divide among twitter users and its implications for social research. Social Science Computer Review 35(6):679–697.
  • Borgatti and Krackhardt (2006) Borgatti, S.P., C. K., and Krackhardt, D. 2006. Robustness of centrality measures under conditions of imperfect data. Social Networks 28(1):124–136.
  • Boyd and Crawford (2012) Boyd, D., and Crawford, K. 2012. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, communication & society 15(5):662–679.
  • Bruns and Stieglitz (2014) Bruns, A., and Stieglitz, S. 2014. Twitter data: what do they represent? it-Information Technology 56(5):240–245.
  • Bruns and Weller (2016) Bruns, A., and Weller, K. 2016. Twitter as a first draft of the present: and the challenges of preserving it for the future. In Proceedings of the 8th ACM Conference on Web Science, 183–189. ACM.
  • Buntain and Golbeck (2015) Buntain, C., and Golbeck, J. 2015. This is your twitter on drugs: Any questions? In Proceedings of the 24th international conference on World Wide Web, 777–782. ACM.
  • Buolamwini and Gebru (2018) Buolamwini, J., and Gebru, T. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, 77–91.
  • Cha et al. (2010) Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. P. 2010. Measuring user influence in twitter: The million follower fallacy. In fourth international AAAI conference on weblogs and social media.
  • Chandrasekharan et al. (2017) Chandrasekharan, E.; Pavalanathan, U.; Srinivasan, A.; Glynn, A.; Eisenstein, J.; and Gilbert, E. 2017. You can’t stay here: The efficacy of reddit’s 2015 ban examined through hate speech. Proceedings of the ACM on Human-Computer Interaction 1(CSCW):31.
  • Cohen and Ruths (2013) Cohen, R., and Ruths, D. 2013. Classifying political orientation on twitter: It’s not easy! In Seventh International AAAI Conference on Weblogs and Social Media.
  • Conover et al. (2011) Conover, M. D.; Gonçalves, B.; Ratkiewicz, J.; Flammini, A.; and Menczer, F. 2011. Predicting the political alignment of twitter users. In 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, 192–199. IEEE.
  • Costenbader and Valente (2003) Costenbader, E., and Valente, T. W. 2003. The stability of centrality measures when networks are sampled. Social networks 25(4):283–307.
  • De Cristofaro et al. (2018) De Cristofaro, E.; Kourtellis, N.; Leontiadis, I.; Stringhini, G.; Zhou, S.; et al. 2018. Lobo: Evaluation of generalization deficiencies in twitter bot classifiers. In Proceedings of the 34th Annual Computer Security Applications Conference, 137–146. ACM.
  • Demartini (2007) Demartini, G. 2007. Finding experts using wikipedia. In Proceedings of the 2nd International Conference on Finding Experts on the Web with Semantics-Volume 290, 33–41. Citeseer.
  • Denzin (2012) Denzin, N. K. 2012. Triangulation 2.0. Journal of mixed methods research 6(2):80–88.
  • Diaz et al. (2016) Diaz, F.; Gamon, M.; Hofman, J. M.; Kıcıman, E.; and Rothschild, D. 2016. Online and social media data as an imperfect continuous panel survey. PloS one 11(1):e0145406.
  • DiMaggio et al. (2001) DiMaggio, P.; Hargittai, E.; Neuman, W. R.; and Robinson, J. P. 2001. Social implications of the internet. Annual Review of Sociology 27(1):307–336.
  • Dimitrov et al. (2019) Dimitrov, D.; Lemmerich, F.; Flöck, F.; and Strohmaier, M. 2019. Different topic, different trafic: How search and navigation interplay on wikipedia. The Journal of Web Science 1.
  • Fatehkia, Kashyap, and Weber (2018) Fatehkia, M.; Kashyap, R.; and Weber, I. 2018. Using facebook ad data to track the global digital gender gap. World Development 107:189–209.
  • Fischer and Budescu (1995) Fischer, I., and Budescu, D. V. 1995. Desirability and hindsight biases in predicting results of a multi-party election.
  • Freedman (1999) Freedman, D. A. 1999. Ecological inference and the ecological fallacy. International Encyclopedia of the social & Behavioral sciences 6(4027-4030):1–7.
  • Galaskiewicz (1991) Galaskiewicz, J. 1991. Estimating point centrality using different network sampling techniques. Social Networks 13(4):347–386.
  • Garcia et al. (2018) Garcia, D.; Goel, M.; Agrawal, A. K.; and Kumaraguru, P. 2018. Collective aspects of privacy in the twitter social network. EPJ Data Science 7(1):3.
  • Gayo-Avello (2012) Gayo-Avello, D. 2012. “i wanted to predict elections with twitter and all i got was this lousy paper”–a balanced survey on election prediction using twitter data. arXiv preprint arXiv:1204.6441.
  • Gebru et al. (2018) Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J. W.; Wallach, H.; Daumeé III, H.; and Crawford, K. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010.
  • Gligorić, Anderson, and West (2018) Gligorić, K.; Anderson, A.; and West, R. 2018. How constraints affect content: The case of twitter’s switch from 140 to 280 characters. In Twelfth International AAAI Conference on Web and Social Media.
  • Goel, Obeng, and Rothschild (2015) Goel, S.; Obeng, A.; and Rothschild, D. 2015. Non-representative surveys: Fast, cheap, and mostly accurate. Working Paper.
  • Goel, Obeng, and Rothschild (2017) Goel, S.; Obeng, A.; and Rothschild, D. 2017. Online, opt-in surveys: Fast and cheap, but are they accurate? Working Paper, Stanford University, Stanford, CA.
  • González-Bailón and Paltoglou (2015) González-Bailón, S., and Paltoglou, G. 2015. Signals of public opinion in online communication: A comparison of methods and data sources. The ANNALS of the American Academy of Political and Social Science 659(1):95–107.
  • González-Bailón et al. (2014) González-Bailón, S.; Wang, N.; Rivero, A.; Borge-Holthoefer, J.; and Moreno, Y. 2014. Assessing the bias in samples of large online networks. Social Networks 38:16–27.
  • Groves and Lyberg (2010) Groves, R. M., and Lyberg, L. 2010. Total survey error: Past, present, and future. Public opinion quarterly 74(5):849–879.
  • Groves et al. (2011) Groves, R. M.; Fowler Jr, F. J.; Couper, M. P.; Lepkowski, J. M.; Singer, E.; and Tourangeau, R. 2011. Survey methodology, volume 561. John Wiley & Sons.
  • Hamilton et al. (2016) Hamilton, W. L.; Clark, K.; Leskovec, J.; and Jurafsky, D. 2016. Inducing domain-specific sentiment lexicons from unlabeled corpora. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

    , volume 2016, 595.
    NIH Public Access.
  • Hampton and Wellman (2003) Hampton, K., and Wellman, B. 2003. Neighboring in netville: How the internet supports community and social capital in a wired suburb. City & Community 2(4):277–311.
  • He and Rothschild (2016) He, R., and Rothschild, D. 2016. Selection bias in documenting online conversations. Working paper.
  • Howison, Wiggins, and Crowston (2011) Howison, J.; Wiggins, A.; and Crowston, K. 2011. Validity issues in the use of social network analysis with digital trace data. Journal of the Association for Information Systems 12(12).
  • Hsieh and Murphy (2017) Hsieh, Y. P., and Murphy, J. 2017. Total twitter error. Total Survey Error in Practice 23–46.
  • Japec et al. (2015) Japec, L.; Kreuter, F.; Berg, M.; Biemer, P.; Decker, P.; Lampe, C.; Lane, J.; O’Neil, C.; and Usher, A. 2015. Big data in survey research: Aapor task force report. Public Opinion Quarterly 79(4):839–880.
  • Jungherr et al. (2017) Jungherr, A.; Schoen, H.; Posegga, O.; and Jürgens, P. 2017. Digital trace data in the study of public opinion: An indicator of attention toward politics rather than political support. Social Science Computer Review 35(3):336–356.
  • Jungherr (2017) Jungherr, A. 2017. Normalizing digital trace data. Digital discussions: How big data informs political communication.
  • Karimi et al. (2016) Karimi, F.; Wagner, C.; Lemmerich, F.; Jadidi, M.; and Strohmaier, M. 2016. Inferring gender from names on the web: A comparative evaluation of gender detection methods. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, 53–54. International World Wide Web Conferences Steering Committee.
  • Kenyon-Dean et al. (2018) Kenyon-Dean, K.; Ahmed, E.; Fujimoto, S.; Georges-Filteau, J.; Glasz, C.; Kaur, B.; Lalande, A.; Bhanderi, S.; Belfer, R.; Kanagasabai, N.; et al. 2018. Sentiment analysis: It’s complicated! In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, 1886–1895.
  • Kohler, Kreuter, and Stuart (2019) Kohler, U.; Kreuter, F.; and Stuart, E. A. 2019. Nonprobability sampling and causal analysis. Annual review of statistics and its application 6:149–172.
  • Kossinets (2006) Kossinets, G. 2006. Effects of missing data in social networks. Social Networks 28:247–268.
  • Lamprecht et al. (2017) Lamprecht, D.; Lerman, K.; Helic, D.; and Strohmaier, M. 2017. How the structure of wikipedia articles influences user navigation. New Review of Hypermedia and Multimedia 23(1):29–50.
  • Lazer et al. (2009) Lazer, D.; Pentland, A.; Adamic, L.; Aral, S.; Barabasi, A.-L.; Brewer, D.; Christakis, N.; Contractor, N.; Fowler, J.; Gutmann, M.; et al. 2009. Social science. computational social science. Science (New York, NY) 323(5915):721–723.
  • Lazer (2015) Lazer, D. 2015. Issues of construct validity and reliability in massive, passive data collections. The City Papers: An Essay Collection from The Decent City Initiative.
  • Lee and Pfeffer (2015) Lee, J., and Pfeffer, J. 2015. Estimating centrality statistics for complete and sampled networks: Some approaches and complications. In 48th Hawaii International Conference on System Sciences, HICSS 2015, Kauai, Hawaii, USA, January 5-8, 2015, 1686–1695.
  • Lerman (2018) Lerman, K. 2018. Computational social scientist beware: Simpson’s paradox in behavioral data. Journal of Computational Social Science 1(1):49–58.
  • Lin et al. (2013) Lin, Y.-R.; Margolin, D.; Keegan, B.; and Lazer, D. 2013. Voices of victory: A computational focus group framework for tracking opinion shift in real time. In Proceedings of the 22nd international conference on World Wide Web, 737–748. ACM.
  • Locker (1993) Locker, D. 1993. Effects of non-response on estimates derived from an oral health survey of older adults. Community dentistry and oral epidemiology 21(2):108–113.
  • Malik and Pfeffer (2016) Malik, M. M., and Pfeffer, J. 2016. Identifying platform effects in social media data. In Tenth International AAAI Conference on Web and Social Media.
  • McCormick et al. (2017) McCormick, T. H.; Lee, H.; Cesare, N.; Shojaie, A.; and Spiro, E. S. 2017. Using twitter for demographic and social science research: Tools for data collection and processing. Sociological methods & research 46(3):390–421.
  • McIver and Brownstein (2014) McIver, D. J., and Brownstein, J. S. 2014. Wikipedia usage estimates prevalence of influenza-like illness in the united states in near real-time. PLoS computational biology 10(4):e1003581.
  • McMahon, Johnson, and Hecht (2017) McMahon, C.; Johnson, I.; and Hecht, B. 2017. The substantial interdependence of wikipedia and google: A case study on the relationship between peer production communities and information technologies. In Eleventh International AAAI Conference on Web and Social Media.
  • Meng (2018) Meng, X.-L. 2018. Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 us presidential election. The Annals of Applied Statistics 12(2):685–726.
  • Metaxas, Mustafaraj, and Gayo-Avello (2011) Metaxas, P. T.; Mustafaraj, E.; and Gayo-Avello, D. 2011. How (not) to predict elections. In 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing, 165–171. IEEE.
  • Mislove et al. (2011) Mislove, A.; Lehmann, S.; Ahn, Y.-Y.; Onnela, J.-P.; and Rosenquist, J. N. 2011. Understanding the demographics of twitter users. In Fifth international AAAI conference on weblogs and social media.
  • Morstatter et al. (2013) Morstatter, F.; Pfeffer, J.; Liu, H.; and Carley, K. M. 2013. Is the sample good enough? comparing data from twitter’s streaming api with twitter’s firehose. In Seventh international AAAI conference on weblogs and social media.
  • Murphy and others (2018) Murphy, J., et al. 2018. Social media in public opinion research: Report of the aapor task force on emerging technologies in public opinion research. american association for public opinion research. 2014.
  • O’Connor et al. (2010) O’Connor, B.; Balasubramanyan, R.; Routledge, B. R.; and Smith, N. A. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Fourth International AAAI Conference on Weblogs and Social Media.
  • Olteanu et al. (2019) Olteanu, A.; Castillo, C.; Diaz, F.; and Kiciman, E. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2:13.
  • Olteanu, Weber, and Gatica-Perez (2016) Olteanu, A.; Weber, I.; and Gatica-Perez, D. 2016. Characterizing the demographics behind the# blacklivesmatter movement. In 2016 AAAI Spring Symposium Series.
  • Pasek et al. (2018) Pasek, J.; Yan, H. Y.; Conrad, F. G.; Newport, F.; and Marken, S. 2018. The stability of economic correlations over time: Identifying conditions under which survey tracking polls and twitter sentiment yield similar conclusions. Public Opinion Quarterly 82(3):470–492.
  • Pasek et al. (2019) Pasek, J.; McClain, C. A.; Newport, F.; and Marken, S. 2019. Who’s tweeting about the president? what big survey data can tell us about digital traces? Social Science Computer Review 0894439318822007.
  • Pavalanathan and Eisenstein (2015) Pavalanathan, U., and Eisenstein, J. 2015. Confounds and consequences in geotagged twitter data. arXiv preprint arXiv:1506.02275.
  • Pfeffer, Mayer, and Morstatter (2018) Pfeffer, J.; Mayer, K.; and Morstatter, F. 2018. Tampering with twitter’s sample api. EPJ Data Science 7(1):50.
  • Phillips et al. (2017) Phillips, L.; Dowling, C.; Shaffer, K.; Hodas, N.; and Volkova, S. 2017. Using social media to predict the future: a systematic literature review. arXiv preprint arXiv:1706.06134.
  • Puschmann and Powell (2018) Puschmann, C., and Powell, A. 2018. Turning words into consumer preferences: How sentiment analysis is framed in research and the news media. Social Media+ Society 4(3):2056305118797724.
  • Rao et al. (2010) Rao, D.; Yarowsky, D.; Shreevats, A.; and Gupta, M. 2010. Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, 37–44. ACM.
  • Riederer et al. (2015) Riederer, C. J.; Zimmeck, S.; Phanord, C.; Chaintreau, A.; and Bellovin, S. M. 2015. I don’t have a photograph, but you can have my footprints.: Revealing the demographics of location data. In Proceedings of the 2015 ACM on Conference on Online Social Networks, 185–195. ACM.
  • Ripberger (2011) Ripberger, J. T. 2011. Capturing curiosity: Using internet search trends to measure public attentiveness. Policy Studies Journal 39(2):239–259.
  • Robinson (2011) Robinson, J. P. 2011. It use and leisure time displacement. Information, Communication & Society 14(4):495–509.
  • Ruiz, Hristidis, and Ipeirotis (2014) Ruiz, E. J.; Hristidis, V.; and Ipeirotis, P. G. 2014. Efficient filtering on hidden document streams. In Eighth International AAAI Conference on Weblogs and Social Media.
  • Ruths and Pfeffer (2014) Ruths, D., and Pfeffer, J. 2014. Social media for large studies of behavior. Science 346(6213):1063–1064.
  • Sakaki, Okazaki, and Matsuo (2010) Sakaki, T.; Okazaki, M.; and Matsuo, Y. 2010. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web, 851–860. ACM.
  • Salganik (2017) Salganik, M. J. 2017. Bit by bit: social research in the digital age. Princeton University Press.
  • Sap et al. (2014) Sap, M.; Park, G.; Eichstaedt, J.; Kern, M.; Stillwell, D.; Kosinski, M.; Ungar, L.; and Schwartz, H. A. 2014. Developing age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1146–1151.
  • Schnell, Noack, and Torregroza (2017) Schnell, R.; Noack, M.; and Torregroza, S. 2017. Differences in general health of internet users and non-users and implications for the use of web surveys. In Survey Research Methods, volume 11, 105–123.
  • Sedgwick (2014) Sedgwick, P. 2014. Unit of observation versus unit of analysis. Bmj 348:g3840.
  • Stier et al. (2018) Stier, S.; Bleier, A.; Bonart, M.; Mörsheim, F.; Bohlouli, M.; Nizhegorodov, M.; Posch, L.; Maier, J.; Rothmund, T.; and Staab, S. 2018. Systematically monitoring social media: The case of the german federal election 2017. arXiv preprint arXiv:1804.02888.
  • Stier et al. (2019) Stier, S.; Breuer, J.; Siegers, P.; and Thorson, K. 2019. Integrating survey data and digital trace data: Key issues in developing an emerging field.
  • Straub, Boudreau, and Gefen (2004) Straub, D.; Boudreau, M.-C.; and Gefen, D. 2004. Validation guidelines for is positivist research. Communications of the Association for Information systems 13(1):24.
  • Tufekci (2014) Tufekci, Z. 2014. Big questions for social media big data: Representativeness, validity and other methodological pitfalls. In Eighth International AAAI Conference on Weblogs and Social Media.
  • Tumasjan et al. (2010) Tumasjan, A.; Sprenger, T. O.; Sandner, P. G.; and Welpe, I. M. 2010. Predicting elections with twitter: What 140 characters reveal about political sentiment. In Fourth international AAAI conference on weblogs and social media.
  • Wagner et al. (2017) Wagner, C.; Singer, P.; Karimi, F.; Pfeffer, J.; and Strohmaier, M. 2017. Sampling from social networks with attributes. In Proceedings of the 26th International Conference on World Wide Web, 1181–1190. International World Wide Web Conferences Steering Committee.
  • Wang et al. (2012) Wang, D. J.; Shi, X.; McFarland, D. A.; and Leskovec, J. 2012. Measurement error in network data: A re-classification. Social Networks 34(4):396–409.
  • Wang et al. (2019) Wang, Z.; Hale, S.; Adelani, D. I.; Grabowicz, P.; Hartman, T.; Flöck, F.; and Jurgens, D. 2019. Demographic inference and representative population estimates from multilingual social media data. In The World Wide Web Conference, WWW ’19, 2056–2067. New York, NY, USA: ACM.
  • Watts (2007) Watts, D. J. 2007. A twenty-first century science. Nature 445(7127):489.
  • Weisberg (2009) Weisberg, H. F. 2009. The total survey error approach: A guide to the new science of survey research. University of Chicago Press.
  • Wu et al. (2018) Wu, T.; Wen, S.; Xiang, Y.; and Zhou, W. 2018. Twitter spam detection: Survey of new approaches and comparative study. Computers & Security 76:265–284.
  • Yang and Eisenstein (2017) Yang, Y., and Eisenstein, J. 2017. Overcoming language variation in sentiment analysis with social attention. Transactions of the Association for Computational Linguistics 5:295–307.
  • Yang and Luo (2017) Yang, X., and Luo, J. 2017. Tracking illicit drug dealing and abuse on instagram using multimodal analysis. ACM Transactions on Intelligent Systems and Technology (TIST) 8(4):58.
  • Yang, Ribeiro, and Neville (2017) Yang, J.; Ribeiro, B.; and Neville, J. 2017. Should we be confident in peer effects estimated from social network crawls? In Eleventh International AAAI Conference on Web and Social Media.
  • Yildiz et al. (2017) Yildiz, D.; Munson, J.; Vitali, A.; Tinati, R.; and Holland, J. A. 2017. Using twitter data for demographic research. Demographic Research 37:1477–1514.
  • Yuan et al. (2013) Yuan, Q.; Nsoesie, E. O.; Lv, B.; Peng, G.; Chunara, R.; and Brownstein, J. S. 2013. Monitoring influenza epidemics in china with search query from baidu. PloS one 8(5):e64323.
  • Zagheni and Weber (2015) Zagheni, E., and Weber, I. 2015. Demographic research with non-representative internet data. International Journal of Manpower 36(1):13–25.
  • Zagheni, Weber, and Gummadi (2017) Zagheni, E.; Weber, I.; and Gummadi, K. 2017. Leveraging facebook’s advertising platform to monitor stocks of migrants. Population and Development Review 43(4):721–734.
  • Zhang et al. (2016) Zhang, J.; Hu, X.; Zhang, Y.; and Liu, H. 2016. Your age is no secret: Inferring microbloggers’ ages via content and interaction analysis. In Tenth International AAAI Conference on Web and Social Media.
  • Zhang, Hill, and Rothschild (2016) Zhang, H.; Hill, S.; and Rothschild, D. 2016. Geolocated twitter panels to study the impact of events. In 2016 AAAI Spring Symposium Series.