Understanding complex socio-technical phenomena requires data-driven research based on large-scale, reliable, relevant data sets. Web data, particularly data from application programming interfaces (APIs), has been an enormous boon for researchers using online social platforms’ databases of user-generated activity and content [49, 58, 67, 89]. The ability to “crawl” and “scrape” large-scale and high-resolution samples of publicly-accessible user data stimulated emerging fields like social computing  and computational social science , and developed new fields like crisis informatics . But following major scandals around data privacy and ethics, social media platforms like Facebook and Twitter changed previously permissive data access provisions of their public APIs . As a consequence, the ability for researchers to collect timely data, share tools, instruct students, and reproduce findings has been curtailed.
This “post-API age” is characterized by the deprecation of data resources used for research and teaching [50, 107], increased stratification of data access based on social, technical, and financial capital [16, 93], and greater fear of prosecution around violating terms of service in the course of research [65, 105]. These changes have had a profoundly chilling effect on researchers’ use of API-derived data to investigate behavior like discrimination, harassment, radicalization, hate speech, and disinformation. Furthermore, researchers have struggled in systematically studying the role that platforms’ changing features, design affordances, and governance strategies play in sustaining these forms of “turpitude-as-a-service” [20, 82]. Faced with conflicting incentives between protecting their users’ data from abuse and maintaining their commitments to values of openness, online social platforms are exploring alternative data sharing models like “trusted third party” models that still carry significant technical and reputational risks [20, 56, 74, 99, 107].
Even if the “golden age” of API-driven computational social science and social computing research had not closed in the shadow of privacy scandals, it was nevertheless characterized by enormous inefficiencies in data collection and inequalities in access [93, 107], ethically-suspect methods and implications [15, 119, 102], a lack of concern for data sharing or reproducibility [12, 124], and failures to validate constructs or generalize to off-platform behavior [39, 72, 75]. Facebook’s and Twitter’s changes in data access were significant, however the enclosure of previously open big social data sources is not ubiquitous among platform providers [17, 68, 73]. Social platforms and online communities like Wikipedia , Stack Exchange , GitHub , and Reddit  continue to offer open APIs and data dumps that are valuable for researchers.
In this paper, we assist to the goal of providing open APIs and data dumps to researchers by releasing the Pushshift Reddit dataset. In addition to monthly dumps of 651M submissions and 5.6B comments posted on Reddit between 2005 and 2019111Available at https://files.pushshift.io/reddit/, the Pushshift Reddit dataset also includes an API for researcher access and a Slackbot that allows researchers to easily interact with the collected data. The Pushshift Reddit API enables researchers to easily execute queries on the whole dataset without the need for downloading the monthly dumps. This reduces the requirement for substantial storage capacity, thus making the data more available to a wider range of users. Finally, we provide access to a Slackbot that allows researchers to easily produce visualizations of data from the Pushshift Reddit dataset in real-time and discuss them with colleagues on Slack. These resources allow research teams to quickly begin interacting with data with very little time spent on the tedious aspects of data collection, cleaning, and storage.
Pushshift is not a new or isolated data platform, but a five year-old platform with a track record in peer-reviewed publications and an active community of several hundred users. Pushshift not only collects Reddit data, but exposes it to researchers via an API. Why do people use Pushshift’s API instead of the official Reddit API? In short, Pushshift makes it much easier for researchers to query and retrieve historical Reddit data, provides extended functionality by providing full-text search against comments and submissions, and has larger single query limits. Specifically, because, at the time of this writing, Pushshift has a size limit five times greater than Reddit’s 100 object limit, Pushshift enables the end user to quickly ingest large amounts of data. Additionally, the Pushshift API offers aggregation endpoints to provide summary analysis of Reddit activity, a feature that the Reddit API lacks entirely.
The Pushshift Reddit dataset provides not just a technical infrastructure of software and hardware for collecting “big social data” but also a social infrastructure of organizational processes for responsibly collecting, governing, and discussing these research data.
2.1 Data collection process
Pushshift uses multiple backend software components to collect, store, catalog, index, and disseminate data to end-users. As seen in Fig. 1, these subsystems are:
The ingest engine, which is responsible for collecting and storing raw data.
A PostgreSQL database, which allows for advanced querying of data and meta-data storage.
An Elastic Search document store cluster, which performs indexing and aggregation of ingested data.
An API to allow researchers dynamic access to collected data and aggregation functionality.
The first stage in the Pushshift pipeline is the ingest engine, which is responsible for actually collecting data. The ingest engine can be thought of as a framework for large scale collection of heterogeneous social media data sources. The ingest engine orchestrates the execution of a multiple data collection programs, each designed to handle a particular data source. Specifically, the ingest engine provides and manages a job scheduling queue, and provides a set of common APIs to handle the data storage. Currently, Pushshift’s ingest engine works as follows:
First, the program runner starts each ingest program (i.e., the programs that actually collect the data). The ingest engine is agnostic to the particulars of the individual ingest programs: no particular programming language is required, and there is no particular expectation of how an ingest program works, modulo its interactions with the remainder of the ingest engine. Typically, an ingest program will directly interact with Web APIs, scrape content from HTML pages, use data streams where available, etc. Next, the ingest program inserts the raw data retrieved into a database as well as into a document store. Behind the scenes, each piece of collected data is added to an intermediate queue (currently implemented via Redis), which serves as a staging area until the data is processed by any custom processing scripts the ingest program’s creator might require. Finally, the raw data is periodically flushed to disk. The data storage format can be specified by the ingest program creator via the custom processing scripts previously mentioned, or a standard, Pushshift-implemented format can be used (e.g., ndjson).
PostgreSQL & ElasticSearch.
Pushshift currently uses Elasticsearch (ES) as a scalable document store for each data source that is part of the ingest pipeline. ES offers a number of important features for storing and analyzing large amounts of data. For example, ES achieves ease-of-scaling by utilizing a cluster approach for horizontal expansion. It ensures redundancy by creating multiple replicas for each index so that a node outage does not affect the overall health of the cluster. The ES robust dynamic mapping tools allow easy modification and expansion of indexes to accommodate changes in data structure from the source. This is useful because Reddit’s API does not implement any type of versioning, yet there are constant additions and modifications made to the API when new features and data types are added to the response objects. By using dynamic mapping types, Pushshift can easily add new fields to existing indices. This enables us to quickly modify the corresponding mappings to allow search and aggregation on those new fields. Pushshift also makes use of the ICU Analysis plug-in for ES [31, 40], which provides support for international locales, full Unicode support up through Unicode 12, and complete emoji search support.
API Pushshift currently allows users to search Reddit data via an API. Right now, this API exports much of the search and aggregation functionality provided by Elastic Search. This functionality supports dozens of community applications and numerous research projects. The API is the major workload of handled by Pushshift’s computational resources, serving 500M requests per month. Although in this paper we focus on a description of the data (Section 3) due to space limitations, we provide online API documentation at https://pushshift.io/api-parameters/.
Community In addition to Pushshift’s website, which features an interactive dashboard of current activity trends, Pushshift also has two active user communities on Reddit and Slack. The /r/pushshift subreddit was created in April 2015 and is used for sharing announcements, answering questions, reporting bugs, and soliciting feedback for new features. There are more than 2,100 subscribers to this subreddit, an active team of 10 moderators, and more than 700 posts (with more than 4,000 comments) from over 350 unique users (see Fig. 2).
The Pushshift Slack team has nearly 300 registered users and more than 260,000 messages across 53 channels discussing data science and visualization. Custom tools have also been developed to integrate the Pushshift archive into these Slack communities. For example, users can interact with a Slack chatbot in realtime. The bot can analyze and visualize Pushshift data based on queries made in the Slack channel, and return those visualizations to the channel for discussion and observation. In Fig.3, a user queried the total number of daily comments to the /r/the_donald subreddit by day over the past four years and received a time series plot and summary statistics from the chatbot within a few seconds. The chatbot can also be shared to other non-Pushshift workspaces, allowing researchers in other Slack workspaces to use the data. This extends the reach of Pushshift data even further.
|id||The submission’s identifier, e.g., “5lcgjh” (String).|
|url||The URL that the submission is posting. This is the same with the permalink in cases where the submission is a self post. E.g., “https://www.reddit.com/r/AskReddit/|
|permalink||Relative URL of the permanent link that points to this specific submission, e.g., “/r/AskReddit/comments/5lcgj9/what_did_you_think_of_the_ending_of_rogue_one/” (String).|
|author||The account name of the poster, e.g., “example_username” (String).|
|created_utc||UNIX timestamp referring to the time of the submission’s creation, e.g., 1483228803 (Integer).|
|subreddit||Name of the subreddit that the submission is posted. Note that it excludes the prefix /r/. E.g., ’AskReddit’ (String).|
|subreddit_id||The identifier of the subreddit, e.g., “t5_2qh1i” (String).|
|selftext||The text that is associated with the submission (String).|
|title||The title that is associated with the submission, e.g., “What did you think of the ending of Rogue One?” (String).|
|num_comments||The number of comments associated with this submission, e.g., 7 (Integer).|
|score||The score that the submission has accumulated. The score is the number of upvotes minus the number of downvotes. E.g., 5 (Integer). NB: Reddit fuzzes the real score to prevent spam bots.|
|is_self||Flag that indicates whether the submission is a self post, e.g., true (Boolean).|
|over_18||Flag that indicates whether the submission is Not-Safe-For-Work, e.g., false (Boolean).|
|distinguished||Flag to determine whether the submission is distinguished222See https://www.reddit.com/r/redditdev/comments/19ak1b/api˙change˙distinguished˙is˙now˙available˙in˙the/ by moderators. “null” means not distinguished (String).|
|edited||Indicates whether the submission has been edited. Either a number indicating the UNIX timestamp that the submission was edited at, “false” otherwise.|
|domain||The domain of the submission, e.g., self.AskReddit (String).|
|stickied||Flag indicating whether the submission is set as sticky in the subreddit, e.g., false (Boolean).|
|locked||Flag indicating whether the submission is currently closed to new comments, e.g., false (Boolean).|
|quarantine||Flag indicating whether the community is quarantine, e.g., false (Boolean).|
|hidden_score||Flag indicating if the submission’s score is hidden, e.g., false (Boolean).|
|retrieved_on||UNIX timestamp referring to the time we crawled the submission, e.g., 1483228803 (Integer).|
|author_flair_css_class||The CSS class of the author’s flair. This field is specific to subreddit (String).|
|author_flair_text||The text of the author’s flair. This field is specific to subreddit (String).|
3 Description of the Pushshift Reddit Dataset
Pushshift makes available all the submissions and comments posted on Reddit between June 2005 and April 2019. The dataset consists of 651,778,198 submissions and 5,601,331,385 comments posted on 2,888,885 subreddits. Fig. 4 shows the number of submissions and comments per day. We observe that the number of submissions and comments increase over the course of our dataset. After August 2013, we have consistently over 1M comments per day, while by the end of our dataset (April 2019) we have 5M comments per day. Also, while submissions are substantially fewer than comments, submissions have reached reached a consistent level of over 500K per day in this dataset.
The Pushshift Reddit dataset is made up of two sets of files: one set of files for the submissions and one for the comments. Below, we describe the structure of each of the files in these two sets.
Submissions. The submissions dataset consists of a set of newline delimited JSON333http://ndjson.org/ files: we maintain a separate file for each month of our data collection. Each line in these files correspond to a submission and it is a JSON object. Table 1 describes the most important key/values included in each submission’s JSON object.
Comments. Similarly to the submissions, the comments’ dataset is a collection of ndjson files with each file corresponding to a month-worth of data. Each line in these files correspond to a comment and it is a JSON object. Table 2 describes the most important keys/values in each comment’s JSON object.
|id||The comment’s identifier, e.g., “dbumnq8” (String).|
|author||The account name of the poster, e.g., “example_username” (String).|
|link_id||Identifier of the submission that this comment is in, e.g., “t3_5l954r” (String).|
|parent_id||Identifier of the parent of this comment, might be the identifier of the submission if it is top-level comment or the identifier of another comment, e.g., “t1_dbu5bpp” (String).|
|created_utc||UNIX timestamp that refers to the time of the submission’s creation, e.g., 1483228803 (Integer).|
|subreddit||Name of the subreddit that the comment is posted. Note that it excludes the prefix /r/. E.g., ’AskReddit’ (String).|
|subreddit_id||The identifier of the subreddit where the comment is posted, e.g., “t5_2qh1i” (String).|
|body||The comment’s text, e.g., “This is an example comment” (String).|
|score||The score of the comment. The score is the number of upvotes minus the number of downvotes. Note that Reddit fuzzes the real score to prevent spam bots. E.g., 5 (Integer).|
|distinguished||Flag to determine whether the comment is distinguished by the moderators. “null” means not distinguished444See https://www.reddit.com/r/redditdev/comments/19ak1b/api˙change˙distinguished˙is˙now˙available˙in˙the/ for more details (String).|
|edited||Flag indicating if the comment has been edited. Either the UNIX timestamp that the comment was edited at, or “false”.|
|stickied||Flag indicating whether the submission is set as sticky in the subreddit, e.g., false (Boolean).|
|retrieved_on||UNIX timestamp that refers to the time that we crawled the comment, e.g., 1483228803 (Integer).|
|gilded||The number of times this comment received Reddit gold, e.g., 0 (Integer).|
|controversiality||Number that indicates whether the comment is controversial, e.g., 0 (Integer).|
|author_flair_css_class||The CSS class of the author’s flair. This field is specific to subreddit (String).|
|author_flair_text||The text of the author’s flair. This field is specific to subreddit (String).|
FAIR principles. The Pushshift Reddit dataset aligns with the FAIR principles.555https://www.go-fair.org/fair-principles/ Our dataset is Findable as the monthly dumps are publicly available via Pushshift’s website666https://files.pushshift.io/reddit/. We also upload a small sample of the dataset to the Zenodo service, so that we obtain a persistent digital object identifier (DOI): 10.5281/zenodo.3608135. Note that we were unable to upload the entire dataset to Zenodo, since the service has a limit of 100GB and our dataset is in the order of several terabytes. The Pushshift Reddit dataset is Accessible as it can be accessed by anyone visiting the Pushshift’s website. Furthermore, we offer an API and a Slackbot that allow researchers to easily execute queries and obtain data from our infrastructure without the need to download the large monthly dumps. Also, our dataset is Interoperable because it is JSON format, which is a widely known and used format for data. Because the provenance for the collected data is very clear, and users are simply asked to cite Pushshift in order to use the data, our dataset is also Reusable.
4 Dataset Use Cases
The Pushshift Reddit dataset has attracted a substantial research community.As of late 2019, Google Scholar indexes over 100 peer-reviewed publications that used Pushshift data (see Fig. 5). This research covers a diverse cross-section of research topics including measuring toxicity, personality, virality, and governance. Pushshift’s influence as a primary source of Reddit data among researchers has attracted empirical scrutiny , which in turn has led to improved data validation efforts . We note that there is some difficulty in ascertaining our dataset’s full contribution to the scientific community due to a previous lack of deliberate efforts to conform to FAIR principles, which we address in this paper.
Online community governance.
Reddit’s ecosystem of sub-reddits are primarily governed by volunteer moderators with substantial discretion over creating and enforcing rules about user behavior and content moderation [45, 113]. This distributed and volunteer-led model stands in contrast to the centralized strategies of other prominent social platforms like Facebook, Twitter, and YouTube . These differences between centralized versus delegated moderation make ideal case studies for comparing the effectiveness of responses to difficult issues like social movements, fringe identities, hate speech, and harassment campaigns [95, 96, 97]. Pushshift data has already been instrumental for researchers exploring the spillover effects of banning offensive sub-communities , identifying common features of abusive behavior across communities , similarity in norms and rules across communities [27, 45], perceptions of fairness in moderation decisions [76, 77], and improving automated moderation tools .
The political extremism research community currently faces significant challenges in understanding how mainstream and fringe online spaces are used by bad actors. Despite widespread agreement that recent increases in online radicalization are due to “a globalised, toxic, anonymous online culture” operating largely outside mainstream social media platforms , much of the research on extremist use of social media still focuses on mainstream sites like Facebook or Twitter . Access to these rapidly-changing online spaces is difficult, and many research teams end up using out-of-date data, or relying on the data they have, rather than the data they need. Many social media platforms face pressure to monetize their data  or remove access to it entirely , making research access to these spaces expensive and difficult. Yet, extremism researchers agree that data access is a key limitation to understanding online radicalization as a phenomenon. Online extremism researchers top recommendation is to “invest more in data-driven analysis of far-right violent extremist and terrorist use of the Internet.”  Pushshift data has already been used to understand the phenomenon of hate speech and political extremism [79, 25, 43, 44, 64] and trolling and negative behaviors in fringe online spaces [4, 125, 126].
|Tutorials & demos||❍||◗||⚫||◗||◗|
The online disinformation research community has focused its attention on how social media facilitates the spread of deliberately inaccurate information [100, 115]. The use of social media platforms to spread this “fake news” and biased political propaganda was particularly concerning given the events surrounding Russian interference in the 2016 US presidential election. Researchers studying disinformation acknowledge that mainstream platforms, particularly Facebook, are still the main place where disinformation campaigns take place  and that a lack of data access is significantly limiting their efforts . While mainstream sites are the largest amplifiers of disinformation content, the content itself is often created on fringe sites that serve as proving grounds [52, 60, 94]. As with extremism and terrorism research, data access and data sharing in the disinformation research community is an ongoing struggle. Pushshift data has already been used in a number of papers on disinformation and social media trustworthiness [32, 71, 127, 130].
Datasets like Pushshift are critically important for researchers who answer questions at the intersection of Internet and society. How does technology spread? What is the impact of each interface or design choice on the efficacy of social media platforms? How should we measure the success or failure of an online community? Pushshift data has already been used in studies of user engagement on social media , social media moderation schemes [112, 114], measuring success and growth of online communities [33, 116], conflict in online groups [34, 35, 83], the spread of technological innovations , modeling collaboration [81, 98], and measuring engagement and collective attention [6, 91].
Big data science.
Because of the relative anonymity allowed by certain social media platforms, large social media datasets are useful for researchers studying topics in health informatics including sensitive medical issues, atypical behaviors, and interpersonal conflict. Pushshift data has been used by researchers studying eating disorders and weight loss , addiction and substance abuse [8, 9, 14, 19, 92, 128], sexually transmitted infections , difficult child-rearing problems , and various mental health challenges [23, 36, 48, 63, 62, 106, 109].
Intelligent systems that can augment and enhance human understanding often require large amounts of human-generated text data generated in a social context. Social media data collected by Pushshift has been used already by researchers in computational linguistics and natural language processing[51, 54, 70, 78, 122, 129, 131], recommender systems [21, 38, 66, 69], intelligent conversational systems [1, 59, 80]120], entity recognition , and other fields associated with the development of systems that can sense, reason, learn, and predict.
5 Related Work
5.1 Existing Data Collection Services
Promising alternatives to the aforementioned model of “storage buckets of open data hosted by cloud providers” exist that are better-tailored towards the needs of researchers.
Pushshift is not the first large-scale real-time social media data collection service aimed towards researchers. Table 3 summarizes the social and organizational features of other similar services. While not an exhaustive list, the following have heavily influenced the research community as well as motivated Pushshift’s own goals and design.
- Media Cloud
is an “open source platform for studying media ecosystems” that tracks hundreds of millions of news stories and makes aggregated count and topical data available via a free and semi-public API. The Media Cloud platform has been used to study digital health communication, agenda-setting, and online social movements. Researchers can use the API to get counts of stories, topics, words, and tags in response to queries by keyword, media source, and time window using a Solr search platform .
- Stats Exchange
is a platform of social question answering communities, including Stack Overflow. While data dumps of the platform are hosted by the Internet Archive , Stack Exchange offers both an API of activity as well as a “Data Explorer” allowing users to write SQL queries via a web interface against a regularly-updated database .
is the parent organization of projects like Wikipedia. It hosts data dumps of revision histories, content, and pageviews; makes data available through robust APIs; and offers a variety of interactive services. Wikimedia’s deployment of Jupyter Notebooks can access replication databases of revisions and content. This enables researchers focus on analyzing data rather than system and database administration.
Other dataset papers. Considering the challenges in the post-API age, the collection, curation, and dissemination of datasets is crucial for the advancement of science. To that end, it is worth exploring other works whose primary contribution has been the dataset they provide. For example,  released a dataset that includes 37M posts and 24M comments covering August 2016 through December 2018 from Gab, a Twitter-like social media platform that after being de-platformed by major service providers ported their codebase to use the federated social network protocol from the Mastodon project. As it turns out,  released a dataset focused around Mastodon itself. Their dataset contains 5M posts, along with a crowdsourced (by Mastodon users) label that indicates whether or not the post contains inappropriate content. Research into other types of computer-mediated communication platforms have also been enabled by dataset contributions.  released a dataset from 178 WhatsApp groups that includes 454K messages from 45K different users.
6 Discussion & Conclusion
In this paper, we presented the Pushshift Reddit Dataset, which includes hundreds of millions of submissions and billions of comments from 2005 until the present. In addition to offering Pushshift’s data as monthly dumps, we also make this dataset available via a searchable API, as well as additional tools and community resources. This paper also serves as a more formal and archival description of what Pushshift’s Reddit dataset provides. Having already been used in over 100 papers from numerous disciplines over the past four years, the Pushshift Reddit dataset will continue to be a valuable resource for the research community in the future.
A. Ahmadvand, H. Sahijwani, J. I. Choi, and E. Agichtein.
Concet: Entity-aware topic classification for open-domain conversational agents.In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1371–1380. ACM, 2019.
-  D. Alba. Ahead of 2020, facebook falls short on plan to share data on disinformation. https://www.nytimes.com/2019/09/29/technology/facebook-disinformation.html, sep 2019.
-  K. K. Aldous, J. An, and B. J. Jansen. View, like, comment, post: Analyzing user engagement by topic at 4 levels across 5 social media platforms for 53 news organizations. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pages 47–57, 2019.
-  H. Almerekhi, H. Kwak, B. J. Jansen, and J. Salminen. Detecting toxicity triggers in online discussions. In Proceedings of the 30th ACM Conference on Hypertext and Social Media, pages 291–292. ACM, 2019.
-  T. Ammari, S. Schoenebeck, and D. Romero. Self-declared throwaway accounts on reddit: How platform affordances and shared norms enable parenting disclosure and support. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):135, 2019.
-  J. An, H. Kwak, O. Posegga, and A. Jungherr. Political discussions in homogeneous and cross-cutting communication spaces. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pages 68–79, 2019.
-  I. Archive. Stack exchange data dump. https://archive.org/details/stackexchange, 2019.
-  D. Balsamo, P. Bajardi, and A. Panisson. Firsthand opiates abuse on social media: Monitoring geospatial patterns of interest through a digital cohort. In The World Wide Web Conference, pages 2572–2579. ACM, 2019.
-  J. O. Barker and J. A. Rohde. Topic clustering of e-cigarette submissions among reddit communities: A network perspective. Health Education & Behavior, 46(2_suppl):59–68, 2019.
-  M. Bastos and S. Walker. Facebook’s data lockdown is a disaster for academic researchers. https://theconversation.com/facebooks-data-lockdown-is-a-disaster-for-academic-researchers-94533, Apr. 2018.
-  J. Baumgartner. My response to the paper highlighting issues with data incompleteness concerning my reddit corpus. https://www.reddit.com/r/datasets/comments/884vkh/, 2018.
-  C. L. Borgman. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology, 63(6):1059–1078, 2012.
-  A. Botta, N. Digiacomo, and K. Mole. Monetizing data: A new source of value in payments. https://www.mckinsey.com/industries/financial-services/our-insights/monetizing-data-a-new-source-of-value-in-payments, 2017.
-  D. A. Bowen, J. O’Donnell, and S. A. Sumner. Increases in online posts about synthetic opioids preceding increases in synthetic opioid death rates: a retrospective observational study. Journal of general internal medicine, 34(12):2702–2704, 2019.
-  d. boyd. Untangling research and practice: What Facebook’s ”emotional contagion” study teaches us. Research Ethics, 12(1):4–13, 2016.
-  d. boyd and K. Crawford. Critical Questions for Big Data. Information, Communication & Society, 15(5):662–679, 2012.
-  J. Boyle. The second enclosure movement and the construction of the public domain. In Copyright Law, pages 63–104. Routledge, 2017.
-  S. Bradshaw and P. N. Howard. The global disinformation order: 2019 global inventory of organised social media manipulation. https://comprop.oii.ox.ac.uk/wp-content/uploads/sites/93/2019/09/CyberTroop-Report19.pdf, 2019.
-  E. I. Brett, E. M. Stevens, T. L. Wagener, E. L. Leavens, T. L. Morgan, W. D. Cotton, and E. T. Hébert. A content analysis of juul discussions on social media: Using reddit to understand patterns and perceptions of juul use. Drug and alcohol dependence, 194:358–362, 2019.
-  A. Bruns. After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11):1544–1566, 2019.
N. Buhagiar, B. Zahir, and A. Abhari.
Using deep learning to recommend discussion threads to users in an online forum.In
2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.
-  V. Burris, E. Smith, and A. Strahm. White supremacist networks on the internet. Sociological Focus, 33(2):215–235, 2000.
-  D. Chakravorti, K. Law, J. Gemmell, and D. Raicu. Detecting and characterizing trends in online mental health discussions. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pages 697–706. IEEE, 2018.
-  E. Chandrasekharan, C. Gandhi, M. W. Mustelier, and E. Gilbert. Crossmod: A cross-community learning-based system to assist reddit moderators. Proc. ACM Hum.-Comput. Interact., 3, Nov. 2019.
-  E. Chandrasekharan, U. Pavalanathan, A. Srinivasan, A. Glynn, J. Eisenstein, and E. Gilbert. You can’t stay here: The efficacy of reddit’s 2015 ban examined through hate speech. Proceedings of the ACM on Human-Computer Interaction, 1(CSCW):31, 2017.
-  E. Chandrasekharan, U. Pavalanathan, A. Srinivasan, A. Glynn, J. Eisenstein, and E. Gilbert. You can’t stay here: The efficacy of reddit’s 2015 ban examined through hate speech. Proc. ACM Hum.-Comput. Interact., 1, Dec. 2017.
-  E. Chandrasekharan, M. Samory, S. Jhaver, H. Charvat, A. Bruckman, C. Lampe, J. Eisenstein, and E. Gilbert. The internet’s hidden rules: An empirical study of reddit norm violations at micro, meso, and macro scales. Proc. ACM Hum.-Comput. Interact., 2, Nov. 2018.
-  E. Chandrasekharan, M. Samory, A. Srinivasan, and E. Gilbert. The bag of communities: Identifying abusive behavior online with preexisting internet data. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, page 3175–3187. ACM, 2017.
-  J. Chuang, S. Fish, D. Larochelle, W. P. Li, and R. Weiss. Large-scale topical analysis of multiple online news sources with media cloud. In NewsKDD: Data Science for News Publishing, 2014.
-  M. Cloud. API specifications. https://github.com/berkmancenter/mediacloud/, 2019.
-  I. P. M. Committee. International components for unicode. http://site.icu-project.org/home, 2019.
E. Crothers, N. Japkowicz, and H. L. Viktor.
Towards ethical content-based detection of online influence
2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2019.
-  T. Cunha, D. Jurgens, C. Tan, and D. Romero. Are all successful communities alike? characterizing and predicting the success of online communities. In The World Wide Web Conference, pages 318–328. ACM, 2019.
-  S. Datta and E. Adar. Extracting inter-community conflicts in reddit. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pages 146–157, 2019.
-  S. Datta, C. Phelan, and E. Adar. Identifying misaligned inter-group links and communities. Proceedings of the ACM on Human-Computer Interaction, 1(CSCW):37, 2017.
-  F. Delahunty, I. D. Wood, and M. Arcan. First insights on a passive major depressive disorder prediction system with incorporated conversational chatbot. In AICS, pages 327–338, 2018.
-  L. Derczynski, E. Nichols, M. van Erp, and N. Limsopatham. Results of the wnut2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147, 2017.
-  L. Eberhard, S. Walk, L. Posch, and D. Helic. Evaluating narrative-driven movie recommendations on reddit. In IUI, pages 1–11, 2019.
-  H. Ekbia, M. Mattioli, I. Kouper, G. Arave, A. Ghazinejad, T. Bowman, V. R. Suri, A. Tsou, S. Weingart, and C. R. Sugimoto. Big data, bigger dilemmas: A critical review. Journal of the Association for Information Science and Technology, 66(8):1523–1545, 2015.
-  Elasticsearch. Icu analysis plugin. https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html, 2019.
-  K. B. Enes, P. P. V. Brum, T. O. Cunha, F. Murai, A. P. C. da Silva, and G. L. Pappa. Reddit weight loss communities: do they have what it takes for effective health interventions? In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pages 508–513. IEEE, 2018.
-  S. Exchange. Data explorer help. https://data.stackexchange.com/help, 2019.
-  G. Fair and R. Wesslen. Shouting into the void: A database of the alternative social media platform gab. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pages 608–610, 2019.
-  T. Farrell, M. Fernandez, J. Novotny, and H. Alani. Exploring misogyny across the manosphere in reddit. In Proceedings of the 10th ACM Conference on Web Science, WebSci’19, pages 87–96, 2019.
-  C. Fiesler, J. McCann, K. Frye, J. R. Brubaker, et al. Reddit rules! characterizing an ecosystem of governance. In Twelfth International AAAI Conference on Web and Social Media. AAAI, 2018.
-  M. Fire and C. Guestrin. The rise and fall of network stars: Analyzing 2.5 million graphs to reveal how high-degree vertices emerge over time. Information Processing & Management, 2019.
-  W. Foundation. Wikimedia downloads. https://dumps.wikimedia.org/, 2019.
-  B. S. Fraga, A. P. C. da Silva, and F. Murai. Online social networks in health care: A study of mental disorders on reddit. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pages 568–573. IEEE, 2018.
-  D. Freelon. On the Interpretation of Digital Trace Data in Communication and Social Computing Research. Journal of Broadcasting & Electronic Media, 58(1):59–75, 2014.
-  D. Freelon. Computational Research in the Post-API Age. Political Communication, 35(4):665–668, 2018.
-  N. E. Fulda. Semantically aligned sentence-level embeddings for agent autonomy and natural language understanding. 2019.
-  D. Funke. Misinformers are moving to smaller platforms. so how should fact-checkers monitor them? https://www.poynter.org/fact-checking/2018/misinformers-are-moving-to-smaller-platforms-so-how-should-fact-checkers-monitor-them/, Dec. 2018.
-  D. Gaffney and J. N. Matias. Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus. PLOS ONE, 13(7):e0200162, July 2018.
-  P. Gamallo, S. Sotelo, J. R. Pichel, and M. Artetxe. Contextualized translations of phrasal verbs with distributional compositional semantics and monolingual corpora. Computational Linguistics, pages 1–27, 2019.
-  K. Garimella and G. Tyson. Whatapp Doc? A First Look at Whatsapp Public Group Data. In Twelfth International AAAI Conference on Web and Social Media, 2018.
-  E. Gibney. Privacy hurdles thwart Facebook democracy research. Nature, 574(7777):158–159, Oct. 2019.
-  M. Glenski, E. Saldanha, and S. Volkova. Characterizing speed and scale of cryptocurrency discussion spread on reddit. In The World Wide Web Conference, pages 560–570. ACM, 2019.
-  S. A. Golder and M. W. Macy. Digital Footprints: Opportunities and Challenges for Online Social Research. Annual Review of Sociology, 40(1):129–152, 2014.
S. Golovanov, A. Tselousov, R. Kurbanov, and S. I. Nikolenko.
Lost in conversation: A conversational agent based on the transformer and transfer learning.In The NeurIPS’18 Competition, pages 295–315. Springer, 2020.
-  D. Gonimah. Storyful’s guide to the social media landscape: Beyond the iceberg metaphor. https://storyful.com/thought-leadership/storyfuls-guide-to-the-social-media-landscape-beyond-the-iceberg-metaphor/, oct 2018.
-  G. Gousios. The ghtorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13. IEEE Press, 2013.
-  R. Grant, D. Kucher, A. M. León, J. Gemmell, and D. Raicu. Discovery of informal topics from post traumatic stress disorder forums. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 452–461. IEEE, 2017.
-  R. N. Grant, D. Kucher, A. M. León, J. F. Gemmell, D. S. Raicu, and S. J. Fodeh. Automatic extraction of informal topics from online suicidal ideation. BMC bioinformatics, 19(8):211, 2018.
-  T. Grover and G. Mark. Detecting potential warning behaviors of ideological radicalization in an alt-right subreddit. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pages 193–204, 2019.
-  A. Halavais. Overcoming terms of service: A proposal for ethical distributed research. Information, Communication & Society, 22(11):1567–1581, 2019.
-  K. Halder, M.-Y. Kan, and K. Sugiyama. Predicting helpful posts in open-ended discussion forums: A neural architecture. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3148–3157, 2019.
-  K. N. Hampton. Studying the Digital: Directions and Challenges for Digital Methods. Annual Review of Sociology, 43(1):167–188, 2017.
-  C. Hess and E. Ostrom. Ideas, artifacts, and facilities: information as a common-pool resource. Law and contemporary problems, 66(1/2):111–145, 2003.
-  J. Hessel, L. Lee, and D. Mimno. Cats and captions vs. creators and the clock: Comparing multimodal content to context in predicting relative popularity. In Proceedings of the 26th International Conference on World Wide Web, pages 927–936. International World Wide Web Conferences Steering Committee, 2017.
-  C. Hidey and K. McKeown. Fixed that for you: Generating contrastive claims with semantic edits. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1756–1767, 2019.
-  B. D. Horne and S. Adali. The impact of crowds on news engagement: A reddit case study. In Eleventh International AAAI Conference on Web and Social Media, 2017.
-  J. Howison, A. Wiggins, and K. Crowston. Validity Issues in the Use of Social Network Analysis with Digital Trace Data. Journal of the Association for Information Systems; Atlanta, 12(12):767–797, Dec. 2011.
-  D. Hunter. Cyberspace as place and the tragedy of the digital anticommons. California Law Review, 91:439, 2003.
-  M. Ingram. Silicon Valley’s Stonewalling. Columbia Journalism Review, 2019.
-  L. Japec, F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O’Neil, and A. Usher. Big Data in Survey Research: AAPOR Task Force Report. Public Opinion Quarterly, 79(4):839–880, 2015.
-  S. Jhaver, D. S. Appling, E. Gilbert, and A. Bruckman. “did you suspect the post would be removed?”: Understanding user reactions to content removals on reddit. Proc. ACM Hum.-Comput. Interact., 3, Nov. 2019.
-  S. Jhaver, A. Bruckman, and E. Gilbert. Does transparency in moderation really matter? user behavior after content removal explanations on reddit. Proc. ACM Hum.-Comput. Interact., 3, Nov. 2019.
-  J.-Y. Jiang, F. Chen, Y.-Y. Chen, and W. Wang. Learning to disentangle interleaved conversational threads with a siamese hierarchical network and similarity ranking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1812–1822, 2018.
-  A. Johnston and A. Marku. Identifying extremism in text using deep learning. Development and Analysis of Deep Learning Architectures, pages 267–289, 2020.
-  P. Jonell, P. Fallgren, F. I. Doğan, J. Lopes, U. Wennberg, and G. Skantze. Crowdsourcing a self-evolving dialog graph. In Proceedings of the 1st International Conference on Conversational User Interfaces, page 14. ACM, 2019.
-  P. Kasper, P. Koncar, S. Walk, T. Santos, M. Wölbitsch, M. Strohmaier, and D. Helic. Modeling user dynamics in collaboration websites. In Dynamics on and of Complex Networks, pages 113–133. Springer, 2017.
-  B. Keegan. Discovering the Social. http://www.brianckeegan.com/2018/03/discovering-the-social/, 2018.
-  S. Kumar, W. L. Hamilton, J. Leskovec, and D. Jurafsky. Community interaction and conflict on the web. In Proceedings of the 2018 World Wide Web Conference, pages 933–943. International World Wide Web Conferences Steering Committee, 2018.
-  A. Kunft, A. Katsifodimos, S. Schelter, S. Breß, T. Rabl, and V. Markl. An intermediate representation for optimizing machine learning pipelines. Proceedings of the VLDB Endowment, 12(11):1553–1567, 2019.
-  A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, and V. Markl. Blockjoin: efficient matrix partitioning through joins. Proceedings of the VLDB Endowment, 10(13):2061–2072, 2017.
-  A. Kunft, L. Stadler, D. Bonetta, C. Basca, J. Meiners, S. Breß, T. Rabl, J. Fumero, and V. Markl. Scootr: Scaling r dataframes on dataflow systems. In Proceedings of the ACM Symposium on Cloud Computing, pages 288–300. ACM, 2018.
-  Y. Lama, D. Hu, A. Jamison, S. C. Quinn, and D. A. Broniatowski. Characterizing trends in human papillomavirus vaccine discourse on reddit (2007-2015): An observational study. JMIR public health and surveillance, 5(1):e12480, 2019.
-  D. Lazer, A. Pentland, L. Adamic, S. Aral, A.-L. Barabási, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King, M. Macy, D. Roy, and M. V. Alstyne. Computational Social Science. Science, 323(5915):721–723, 2009.
-  D. Lazer and J. Radford. Data ex Machina: Introduction to Big Data. Annual Review of Sociology, 43(1):19–39, 2017.
-  K. Leetaru and P. A. Schrodt. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA annual convention, number 4, pages 1–49, 2013.
-  P. Lorenz-Spreen, B. M. Mønsted, P. Hövel, and S. Lehmann. Accelerating dynamics of collective attention. Nature communications, 10(1):1759, 2019.
-  J. Lu, S. Sridhar, R. Pandey, M. A. Hasan, and G. Mohler. Investigate transitions into drug addiction through text mining of reddit data. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2367–2375. ACM, 2019.
-  L. Manovich. Trending: The promises and the challenges of big social data. Debates in the digital humanities, 2:460–475, 2011.
-  A. Marwick and R. Lewis. Media manipulation and disinformation online: Case studies. https://datasociety.net/pubs/oh/DataAndSociety_MediaManipulationAndDisinformationOnline.pdf, may 2017.
-  A. Massanari. #gamergate and the fappening: How reddit’s algorithm, governance, and culture support toxic technocultures. New Media & Society, 19(3):329–346, 2017.
-  J. N. Matias. Going dark: Social factors in collective action against platform operators in the reddit blackout. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, page 1138–1151. ACM, 2016.
-  J. N. Matias. The Civic Labor of Volunteer Moderators Online. Social Media + Society, Apr. 2019.
-  A. N. Medvedev, J.-C. Delvenne, and R. Lambiotte. Modelling structure and predicting dynamics of discussion threads in online boards. Journal of Complex Networks, 7(1):67–82, 2018.
-  J. Mervis. Privacy concerns could derail Facebook data-sharing plan. Science, 365(6460):1360–1361, 2019.
-  V. Narayanan, V. Barash, J. Kelly, B. Kollanyi, L.-M. Neudert, and P. N. Howard. Polarization, partisanship and junk news consumption over the us. http://comprop.oii.ox.ac.uk/wp-content/uploads/sites/93/2018/02/Polarization-Partisanship-JunkNews.pdf, 2018.
-  A. Oboloer, W. Allington, and P. Scolyer-Gray. Hate and violent extremism from an online subculture: The yom kippur terrorist attack in halle, germany. http://ohpi.org.au/halle/Hate%20and%20Violent%20Extremism%20from%20an%20Online%20Sub-Culture.pdf, 2019.
-  A. Olteanu, C. Castillo, F. Diaz, and E. Kıcıman. Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries. Frontiers in Big Data, 2, 2019.
-  F. Ozcan. Bayesian Nonparametric Models on Big Data. PhD thesis, UC Irvine, 2017.
-  L. Palen and K. M. Anderson. Crisis informatics—New data for extraordinary times. Science, 353(6296):224–225, 2016.
-  K. S. Patel. Testing the Limits of the First Amendment: How Online Civil Rights Testing is Protected Speech Activity. Columbia Law Review, 118(5):1473–1516, 2018.
-  I. Pirina and Ç. Çöltekin. Identifying depression on reddit: The effect of training data. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task, pages 9–12, 2018.
-  C. Puschmann. An end to the wild west of social media research: A response to Axel Bruns. Information, Communication & Society, 22(11):1582–1589, 2019.
-  Reddit. API documentation. https://www.reddit.com/dev/api/, 2019.
-  N. Rezaii, E. Walker, and P. Wolff. A machine learning approach to predicting psychosis using semantic density and latent content analysis. NPJ schizophrenia, 5, 2019.
-  I. Sarantopoulos, D. Papatheodorou, D. Vogiatzis, G. Tzortzis, and G. Paliouras. Timerank: A random walk approach for community discovery in dynamic networks. In International Conference on Complex Networks and their Applications, pages 338–350. Springer, 2018.
-  J. Seering, T. Wang, J. Yoon, and G. Kaufman. Moderator engagement and community development in the age of algorithms. New Media & Society, 21(7):1417–1443, July 2019.
-  Q. Shen and C. Rose. The discourse of online content moderation: Investigating polarized user responses to changes in reddit’s quarantine policy. In Proceedings of the Third Workshop on Abusive Language Online, pages 58–69, 2019.
-  T. Squirrell. Platform dialectics: The relationships between volunteer moderators and end users on reddit. New Media & Society, Mar. 2019.
-  K. B. Srinivasan, C. Danescu-Niculescu-Mizil, L. Lee, and C. Tan. Content removal as a moderation strategy: Compliance and other outcomes in the changemyview community. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):163, 2019.
-  K. Starbird. Examining the alternative media ecosystem through the production of alternative narratives of mass shooting events on twitter. In International AAAI Conference on Web and Social Media. AAAI, 2017.
-  C. Tan. Tracing community genealogy: how new communities emerge from the old. In Twelfth International AAAI Conference on Web and Social Media, 2018.
-  T. A. Terrorism. Insights from the centre for analysis of the radical right’s inaugural conference in london. https://www.techagainstterrorism.org/2019/06/06/insights-from-the-centre-for-analysis-of-the-radical-rights-inaugural-conference-in-london/, May 2019.
-  S. Tsugawa and S. Niida. The impact of social network structure on the growth and survival of online communities. International Conference on Advances in Social Networks Analysis and Mining, pages 1112–1119, 2019.
-  Z. Tufekci. Big questions for social media big data: Representativeness, validity and other methodological pitfalls. In Eighth International AAAI Conference on Weblogs and Social Media. AAAI, 2014.
-  M. Völske, M. Potthast, S. Syed, and B. Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
-  S. Walker, D. Mercea, and M. Bastos. The disinformation landscape and the lockdown of social platforms. Information, Communication & Society, 22(11):1531–1543, 2019.
-  A. Wang, J. Hula, P. Xia, R. Pappagari, R. T. McCoy, R. Patel, N. Kim, I. Tenney, Y. Huang, K. Yu, et al. Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4465–4476, 2019.
-  F.-Y. Wang, K. M. Carley, D. Zeng, and W. Mao. Social Computing: From Social Informatics to Social Intelligence. IEEE Intelligent Systems, 22(2):79–83, Mar. 2007.
-  K. Weller and K. E. Kinder-Kurlanda. A manifesto for data sharing in social media research. In Proceedings of the 8th ACM Conference on Web Science, pages 166–172. ACM, 2016.
-  S. Zannettou, J. Blackburn, E. De Cristofaro, M. Sirivianos, and G. Stringhini. Understanding web archiving services and their (mis) use on social media. In Twelfth International AAAI Conference on Web and Social Media, 2018.
-  S. Zannettou, T. Caulfield, J. Blackburn, E. De Cristofaro, M. Sirivianos, G. Stringhini, and G. Suarez-Tangil. On the Origins of Memes by Means of Fringe Web Communities. In Proceedings of the Internet Measurement Conference 2018, IMC ’18, pages 188–202, Boston, MA, USA, 2018. ACM.
-  S. Zannettou, T. Caulfield, W. Setzer, M. Sirivianos, G. Stringhini, and J. Blackburn. Who let the trolls out?: Towards understanding state-sponsored trolls. In Proceedings of the 10th ACM Conference on Web Science, pages 353–362. ACM, 2019.
-  Y. Zhan, Z. Zhang, J. M. Okamoto, D. D. Zeng, and S. J. Leischow. Underage juul use patterns: Content analysis of reddit messages. Journal of medical Internet research, 21(9):e13038, 2019.
-  W. Zheng and K. Zhou. Enhancing conversational dialogue models with grounded knowledge. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 709–718. ACM, 2019.
-  Y. Zhou, M. Dredze, D. A. Broniatowski, and W. D. Adler. Elites and foreign actors among the alt-right: The gab social media platform. First Monday, 24(9), 2019.
-  Y. Zhuang, J. Xie, Y. Zheng, and X. Zhu. Quantifying context overlap for training word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 587–593, 2018.
-  M. Zignani, C. Quadri, A. Galdeman, S. Gaito, and G. P. Rossi. Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pages 639–645, 2019.