Reddit-TUDFE: practical tool to explore Reddit usability in data science and knowledge processing

by   Jan Sawicki, et al.
Politechnika Warszawska

This contribution argues that Reddit, as a massive, categorized, open-access dataset, can be used to conduct knowledge capture on "almost any topic". Presented analysis, is based on 180 manually annotated papers related to Reddit and data acquired from top databases of scientific papers. Moreover, an open source tool is introduced, which provides easy access to Reddit resources, and exploratory data analysis of how Reddit covers selected topics.



There are no comments yet.


page 3


Some considerations about reviewing and open-access in scientific publishing

Scientific research changed profoundly over the last 30 years, in all it...

Analysis of Computational Science Papers from ICCS 2001-2016 using Topic Modeling and Graph Theory

This paper presents results of topic modeling and network models of topi...

GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

Traceability between published scientific breakthroughs and their implem...

Extracting Topics from Open Educational Resources

In recent years, Open Educational Resources (OERs) were earmarked as cri...

Complex-network approach for visualizing and quantifying the evolution of a scientific topic

Tracing the evolution of specific topics is a subject area which belongs...

BiblioDAP: The 1st Workshop on Bibliographic Data Analysis and Processing

Automatic processing of bibliographic data becomes very important in dig...

A practical, effective calculation of gamma difference distributions with open data science tools

At present, there is still no officially accepted and extensively verifi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recently, social networks became popular repositories of data used for knowledge capture. The aim of this work is to explore usability of Reddit as a knowledge exploration data source. In this context, we present a review of literature about Reddit. Moreover, a specialized tool (Reddit-TUDFE)’ is introduced, which allows fast check of Reddit topic’s coverage. The key contributions of this work are answers to the following research question (RQs):
RQ1: What is the most popular method to acquire Reddit data?
RQ2: What are the most researched problems using Reddit as a dataset (do they include knowledge capture/management)?
RQ3: How does Reddit usage in science change over time?
RQ4: Are there topics that are not substantially covered on Reddit?
RQ5: Is Reddit used as a single dataset or with other platforms?

2. What is Reddit?

Reddit is a web content rating and discussion website (Medvedev et al., 2017). It was created in 2005 and is ranked as the 17th most visited website in the world, with over 430 million monthly active users111 Reddit is divided into thematic subfora (subreddits) dynamically created by its users. Therefore, the topic structure is evolving, in response to user needs. Each subreddit has its moderators who may supervise submissions and comments. Comment are linked to submissions, or earlier comments, forming a tree-like structure.

Most of the subrredits are public (for registered and non-registred users) with some exceptions (based on karma points, comments, gold, moderator status, time on Reddit, username and others). The tool (introduced in section 6 is based on publicly available data.

2.1. Accessibility – Reddit API vs. Pushshift API

Not only is the data on Reddit publicly accessible, but it is also distributed via the official Reddit API. However, it was found that most researchers do not actually use it. Instead, they choose Pushshift API (Baumgartner et al., 2020). None of the analysed papers state the explicit reason for this choice (very few mention how the dataset has been retrieved). However, testing the capabilities of Reddit API and Pushshift API shows that the key factor is that the Reddit API does not allow easy retrieval of historical data, while Pushshift does.

3. Data acquisition and processing

To explore Reddit, as seen by scientists, a dataset of 180 papers was assembled. All of them were related to Reddit and submitted to arXiv between 01-01-2019 and 01-03-2021 (retrieved on 30-03-2021222 This dataset has been processed manually and automatically. First, collected papers have been manually annotated with four attribute sets: topic (a general area of research), methods

(theoretical approach, e.g. neural network, text embedding),

dataset and technologies (practical software, e.g. BERT (Devlin et al., 2018)). Next, obtained results were merged with publicly available data, i.e. the content (title and raw text) and the bibliometric metadata. This allowed extraction of information presented in Section 4.

Each collected paper has been converted to a raw text file, using PDF Miner (Shinyama, 2015). Next, the key features of titles and texts have been cleaned and mined using the NLTK framework (Loper and Bird, 2002) (for sentiment and subjectivity) and TF-IDF (Rajaraman and Ullman, 2011)

for vectorization (both from the sklearn library 

(Pedregosa et al., 2011)).

4. Analysis and findings

As a result of data processing we were able to formulate a number of observations. Let us summarize the most important ones.

4.1. Metadata and bibliometrics

Firstly, let us consider a few noticeable bibliometric and authorship statistics gathered using Semantic Scholar 333

  • There were two major growths in published articles count: one after March 2020 (correlated with the outburst of the COVID-19 pandemic) and second in October 2020 (correlated with notification dates for many scientific conferences (Viglione, 2020)).

  • The majority of papers (over 65%) were written by 2-4 authors, with one having 26 authors (Garibay et al., 2020).

  • The most prolific authors (with over 10 publications) of Reddit-related papers are Savvas Zannettou (Max-Planck-Institute), Jeremy Blackburn (Binghamton University).

4.2. Analysis of topic, methods and technology

Topics and methods, used in studies, are summarized in subfigures of Figure 1. Top subfigure of Figure 1 shows that the most popular research topics is conversation, which matches the fact that Reddit is a discussion forum. Due to the timing of our work (overlapping with COVID-19 pandemic), the second most common topic is COVID.

Figure 1. Article count by article topic (top) and methods (bottom)

Since Reddit consists of text discussions, it is not surprising that the two most common methods in Reddit-related research are text embeddings, used in text processing, and networks, used for social network analysis. Note that “network” (graph) and “neural network” are separate terms. Regarding technologies, over 45% of studies used Pushshift (Baumgartner et al., 2020) for Reddit data extraction, and over 35% applied BERT(Devlin et al., 2018) embedding (and its variations) for NLP.

Finally, topics and methods have been combined in a correlation heatmap (Figure 2).

Figure 2. Research methods correlated with article topics

Here, few significant correlations have been established.

  • Papers related to drugs typically use word embeddings (which can be, however, related to the overall popularity of word embedding(s) (see citation count for (Devlin et al., 2018)).

  • Networks are typically applied in analysis of trends.

  • Articles dealing with sarcasm often use LSTM networks.

  • Research devoted to conversation generation typically applies BLEU metric.

Interestingly, only two papers (1%) are related to knowledge processing (specifically, knowledge graphs 

(Cao et al., 2020; Zhang et al., 2019)). Expanding arXiv search to all articles including “knowledge” and “Reddit”, resulted in 4 records, none related to knowledge capture. Therefore, top knowledge-related conferences were searched, but only one paper (Hastings et al., 2011), about knowledge and Reddit, has been found (published by K-CAP 2011 444 This points to Reddit as an underexplored resource.

The previous remark may be counter-intuitive to the following remark. Moving to RQ5, it was discovered that among papers that use Reddit, over 30% also use Twitter, which is a data source mainly used for sentiment analysis 

(Kharde et al., 2016); RQ5). Other datasets utilized together with Reddit are: Facebook, 4Chan, YouTube, Gab. Each of them appears in less than 10% of papers which used Reddit.

4.3. Linguistic analysis

Linguistic analysis was performed using NLTK framework (Loper and Bird, 2002). The sentiment of papers’ texts is mostly neutral (Figure 3).

Figure 3. Polarity histogram (scale normalized)

Interestingly, a number of texts had high subjectivity (see, Figure 4), which should not occur in scientific publications (Levitt et al., 2020). However, closer scrutiny revealed that use of “subjective” words (e.g. “controversial”, “bias”) results in higher subjectivity scores, e.g.: “However, this openness formed a platform for the polarization of opinions and controversial discussions” (Jasser et al., 2020) (score: 0.95).

Figure 4. Subjectivity histogram (scale normalized)

5. Reddit – The Ultimate Dataset for Everything

Let us now address RQ3 and RQ4. Even though, they cannot be unequivocally answered, they can be experimentally evaluated.

5.1. Reddit in scholarly research

As shown in Figure 1, within arXiv, Reddit has been used for a study of a variety of research topics. However, Reddit dataset is present also in other sources. To show how the number of scholarly papers (related to Reddit) changed between 2010 and 2021, 10 databases have been analysed. As shown in Figure 5 the number of articles each year rises year to year dynamically (RQ3).

Figure 5. Non-cumulative count of papers related to Reddit in scientific databases in years 2010-2021.

The reason for outlaying results of Google Scholar may be one of its primary criticisms (Jacsó, 2005; Shultz, 2007; Jacsó, 2005), i.e. incorrect bibliometrics (due to use of automated algorithms) (López-Cózar et al., 2017; Gray et al., 2012) and its “inability or unwillingness to elaborate on what documents its system crawls” (Gray et al., 2012). Moreover, Google Scholar declares inconsistently the number of query results and the actual number of returned results (e.g. a query returns 1000 results and declares 58 600555,5&as_ylo=2020&as_yhi=2020 accessed on 11-09-2021)).

5.2. Google Trends

The next experiment explored presence of popular trends in Reddit. For all Global Google Trends 2020666 their Reddit presence has been measured (see table 7). Overall, 79% of top Google Trends have a dedicated subreddit, while all of them are widely discussed. Table 7 illustrates top 1 in each Google Trend category.

Google Trend category on Reddit reference
Coronavirus searches subreddit r/Coronavirus
Tom Hanks actors subreddit r/tomhanks
Ryan Newman athletes subreddit r/RyanNewman
Parasite movies subreddit r/parasite
Tiger King tv shows subreddit r/TigerKing
Joe Biden people subreddit r/JoeBiden
Coronavirus news subreddit r/Coronavirus
Among Us games subreddit r/AmongUs
Dalgona coffee recipes discussion r/cafe
Kobe Bryant loss subreddit r/kobebryant
Table 1. Global Google Trends 2020777 (top one in each Google Trends category) and their appearance on Reddit (“subreddit” - there exists a dedicated subforum, “discussion” - the topic is present in (a) subreddit(s) of a broader topic)

6. Is this on Reddit? – Reddit-TUDFE

To further explore whether Reddit supports capturing “knowledge about any area”, a tool for easy exploratory data analysis (EDA (Cox, 2017)) was designed. Reddit-TUDFE allows quick search of any topic on Reddit, checking if/how it is represented, and how it is discussed. Reddit-TUDFE delivers the following functions:

  1. Uses Reddit API to search for best matching subreddit.

  2. Downloads newest posts from the subreddit, using Pushshift API and a combination of PRAW888 and PSAW999

  3. Performs basic text cleaning (tokenization with NLTK (Loper and Bird, 2002), removal of stopwords, punctuation, numbers).

  4. Generates and displays post titles and content wordclouds 101010

The code follows state-of-the-art solutions for code sharing ((Perkel, 2018)) and is publicly available on GitHub as a Jupyter Notebook (Randles et al., 2017).

To present capabilities of the application, let us present a few examples, as subfigures of Figure 6

  • Left subfigure shows result for the phrase “music”, a generic term, which is certainly discussed on Reddit. One may see particular genres: rock, pop, rap, relaxing, electronic, etc.

  • Middle subfigure displays results for phrase “rock”, a bit narrowed topic, but still vague and also present in Reddit, including artists/bands like: Rolling Stones, AC/DC, Led Zeppeling, Queen, Pink etc.

  • Right subfigure contains a strictly specific topic, i.e. the band “The Beatles”, which is also widely covered on Reddit. Here one may see individual members: John Lennon, Paul McCartney, Ringo Starr, and George Harrison.

Figure 6. Wordclouds of 200 posts (before 01-09-2021) concerning (left to right): “music”, “rock” and “The Beatles”.

The wordclouds are build from posts related to a subreddit dedicated (or closest) to the searched topic. Reddit-TUDFE allows to quickly check if, and how, a particular topic is covered. Note that similar examples can be derived for any other topic, while Reddit shows potential in e.g. building ontologies, or semantic graphs.

7. Concluding remarks

This work provides evidence that Reddit is a robust, but underutilized resource for knowledge capture, in almost any field of interest. In this context, the following conclusions can be formulated:

  • Reddit offers publicly available data, which can be easily retrieved with Pushshift API.

  • Most popular techniques for Reddit knowledge capture are: word embeddings, graph networks, and neural networks.

  • Reddit covers majority (79%) of topics that appear in Global Google Trends, sustaining the claim that Reddit is a robust source of knowledge about “everything”.

  • Reddit research becomes more popular over time (based on count of published articles).

  • Reddit is most commonly used in tandem with Twitter.

This analysis and the Reddit–TUDFE tool provide foundation for future research on Reddit and its potential in fully automatic knowledge extraction and knowledge graph building.


  • J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn (2020) The pushshift reddit dataset. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14, pp. 830–839. Cited by: §2.1, §4.2.
  • L. Cao, H. Zhang, and L. Feng (2020) Building and using personal knowledge graph to improve suicidal ideation detection on social media. IEEE Transactions on Multimedia. Cited by: §4.2.
  • V. Cox (2017) Exploratory data analysis. In Translating Statistics to Make Decisions, pp. 47–74. Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3, 1st item, §4.2.
  • I. Garibay, T. A. Oghaz, N. Yousefi, E. C. Mutlu, M. Schiappa, S. Scheinert, G. C. Anagnostopoulos, C. Bouwens, S. M. Fiore, A. Mantzaris, et al. (2020) Deep agent: studying the dynamics of information spread and evolution in social networks. arXiv preprint arXiv:2003.11611. Cited by: 2nd item.
  • J. E. Gray, M. C. Hamilton, A. Hauser, M. M. Janz, J. P. Peters, and F. Taggart (2012) Scholarish: google scholar and its value to the sciences. Issues in Science and Technology Librarianship 70 (Summer). Cited by: §5.1.
  • J. Hastings, O. Kutz, and T. Mossakowski (2011) How to model the shapes of molecules? combining topology and ontology using heterogeneous specifications. In In Proc. of the Deep Knowledge Representation Challenge Workshop (DKR-11), K-CAP-11, Cited by: §4.2.
  • P. Jacsó (2005) Google scholar: the pros and the cons. Online information review. Cited by: §5.1.
  • J. Jasser, I. Garibay, S. Scheinert, and A. V. Mantzaris (2020) Controversial information spreads faster and further in reddit. arXiv preprint arXiv:2006.13991. Cited by: §4.3.
  • V. Kharde, P. Sonawane, et al. (2016) Sentiment analysis of twitter data: a survey of techniques. arXiv preprint arXiv:1601.06971. Cited by: §4.2.
  • H. M. Levitt, F. I. Surace, M. B. Wu, B. Chapin, J. G. Hargrove, C. Herbitter, E. C. Lu, M. R. Maroney, and A. L. Hochman (2020) The meaning of scientific objectivity and subjectivity: from the perspective of methodologists.. Psychological methods. Cited by: §4.3.
  • E. Loper and S. Bird (2002) Nltk: the natural language toolkit. arXiv preprint cs/0205028. Cited by: §3, §4.3, item 3.
  • E. D. López-Cózar, E. Orduña-Malea, A. Martín-Martín, and J. M. Ayllón (2017) Google scholar: the big data bibliographic tool. Research analytics: boosting university productivity and competitiveness through scientometrics, pp. 59. Cited by: §5.1.
  • A. N. Medvedev, R. Lambiotte, and J. Delvenne (2017) The anatomy of reddit: an overview of academic research. In Dynamics on and of Complex Networks, pp. 183–204. Cited by: §2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011)

    Scikit-learn: machine learning in Python

    Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.
  • J. M. Perkel (2018) Why jupyter is data scientists’ computational notebook of choice.. Nature 563 (7732), pp. 145–147. Cited by: §6.
  • A. Rajaraman and J. D. Ullman (2011) Data mining. In Mining of Massive Datasets, pp. 1–17. External Links: Document Cited by: §3.
  • B. M. Randles, I. V. Pasquetto, M. S. Golshan, and C. L. Borgman (2017) Using the jupyter notebook as a tool for open science: an empirical study. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1–2. Cited by: §6.
  • Y. Shinyama (2015) Pdfminer: python pdf parser and analyzer. Retrieved on 11. Cited by: §3.
  • M. Shultz (2007) Comparing test searches in pubmed and google scholar. Journal of the Medical Library Association: JMLA 95 (4), pp. 442. Cited by: §5.1.
  • G. Viglione (2020) How scientific conferences will survive the coronavirus shock.. Nature 582 (7811), pp. 166–168. Cited by: 1st item.
  • H. Zhang, Z. Liu, C. Xiong, and Z. Liu (2019) Grounded conversation generation as guided traverses in commonsense knowledge graphs. arXiv preprint arXiv:1911.02707. Cited by: §4.2.