Recently, social networks became popular repositories of data used for knowledge capture. The aim of this work is to explore usability of Reddit as a knowledge exploration data source.
In this context, we present a review of literature about Reddit. Moreover, a specialized tool (Reddit-TUDFE)’ is introduced, which allows fast check of Reddit topic’s coverage. The key contributions of this work are answers to the following research question (RQs):
RQ1: What is the most popular method to acquire Reddit data?
RQ2: What are the most researched problems using Reddit as a dataset (do they include knowledge capture/management)?
RQ3: How does Reddit usage in science change over time?
RQ4: Are there topics that are not substantially covered on Reddit?
RQ5: Is Reddit used as a single dataset or with other platforms?
2. What is Reddit?
Reddit is a web content rating and discussion website (Medvedev et al., 2017). It was created in 2005 and is ranked as the 17th most visited website in the world, with over 430 million monthly active users111https://www.statista.com/topics/5672/reddit/#dossierSummary. Reddit is divided into thematic subfora (subreddits) dynamically created by its users. Therefore, the topic structure is evolving, in response to user needs. Each subreddit has its moderators who may supervise submissions and comments. Comment are linked to submissions, or earlier comments, forming a tree-like structure.
Most of the subrredits are public (for registered and non-registred users) with some exceptions (based on karma points, comments, gold, moderator status, time on Reddit, username and others). The tool (introduced in section 6 is based on publicly available data.
2.1. Accessibility – Reddit API vs. Pushshift API
Not only is the data on Reddit publicly accessible, but it is also distributed via the official Reddit API. However, it was found that most researchers do not actually use it. Instead, they choose Pushshift API (Baumgartner et al., 2020). None of the analysed papers state the explicit reason for this choice (very few mention how the dataset has been retrieved). However, testing the capabilities of Reddit API and Pushshift API shows that the key factor is that the Reddit API does not allow easy retrieval of historical data, while Pushshift does.
3. Data acquisition and processing
To explore Reddit, as seen by scientists, a dataset of 180 papers was assembled. All of them were related to Reddit and submitted to arXiv between 01-01-2019 and 01-03-2021 (retrieved on 30-03-2021222https://arxiv.org/search/advanced?&terms-0-term=reddit&classification-computer_science=y&date-from_date=2019-01-01&date-to_date=2021-03-01). This dataset has been processed manually and automatically. First, collected papers have been manually annotated with four attribute sets: topic (a general area of research), methods
(theoretical approach, e.g. neural network, text embedding),dataset and technologies (practical software, e.g. BERT (Devlin et al., 2018)). Next, obtained results were merged with publicly available data, i.e. the content (title and raw text) and the bibliometric metadata. This allowed extraction of information presented in Section 4.
Each collected paper has been converted to a raw text file, using PDF Miner (Shinyama, 2015). Next, the key features of titles and texts have been cleaned and mined using the NLTK framework (Loper and Bird, 2002) (for sentiment and subjectivity) and TF-IDF (Rajaraman and Ullman, 2011)
for vectorization (both from the sklearn library(Pedregosa et al., 2011)).
4. Analysis and findings
As a result of data processing we were able to formulate a number of observations. Let us summarize the most important ones.
4.1. Metadata and bibliometrics
Firstly, let us consider a few noticeable bibliometric and authorship statistics gathered using Semantic Scholar 333https://api.semanticscholar.org/v1/paper/:
There were two major growths in published articles count: one after March 2020 (correlated with the outburst of the COVID-19 pandemic) and second in October 2020 (correlated with notification dates for many scientific conferences (Viglione, 2020)).
The majority of papers (over 65%) were written by 2-4 authors, with one having 26 authors (Garibay et al., 2020).
The most prolific authors (with over 10 publications) of Reddit-related papers are Savvas Zannettou (Max-Planck-Institute), Jeremy Blackburn (Binghamton University).
4.2. Analysis of topic, methods and technology
Topics and methods, used in studies, are summarized in subfigures of Figure 1. Top subfigure of Figure 1 shows that the most popular research topics is conversation, which matches the fact that Reddit is a discussion forum. Due to the timing of our work (overlapping with COVID-19 pandemic), the second most common topic is COVID.
Since Reddit consists of text discussions, it is not surprising that the two most common methods in Reddit-related research are text embeddings, used in text processing, and networks, used for social network analysis. Note that “network” (graph) and “neural network” are separate terms. Regarding technologies, over 45% of studies used Pushshift (Baumgartner et al., 2020) for Reddit data extraction, and over 35% applied BERT(Devlin et al., 2018) embedding (and its variations) for NLP.
Finally, topics and methods have been combined in a correlation heatmap (Figure 2).
Here, few significant correlations have been established.
Papers related to drugs typically use word embeddings (which can be, however, related to the overall popularity of word embedding(s) (see citation count for (Devlin et al., 2018)).
Networks are typically applied in analysis of trends.
Articles dealing with sarcasm often use LSTM networks.
Research devoted to conversation generation typically applies BLEU metric.
Interestingly, only two papers (1%) are related to knowledge processing (specifically, knowledge graphs(Cao et al., 2020; Zhang et al., 2019)). Expanding arXiv search to all articles including “knowledge” and “Reddit”, resulted in 4 records, none related to knowledge capture. Therefore, top knowledge-related conferences were searched, but only one paper (Hastings et al., 2011), about knowledge and Reddit, has been found (published by K-CAP 2011 444https://www.k-cap.org/kcap11/index.html. This points to Reddit as an underexplored resource.
The previous remark may be counter-intuitive to the following remark. Moving to RQ5, it was discovered that among papers that use Reddit, over 30% also use Twitter, which is a data source mainly used for sentiment analysis(Kharde et al., 2016); RQ5). Other datasets utilized together with Reddit are: Facebook, 4Chan, YouTube, Gab. Each of them appears in less than 10% of papers which used Reddit.
4.3. Linguistic analysis
Interestingly, a number of texts had high subjectivity (see, Figure 4), which should not occur in scientific publications (Levitt et al., 2020). However, closer scrutiny revealed that use of “subjective” words (e.g. “controversial”, “bias”) results in higher subjectivity scores, e.g.: “However, this openness formed a platform for the polarization of opinions and controversial discussions” (Jasser et al., 2020) (score: 0.95).
5. Reddit – The Ultimate Dataset for Everything
Let us now address RQ3 and RQ4. Even though, they cannot be unequivocally answered, they can be experimentally evaluated.
5.1. Reddit in scholarly research
As shown in Figure 1, within arXiv, Reddit has been used for a study of a variety of research topics. However, Reddit dataset is present also in other sources. To show how the number of scholarly papers (related to Reddit) changed between 2010 and 2021, 10 databases have been analysed. As shown in Figure 5 the number of articles each year rises year to year dynamically (RQ3).
The reason for outlaying results of Google Scholar may be one of its primary criticisms (Jacsó, 2005; Shultz, 2007; Jacsó, 2005), i.e. incorrect bibliometrics (due to use of automated algorithms) (López-Cózar et al., 2017; Gray et al., 2012) and its “inability or unwillingness to elaborate on what documents its system crawls” (Gray et al., 2012). Moreover, Google Scholar declares inconsistently the number of query results and the actual number of returned results (e.g. a query returns 1000 results and declares 58 600555https://scholar.google.com/scholar?start=990&q=reddit&hl=en&as_sdt=0,5&as_ylo=2020&as_yhi=2020 accessed on 11-09-2021)).
5.2. Google Trends
The next experiment explored presence of popular trends in Reddit. For all Global Google Trends 2020666https://trends.google.com/trends/yis/2020/GLOBAL/ their Reddit presence has been measured (see table 7). Overall, 79% of top Google Trends have a dedicated subreddit, while all of them are widely discussed. Table 7 illustrates top 1 in each Google Trend category.
|Google Trend||category||on Reddit||reference|
|Tiger King||tv shows||subreddit||r/TigerKing|
6. Is this on Reddit? – Reddit-TUDFE
To further explore whether Reddit supports capturing “knowledge about any area”, a tool for easy exploratory data analysis (EDA (Cox, 2017)) was designed. Reddit-TUDFE allows quick search of any topic on Reddit, checking if/how it is represented, and how it is discussed. Reddit-TUDFE delivers the following functions:
The code follows state-of-the-art solutions for code sharing ((Perkel, 2018)) and is publicly available on GitHub 111111github.com/JanSawicki/reddit-tudfe/ as a Jupyter Notebook (Randles et al., 2017).
To present capabilities of the application, let us present a few examples, as subfigures of Figure 6
Left subfigure shows result for the phrase “music”, a generic term, which is certainly discussed on Reddit. One may see particular genres: rock, pop, rap, relaxing, electronic, etc.
Middle subfigure displays results for phrase “rock”, a bit narrowed topic, but still vague and also present in Reddit, including artists/bands like: Rolling Stones, AC/DC, Led Zeppeling, Queen, Pink etc.
Right subfigure contains a strictly specific topic, i.e. the band “The Beatles”, which is also widely covered on Reddit. Here one may see individual members: John Lennon, Paul McCartney, Ringo Starr, and George Harrison.
The wordclouds are build from posts related to a subreddit dedicated (or closest) to the searched topic. Reddit-TUDFE allows to quickly check if, and how, a particular topic is covered. Note that similar examples can be derived for any other topic, while Reddit shows potential in e.g. building ontologies, or semantic graphs.
7. Concluding remarks
This work provides evidence that Reddit is a robust, but underutilized resource for knowledge capture, in almost any field of interest. In this context, the following conclusions can be formulated:
Reddit offers publicly available data, which can be easily retrieved with Pushshift API.
Most popular techniques for Reddit knowledge capture are: word embeddings, graph networks, and neural networks.
Reddit covers majority (79%) of topics that appear in Global Google Trends, sustaining the claim that Reddit is a robust source of knowledge about “everything”.
Reddit research becomes more popular over time (based on count of published articles).
Reddit is most commonly used in tandem with Twitter.
This analysis and the Reddit–TUDFE tool provide foundation for future research on Reddit and its potential in fully automatic knowledge extraction and knowledge graph building.
- The pushshift reddit dataset. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14, pp. 830–839. Cited by: §2.1, §4.2.
- Building and using personal knowledge graph to improve suicidal ideation detection on social media. IEEE Transactions on Multimedia. Cited by: §4.2.
- Exploratory data analysis. In Translating Statistics to Make Decisions, pp. 47–74. Cited by: §6.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3, 1st item, §4.2.
- Deep agent: studying the dynamics of information spread and evolution in social networks. arXiv preprint arXiv:2003.11611. Cited by: 2nd item.
- Scholarish: google scholar and its value to the sciences. Issues in Science and Technology Librarianship 70 (Summer). Cited by: §5.1.
- How to model the shapes of molecules? combining topology and ontology using heterogeneous specifications. In In Proc. of the Deep Knowledge Representation Challenge Workshop (DKR-11), K-CAP-11, Cited by: §4.2.
- Google scholar: the pros and the cons. Online information review. Cited by: §5.1.
- Controversial information spreads faster and further in reddit. arXiv preprint arXiv:2006.13991. Cited by: §4.3.
- Sentiment analysis of twitter data: a survey of techniques. arXiv preprint arXiv:1601.06971. Cited by: §4.2.
- The meaning of scientific objectivity and subjectivity: from the perspective of methodologists.. Psychological methods. Cited by: §4.3.
- Nltk: the natural language toolkit. arXiv preprint cs/0205028. Cited by: §3, §4.3, item 3.
- Google scholar: the big data bibliographic tool. Research analytics: boosting university productivity and competitiveness through scientometrics, pp. 59. Cited by: §5.1.
- The anatomy of reddit: an overview of academic research. In Dynamics on and of Complex Networks, pp. 183–204. Cited by: §2.
Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.
- Why jupyter is data scientists’ computational notebook of choice.. Nature 563 (7732), pp. 145–147. Cited by: §6.
- Data mining. In Mining of Massive Datasets, pp. 1–17. External Links: Cited by: §3.
- Using the jupyter notebook as a tool for open science: an empirical study. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1–2. Cited by: §6.
- Pdfminer: python pdf parser and analyzer. Retrieved on 11. Cited by: §3.
- Comparing test searches in pubmed and google scholar. Journal of the Medical Library Association: JMLA 95 (4), pp. 442. Cited by: §5.1.
- How scientific conferences will survive the coronavirus shock.. Nature 582 (7811), pp. 166–168. Cited by: 1st item.
- Grounded conversation generation as guided traverses in commonsense knowledge graphs. arXiv preprint arXiv:1911.02707. Cited by: §4.2.