A Personalized Subreddit Recommendation Engine

05/03/2019 ∙ by Abhishek K Das, et al. ∙ Texas A&M University 0

This paper aims to improve upon the generic recommendations that Reddit provides for its users. We propose a novel personalized recommender system that learns from both, the presence and the content of user-subreddit interaction, using implicit and explicit signals to provide robust recommendations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reddit, with more than 540 Million Active users and 138K+ subreddits, is the third most frequently visited website in the US. The website comprises of more than a million ’subreddits’, wherein each subreddit caters to a specific topic. For example, the subreddit r/soccer is the subreddit for the game of soccer. Each subreddit has between a handful to millions of users who post content and comment on the posted content, and these comments are what form the aggregated dataset for our work.

While it’s tagline is ’The front page of the internet’, it still lacks in its subreddit recommendation system, which usually provides generic recommendations for users. While websites such as Amazon and LinkedIn have mastered the art of making personalized recommendations based on user profiles, browsing and interaction history, Reddit still relies on trending topics and moderators to generate a list of recommendations. Moreover, these are not unique or personalized to the user’s taste.

Our project proposes a novel model that takes in implicit and explicit factors from the interactions that a user has had with a subreddit, and tackles the problem of generating personalized recommendations in various ways. We use methods from traditional Information Retrieval and Natural Language Processing, to test various models and evaluate them on the basis of their performance. We propose a hybrid recommender system, which relies on both, collaborative filtering and content-based features, to build a system that can tackle the problem of cold start as well as low computation processing capabilities.

Ii Related Work

There have been previous models which have worked in the domain to provide and improve subreddit recommendations. Akash Japi, in his [6]

blog article wrote a web crawler to scrape user comments belonging to users found on top comments from the subreddit r/all which itself aggregates the top posts of reddit. He then created vectors for each users and then used k-NN measures to find similar users.

Furthermore, although in a different domain, we found the work presented by He and McAuley in VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback[2] quite relevant. The authors use a visual embedding of items to suggest better recommendations by grouping visually similar items together. Based on this system, we develop a Textual Bayesian Personalized Ranking recommendation system, which creates lower space embeddings for both users and subreddits to suggest similar subreddits.

Iii Methodology

Iii-a Dataset

We obtained a dataset which comprises of user comments on Reddit from the month of January 2015. It contains 57 million comments from Reddit users. Since Reddit does not release the user-subreddit subscription data, implicit and explicit indicators of user interest have to be derived from this dataset. We further processed this dataset to improve the quality by performing the following steps.

  • Removing user - subreddit interactions which were lesser than 30 characters and users with fewer than 5 comments

  • Removing bots and [deleted] comments

  • Dividing the dataset into two parts - dataset A with interactions, and dataset B with comments as well as interactions

Our final datasets consisted of 28 million comments, 735834 users and 14842 subreddits. All the comments were filtered to lowercase, and stopwords and punctuation were removed.

The final dataset contained subreddits of varying popularity. One aim of our project is to try to not simply recommend the most popular, or most interacted with subreddits. We explore the dataset to understand what fraction of the data was dominated by subreddits. This is shown in Figure 1, which indicates that nearly 20% of the comments in our dataset were made on the top 30 subreddits.

Fig. 1: Most popular subreddits by comments in January 2015

Iii-B Matrix factorization - ALS

To create a baseline for our recommender system, and to leverage the large amount of informative interactions we have between users and subreddits, we use the dataset A and implement a user-subreddit collaborative filtering based model, that uses Matrix Factorization using Alternating Least Squares (ALS) as the optimization function[cite]. For this model, we assume the presence of an interaction between a user and a subreddit as a positive rating, assigning it a rating of 1, and the absence is assigned a rating of 0.

This information is then used to minimize the objective function as described in [Implicit cf paper cite], which calculates the user factors and the subreddit factors by projecting them into a common latent factor space where they can be directly compared.

Iii-C Bayesian Personalized Ranking - BPR

Fig. 2: Bayesian Personalized Rankings

The Bayesian Personalized Ranking works with the goal to estimate a personalized ranking function for each user with respect to all pairs of items,

and . In this paper[1], the authors use implicit signals to compare pairs of items for users to create rankings as shown below in Figure 2. For each user , the value is calculated. This value is positive is user prefers item over item , and negative if vice-versa. This value is calculated by any method that can approximate the relationship of a user with any two items. In our experiments, we estimate this value as follows:

where the terms are Matrix Factorization based latent factors.

Iii-D Textual Bayesian Personalized Ranking (t-BPR)

The concept of t-BPR takes inspiration from v-BPR [cite paper] where we combine the knowledge of implicit preferences for a certain subreddit for a user with the actual explicit features, that is, the content of the comment. Similar to visual factors in vBPR, we used textual factors for both, the items (in this case, subreddits), and the users, and incorporated this in our algorithm. We trained the textual factors for both - Subreddit2Vec and User2Vec, using the Doc2Vec algorithm [cite paper]. We then use these factors in our textual BPR algorithm using two methods: vanilla t-BPR and Learnt t-BPR.

Fig. 3: Textual BPR architecture

Iii-D1 Vanilla t-BPR

Using the independently learnt textual factors for each subreddit and each user, we calculated the values for for all pairs of subreddits and , for each user. This is done by calculating the preference of any user for any subreddit using the following equation:

where is the global bias, is the bias for user u, is the bias for subreddit i, and the values are the subreddit2vec and user2vec values for that particular subreddit and user. and are the newly introduced textual factors whose inner product models the interaction between the users and the subreddits in the form of comments’ representations in dimensions.

The calculated estimated values for are then evaluated using the AUC methodology described in the Evaluation section.

Iii-D2 Learnt t-BPR

We further work on our t-BPR to learn the user embedding kernel which linearly transforms the high-dimensional features of the users into a much lower-dimensional ‘textual rating’ space. This is represented by the following:

The final prediction model is:

The following optimization criterion is used for personalized ranking:


is the logistic (sigmoid) function and

is a model-specific regularization hyperparameter.


For evaluation, we split our data into two sets- training and test data. We initially have a list of subreddits a user subscribes to. We take out 10% of subreddits associated with a user and add them to our test set. We used Stratified Sampling to split the dataset, in order to ensure that there is enough training data for each user. Once training is complete, we test how many of the subreddits we removed in the initial set are present in our recommendations.

We use the Area-Under-the-Curve (AUC) as a metric to quantify our evaluation. We considered various evaluation criteria, and decided that Area Under ROC Curve(AUC) as the best for our recommender system[3]

. This evaluation criteria gives the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Comparing pairs of subreddits to see their relative ranking is the most important feature in a subreddit recommender system, so measuring AUC was the best fit for evaluation. AUC is a well-defined criterion because our model is comparison-based, without any ground-truth for our recommender system.

We take the definition as given in the [cite BPR paper] BPR paper. AUC is defined as:



We plotted t-SNE plots[4] for our subreddits based on the Subreddit2Vec vectors that we generated to observe the similarities between subreddits and find interesting combinations between the as shown in Figure 4.

Fig. 4: t-SNE plot for subreddits

We compare our ALS, BPR and Textual BPR models using the AUC methodology. We observe in Figure 5 that Textual BPR performs better than the other two methods by a significant margin. We ran our experiments building vectors for various dimensions between 8 and 128. We hypothesize that this is due to the word embeddings of comments providing our model with context about the user and the subreddit.

Fig. 5: AUC results comparing Textual BPR vs other methods

The best AUC scores along with the dimensions that gave the best results are in given in the table I.

Model Best AUC score Number of Dimensions
MF - ALS 0.815 128
BPR 0.717 32
t-BPR 0.901 16
TABLE I: AUC scores for best performing models

Iii-E Qualitative Analysis

We also present an example of our system’s recommendation for a sample user, u/TallnFrosty. This user’s history suggests an interest in sports, especially basketball, soccer and football. We find that our system’s recommendations are a good mix of niche and generic subreddits which may interest him/her. We also give an example of subreddits which are most similar to the Game of Thrones subreddit, and again find other TV shows ideologically similar to Game of Thrones in the results, as expected.

After a thorough analysis from our results, we find that the majority of our recommendations do not include reddits most popular subreddits, which shows that our model understands the actual meaning using comments, and is able to recommend novel recommends to a fair extent.

Fig. 6: Examples of Textual BPR


Our work presents a new framework for recommending subreddits of interest for existing Reddit users based on their past interaction with the website. We present Textual BPR, an embedding based BPR system which performs significantly better than both the pre-existing real world Reddit recommender as well as tradition ALS and BPR approaches. We observe our system performs well in practice, often suggesting novel and diverse subreddits of interest.


We would like to thank Professor James Caverlee and Jianling Wang for their invaluable contributions and feedback for this project. We would also like to thank our classmates who gave us valuable suggestions during the initials stages as well as their critical review during our poster presentation.

Future Work

In the future, we would like to investigate on improving serendipity, user novelty and coverage in our recommender system based on work being done in this field[5].

In addition, as suggested by Professor Caverlee, our item space involves polarized subreddits. User emotion in different subreddits is highly dependent on the type of subreddit. For example, subreddits involving politics are more polarized and we want to consider these latent factors using Natural Language Processsing to improve our recommender.

Our dataset also provides a controversiality score, a measure of how controversial a comment is. We plan to add this score with polarization to draw further inferences on how trolls function on sites like Reddit, and how they affect users and discussions. This would be extremely valuable as trolls are difficult to combat in real-world scenarios, especially on popular sites like Reddit.