Representative sampling of collaborative filtering (CF) data is a crucial problem from numerous stand-points and can be performed at various levels: (1) mining hard-negatives while training complex algorithms over massive datasets (Mittal et al., 2021; Chen et al., 2017)
; (2) down-sampling the item-space to estimate expensive ranking metrics(Krichene and Rendle, 2020; Cañamares and Castells, 2020)
; and (3) reasons like easy-sharing, fast-experimentation, mitigating the significant environmental footprint of training resource-hungry machine learning models(Patterson et al., 2021; Wu et al., 2021; Gupta et al., 2021; Schwartz et al., 2019). In this paper, we are interested in finding a sub-sample of a dataset which has minimal effects on model utility evaluation i.e. an algorithm performing well on the sub-sample should also perform well on the original dataset.
Preserving exactly the same levels of performance on sub-sampled data over metrics, such as MSE and AUC, is a challenging problem. A simpler yet useful problem is accurately preserving the ranking
or relative performance of different algorithms on sub-sampled data. For example, a sampling scheme that has low bias but high variance in preserving metric performance values has less utility than a different sampling scheme with high amounts of bias but low variance, since the overall algorithm ranking is still preserved.
Performing ad-hoc sampling such as randomly removing interactions, or making dense subsets by removing users or items with few interactions (Sachdeva and McAuley, 2020) can have adverse downstream repercussions. For example, sampling only the head-portion of a dataset—from a fairness and inclusion perspective—is inherently biased against minority-groups and benchmarking algorithms on this biased data is likely to propagate the original sampling bias. Alternatively, simply from a model performance view-point, accurately retaining the relative performance of different recommendation algorithms on much smaller sub-samples is a challenging research problem in itself.
Two prominent directions toward better and more representative sampling of CF data are: (1) designing principled sampling strategies, especially for user-item interaction data; and (2) analyzing the performance of different sampling strategies, in order to better grasp which sampling scheme performs “better” for which types of data. In this paper, we explore both directions through the lens of expediting the recommendation algorithm development cycle by:
Characterizing the efficacy of sixteen different sampling strategies in accurately benchmarking various kinds of recommendation algorithms on smaller sub-samples.
Proposing a sampling strategy, SVP-CF, which dynamically samples the “toughest” portion of a CF dataset. SVP-CF is tailor-designed to handle the inherent data heterogeneity and missing-not-at-random properties in user-item interaction data.
Developing Data-genie, which analyzes the performance of different sampling strategies. Given a dataset sub-sample, Data-genie can directly estimate the likelihood of model performance being preserved on that sample.
Ultimately, our experiments reveal that SVP-CF outperforms all other sampling strategies and can accurately benchmark recommendation algorithms with roughly of the original dataset size. Furthermore, by employing Data-genie to dynamically select the best sampling scheme for a dataset, we are able to preserve model performance with only of the initial data, leading to a net training time speedup.
2. Related Work
Sampling CF data.
Sampling in CF-data has been a popular choice for three major scenarios. Most prominently, sampling is used for mining hard-negatives while training recommendation algorithms. Some popular approaches include random sampling; using the graph-structure (Ying et al., 2018; Mittal et al., 2021); and ad-hoc techniques like similarity search (Jain et al., 2019), stratified sampling (Chen et al., 2017), etc. Sampling is also generally employed for evaluating recommendation algorithms by estimating expensive to compute metrics like Recall, nDCG, etc. (Krichene and Rendle, 2020; Cañamares and Castells, 2020). Finally, sampling is also used to create smaller sub-samples of a big dataset for reasons like fast experimentation, benchmarking different algorithms, privacy concerns, etc. However, the consequences of different samplers on any of these downstream applications is under-studied, and is the main research interest of this paper.
Closest to our work, a coreset is loosely defined as a subset of data-points that maintains a similar “quality” as the full dataset for subsequent model training. Submodular approaches try to optimize a function which measures the utility of a subset , and use it as a proxy to select the best coreset (Kaushal et al., 2019). More recent works treat coreset selection as a bi-level optimization problem (Borsos et al., 2020; Krause et al., 2021) and directly optimize for the best coreset for a given downstream task. Selection-via-proxy (Coleman et al., 2020) is another technique which employs a base-model as a proxy to tag the importance of each data-point. Note, however, that most existing coreset selection approaches were designed primarily for classification data, whereas adapting them for CF-data is non-trivial because of: (1) the inherent data heterogeneity; the (2) wide range of recommendation metrics; and (3) the prevalent missing-data characteristics.
Evaluating sample quality.
The quality of a dataset sample, if estimated correctly is of high interest for various applications. Short of being able to evaluate the “true” utility of a sample, one generally resorts to either retaining task-dependent characteristics (Morstatter et al., 2013) or
employing universal, handcrafted features like a social network’s hop distribution, eigenvalue distribution,etc. (Leskovec and Faloutsos, 2006) as meaningful proxies. Note that evaluating the sample quality with a limited set of handcrafted features might introduce bias in the sampled data, depending on the number and quality of such features.
3. Sampling Collaborative Filtering Datasets
Given our motivation of quickly benchmarking recommendation algorithms, we now aim to characterize the performance of various commonly-used sampling strategies. We loosely define the performance of a sampling scheme as its ability in effectively retaining the performance-ranking of different recommendation algorithms on the full vs. sub-sampled data. In this section, we start by discussing the different recommendation feedback scenarios we consider, along with a representative sample of popular recommendation algorithms that we aim to efficiently benchmark. We then examine popular data sampling strategies, followed by proposing a novel, proxy-based sampling strategy (SVP-CF) that is especially suited for sampling representative subsets from long-tail CF data.
3.1. Problem Settings & Methods Compared
To give a representative sample of typical recommendation scenarios, we consider three different user feedback settings. In explicit feedback, each user gives a numerical rating to each interacted item ; the model must predict these ratings for novel (test) user-item interactions. Models from this class are evaluated in terms of the Mean Squared Error (MSE) of the predicted ratings. Another scenario we consider is implicit feedback, where the interactions for each user are only available for positive items (e.g. clicks or purchases), whilst all non-interacted items are considered as negatives. We employ the AUC, Recall@, and nDCG@ metrics to evaluate model performance for implicit feedback algorithms. Finally, we also consider sequential feedback, where each user interacts with an ordered sequence of items such that for all . Given , the goal is to identify the next-item for each sequence that each user is most likely to interact with. We use the same metrics as in implicit feedback settings. Note that following recent warnings against sampled metrics for evaluating recommendation algorithms (Krichene and Rendle, 2020; Cañamares and Castells, 2020), we compute both Recall and nDCG by ranking all items in the dataset. Further specifics about the datasets used, pre-processing, train/test splits, etc. are discussed in-depth in Section 3.5.
Given the diversity of the scenarios discussed above, there are numerous relevant recommendation algorithms. We use the following seven recommendation algorithms, intended to represent the state-of-the-art and standard baselines:
PopRec: A naïve baseline that simply ranks items according to overall train-set popularity. Note that this method is unaffected by the user for which items are being recommended, and has the same global ranking of all items.
Bias-only: Another simple baseline that assumes no interactions between users and items. Formally, it learns: (1) a global bias ; (2) scalar biases for each user ; and (3) scalar biases for each item . Ultimately, the rating/relevance for user and item is modeled as .
Matrix Factorization (MF) (Koren et al., 2009): Represents both users and items in a common, low-dimensional latent-space by factorizing the user-item interaction matrix. Formally, the rating/relevance for user and item is modeled as where are learned latent representations.
Neural Matrix Factorization (NeuMF) (He et al., 2017):
Leverages the representation power of deep neural-networks to capture non-linear correlations between user and item embeddings. Formally, the rating/relevance for userand item is modeled as where , ‘——’ represents the concatenation operation, and represents an arbitrarily complex neural network.
Variational Auto-Encoders for Collaborative Filtering (MVAE) (Liang et al., 2018): Builds upon the Variational Auto-Encoder (VAE) (Kingma and Welling, 2014) framework to learn a low-dimensional representation of a user’s consumption history. More specifically, MVAE encodes each user’s bag-of-words consumption history using a VAE and further decodes the latent representation to obtain the completed user preference over all items.
Sequential Variational Auto-Encoders for Collaborative Filtering (SVAE) (Sachdeva et al., 2019): A sequential algorithm that combines the temporal modeling capabilities of a GRU (Chung et al., 2014) along with the representation power of VAEs. Unlike MVAE, SVAE uses a GRU to encode the user’s consumption sequence followed by a multinomial VAE at each time-step to model the likelihood of the next item.
Self-attentive Sequential Recommendation (SASRec) (Kang and McAuley, 2018): Another sequential algorithm that relies on the sequence modeling capabilities of self-attentive neural networks (Vaswani et al., 2017) to predict the occurance of the next item in a user’s consumption sequence. To be precise, given a user and their time-ordered consumption history , SASRec first applies self-attention on followed by a series of non-linear feed-forward layers to finally obtain the next item likelihood.
We also list models and metrics for each of the three different CF-scenarios in Table 1. Since bias-only, MF, and NeuMF can be trained for all three CF-scenarios, we optimize them using the regularized least-squares regression loss for explicit feedback, and the pairwise-ranking (BPR (Rendle et al., 2009)) loss for implicit/sequential feedback. Note however that the aforementioned algorithms are only intended to be a representative sample of a wide pool of recommendation algorithms, and in our pursuit to benchmark recommender systems faster, we are primarily concerned with the ranking of different algorithms on the full dataset vs. a smaller sub-sample.
3.2. Sampling Strategies
Given a user-item CF dataset , we aim to create a subset according to some sampling strategy . In this paper, to be comprehensive, we consider a sample of eight popular sampling strategies, which can be grouped into the following three categories:
3.2.1. Interaction sampling.
We first discuss three strategies that sample interactions from . In Random Interaction Sampling, we generate by randomly sampling of all the user-item interactions in . User-history Stratified Sampling is another popular sampling technique (see e.g. (Sachdeva et al., 2019; X et al., 2011)) to generate smaller CF-datasets. To match the user-frequency distribution amongst and , it randomly samples of interactions from each user’s consumption history. Unlike random stratified sampling, User-history Temporal Sampling samples of the most recent interactions for each user. This strategy is representative of the popular practice of making data subsets from the online traffic of the last days (Mittal et al., 2021; Jain et al., 2016).
3.2.2. User sampling.
Similar to sampling interactions, we also consider two strategies which sample users in instead. To ensure a fair comparison amongst the different kinds of sampling schemes used in this paper, we retain exactly of the total interactions in . In Random User Sampling, we retain users from at random. To be more specific, we iteratively preserve all the interactions for a random user until we have retained of the original interactions. Another strategy we employ is Head User Sampling, in which we iteratively remove the user with the least amount of total interactions. This method is representative of commonly used data pre-processing strategies (see e.g. (Liang et al., 2018; He et al., 2017)) to make data suitable for parameter-heavy algorithms. Sampling the data in such a way can introduce bias toward users from minority groups which might raise concerns from a diversity and fairness perspective (Mitchell et al., 2018).
3.2.3. Graph sampling.
Instead of sampling directly from , we also consider three strategies that sample from the inherent user-item bipartite interaction graph . In Centrality-based Sampling, we proceed by computing the pagerank centrality scores (Page et al., 1999) for each node in , and retain all the edges (interactions) of the top scoring nodes until a total of the original interactions have been preserved. Another popular strategy we employ is Random-walk Sampling (Leskovec and Faloutsos, 2006), which performs multiple random-walks with restart on and retains the edges amongst those pairs of nodes that have been visited at least once. We keep expanding our walk until of the initial edges have been retained. We also utilize Forest-fire Sampling (Leskovec et al., 2005), which is a snowball sampling method and proceeds by randomly “burning” the outgoing edges of visited nodes. It initially starts with a random node, and then propagates to a random subset of previously unvisited neighbors. The propagation is terminated once we have created a graph-subset with of the initial edges.
3.3. Svp-Cf: Selection-Via-Proxy for CF data
and ImageNet(Deng et al., 2009). The main idea proposed is simple and effective, and proceeds by training a relatively inexpensive base-model as a proxy to define the “importance” of a data-point. However, applying SVP to CF-data can be highly non-trivial because of the following impediments:
Data heterogeneity: Unlike classification data over some input space and label-space , CF-data consists of numerous four-tuples . Such multimodal data adds many different dimensions to sample data from, making it increasingly complex to define meaningful samplers.
Defining the importance of a data point:
Unlike classification, where we can measure the performance of a classifier by its empirical risk on held-out data, for recommendation, there are a variety of different scenarios (Section 3.1
) along with a wide list of relevant evaluation metrics. Hence, it becomes challenging to adapt importance-tagging techniques like greedy k-centers(Sener and Savarese, 2018), forgetting-events (Toneva et al., 2019), etc. for recommendation tasks.
CF-data is well-known for (1) its sparsity; (2) skewed and long-tail user/item distributions; and (3) missing-not-at-random (MNAR) properties of the user-item interaction matrix. This results in additional problems as we are now sampling data from skewed, MNAR data, especially using proxy-models trained on the same skewed data. Such sampling in the worst-case might even lead to exacerbating existing biases in the data or even aberrant data samples.
To address these fundamental limitations in applying the SVP philosophy to CF-data, we propose SVP-CF to sample representative subsets from large user-item interaction data. SVP-CF is also specifically devised for our objective of benchmarking different recommendation algorithms, as it relies on the crucial assumption that the “easiest” part of a dataset will generally be easy for all algorithms. Under this assumption, even after removing such data we are still likely to retain the overall algorithms’ ranking.
Because of the inherent data heterogeneity in user-item interaction data, we can sub-sample in a variety of different ways. We design SVP-CF to be versatile in this aspect as it can be applied to sample users, items, interactions, or combinations of them, by marginally adjusting the definition of importance of each data-point. In this paper, we limit the discussion to only sampling users and interactions (separately), but extending SVP-CF for sampling across other data modalities should be relatively straightforward.
Irrespective of whether to sample users or interactions, SVP-CF proceeds by training an inexpensive proxy model on the full, original data and modifies the forgetting-events approach (Toneva et al., 2019) to retain the points with the highest importance. To be more specific, for explicit feedback, we define the importance of each data-point i.e. interaction as
’s average MSE (over epochs) of the specific interaction if we’re sampling interactionsor ’s average MSE of (over epochs) if we’re sampling users. Whereas, for implicit and sequential feedback, we use ’s average inverse-AUC while computing the importance of each data-point. For the sake of completeness, we experiment with both Bias-only and MF as two different kinds of proxy-models for SVP-CF. Since both models can be trained for all three CF-scenarios (Table 1), we can directly use them to tag the importance for each CF-scenario.
Ultimately, to handle the MNAR and long-tail problems, we also propose SVP-CF-Prop which employs user and item propensities to correct the distribution mismatch while estimating the importance of each datapoint. More specifically, let
denote the probability of userand item ’s interaction actually being observed (propensity), be the total number of epochs that was trained for, denote the proxy model after the epoch, be the set of positive interactions for , and be the set of negative interactions for ; then, the importance function for SVP-CF-Prop, is defined as follows:
Proposition 3.1 ().
Given an ideal propensity-model is an unbiased estimator of
is an unbiased estimator of.
. The most common ways comprise using machine learning models like naïve bayes and logistic regression(Schnabel et al., 2016), or by fitting handcrafted functions (Jain et al., 2016). For our problem statement, we make a simplifying assumption that the data noise is one-sided i.e. or the probability of a user interacting with a wrong item is zero, and model the probability of an interaction going missing to decompose over the user and item as follows:
Ultimately, following (Jain et al., 2016), we assume the user and item propensities to lie on the following sigmoid curves:
Where, and represent the total number of interactions of user and item respectively, and are two fixed scalars, and .
3.4. Performance of a sampling strategy
To quantify the performance of a sampling strategy on a dataset , we start by creating various subsets of according to and call them . Next, we train and evaluate all the relevant recommendation algorithms on both and . Let the ranking of all algorithms according to CF-scenario and metric trained on and be and respectively, then the performance measure is defined as the average correlation between and measured through Kendall’s Tau over all possible CF-scenarios, metrics, and sampling percents:
Where is an appropriate normalizing constant for computing the average, sampling percent , CF-scenario , metric and their pertinence towards each other can all be found in Table 1. has the same range as Kendall’s Tau i.e. and a higher indicates strong agreement between the algorithm ranking on the full and sub-sampled datasets, whereas a large negative implies that the algorithm order was effectively reversed.
To promote dataset diversity in our experiments, we use six public user-item rating interaction datasets with varying sizes, sparsity patterns, and other characteristics. We use the Magazine, Luxury, and Video-games categories of the Amazon review dataset (Ni et al., 2019)
, along with the Movielens-100k(Harper and Konstan, 2015), BeerAdvocate (McAuley et al., 2012), and GoodReads Comics (Wan and McAuley, 2018) datasets. A brief set of data statistics is also presented in Table 2. We simulate all three CF-scenarios (Section 3.1) via different pre-processing strategies. For explicit and implicit feedback, we follow a randomized 80/10/10 train-test-validation split for each user’s consumption history in the dataset, and make use of the leave-one-last (Meng et al., 2020) strategy for sequential feedback i.e. keep the last two interactions in each user’s time-sorted consumption history in the validation and test-set respectively. Since we can’t control the initial construction of datasets, and to minimize the initial data bias, we follow the least restrictive data pre-processing (Dacrema et al., 2019; Sachdeva and McAuley, 2020). We only weed out the users with less than 3 interactions, to keep at least one occurrence in the train, validation, and test sets.
We implement all algorithms in PyTorch111Code is available at https://github.com/noveens/sampling_cf and train on a single GPU. For a fair comparison across algorithms, we perform hyper-parameter search on the validation set. For the three smallest datasets used in this paper (Table 2), we search the latent size in , dropout in , and the learning rate in . Whereas for the three largest datasets, we fix the learning rate to be . Note that despite the limited number of datasets and recommendation algorithms used in this study, given that we need to train all algorithms with hyper-parameter tuning for all CF scenarios, data sampled according to all different sampling strategies discussed in Section 3.2, there are a total of unique models trained, equating to a cumulative train time of over months.
To compute the -values as defined in Section 3.4, we construct samples for each dataset and sampling strategy. To keep comparisons as fair as possible, for all sampling schemes, we only sample on the train set and never touch the validation and test sets.
3.5.1. How do different sampling strategies compare to each other?
Results with -values for all sampling schemes on all datasets are in Table 3. Even though there are only six datasets under consideration, there are a few prominent patterns. First, the average for most sampling schemes is around , which implies a statistically significant correlation between the ranking of algorithms on the full vs. sub-sampled datasets. Next, SVP-CF generally outperforms all commonly used sampling strategies by some margin in retaining the ranking of different recommendation algorithms. Finally, the methods that discard the tail of a dataset (head-user and centrality-based) are the worst performing strategies overall, which supports the recent warnings against dense sampling of data (Sachdeva and McAuley, 2020).
|SVP-CF w/ MF||0.418||0.674||0.398||0.326||0.425||0.662||0.484|
|SVP-CF w/ Bias-only||0.38||0.684||0.431||0.348||0.365||0.6||0.468|
|SVP-CF-Prop w/ MF||0.381||0.617||0.313||0.305||0.356||0.608||0.43|
|SVP-CF-Prop w/ Bias-only||0.408||0.617||0.351||0.316||0.437||0.617||0.458|
|SVP-CF w/ MF||0.468||0.578||0.308||0.13||0.136||0.444||0.344|
|SVP-CF w/ Bias-only||0.49||0.608||0.276||0.124||0.196||0.362||0.343|
|SVP-CF-Prop w/ MF||0.438||0.683||0.307||0.098||0.458||0.592||0.429|
|SVP-CF-Prop w/ Bias-only||0.434||0.751||0.233||0.107||0.506||0.637||0.445|
3.5.2. How does the relative performance of algorithms change as a function of sampling rate?
In an attempt to better understand the impact of sampling on different recommendation algorithms used in this study (Section 3.1), we visualize the probability of a recommendation algorithm moving up in the overall method ranking with data sampling. We estimate the aforementioned probability using Maximum-Likelihood-Estimation (MLE) on the experiments already run in computing . Formally, given a recommendation algorithm , CF-scenario , and data sampling percent :
where is an appropriate normalizing constant, and represents the total number of algorithms. A heatmap visualizing for all algorithms and CF-scenarios is shown in Figure 1. We see that simpler methods like Bias-only and PopRec have the highest probability across data scenarios of increasing their ranking order with extreme sampling. Whereas parameter-heavy algorithms like SASRec, SVAE, MVAE, etc. tend to decrease in the ranking order, which is indicative of overfitting on smaller data samples.
3.5.3. How much data to sample?
Since is averaged over all % data samples, to better estimate a reasonable amount of data to sample, we stratify according to each value of and note the average Kendall’s Tau. As we observe from Figure 2, there is a steady increase in the performance measure when more data is retained. Next, despite the results in Figure 2 being averaged over sixteen sampling strategies, we still notice a significant amount of performance retained after sampling just of the data.
3.5.4. Are different metrics affected equally by sampling?
In an attempt to better understand how the different implicit and sequential feedback metrics (Section 3.1) are affected by sampling, we visualize the average Kendall’s Tau for all sampling strategies (except SVP-CF for brevity) and all % data sampling choices separately over the AUC, Recall, and nDCG metrics in Figure 3. As expected, we observe a steady decrease in the model quality across the accuracy metrics over the different sampling schemes. This is in agreement with the analysis from Figure 2. Next, most sampling schemes follow a similar downwards trend in performance for the three metrics with AUC being slightly less affected and nDCG being slightly more affected by extreme sampling.
4. Data-genie: Which sampler is best for me?
Although the results presented in Section 3.5 are indicative of correlation between the ranking of recommendation algorithms on the full dataset vs. smaller sub-samples, there still is no ‘one-size-fits-all’ solution to the question of how to best sub-sample a dataset for retaining the performance of different recommendation algorithms? In this section, we propose Data-genie
, that attempts to answer this question from a statistical perspective, in contrast with existing literature that generally has to resort to sensible heuristics(Leskovec and Faloutsos, 2006; Belletti et al., 2019; Chen et al., 2017).
4.1. Problem formulation
Given a dataset , we aim to gauge how a new sampling strategy will perform in retaining the performance of different recommendation algorithms. Having already experimented with sixteen different sampling strategies on six datasets (Section 3.5), we take a frequentist approach in predicting the performance of any sampling scheme. To be precise, to predict the performance of sampling scheme on dataset , we start by creating ’s subset according to and call it . We then represent and in a low-dimensional latent space, followed by a powerful regression model to directly estimate the performance of on .
4.2. Dataset representation
We experiment with the following techniques of embedding a user-item interaction dataset into lower dimensions:
For this method, we cherry-pick a few representative characteristics of and the underlying user-item bipartite interaction graph . Inspired by prior work (Leskovec and Faloutsos, 2006), we represent as a combination of five features. We first utilize the frequency distribution of all users and items in . Next, we evaluate the distribution of the top eigenvalues of ’s adjacency matrix. All of these three distributions are generally long-tailed and heavily skewed. Furthermore, to capture notions like the diameter of , we compare the distribution of the number of hops vs. the number of pairs of nodes in reachable at a distance less than (Palmer et al., 2002). This distribution, unlike others is monotonically increasing in . Finally, we also compute the size distribution of all connected components in , where a connected component is defined to be the maximal set of nodes, such that a path exists between any pair of nodes. Ultimately, we ascertain ’s final representation by concatenating evenly-spaced samples from each of the aforementioned distributions along with the total number of users, items, and interactions in . This results in a dimensional embedding for each dataset. Note that unlike previous work of simply retaining the discussed features as a proxy of the quality of data subsets (Leskovec and Faloutsos, 2006), Data-genie instead uses these features to learn a regression model on-top which can dynamically establish the importance of each feature in the performance of a sampling strategy.
4.2.2. Unsupervised GCN.
With the recent advancements in the field of Graph Convolutional Networks (Kipf and Welling, 2017) to represent graph-structured data for a variety of downstream tasks, we also experiment with a GCN approach to embed . We modify the InfoGraph framework (Sun et al., 2020), which uses graph convolution encoders to obtain patch-level representations, followed by sort-pooling (Zhang et al., 2018) to obtain a fixed, low-dimensional embedding for the entire graph. Since the nodes in are the union of all users and items in , we randomly initialize dimensional embeddings using a Xavier-uniform prior (Glorot and Bengio, 2010). Parameter optimization is performed in an unsupervised fashion by maximizing the mutual information (Nowozin et al., 2016) amongst the graph-level and patch-level representations of nodes in the same graph. We validate the best values of the latent dimension and number of layers of the GCN from and respectively.
4.3. Training & Inference
Having discussed different representation functions to embed a CF-dataset in Section 4.2, we now discuss Data-genie’s training framework agnostic to the actual details about .
As a proxy of the performance of a sampler on a given dataset, we re-use the Kendall’s Tau for each CF-scenario, metric, and sampling percent used while computing the in Section 3.5. To be specific, given which is a sample of type feedback data , sampled according to sampling strategy , we aim to estimate without ever computing the actual ranking of algorithms on either the full or sampled datasets:
where is an arbitrary neural network, and is the metric of interest (see Table 1). We train by either (1) regressing on the Kendall’s Tau computed for each CF scenario, metric, and sampling percent used while computing the scores in Section 3.5; or (2) performing BPR-style (Rendle et al., 2009) pairwise ranking on two sampling schemes . Formally, the two optimization problems are defined as follows:
The critical differences between the two aforementioned optimization problems are the downstream use-case and ’s training time. If the utility of Data-genie is to rank different sampling schemes for a given dataset, then Data-genie-ranking is better suited as it is robust to the noise in computing like improper hyper-parameter tuning, local minima, etc. On the other hand, Data-genie-regression is better suited for the use-case of estimating the exact values of for a sampling scheme on a given dataset. Even though both optimization problems converge in less than minutes given the data collected in Section 3.5, the complexity of optimizing Data-genie-ranking is still squared w.r.t. the total number of sampling schemes, whilst that of Data-genie-regression is linear.
To compute we concatenate , , one-hot embedding of
; and pass it through two relu-activated MLP projections to obtain. For Data-genie-regression, we also pass the final output through a tanh activation, to reflect the range of Kendall’s Tau i.e. .
Since computing both and are agnostic to the datasets and the sampling schemes, we can simply use the trained and functions to rank any sampling scheme for any CF dataset. Computationally, given a trained Data-genie, the utility of a sampling scheme can be computed simply by computing twice, along with a single pass over , completing in the order of milliseconds.
We first create a train/validation/test split by randomly splitting all possible metrics and sampling pairs into proportions. Subsequently for each dataset , CF-scenario , and in the validation/test-set, we ask Data-genie to rank all samplers (Table 3) for sampling of type feedback for and use metric for evaluation by sorting for each sampler, as defined in Equation 1. To evaluate Data-genie, we use the P metric between the actual sampler ranking computed while computing scores in Section 3.5, and the one estimated by Data-genie.
4.4.1. How accurately can Data-genie predict the best sampling scheme?
In Figure 5, we compare all dataset representation choices , and multiple architectures for the task of predicting the best sampling strategy. In addition to the regression and ranking architectures discussed in Section 4.3, we also compare with linear least-squares regression and XGBoost regression (Chen and Guestrin, ) as other choices of . In addition, we compare Data-genie with simple baselines: (1) randomly choosing a sampling strategy; and (2) the best possible static sampler choosing strategy—always predict user sampling w/ Bias-only SVP-CF-Prop. First and foremost, irrespective of the and choices, Data-genie outperforms both baselines. Next, both the handcrafted features and the unsupervised GCN features perform quite well in predicting the best sampling strategy, indicating that the graph characteristics are well correlated with the final performance of a sampling strategy. Finally, Data-genie-regression and Data-genie-ranking both perform better than alternative choices, especially for the P metric.
4.4.2. Can we use Data-genie to sample more data without compromising performance?
In Figure 5, we compare the impact of Data-genie in sampling more data by dynamically choosing an appropriate sampler for a given dataset, metric, and data to sample. More specifically, we compare the percentage of data sampled with the Kendall’s Tau averaged over all datasets, CF-scenarios, and relevant metrics for different sampling strategy selection approaches. We compare Data-genie with: (1) randomly picking a sampling strategy averaged over runs; and (2) the Pareto frontier as a skyline which always selects the best sampling strategy for any CF-dataset. As we observe from Figure 5, Data-genie is better than predicting a sampling scheme at random, and is much closer to the Pareto frontier. Next, pairwise ranking approaches are marginally better than regression approaches irrespective of . Finally, Data-genie can appraise the best-performing recommendation algorithm with a suitable amount of confidence using only of the original data. This is significantly more efficient compared to having to sample if we were to always sample using a fixed strategy.
In this work, we discussed two approaches for representative sampling of CF-data, especially for accurately retaining the relative performance of different recommendation algorithms. First, we proposed SVP-CF, which is better than commonly used sampling strategies, followed by introducing Data-genie which analyzes the performance of different samplers on different datasets. Detailed experiments suggest that Data-genie can confidently estimate the downstream utility of any sampler within a few milliseconds, thereby enabling practitioners to benchmark different algorithms on % data sub-samples, with an average time speedup.
To realize the real-world environmental impact of Data-genie, we discuss a typical weekly RecSys development cycle and its carbon footprint. Taking the Criteo Ad dataset as inspiration, we assume a common industry-scale dataset to have B interactions. We assume a hypothetical use case that benchmarks for e.g. different algorithms, each with different hyper-parameter variations. To estimate the energy consumption of GPUs, we scale the minute MLPerf (Mattson et al., 2020) run of training NeuMF (He et al., 2017) on the Movielens-20M dataset over an Nvidia DGX-2 machine. The total estimated run-time for all experiments would be hours; and following (Strubell et al., 2019), the net CO emissions would roughly be lbs. To better understand the significance of this number, a brief CO emissions comparison is presented in Table 4. Clearly, Data-genie along with saving a large amount of experimentation time and cloud compute cost, can also significantly reduce the carbon footprint of this weekly process by more than an average human’s yearly CO emissions.
1 person, NYSF flight
|Human life, 1 year avg.||11k|
|Weekly RecSys development cycle||20k|
|” w/ Data-genie||3.4k|
Despite having significantly benefited the run-time and environmental impacts of benchmarking algorithms on massive datasets, our analysis heavily relied on the experiments of training seven recommendation algorithms on six datasets and their various samples. Despite the already large experimental cost, we strongly believe that the downstream performance of Data-genie could be further improved by simply considering more algorithms and diverse datasets. In addition to better sampling, analyzing the fairness aspects of training algorithms on sub-sampled datasets is an interesting research direction, which we plan to explore in future work.
Acknowledgements.This work was partly supported by NSF Award #1750063.
- Scaling up collaborative filtering data sets through randomized fractal expansions. External Links: Cited by: §4.
- Coresets via bilevel optimization for continual learning and streaming. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: §2.
- On target item sampling in offline recommender system evaluation. In 14th ACM Conference on Recommender Systems, Cited by: §1, §2, §3.1.
-  XGBoost: a scalable tree boosting system. In KDD ’16, Cited by: §4.4.1.
- On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’17. Cited by: §1, §2, §4.
Empirical evaluation of gated recurrent neural networks on sequence modeling. External Links: Cited by: 6th item.
Selection via proxy: efficient data selection for deep learning. In ICLR, Cited by: §2, §3.3.
- Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19. Cited by: §3.5.
- Imagenet: a large-scale hierarchical image database. In , Cited by: §3.3.
- Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §4.2.2.
- Chasing carbon: the elusive environmental footprint of computing. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Vol. . External Links: Cited by: §1.
- The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis). Cited by: §3.5.
- Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17. Cited by: 4th item, §3.2.2, §5.
- Slice: scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the 12th ACM Conference on Web Search and Data Mining, Cited by: §2.
Extreme multi-label loss functions for recommendation, tagging, ranking and other missing label applications. In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §3.2.1, §3.3.
- Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining, Vol. . Cited by: 7th item.
Learning from less data: a unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. . Cited by: §2.
- Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, External Links: Cited by: 5th item.
- Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §4.2.2.
- Matrix factorization techniques for recommender systems. Computer 42 (8), pp. . External Links: Cited by: 3rd item.
- Semi-supervised batch active learning via bilevel optimization. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §2.
- On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20. Cited by: §1, §2, §3.1.
- Learning multiple layers of features from tiny images. Technical report . Cited by: §3.3.
- Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM SIGKDD Conference on Knowledge Discovery in Data Mining, KDD ’05. Cited by: §3.2.3.
- Sampling from large graphs. In KDD ’06, Cited by: §2, §3.2.3, §4.2.1, §4.
Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference, WWW ’18. Cited by: 5th item, §3.2.2.
- MLPerf training benchmark. In Proceedings of Machine Learning and Systems, Cited by: §5.
- Learning attitudes and attributes from multi-aspect reviews. In ICDM ’12, Cited by: §3.5.
- Exploring data splitting strategies for the evaluation of recommendation models. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20. Cited by: §3.5.
- Prediction-based decisions and fairness: a catalogue of choices, assumptions, and definitions. arXiv preprint arXiv:1811.07867. Cited by: §3.2.2.
- ECLARE: extreme classification with label graph correlations. In Proceedings of The ACM International World Wide Web Conference, Cited by: §1, §2, §3.2.1.
- Is the sample good enough? comparing data from twitter’s streaming api with twitter’s firehose. Proceedings of the International AAAI Conference on Web and Social Media 7 (1). External Links: Cited by: §2.
Justifying recommendations using distantly-labeled reviews and fine-grained aspects.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §3.5.
- F-gan: training generative neural samplers using variational divergence minimization. In NeurIPS, Cited by: §4.2.2.
- The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §3.2.3.
- ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the 8th SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’02. Cited by: §4.2.1.
- Carbon emissions and large neural network training. External Links: Cited by: §1.
BPR: bayesian personalized ranking from implicit feedback.
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09. Cited by: §3.1, §4.3.
- Sequential variational autoencoders for collaborative filtering. In Proceedings of the ACM International Conference on Web Search and Data Mining, WSDM ’19. Cited by: 6th item, §3.2.1.
- How useful are reviews for recommendation? a critical review and potential improvements. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20. Cited by: §1, §3.5, §3.5.1.
- Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20. Cited by: §3.3.
- Recommendations as treatments: debiasing learning and evaluation. In Proceedings of The 33rd International Conference on Machine Learning, Cited by: §3.3.
- Green ai. External Links: Cited by: §1.
Active learning for convolutional neural networks: a core-set approach. In ICLR, Cited by: 2nd item.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Cited by: Table 4, §5.
- InfoGraph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In ICLR, Cited by: §4.2.2.
- An empirical study of example forgetting during deep neural network learning. In ICLR, Cited by: 2nd item, §3.3.
- Attention is all you need. In NeurIPS, pp. . Cited by: 7th item.
- Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on Recommender Systems, Cited by: §3.5.
- Sustainable ai: environmental implications, challenges and opportunities. External Links: Cited by: §1.
- Data mining methods for recommender systems. In Recommender Systems Handbook, (da). Cited by: §3.2.1.
- Graph convolutional neural networks for web-scale recommender systems. In KDD ’18, External Links: Cited by: §2.
- An end-to-end deep learning architecture for graph classification. In AAAI, Cited by: §4.2.2.
- Unbiased implicit recommendation and propensity estimation via combinational joint learning. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20. Cited by: §3.3.