1. Introduction
Representative sampling of collaborative filtering (CF) data is a crucial problem from numerous standpoints and can be performed at various levels: (1) mining hardnegatives while training complex algorithms over massive datasets (Mittal et al., 2021; Chen et al., 2017)
; (2) downsampling the itemspace to estimate expensive ranking metrics
(Krichene and Rendle, 2020; Cañamares and Castells, 2020); and (3) reasons like easysharing, fastexperimentation, mitigating the significant environmental footprint of training resourcehungry machine learning models
(Patterson et al., 2021; Wu et al., 2021; Gupta et al., 2021; Schwartz et al., 2019). In this paper, we are interested in finding a subsample of a dataset which has minimal effects on model utility evaluation i.e. an algorithm performing well on the subsample should also perform well on the original dataset.Preserving exactly the same levels of performance on subsampled data over metrics, such as MSE and AUC, is a challenging problem. A simpler yet useful problem is accurately preserving the ranking
or relative performance of different algorithms on subsampled data. For example, a sampling scheme that has low bias but high variance in preserving metric performance values has less utility than a different sampling scheme with high amounts of bias but low variance, since the overall algorithm ranking is still preserved.
Performing adhoc sampling such as randomly removing interactions, or making dense subsets by removing users or items with few interactions (Sachdeva and McAuley, 2020) can have adverse downstream repercussions. For example, sampling only the headportion of a dataset—from a fairness and inclusion perspective—is inherently biased against minoritygroups and benchmarking algorithms on this biased data is likely to propagate the original sampling bias. Alternatively, simply from a model performance viewpoint, accurately retaining the relative performance of different recommendation algorithms on much smaller subsamples is a challenging research problem in itself.
Two prominent directions toward better and more representative sampling of CF data are: (1) designing principled sampling strategies, especially for useritem interaction data; and (2) analyzing the performance of different sampling strategies, in order to better grasp which sampling scheme performs “better” for which types of data. In this paper, we explore both directions through the lens of expediting the recommendation algorithm development cycle by:

Characterizing the efficacy of sixteen different sampling strategies in accurately benchmarking various kinds of recommendation algorithms on smaller subsamples.

Proposing a sampling strategy, SVPCF, which dynamically samples the “toughest” portion of a CF dataset. SVPCF is tailordesigned to handle the inherent data heterogeneity and missingnotatrandom properties in useritem interaction data.

Developing Datagenie, which analyzes the performance of different sampling strategies. Given a dataset subsample, Datagenie can directly estimate the likelihood of model performance being preserved on that sample.
Ultimately, our experiments reveal that SVPCF outperforms all other sampling strategies and can accurately benchmark recommendation algorithms with roughly of the original dataset size. Furthermore, by employing Datagenie to dynamically select the best sampling scheme for a dataset, we are able to preserve model performance with only of the initial data, leading to a net training time speedup.
2. Related Work
Sampling CF data.
Sampling in CFdata has been a popular choice for three major scenarios. Most prominently, sampling is used for mining hardnegatives while training recommendation algorithms. Some popular approaches include random sampling; using the graphstructure (Ying et al., 2018; Mittal et al., 2021); and adhoc techniques like similarity search (Jain et al., 2019), stratified sampling (Chen et al., 2017), etc. Sampling is also generally employed for evaluating recommendation algorithms by estimating expensive to compute metrics like Recall, nDCG, etc. (Krichene and Rendle, 2020; Cañamares and Castells, 2020). Finally, sampling is also used to create smaller subsamples of a big dataset for reasons like fast experimentation, benchmarking different algorithms, privacy concerns, etc. However, the consequences of different samplers on any of these downstream applications is understudied, and is the main research interest of this paper.
Coreset selection.
Closest to our work, a coreset is loosely defined as a subset of datapoints that maintains a similar “quality” as the full dataset for subsequent model training. Submodular approaches try to optimize a function which measures the utility of a subset , and use it as a proxy to select the best coreset (Kaushal et al., 2019). More recent works treat coreset selection as a bilevel optimization problem (Borsos et al., 2020; Krause et al., 2021) and directly optimize for the best coreset for a given downstream task. Selectionviaproxy (Coleman et al., 2020) is another technique which employs a basemodel as a proxy to tag the importance of each datapoint. Note, however, that most existing coreset selection approaches were designed primarily for classification data, whereas adapting them for CFdata is nontrivial because of: (1) the inherent data heterogeneity; the (2) wide range of recommendation metrics; and (3) the prevalent missingdata characteristics.
Evaluating sample quality.
The quality of a dataset sample, if estimated correctly is of high interest for various applications. Short of being able to evaluate the “true” utility of a sample, one generally resorts to either retaining taskdependent characteristics (Morstatter et al., 2013) or
employing universal, handcrafted features like a social network’s hop distribution, eigenvalue distribution,
etc. (Leskovec and Faloutsos, 2006) as meaningful proxies. Note that evaluating the sample quality with a limited set of handcrafted features might introduce bias in the sampled data, depending on the number and quality of such features.3. Sampling Collaborative Filtering Datasets
Given our motivation of quickly benchmarking recommendation algorithms, we now aim to characterize the performance of various commonlyused sampling strategies. We loosely define the performance of a sampling scheme as its ability in effectively retaining the performanceranking of different recommendation algorithms on the full vs. subsampled data. In this section, we start by discussing the different recommendation feedback scenarios we consider, along with a representative sample of popular recommendation algorithms that we aim to efficiently benchmark. We then examine popular data sampling strategies, followed by proposing a novel, proxybased sampling strategy (SVPCF) that is especially suited for sampling representative subsets from longtail CF data.
3.1. Problem Settings & Methods Compared
To give a representative sample of typical recommendation scenarios, we consider three different user feedback settings. In explicit feedback, each user gives a numerical rating to each interacted item ; the model must predict these ratings for novel (test) useritem interactions. Models from this class are evaluated in terms of the Mean Squared Error (MSE) of the predicted ratings. Another scenario we consider is implicit feedback, where the interactions for each user are only available for positive items (e.g. clicks or purchases), whilst all noninteracted items are considered as negatives. We employ the AUC, Recall@, and nDCG@ metrics to evaluate model performance for implicit feedback algorithms. Finally, we also consider sequential feedback, where each user interacts with an ordered sequence of items such that for all . Given , the goal is to identify the nextitem for each sequence that each user is most likely to interact with. We use the same metrics as in implicit feedback settings. Note that following recent warnings against sampled metrics for evaluating recommendation algorithms (Krichene and Rendle, 2020; Cañamares and Castells, 2020), we compute both Recall and nDCG by ranking all items in the dataset. Further specifics about the datasets used, preprocessing, train/test splits, etc. are discussed indepth in Section 3.5.
Given the diversity of the scenarios discussed above, there are numerous relevant recommendation algorithms. We use the following seven recommendation algorithms, intended to represent the stateoftheart and standard baselines:

PopRec: A naïve baseline that simply ranks items according to overall trainset popularity. Note that this method is unaffected by the user for which items are being recommended, and has the same global ranking of all items.

Biasonly: Another simple baseline that assumes no interactions between users and items. Formally, it learns: (1) a global bias ; (2) scalar biases for each user ; and (3) scalar biases for each item . Ultimately, the rating/relevance for user and item is modeled as .

Matrix Factorization (MF) (Koren et al., 2009): Represents both users and items in a common, lowdimensional latentspace by factorizing the useritem interaction matrix. Formally, the rating/relevance for user and item is modeled as where are learned latent representations.

Neural Matrix Factorization (NeuMF) (He et al., 2017):
Leverages the representation power of deep neuralnetworks to capture nonlinear correlations between user and item embeddings. Formally, the rating/relevance for user
and item is modeled as where , ‘——’ represents the concatenation operation, and represents an arbitrarily complex neural network. 
Variational AutoEncoders for Collaborative Filtering (MVAE) (Liang et al., 2018): Builds upon the Variational AutoEncoder (VAE) (Kingma and Welling, 2014) framework to learn a lowdimensional representation of a user’s consumption history. More specifically, MVAE encodes each user’s bagofwords consumption history using a VAE and further decodes the latent representation to obtain the completed user preference over all items.

Sequential Variational AutoEncoders for Collaborative Filtering (SVAE) (Sachdeva et al., 2019): A sequential algorithm that combines the temporal modeling capabilities of a GRU (Chung et al., 2014) along with the representation power of VAEs. Unlike MVAE, SVAE uses a GRU to encode the user’s consumption sequence followed by a multinomial VAE at each timestep to model the likelihood of the next item.

Selfattentive Sequential Recommendation (SASRec) (Kang and McAuley, 2018): Another sequential algorithm that relies on the sequence modeling capabilities of selfattentive neural networks (Vaswani et al., 2017) to predict the occurance of the next item in a user’s consumption sequence. To be precise, given a user and their timeordered consumption history , SASRec first applies selfattention on followed by a series of nonlinear feedforward layers to finally obtain the next item likelihood.
We also list models and metrics for each of the three different CFscenarios in Table 1. Since biasonly, MF, and NeuMF can be trained for all three CFscenarios, we optimize them using the regularized leastsquares regression loss for explicit feedback, and the pairwiseranking (BPR (Rendle et al., 2009)) loss for implicit/sequential feedback. Note however that the aforementioned algorithms are only intended to be a representative sample of a wide pool of recommendation algorithms, and in our pursuit to benchmark recommender systems faster, we are primarily concerned with the ranking of different algorithms on the full dataset vs. a smaller subsample.
CFscenario  Algorithm  Metric  

Biasonly  MF  NeuMF  PopRec  MVAE  SVAE  SASRec  MSE  AUC  Recall@k  nDCG@k  
Explicit  Yes  Yes  Yes  Yes  
Implicit  Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes  
Sequential  Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes 
3.2. Sampling Strategies
Given a useritem CF dataset , we aim to create a subset according to some sampling strategy . In this paper, to be comprehensive, we consider a sample of eight popular sampling strategies, which can be grouped into the following three categories:
3.2.1. Interaction sampling.
We first discuss three strategies that sample interactions from . In Random Interaction Sampling, we generate by randomly sampling of all the useritem interactions in . Userhistory Stratified Sampling is another popular sampling technique (see e.g. (Sachdeva et al., 2019; X et al., 2011)) to generate smaller CFdatasets. To match the userfrequency distribution amongst and , it randomly samples of interactions from each user’s consumption history. Unlike random stratified sampling, Userhistory Temporal Sampling samples of the most recent interactions for each user. This strategy is representative of the popular practice of making data subsets from the online traffic of the last days (Mittal et al., 2021; Jain et al., 2016).
3.2.2. User sampling.
Similar to sampling interactions, we also consider two strategies which sample users in instead. To ensure a fair comparison amongst the different kinds of sampling schemes used in this paper, we retain exactly of the total interactions in . In Random User Sampling, we retain users from at random. To be more specific, we iteratively preserve all the interactions for a random user until we have retained of the original interactions. Another strategy we employ is Head User Sampling, in which we iteratively remove the user with the least amount of total interactions. This method is representative of commonly used data preprocessing strategies (see e.g. (Liang et al., 2018; He et al., 2017)) to make data suitable for parameterheavy algorithms. Sampling the data in such a way can introduce bias toward users from minority groups which might raise concerns from a diversity and fairness perspective (Mitchell et al., 2018).
3.2.3. Graph sampling.
Instead of sampling directly from , we also consider three strategies that sample from the inherent useritem bipartite interaction graph . In Centralitybased Sampling, we proceed by computing the pagerank centrality scores (Page et al., 1999) for each node in , and retain all the edges (interactions) of the top scoring nodes until a total of the original interactions have been preserved. Another popular strategy we employ is Randomwalk Sampling (Leskovec and Faloutsos, 2006), which performs multiple randomwalks with restart on and retains the edges amongst those pairs of nodes that have been visited at least once. We keep expanding our walk until of the initial edges have been retained. We also utilize Forestfire Sampling (Leskovec et al., 2005), which is a snowball sampling method and proceeds by randomly “burning” the outgoing edges of visited nodes. It initially starts with a random node, and then propagates to a random subset of previously unvisited neighbors. The propagation is terminated once we have created a graphsubset with of the initial edges.
3.3. SvpCf: SelectionViaProxy for CF data
SelectionViaProxy (SVP) (Coleman et al., 2020) is a leading coreset mining technique for classification datasets like CIFAR10 (Krizhevsky, 2009)
and ImageNet
(Deng et al., 2009). The main idea proposed is simple and effective, and proceeds by training a relatively inexpensive basemodel as a proxy to define the “importance” of a datapoint. However, applying SVP to CFdata can be highly nontrivial because of the following impediments:
Data heterogeneity: Unlike classification data over some input space and labelspace , CFdata consists of numerous fourtuples . Such multimodal data adds many different dimensions to sample data from, making it increasingly complex to define meaningful samplers.

Defining the importance of a data point:
Unlike classification, where we can measure the performance of a classifier by its empirical risk on heldout data, for recommendation, there are a variety of different scenarios (
Section 3.1) along with a wide list of relevant evaluation metrics. Hence, it becomes challenging to adapt importancetagging techniques like greedy kcenters
(Sener and Savarese, 2018), forgettingevents (Toneva et al., 2019), etc. for recommendation tasks. 
Missing data:
CFdata is wellknown for (1) its sparsity; (2) skewed and longtail user/item distributions; and (3) missingnotatrandom (MNAR) properties of the useritem interaction matrix. This results in additional problems as we are now sampling data from skewed, MNAR data, especially using proxymodels trained on the same skewed data. Such sampling in the worstcase might even lead to exacerbating existing biases in the data or even aberrant data samples.
To address these fundamental limitations in applying the SVP philosophy to CFdata, we propose SVPCF to sample representative subsets from large useritem interaction data. SVPCF is also specifically devised for our objective of benchmarking different recommendation algorithms, as it relies on the crucial assumption that the “easiest” part of a dataset will generally be easy for all algorithms. Under this assumption, even after removing such data we are still likely to retain the overall algorithms’ ranking.
Because of the inherent data heterogeneity in useritem interaction data, we can subsample in a variety of different ways. We design SVPCF to be versatile in this aspect as it can be applied to sample users, items, interactions, or combinations of them, by marginally adjusting the definition of importance of each datapoint. In this paper, we limit the discussion to only sampling users and interactions (separately), but extending SVPCF for sampling across other data modalities should be relatively straightforward.
Irrespective of whether to sample users or interactions, SVPCF proceeds by training an inexpensive proxy model on the full, original data and modifies the forgettingevents approach (Toneva et al., 2019) to retain the points with the highest importance. To be more specific, for explicit feedback, we define the importance of each datapoint i.e. interaction as
’s average MSE (over epochs) of the specific interaction if we’re sampling interactions
or ’s average MSE of (over epochs) if we’re sampling users. Whereas, for implicit and sequential feedback, we use ’s average inverseAUC while computing the importance of each datapoint. For the sake of completeness, we experiment with both Biasonly and MF as two different kinds of proxymodels for SVPCF. Since both models can be trained for all three CFscenarios (Table 1), we can directly use them to tag the importance for each CFscenario.Ultimately, to handle the MNAR and longtail problems, we also propose SVPCFProp which employs user and item propensities to correct the distribution mismatch while estimating the importance of each datapoint. More specifically, let
denote the probability of user
and item ’s interaction actually being observed (propensity), be the total number of epochs that was trained for, denote the proxy model after the epoch, be the set of positive interactions for , and be the set of negative interactions for ; then, the importance function for SVPCFProp, is defined as follows:Proposition 3.1 ().
Proof.
Propensity model.
A wide variety of ways exist to model the propensity score of a useritem interaction (Zhu et al., 2020; Schnabel et al., 2016; Sachdeva et al., 2020; Jain et al., 2016)
. The most common ways comprise using machine learning models like naïve bayes and logistic regression
(Schnabel et al., 2016), or by fitting handcrafted functions (Jain et al., 2016). For our problem statement, we make a simplifying assumption that the data noise is onesided i.e. or the probability of a user interacting with a wrong item is zero, and model the probability of an interaction going missing to decompose over the user and item as follows:Ultimately, following (Jain et al., 2016), we assume the user and item propensities to lie on the following sigmoid curves:
Where, and represent the total number of interactions of user and item respectively, and are two fixed scalars, and .
3.4. Performance of a sampling strategy
To quantify the performance of a sampling strategy on a dataset , we start by creating various subsets of according to and call them . Next, we train and evaluate all the relevant recommendation algorithms on both and . Let the ranking of all algorithms according to CFscenario and metric trained on and be and respectively, then the performance measure is defined as the average correlation between and measured through Kendall’s Tau over all possible CFscenarios, metrics, and sampling percents:
Where is an appropriate normalizing constant for computing the average, sampling percent , CFscenario , metric and their pertinence towards each other can all be found in Table 1. has the same range as Kendall’s Tau i.e. and a higher indicates strong agreement between the algorithm ranking on the full and subsampled datasets, whereas a large negative implies that the algorithm order was effectively reversed.
3.5. Experiments
Datasets.
To promote dataset diversity in our experiments, we use six public useritem rating interaction datasets with varying sizes, sparsity patterns, and other characteristics. We use the Magazine, Luxury, and Videogames categories of the Amazon review dataset (Ni et al., 2019)
, along with the Movielens100k
(Harper and Konstan, 2015), BeerAdvocate (McAuley et al., 2012), and GoodReads Comics (Wan and McAuley, 2018) datasets. A brief set of data statistics is also presented in Table 2. We simulate all three CFscenarios (Section 3.1) via different preprocessing strategies. For explicit and implicit feedback, we follow a randomized 80/10/10 traintestvalidation split for each user’s consumption history in the dataset, and make use of the leaveonelast (Meng et al., 2020) strategy for sequential feedback i.e. keep the last two interactions in each user’s timesorted consumption history in the validation and testset respectively. Since we can’t control the initial construction of datasets, and to minimize the initial data bias, we follow the least restrictive data preprocessing (Dacrema et al., 2019; Sachdeva and McAuley, 2020). We only weed out the users with less than 3 interactions, to keep at least one occurrence in the train, validation, and test sets.Dataset  #  #  #  Avg. User 

Interactions  Users  Items  history length  
Amazon Magazine 
12.7k  3.1k  1.3k  4.1 
ML100k  100k  943  1.7k  106.04 
Amazon Luxury  126k  29.7k  8.4k  4.26 
Amazon Videogames  973k  181k  55.3k  5.37 
BeerAdvocate  1.51M  18.6k  64.3k  81.45 
Goodreads Comics  4.37M  133k  89k  32.72 

Training details.
We implement all algorithms in PyTorch
^{1}^{1}1Code is available at https://github.com/noveens/sampling_cf and train on a single GPU. For a fair comparison across algorithms, we perform hyperparameter search on the validation set. For the three smallest datasets used in this paper (Table 2), we search the latent size in , dropout in , and the learning rate in . Whereas for the three largest datasets, we fix the learning rate to be . Note that despite the limited number of datasets and recommendation algorithms used in this study, given that we need to train all algorithms with hyperparameter tuning for all CF scenarios, data sampled according to all different sampling strategies discussed in Section 3.2, there are a total of unique models trained, equating to a cumulative train time of over months.Data sampling.
To compute the values as defined in Section 3.4, we construct samples for each dataset and sampling strategy. To keep comparisons as fair as possible, for all sampling schemes, we only sample on the train set and never touch the validation and test sets.
3.5.1. How do different sampling strategies compare to each other?
Results with values for all sampling schemes on all datasets are in Table 3. Even though there are only six datasets under consideration, there are a few prominent patterns. First, the average for most sampling schemes is around , which implies a statistically significant correlation between the ranking of algorithms on the full vs. subsampled datasets. Next, SVPCF generally outperforms all commonly used sampling strategies by some margin in retaining the ranking of different recommendation algorithms. Finally, the methods that discard the tail of a dataset (headuser and centralitybased) are the worst performing strategies overall, which supports the recent warnings against dense sampling of data (Sachdeva and McAuley, 2020).
Sampling strategy  Datasets  


ML100k 


BeerAdvocate 

Average  
Interaction sampling 
Random  0.428  0.551  0.409  0.047  0.455  0.552  0.407  
Stratified  0.27  0.499  0.291  0.01  0.468  0.538  0.343  
Temporal  0.289  0.569  0.416  0.02  0.539  0.634  0.405  
SVPCF w/ MF  0.418  0.674  0.398  0.326  0.425  0.662  0.484  
SVPCF w/ Biasonly  0.38  0.684  0.431  0.348  0.365  0.6  0.468  
SVPCFProp w/ MF  0.381  0.617  0.313  0.305  0.356  0.608  0.43  
SVPCFProp w/ Biasonly  0.408  0.617  0.351  0.316  0.437  0.617  0.458  
User sampling  Random  0.436  0.622  0.429  0.17  0.344  0.582  0.431  
Head  0.369  0.403  0.315  0.11  0.04  0.02  0.19  
SVPCF w/ MF  0.468  0.578  0.308  0.13  0.136  0.444  0.344  
SVPCF w/ Biasonly  0.49  0.608  0.276  0.124  0.196  0.362  0.343  
SVPCFProp w/ MF  0.438  0.683  0.307  0.098  0.458  0.592  0.429  
SVPCFProp w/ Biasonly  0.434  0.751  0.233  0.107  0.506  0.637  0.445  
Graph  Centrality  0.307  0.464  0.407  0.063  0.011  0.343  0.266  
Randomwalk  0.596  0.5  0.395  0.306  0.137  0.442  0.396  
Forestfire  0.564  0.493  0.415  0.265  0.099  0.454  0.382 
3.5.2. How does the relative performance of algorithms change as a function of sampling rate?
In an attempt to better understand the impact of sampling on different recommendation algorithms used in this study (Section 3.1), we visualize the probability of a recommendation algorithm moving up in the overall method ranking with data sampling. We estimate the aforementioned probability using MaximumLikelihoodEstimation (MLE) on the experiments already run in computing . Formally, given a recommendation algorithm , CFscenario , and data sampling percent :
where is an appropriate normalizing constant, and represents the total number of algorithms. A heatmap visualizing for all algorithms and CFscenarios is shown in Figure 1. We see that simpler methods like Biasonly and PopRec have the highest probability across data scenarios of increasing their ranking order with extreme sampling. Whereas parameterheavy algorithms like SASRec, SVAE, MVAE, etc. tend to decrease in the ranking order, which is indicative of overfitting on smaller data samples.
3.5.3. How much data to sample?
Since is averaged over all % data samples, to better estimate a reasonable amount of data to sample, we stratify according to each value of and note the average Kendall’s Tau. As we observe from Figure 2, there is a steady increase in the performance measure when more data is retained. Next, despite the results in Figure 2 being averaged over sixteen sampling strategies, we still notice a significant amount of performance retained after sampling just of the data.
3.5.4. Are different metrics affected equally by sampling?
In an attempt to better understand how the different implicit and sequential feedback metrics (Section 3.1) are affected by sampling, we visualize the average Kendall’s Tau for all sampling strategies (except SVPCF for brevity) and all % data sampling choices separately over the AUC, Recall, and nDCG metrics in Figure 3. As expected, we observe a steady decrease in the model quality across the accuracy metrics over the different sampling schemes. This is in agreement with the analysis from Figure 2. Next, most sampling schemes follow a similar downwards trend in performance for the three metrics with AUC being slightly less affected and nDCG being slightly more affected by extreme sampling.
4. Datagenie: Which sampler is best for me?
Although the results presented in Section 3.5 are indicative of correlation between the ranking of recommendation algorithms on the full dataset vs. smaller subsamples, there still is no ‘onesizefitsall’ solution to the question of how to best subsample a dataset for retaining the performance of different recommendation algorithms? In this section, we propose Datagenie
, that attempts to answer this question from a statistical perspective, in contrast with existing literature that generally has to resort to sensible heuristics
(Leskovec and Faloutsos, 2006; Belletti et al., 2019; Chen et al., 2017).4.1. Problem formulation
Given a dataset , we aim to gauge how a new sampling strategy will perform in retaining the performance of different recommendation algorithms. Having already experimented with sixteen different sampling strategies on six datasets (Section 3.5), we take a frequentist approach in predicting the performance of any sampling scheme. To be precise, to predict the performance of sampling scheme on dataset , we start by creating ’s subset according to and call it . We then represent and in a lowdimensional latent space, followed by a powerful regression model to directly estimate the performance of on .
4.2. Dataset representation
We experiment with the following techniques of embedding a useritem interaction dataset into lower dimensions:
4.2.1. Handcrafted.
For this method, we cherrypick a few representative characteristics of and the underlying useritem bipartite interaction graph . Inspired by prior work (Leskovec and Faloutsos, 2006), we represent as a combination of five features. We first utilize the frequency distribution of all users and items in . Next, we evaluate the distribution of the top eigenvalues of ’s adjacency matrix. All of these three distributions are generally longtailed and heavily skewed. Furthermore, to capture notions like the diameter of , we compare the distribution of the number of hops vs. the number of pairs of nodes in reachable at a distance less than (Palmer et al., 2002). This distribution, unlike others is monotonically increasing in . Finally, we also compute the size distribution of all connected components in , where a connected component is defined to be the maximal set of nodes, such that a path exists between any pair of nodes. Ultimately, we ascertain ’s final representation by concatenating evenlyspaced samples from each of the aforementioned distributions along with the total number of users, items, and interactions in . This results in a dimensional embedding for each dataset. Note that unlike previous work of simply retaining the discussed features as a proxy of the quality of data subsets (Leskovec and Faloutsos, 2006), Datagenie instead uses these features to learn a regression model ontop which can dynamically establish the importance of each feature in the performance of a sampling strategy.
4.2.2. Unsupervised GCN.
With the recent advancements in the field of Graph Convolutional Networks (Kipf and Welling, 2017) to represent graphstructured data for a variety of downstream tasks, we also experiment with a GCN approach to embed . We modify the InfoGraph framework (Sun et al., 2020), which uses graph convolution encoders to obtain patchlevel representations, followed by sortpooling (Zhang et al., 2018) to obtain a fixed, lowdimensional embedding for the entire graph. Since the nodes in are the union of all users and items in , we randomly initialize dimensional embeddings using a Xavieruniform prior (Glorot and Bengio, 2010). Parameter optimization is performed in an unsupervised fashion by maximizing the mutual information (Nowozin et al., 2016) amongst the graphlevel and patchlevel representations of nodes in the same graph. We validate the best values of the latent dimension and number of layers of the GCN from and respectively.
4.3. Training & Inference
Having discussed different representation functions to embed a CFdataset in Section 4.2, we now discuss Datagenie’s training framework agnostic to the actual details about .
Optimization problem.
As a proxy of the performance of a sampler on a given dataset, we reuse the Kendall’s Tau for each CFscenario, metric, and sampling percent used while computing the in Section 3.5. To be specific, given which is a sample of type feedback data , sampled according to sampling strategy , we aim to estimate without ever computing the actual ranking of algorithms on either the full or sampled datasets:
(1) 
where is an arbitrary neural network, and is the metric of interest (see Table 1). We train by either (1) regressing on the Kendall’s Tau computed for each CF scenario, metric, and sampling percent used while computing the scores in Section 3.5; or (2) performing BPRstyle (Rendle et al., 2009) pairwise ranking on two sampling schemes . Formally, the two optimization problems are defined as follows:
(Datagenieregression)  
(Datagenieranking)  
where, 
The critical differences between the two aforementioned optimization problems are the downstream usecase and ’s training time. If the utility of Datagenie is to rank different sampling schemes for a given dataset, then Datagenieranking is better suited as it is robust to the noise in computing like improper hyperparameter tuning, local minima, etc. On the other hand, Datagenieregression is better suited for the usecase of estimating the exact values of for a sampling scheme on a given dataset. Even though both optimization problems converge in less than minutes given the data collected in Section 3.5, the complexity of optimizing Datagenieranking is still squared w.r.t. the total number of sampling schemes, whilst that of Datagenieregression is linear.
Architecture.
To compute we concatenate , , onehot embedding of
; and pass it through two reluactivated MLP projections to obtain
. For Datagenieregression, we also pass the final output through a tanh activation, to reflect the range of Kendall’s Tau i.e. .Inference.
Since computing both and are agnostic to the datasets and the sampling schemes, we can simply use the trained and functions to rank any sampling scheme for any CF dataset. Computationally, given a trained Datagenie, the utility of a sampling scheme can be computed simply by computing twice, along with a single pass over , completing in the order of milliseconds.
4.4. Experiments
Setup.
We first create a train/validation/test split by randomly splitting all possible metrics and sampling pairs into proportions. Subsequently for each dataset , CFscenario , and in the validation/testset, we ask Datagenie to rank all samplers (Table 3) for sampling of type feedback for and use metric for evaluation by sorting for each sampler, as defined in Equation 1. To evaluate Datagenie, we use the P metric between the actual sampler ranking computed while computing scores in Section 3.5, and the one estimated by Datagenie.
4.4.1. How accurately can Datagenie predict the best sampling scheme?
In Figure 5, we compare all dataset representation choices , and multiple architectures for the task of predicting the best sampling strategy. In addition to the regression and ranking architectures discussed in Section 4.3, we also compare with linear leastsquares regression and XGBoost regression (Chen and Guestrin, ) as other choices of . In addition, we compare Datagenie with simple baselines: (1) randomly choosing a sampling strategy; and (2) the best possible static sampler choosing strategy—always predict user sampling w/ Biasonly SVPCFProp. First and foremost, irrespective of the and choices, Datagenie outperforms both baselines. Next, both the handcrafted features and the unsupervised GCN features perform quite well in predicting the best sampling strategy, indicating that the graph characteristics are well correlated with the final performance of a sampling strategy. Finally, Datagenieregression and Datagenieranking both perform better than alternative choices, especially for the P metric.
4.4.2. Can we use Datagenie to sample more data without compromising performance?
In Figure 5, we compare the impact of Datagenie in sampling more data by dynamically choosing an appropriate sampler for a given dataset, metric, and data to sample. More specifically, we compare the percentage of data sampled with the Kendall’s Tau averaged over all datasets, CFscenarios, and relevant metrics for different sampling strategy selection approaches. We compare Datagenie with: (1) randomly picking a sampling strategy averaged over runs; and (2) the Pareto frontier as a skyline which always selects the best sampling strategy for any CFdataset. As we observe from Figure 5, Datagenie is better than predicting a sampling scheme at random, and is much closer to the Pareto frontier. Next, pairwise ranking approaches are marginally better than regression approaches irrespective of . Finally, Datagenie can appraise the bestperforming recommendation algorithm with a suitable amount of confidence using only of the original data. This is significantly more efficient compared to having to sample if we were to always sample using a fixed strategy.
5. Discussion
In this work, we discussed two approaches for representative sampling of CFdata, especially for accurately retaining the relative performance of different recommendation algorithms. First, we proposed SVPCF, which is better than commonly used sampling strategies, followed by introducing Datagenie which analyzes the performance of different samplers on different datasets. Detailed experiments suggest that Datagenie can confidently estimate the downstream utility of any sampler within a few milliseconds, thereby enabling practitioners to benchmark different algorithms on % data subsamples, with an average time speedup.
To realize the realworld environmental impact of Datagenie, we discuss a typical weekly RecSys development cycle and its carbon footprint. Taking the Criteo Ad dataset as inspiration, we assume a common industryscale dataset to have B interactions. We assume a hypothetical use case that benchmarks for e.g. different algorithms, each with different hyperparameter variations. To estimate the energy consumption of GPUs, we scale the minute MLPerf (Mattson et al., 2020) run of training NeuMF (He et al., 2017) on the Movielens20M dataset over an Nvidia DGX2 machine. The total estimated runtime for all experiments would be hours; and following (Strubell et al., 2019), the net CO emissions would roughly be lbs. To better understand the significance of this number, a brief CO emissions comparison is presented in Table 4. Clearly, Datagenie along with saving a large amount of experimentation time and cloud compute cost, can also significantly reduce the carbon footprint of this weekly process by more than an average human’s yearly CO emissions.
Consumption  COe (lbs.) 

1 person, NYSF flight 
2k 
Human life, 1 year avg.  11k 
Weekly RecSys development cycle  20k 
” w/ Datagenie  3.4k 

Despite having significantly benefited the runtime and environmental impacts of benchmarking algorithms on massive datasets, our analysis heavily relied on the experiments of training seven recommendation algorithms on six datasets and their various samples. Despite the already large experimental cost, we strongly believe that the downstream performance of Datagenie could be further improved by simply considering more algorithms and diverse datasets. In addition to better sampling, analyzing the fairness aspects of training algorithms on subsampled datasets is an interesting research direction, which we plan to explore in future work.
Acknowledgements.
This work was partly supported by NSF Award #1750063.References
 Scaling up collaborative filtering data sets through randomized fractal expansions. External Links: 1905.09874 Cited by: §4.
 Coresets via bilevel optimization for continual learning and streaming. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: §2.
 On target item sampling in offline recommender system evaluation. In 14th ACM Conference on Recommender Systems, Cited by: §1, §2, §3.1.
 [4] XGBoost: a scalable tree boosting system. In KDD ’16, Cited by: §4.4.1.
 On sampling strategies for neural networkbased collaborative filtering. In Proceedings of the 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’17. Cited by: §1, §2, §4.

Empirical evaluation of gated recurrent neural networks on sequence modeling
. External Links: 1412.3555 Cited by: 6th item. 
Selection via proxy: efficient data selection for deep learning
. In ICLR, Cited by: §2, §3.3.  Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19. Cited by: §3.5.

Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, Cited by: §3.3.  Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §4.2.2.
 Chasing carbon: the elusive environmental footprint of computing. In 2021 IEEE International Symposium on HighPerformance Computer Architecture (HPCA), Vol. . External Links: ISSN Cited by: §1.
 The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis). Cited by: §3.5.
 Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17. Cited by: 4th item, §3.2.2, §5.
 Slice: scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the 12th ACM Conference on Web Search and Data Mining, Cited by: §2.

Extreme multilabel loss functions for recommendation, tagging, ranking and other missing label applications
. In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §3.2.1, §3.3.  Selfattentive sequential recommendation. In 2018 IEEE International Conference on Data Mining, Vol. . Cited by: 7th item.

Learning from less data: a unified data subset selection and active learning framework for computer vision
. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. . Cited by: §2.  AutoEncoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, External Links: http://arxiv.org/abs/1312.6114v10 Cited by: 5th item.
 Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §4.2.2.
 Matrix factorization techniques for recommender systems. Computer 42 (8), pp. . External Links: ISSN Cited by: 3rd item.
 Semisupervised batch active learning via bilevel optimization. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §2.
 On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20. Cited by: §1, §2, §3.1.
 Learning multiple layers of features from tiny images. Technical report . Cited by: §3.3.
 Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM SIGKDD Conference on Knowledge Discovery in Data Mining, KDD ’05. Cited by: §3.2.3.
 Sampling from large graphs. In KDD ’06, Cited by: §2, §3.2.3, §4.2.1, §4.

Variational autoencoders for collaborative filtering
. In Proceedings of the 2018 World Wide Web Conference, WWW ’18. Cited by: 5th item, §3.2.2.  MLPerf training benchmark. In Proceedings of Machine Learning and Systems, Cited by: §5.
 Learning attitudes and attributes from multiaspect reviews. In ICDM ’12, Cited by: §3.5.
 Exploring data splitting strategies for the evaluation of recommendation models. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20. Cited by: §3.5.
 Predictionbased decisions and fairness: a catalogue of choices, assumptions, and definitions. arXiv preprint arXiv:1811.07867. Cited by: §3.2.2.
 ECLARE: extreme classification with label graph correlations. In Proceedings of The ACM International World Wide Web Conference, Cited by: §1, §2, §3.2.1.
 Is the sample good enough? comparing data from twitter’s streaming api with twitter’s firehose. Proceedings of the International AAAI Conference on Web and Social Media 7 (1). External Links: Link Cited by: §2.

Justifying recommendations using distantlylabeled reviews and finegrained aspects.
In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)
, Cited by: §3.5.  Fgan: training generative neural samplers using variational divergence minimization. In NeurIPS, Cited by: §4.2.2.
 The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §3.2.3.
 ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the 8th SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’02. Cited by: §4.2.1.
 Carbon emissions and large neural network training. External Links: 2104.10350 Cited by: §1.

BPR: bayesian personalized ranking from implicit feedback.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, UAI ’09. Cited by: §3.1, §4.3.  Sequential variational autoencoders for collaborative filtering. In Proceedings of the ACM International Conference on Web Search and Data Mining, WSDM ’19. Cited by: 6th item, §3.2.1.
 How useful are reviews for recommendation? a critical review and potential improvements. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20. Cited by: §1, §3.5, §3.5.1.
 Offpolicy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20. Cited by: §3.3.
 Recommendations as treatments: debiasing learning and evaluation. In Proceedings of The 33rd International Conference on Machine Learning, Cited by: §3.3.
 Green ai. External Links: 1907.10597 Cited by: §1.

Active learning for convolutional neural networks: a coreset approach
. In ICLR, Cited by: 2nd item.  Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Cited by: Table 4, §5.
 InfoGraph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. In ICLR, Cited by: §4.2.2.
 An empirical study of example forgetting during deep neural network learning. In ICLR, Cited by: 2nd item, §3.3.
 Attention is all you need. In NeurIPS, pp. . Cited by: 7th item.
 Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on Recommender Systems, Cited by: §3.5.
 Sustainable ai: environmental implications, challenges and opportunities. External Links: 2111.00364 Cited by: §1.
 Data mining methods for recommender systems. In Recommender Systems Handbook, (da). Cited by: §3.2.1.
 Graph convolutional neural networks for webscale recommender systems. In KDD ’18, External Links: ISBN 9781450355520 Cited by: §2.
 An endtoend deep learning architecture for graph classification. In AAAI, Cited by: §4.2.2.
 Unbiased implicit recommendation and propensity estimation via combinational joint learning. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20. Cited by: §3.3.
Comments
There are no comments yet.