A unified, comprehensive and efficient recommendation library
In recent years, there are a large number of recommendation algorithms proposed in the literature, from traditional collaborative filtering to neural network algorithms. However, the concerns about how to standardize open source implementation of recommendation algorithms continually increase in the research community. In the light of this challenge, we propose a unified, comprehensive and efficient recommender system library called RecBole, which provides a unified framework to develop and reproduce recommender systems for research purpose. In this library, we implement 53 recommendation models on 27 benchmark datasets, covering the categories of general recommendation, sequential recommendation, context-aware recommendation and knowledge-based recommendation. We implement the RecBole library based on PyTorch, which is one of the most popular deep learning frameworks. Our library is featured in many aspects, including general and extensible data structures, comprehensive benchmark models and datasets, efficient GPU-accelerated execution, and extensive and standard evaluation protocols. We provide a series of auxiliary functions, tools, and scripts to facilitate the use of this library, such as automatic parameter tuning and break-point resume. Such a framework is useful to standardize the implementation and evaluation of recommender systems. The project and documents are released at https://recbole.io.READ FULL TEXT VIEW PDF
A unified, comprehensive and efficient recommendation library
In the era of big data, recommender systems are playing a key role in tackling information overload, which largely improve the user experiences in a variety of applications, ranging from e-commerce, video sharing to healthcare assistant and on-line education. The huge business value makes recommender systems become a longstanding research topic, with a large number of new models proposed each year (Zhang et al., 2019a).
With the rapid growth of recommendation algorithms, these algorithms are usually developed under different platforms or frameworks. Typically, an experienced researcher often finds it difficult to implement the compared baselines in a unified way or framework. Indeed, many common components or procedures of these recommendation algorithms are duplicate or highly similar, which should be reused or extended. Besides, we are aware that there is an increasing concern about model reproducibility in the research community. Due to some reasons, many published recommendation algorithms still lack public implementations. Even with open source code, many details are implemented inconsistently (e.g., with different loss functions or optimization strategies) by different developers. There is a need to re-consider the implementation of recommendation algorithms in a unified way, especially with deep learning.
In order to alleviate the above issues, we initiate a project to provide a unified framework for developing recommendation algorithms. We implement an open source recommender system library, called RecBole (pronounced as [rk’boUlr]) 111Bole was a famous Chinese judge of horses in Spring and Autumn period, who was the legendary inventor of equine physiognomy (“judging a horse’s qualities from appearance”). Bole is frequently associated with the fabled qianlima (a Chinese word) “thousand-li horse”, which was supposedly able to gallop one thousand li (approximately 400 km) in a single day. Read more details about Bole at the wikipedia page via the link: https://en.wikipedia.org/wiki/Bo_Le. Here, we make an analogy between identifying qianlima horses and making good recommendations.. Based on this library, we would like to enhance the reproducibility of existing models and ease the developing process of new algorithms. Our work is also useful to standardize the evaluation protocol of recommendation algorithms.
Indeed, a considerable number of recommender system libraries have been released in the past decade (Guo et al., 2015; Gantner et al., 2011; Wu et al., 2017; Wang et al., 2020a; Sun et al., 2020). These works have largely advanced the progress of open source recommender systems. Many libraries have made continuous improvement with increasingly added features. We have extensively surveyed these libraries and broadly fused their merits into RecBole. To summarize, the key features and capabilities of our RecBole library are summarized in the following five aspects:
Unified recommendation framework. We adopt PyTorch (Paszke et al., 2019) to develop the entire recommender system library, since it is one of the most popular deep learning frameworks, especially in the research community. As three core components of our library, we design and develop data modules, model modules, and evaluation modules, and encapsulate many common components, functions or procedures shared by different recommendation algorithms. In our library, for reusing existing models, one can easily compare different recommendation algorithms with built-in evaluation protocols via simple yet flexible configuration; for developing new models, one only needs to focus on a small number of interface functions, so that common parts can be reused and implementation details are made transparent to the developers.
General and extensible data structure. For unified algorithm development, we implement the supporting data structures at two levels. At the user level, we introduce atomic files to format the input of mainstream recommendation tasks in a flexible way. The proposed atomic files are able to characterize the input of four kinds of mainstream recommendation tasks. At the algorithm level, we introduce a general data structure Interaction to unify the internal data representations tailored to GPU-based environment. The design of Interaction is particularly convenient to develop new algorithms with supporting mechanisms or functions, e.g., fetching the data by referencing feature name. We implement Dataset and DataLoader (two python classes) to automate the entire data flow, which greatly reduces the efforts for developing new recommendation models.
Comprehensive benchmark models and datasets. So far, we have implemented 53 recommendation algorithms, covering the categories of general recommendation, sequential recommendation, context-aware recommendation and knowledge-based recommendation. Besides traditional recommendation algorithms, we incorporate a large number of neural algorithms proposed in recent years. We provide flexible supporting mechanisms via the configuration files or command lines to run, compare and test these algorithms. We also implement rich auxiliary functions to use these models, including automatic parameter tuning and break-point resume. To construct a reusable benchmark, we incorporate 27 commonly used datasets for evaluating recommender systems. With original dataset copies, a user can simply transform the data into a form that can be used in our library with the provided preprocessing tools or scripts. More datasets and methods will be continually incorporated into our library.
Efficient GPU-accelerated execution. We design and implement a number of efficiency optimization techniques that are tailored to the GPU environment. As two major sources of time costs, both the model training and testing are accelerated with GPU-oriented implementations. For model test, a special acceleration strategy is proposed to improve the efficiency of the full ranking for top- item recommendation. We convert the top- evaluation for all the users into the computation based on a unified matrix form. With this matrix form, we can utilize the GPU-version topk() function in PyTorch to directly optimize the top-
finding procedure. Furthermore, such a matrix form is particularly convenient for generating the recommendations and computing the evaluation metrics. We empirically show that it significantly reduces the time cost of the straightforward implementation without our acceleration strategy.
Extensive and standard evaluation protocols. Our library supports a series of widely adopted evaluation protocols for testing and comparing recommendation algorithms. It incorporates the various evaluation settings discussed in (Zhao et al., 2020). Specially, we implement different combinations of item sorting (i.e., how to sort the items before data splitting) and data splitting (i.e., how to derive the train/validation/test sets) for deriving the evaluation sets. We also consider both full ranking and sample-based ranking, which is recently a controversial issue in the field of recommender system (Krichene and Rendle, 2020). We encapsulate four basic interfaces (namely Group, Split, Order and NegSample) to support the above evaluation protocols, which is flexible to include other evaluation settings. We provide a few commonly used evaluation settings (e.g., ratio-based splitting plus random ordering for dataset splitting), which integrates the alternative settings of the above four factors. Our library provides a possibility to evaluate recommendation models under different evaluation settings.
The overall framework of our library RecBole is presented in Figure 1
. The bottom part is the configuration module, which helps users to set up the experimental environment (e.g., hyperparameters and running details). The data, model and evaluation modules are built upon the configuration module, which forms the core code of our library. The execution module is responsible for running and evaluating the model based on specific settings of the environment. All the auxiliary functions are collected in the utility module, including automatic parameter tuning, logger and evaluation metrics. In the following, we briefly present the designs of three core modules, and more details can be found in the library documents.
A major development guideline of our library is to make the code highly self-contained and unified. For this purpose, data module is indeed the most important part that supports the entire library by providing fundamental data structures and functions.
For extensibility and reusability, our data module designs an elegant data flow that transforms raw data into the model input.
The overall data flow can be described as follows: raw input atomic files Dataset Dataloader algorithms. The implementation of class Dataset is mainly based on the primary data structure of pandas.DataFrame in the library of pandas, and the implementation of class Dataloader is based on a general internal data structure implemented by our library, called Interaction.
Our data flow involves two special data forms, which are oriented to users and algorithms, respectively. For data preparation, we introduce and define six atomic file types (having the same or similar file format) for unifying the input at the user level. While, for internal data representations, we introduce and implement a flexible data structure Interaction at the algorithm level. The atomic files are able to characterize most forms of the input data required by different recommendation tasks, and the Interaction data structure provides a unified internal data representation for different recommendation algorithms.
In order to help users transform raw input into atomic files, we have collected more than 27 commonly used datasets and released the corresponding conversion tools, which makes it quite convenient to start with our library. We present the statistics of these datasets in Table 1. During the transformation step from atomic files to class Dataset, we provide many useful functions that support a series of preprocessing steps in recommender systems, such as
-core data filtering and missing value imputation. We present the functions supported by classDataset in Table 2.
|MovieLens||-||-||-||Harper and Konstan (2016)|
|Epinions||116,260||41,269||188,478||Zhao et al. (2014)|
|Book-Crossing||105,284||340,557||1,149,780||Ziegler et al. (2005)|
|Jester||73,421||101||4,136,360||Goldberg et al. (2001)|
|Yahoo Music||1,948,882||98,211||11,557,943||Yahoo-Research (2020)|
|KDD2010||-||-||-||Stamper et al. (2010)|
|Amazon||-||-||-||He and McAuley (2016)|
|55,187||9,911||1,445,622||He et al. (2017b)|
|Gowalla||107,092||1,280,969||6,442,892||Cho et al. (2011)|
|Last.FM||1,892||17,632||92,834||Cantador et al. (2011)|
|Steam||2,567,538||32,135||7,793,069||Kang and McAuley (2018)|
|LFM-1b||120,322||3,123,496||1,088,161,692||Schedl and Ferwerda (2017)|
|iPinYou||19,731,660||163||24,637,657||Liao et al. (2014)|
|Phishing websites||-||-||11,055||Mohammad et al. (2012)|
|_filter_by_inter_num||remove users/items with too many or too few records|
|_filter_by_field_value||filter the data based on the value of some feature|
|_remap_ID||map the features of type token to a new set of integer ID|
|_fill_nan||missing value imputation|
|_set_label_by_threshold||generate interaction labels according to the feature value|
|_normalize||normalize the features of type float|
|_preload_weight_matrix||initialize embedding tables with some features|
So far, our library introduces six atomic file types, which are served as basic components for characterizing the input of various recommendation tasks. In the literature, there is a considerable number of recommendation tasks. We try to summarize and unify the most basic input forms for mainstream recommendation tasks. Note that these files are only functionally different while their formats are rather similar. The details of these atomic files are summarized in Table 3.
|Suffix||Data types||Content||Example format|
|.inter||all types||User-item interaction||UserID,ItemID,Rating,Review|
|.user||all types||User feature||UserID,Age|
|.item||all types||Item feature||ItemID,Category|
Triplets in a knowledge graph
|.link||int||Item-entity linkage data||ItemID,EntityID|
|.net||all types||Social graph data||SourceID,TargetID,weight|
We identify different files by their suffixes. By summarizing existing recommendation models and datasets, we conclude with four basic data types, i.e., “token” (representing integers or strings), “token sequence”, “float” and “float sequence”. “token” and “token sequence” are used to represent discrete features such as ID or category, while “float” and “float sequence” are used to represent continuous features, such as price. Atomic files support sparse feature representations, so that the space taken by the atomic files can be largely reduced. Most of atomic files support all the four data types except the .kg and .link files. The example formats of the files are presented in the fourth column of Table 3.
Next, we present the detailed description of each atomic file:
.inter is a mandatory file used in all the recommendation tasks. Each line is composed of the user ID (token), item ID (token), user-item rating (float, optional), timestamp (float, optional) and review text (token sequence, optional). Different fields are separated by commas.
.user is a user profile file, which includes the user categorical or continuous features. Each line is formatted as user ID (token), feature (token or float), feature (token or float), …, feature (token or float).
.item is an item feature file, which describes the item characteristics, and the format is as follows: item ID (token), feature (token or float), feature (token or float), …, feature (token or float). .user and .item are used for context-aware recommendation.
.kg is a knowledge graph file used for knowledge-based recommendation. Each line corresponds to a triplet, and the format is as follows: head entity ID (token), tail entity ID (token), relation ID (token).
.link is also used for knowledge-based recommendation. It records the correspondence between the recommender systems items and the knowledge graph entities. The file format is as follows: item ID (token), entity ID (token), which denotes the item-to-entity mapping.
.net is a social network file used for social recommendation. The format is as follows: source user ID (token), target user ID (token), weight (float, optional).
The essence of the atomic files is feature-based data frames corresponding to different parts of the task input. They can cover the input of most mainstream recommendation tasks in the literature. In case the atomic files are not sufficient to support new tasks, one can incrementally introduce new atomic files in a flexible way.
Based on the above atomic files, we can utilize a series of file combinations to facilitate five mainstream recommendation tasks, namely general recommendation, context-aware recommendation, knowledge-based recommendation, sequential recommendation and social recommendation. Currently, we have implemented the supporting mechanisms for the first four kinds of recommendation tasks, while the code for social recommendation is under development.
The correspondence between atomic files and recommendation models are presented in Table 4. A major merit of our input files is that atomic files themselves are not dependent on specific tasks. As we can see, given a dataset, the user can reuse the same .inter file (without any modification on data files) when switching between different recommendation tasks. Our library reads the configuration file and determines what to do with the data files.
Another note is that Table 4 presents the combination of mandatory atomic files in each task. It is also possible to use additional atomic files besides mandatory files. For example, for sequential recommendation, we may also need to use context features. To support this, one can simply extend the original combination to .inter, .user, .item as needed.
|Tasks||Mandatory atomic files|
|Context-aware recommendation||.inter, .user, .item|
|Knowledge-based recommendation||.inter, .kg, .link|
|Social recommendation||.inter, .net|
As discussed in Section 2.1.1, in our library, Interaction is the internal data structural that is fed into the recommendation algorithms.
In order to make it unified and flexible, it is implemented as a new abstract data type based on python.dict, which is a key-value indexed data structure. The keys correspond to features from input, which can be conveniently referenced with feature names when writing the recommendation algorithms; and the values correspond to tensors (implemented by torch.Tensor
), which will be used for the update and computation in learning algorithms. Specially, the value entry for a specific key stores all the corresponding tensor data in a batch or mini-batch.
With such a data structure, our library provides a friendly interface to write the recommendation algorithms in a batch-based mode. For example, we can read all the user embeddings and items embeddings from an instantiated Interaction object “” with a simple key finding step with feature names:
All the details of the transformation from raw input to internal data representations are transparent to the developers. They can implement different algorithms easily based on unified internal data representation Interaction. Besides, the value components are implemented based on torch.Tensor. We wrap many functions of PyTorch to develop a GRU-oriented data structure, which can support batch-based mechanism (e.g., copying a batch of data to GPU). Specially, we summarize the important functions that Interaction supports in Table 5.
|to(device)||transfer all tensors to torch.device|
|cpu||transfer all tensors to CPU|
|numpy||transfer all tensors to numpy.Array|
|repeat||repeats each tensor along the batch_size dimension|
|repeat_interleave||repeat elements of a tensor, similar to repeat_interleave|
|update||update this object with another Interaction, similar to update|
|ItemKNN||Deshpande and Karypis (2004)|
|BPR||Rendle et al. (2009)|
|NeuMF||He et al. (2017b)|
|DMF||Xue et al. (2017)|
|NAIS||He et al. (2018b)|
|NGCF||Wang et al. (2019e)|
|GCMC||van den Berg et al. (2017)|
|LightGCN||He et al. (2020)|
|DGCF||Wang et al. (2020b)|
|ConvNCF||He et al. (2018a)|
|FISM||Kabbur et al. (2013)|
|SpectralCF||Zheng et al. (2018)|
|DIN||Zhou et al. (2018)|
|DSSM||Huang et al. (2013)|
|DeepFM||Guo et al. (2017)|
|xDeepFM||Lian et al. (2018)|
|Wide&Deep||Cheng et al. (2016)|
|NFM||He and Chua (2017)|
|AFM||Xiao et al. (2017)|
|AutoInt||Song et al. (2019)|
|DCN||Wang et al. (2017)|
|FNN(DNN)||Zhang et al. (2016b)|
|PNN||Qu et al. (2016)|
|FFM||Juan et al. (2016)|
|FwFM||Pan et al. (2018)|
|Sequential recommendation||Improved GRU-Rec||Tan et al. (2016)|
|SASRec||Kang and McAuley (2018)|
|NARM||Li et al. (2017)|
|FPMC||Rendle et al. (2010)|
|STAMP||Liu et al. (2018)|
|Caser||Tang and Wang (2018)|
|NextItNet||Yuan et al. (2019)|
|TransRec||He et al. (2017a)|
|S3Rec||Zhou et al. (2020)|
|GRU4RecF (+feature embedding)||Hidasi et al. (2016)|
|SASRecF (+feature embedding)||-|
|BERT4Rec||Sun et al. (2019)|
|FDSA||Zhang et al. (2019b)|
|SRGNN||Wu et al. (2019)|
|GCSAN||Xu et al. (2019)|
|GRU + KG Embedding||-|
|KSR||Huang et al. (2018)|
|Knowledge-based recommendation||CKE||Zhang et al. (2016a)|
|KTUP||Cao et al. (2019)|
|RippleNet||Wang et al. (2018)|
|KGAT||Wang et al. (2019d)|
|KGNN-LS||Wang et al. (2019a)|
|KGCN||Wang et al. (2019c)|
|MKR||Wang et al. (2019b)|
|CFKG||Ai et al. (2018)|
Based on the data module, we organize the implementations of recommendation algorithms in a separate model module.
By setting up the model module, we can largely decouple the the algorithm implementation from other components, which is particularly important to collaborative development of this library. To implement a new model within the four tasks in Table 6, one only needs to follow the required interfaces to connect with input and evaluation modules, while the details of other parts can be ignored.
In specific, we utilize the interface function of calculate_loss for training and the interface function of predict for testing. To implement a model, what a user needs to do is to implement these important interface functions, without considering other details. These interface functions are indeed general to various recommendation algorithms, so that we can implement various algorithms in a highly unified way. Such a design mode enables quick development of new algorithms.
Besides, our model module further encapsulates many important model implementation details, such as the learning strategy. For code reuse, we implement several commonly used loss functions (e.g., BPR loss, margin-based loss, and regularization-based loss), neural components (e.g., MLP, multi-head attention, and graph neural network) and initialization methods (e.g., Xavier’s normal and uniform initialization) as individual components, which can be directly used when building complex models or algorithms.
As the first release version, we have implemented 53 recommendation models in the four categories of general recommendation, sequential recommendation, context-aware recommendation and knowledge-based recommendation. We summarize all the implemented models in Table 6.
We have carefully surveyed the recent literature and selected the commonly used recommendation models and their associated variants (which may not receive high citations) in our library. As we can see from Table 6, we mainly focus on the recently proposed neural methods, while also keep some classic traditional methods such as ItemKNN and FM. In the future, more methods will also be incorporated in regular update.
For all the implemented models, we have tested their performance on two or four selected datasets, and invited a code reviewer to examine the correctness of the implementation.
In order to better use the models in our library, we also implement a series of useful functions.
A particularly useful function is automatic parameter tuning. The user is allowed to provide a parameter set for searching an optimal value leading to the best performance.
Given a set of parameter values, we can indicate four types of tuning methods, i.e.,
“Grid Search”, “Random Search”, “ Tree of Parzen Estimators (TPE)
Tree of Parzen Estimators (TPE)” and “Adaptive TPE”. The tuning procedure is implemented based on the library of hyperopt (Bergstra et al., 2013).
Besides, we add the functions of model saving and loading to store and reuse the learned models, respectively. Our library also supports the resume of model learning from a previously stored break point. In the training process, one can print and monitor the change of the loss value and apply training tricks such as early-stopping. These tiny tricks largely improve the usage experiences with our library.
The function of evaluation module is to implement commonly used evaluation protocols for recommender systems. Since different models can be compared under the same evaluation module, our library is useful to standardize the evaluation of recommender systems.
Our library supports both value-based and ranking-based evaluation metrics. The value-based metrics (for rating prediction) include Root Mean Square Error (RMSE) and Mean Average Error (MAE), measuring the prediction difference between the true and predicted values. The ranking-based metrics (for top- item recommendation) include the most widely used ranking-aware metrics, such as Recall, Precision, NDCG, and MRR, measuring the ranking performance of the generated recommendation lists by an algorithm.
In recent years, there are more and more concerns on the appropriate evaluation of recommender systems (Krichene and Rendle, 2020; Zhao et al., 2020). Basically speaking, the divergence mainly lies in the ranking-based evaluation for top- item recommendation. Note that the focus of our library is not to identify the most suitable evaluation protocols. Instead, we aim to provide most of the widely adopted evaluation protocols (even the most critical ones) in the literature. Our library provides a possibility to compare the performance of various recommendation models under different evaluation protocols.
For top- item recommendation, the implemented evaluation settings cover the various settings of our earlier work in (Zhao et al., 2020), where we have studied the influence of different evaluation protocols on the performance comparison of models. In particular, we mainly consider the combinations of item sorting (i.e., how to sort the items before data splitting) and data splitting (i.e., how to derive the train/validation/test sets) for constructing evaluation sets. We also consider both full ranking and sampled-based ranking, which is recently a controversial issue in the field of recommender system (Krichene and Rendle, 2020). We summarize the supporting evaluation settings by our library in Table 7.
|RO_RS||Random Ordering + Ratio-based Splitting|
|TO_LS||Temporal Ordering + Leave-one-out Splitting|
|RO_LS||Random Ordering + Leave-one-out Splitting|
|TO_RS||Temporal Ordering + Ratio-based Splitting|
|full||full ranking with all item candidates|
|uni||sample-based ranking: each positive item is paired with sampled negative items|
In order to facilitate various evaluation settings, we encapsulate the related functions into four major parts, namely Group, Split, Order and NegSample. With these implementations, we can effectively support different evaluation protocols, which is also an appealing feature to use our library.
For top- item recommendation, a large source of time cost is to predict the top items for recommendation during the test stage. The recommendation model needs to iteratively go through each user and then select
most probable items for a user. Since the methods of computing the ranking score of each item are different for various models, it is not easy to optimize the entire evaluation procedure in a general way. Therefore, we mainly focus on the step of selecting and generating topitems after the model has assigned each item with a confidence score.
A problem is that different users have a varying number of ground-truth items in test set (resulting in different-sized user-by-item matrices), which is not suitable to unified parallel computation on GPU. Our solution is to consider all the items, including the items in the training set (called training items). Given users and items for consideration, we can obtain a matrix consisting of the confidence scores from a model over the entire item set. However, the matrix cannot be directly used for top- finding, since it contains the scores of the training items, which should not be considered for recommendation during test. Our solution is to set the scores of training items to the negative infinity, and perform the full ranking over the entire item set without removing training items. In this way, all the users correspond to equal-sized evaluation matrices (i.e., ) for subsequent computation. This step is called filling.
In a batch, we sample a certain number of users for computation, each with her/his scores over the entire item set. We re-organize the items so that the ground-truth items in test set (blue boxes in Figure 2) are moved to the head of item lists, which is a key point for our acceleration. Unlike the sorting operation, item re-organization can be finished with a much smaller time cost. This step is called re-organizing.
Then, we utilize the GPU-version topk() function provided by PyTorch to find the top items with the highest scores for users. The GPU-version topk() function has been specially optimized based on CUDA, which is very efficient in our case. This step is called topk-finding.
With the topk() function, we can obtain a result matrix of size (concatenating the results of all the batches), which records the original index (just after re-orginization) of the top selected items. We further generate a
-length vectorconsisting of the numbers of test items for each user, and produce the final recommendation result matrix based on and . Specially, we apply the broadcasting mechanism in NumPy and perform the logic comparison between and , which means that each column in will be compared with . The comparison results form a binary matrix , indicating whether the corresponding entry is a right recommendation or not. The rational is given as follows: since all the ground-truth items in test set were placed at the head of the list, only the items with small index numbers are likely to be the correct ones for recommendation. Given a user, the original index of the last ground-truth item in test set is the number of test items minus one (assume we number the index from zero). This step is called broadcasting-comparison.
The generated result matrix consists of zeros and ones, which are particularly convenient for computing evaluation metrics. As will be shown next, such an acceleration strategy is able to improve the efficiency for full-ranking item recommendation.
In this part, we empirically analyze the efficiency improvement yielded by our acceleration strategy.
Specifically, the classic BPR model (Rendle et al., 2009) is selected for efficiency analysis, since it is one of the most commonly used baselines for top- recommendation. Besides, its model architecture is pretty simple without the influence of other factors, which is suitable for efficiency analysis. We compare its performance with and without the acceleration strategy in our implementation. We measure the model performance by the total time that (1) it generates a recommendation list of top ten items for users and (2) computes the metrics (NDCG@10 and Recall@10) over the recommendation list, on all the users. To further analyze the model efficiency on datasets of varying sizes, we use three Movielens datasets 222https://grouplens.org/datasets/movielens/ (i.e., Movielens-100k, Movielens-1M, and MovieLens-10M) to conduct the experiments. We split one original dataset into train, validation and test sets with a ratio of . We only count the time for generating top ten recommendations (with full ranking) on the test set. For stable comparison, we average the time of ten runs of the two implementations as comparisons. Our experiments are performed on a windows PC with CPU (AMD 3900x, 12 cores, 24 threads, 3.8GHz) and GPU (Nvidia RTX 2070 Super 8G).
The results of efficiency comparison are shown in Table 8. From the result we can see that by applying the acceleration strategy, we can significantly speed up the evaluation process. In particular, on the largest dataset MovieLens-10M, the accelerated model can perform the full ranking within about one second, which indicates that our implementation is efficient. Currently, we only compare the entire time with all the acceleration techniques. As future work, we will analyze the contribution of each specific technique in detail.
In this section, we show how to use our library with code examples. We detail the usage description in two parts, namely running existing models in our library and implementing new models based on the interfaces provided in our library.
The contained models in our library can be run with either fixed parameters or auto-tuned parameters.
Figure 3 presents a general procedure for running existing models in our library. To begin with, one needs to download and format the raw public dataset based on our provided utils. The running procedure relies on some experimental configuration, which can be obtained from the files, command line or parameter dictionaries. The dataset and model are prepared according to the configured parameters and settings, and the execution module is responsible for training and evaluating the models.
The detailed steps are given as follows:
Unified dataset formatting process. The user firstly selects a dataset for use. Then the dataset is formatted based on the scripts provided in our library. The scripts can generate required atomic files for different datasets, which are used as the input. Until now, we have collected nearly 27 commonly used datasets, and released their preprocessing scripts. More datasets will be continually incorporated. This procedure takes the following code:
Flexible configuration generation methods. In our library, the experiment configurations can be generated in different ways. One can write a configuration file, and then read this file in the main function as follows:
Another way for configuration is to include parameters in the command line, which is useful for specially focused parameters. At last, one can also directly write the parameter dictionaries in the code.
Dataset filtering and splitting. Based on the configuration, we provide the auxiliary functions to filter and split the dataset. The user can filter the dataset by keeping the users/items with at least interactions, removing the data occurred in some fixed time period and so on. Different filtering methods can be applied in a unified function: blue!10 blue
When splitting the dataset, one can indicate ratio-based method or leave-one-out method. Then the user can use the following function to generate the training, validation and testing sets: blue!10 blue
Model loading. Given a target model, the user can adopt the following function to obtain a model instance, where the hyper-parameters are set according to the configuration files: blue!10 blue
Model training and evaluation. Once the dataset and model are prepared, the user can train and evaluate the model based on the following functions: blue!10 blue
Our library is featured in the capability of automatic parameter (or hyper-parameter) tuning. One can readily optimize a given model according to the provided hyper-parameter range. The general steps are given as follows:
Setting the parameter range. The users are allowed to provide candidate parameter values in the file “hyper.test”. In this file, each line is formatted as: blue!10 blue
Instead of a fixed value, the users can empirically indicate a value set, which will be explored in the following tuning steps.
Setting the tuning method. Our parameter tuning function is implemented based on the library hyperopt. Given a set of parameter values, we can indicate four types of tuning methods, i.e., “Grid Search”, “Random Search”, “Tree of Parzen Estimators (TPE)” and “Adaptive TPE”. The tuning method is delivered into the program by the following code: blue!10 blue
where the parameter range file is used to indicate parameter values as mentioned above.
Starting the tuning process. The user can start the running process by the following code: blue!10 blue
With the set tuning range and method, our library will run the model iteratively, and finally output and save the optimal parameters and the corresponding model performance.
Based on our library, it is convenient to implement a new model with the provided interfaces. The user only needs to implement three mandatory functions. A typical implementation process is as follows:
Implementing the “__init__()” function. In this function, the user performs parameter initialization, global variable definition and so on. The new model should be a sub-class of the abstract model class provided in our library. Until now, we have implemented the abstract classes for general recommendation, knowledge-based recommendation, sequential recommendation and context-aware recommendation.
Implementing the “calculate_loss()” function. This function calculates the loss to be optimized by the new model. Based on the return value of this function, the library will automatically invoke different optimization methods to learn the model according to the pre-set configurations.
Implementing the “predict()” function. This function is used to predict a score from the input data (e.g., the rating given a user-item pair). This function can be used to compute the loss or derive the item ranking during the model testing phrase.
In recent years, a considerable number of open source recommender system libraries have been released for research purpose. We summarize the characteristics of existing recommender system libraries in Table 9. We report the number of benchmark models and datasets in each library.
From Table 9
, we can see a programming language evolution from C/C++/JAVA to Python/Tensorflow/PyTorch. Besides, there is an increasing trend with deep learning based recommender system libraries. We select PyTorch as the deep learning framework for development. PyTorch has become one of the most popular deep learning frameworks in the research community, with many appealing usage features.
In addition to reproduce the existing models, we aim to ease the developing process of new algorithms. We design general and extensible underlying data structures to support the unified development framework. By providing a series of useful tools, functions and scripts (e.g., automatic parameter tuning), our library is particularly convenient to be used for scientific research.
Currently, we have included a considerable number of benchmark datasets and models , and in the future we will continually add more datasets and models.
|LibFM (Rendle (2012))||C++||1||-||No||manual|
|MyMediaLite (Gantner et al. (2011))||#C||44||5||No||manual|
|Librec (Guo et al. (2015))||Java||70+||11||No||manual|
|RankSys (Castells et al. (2015))||Java||8||-||No||manual|
|Surprise (Hug (2020))||Python||9||3||No||manual|
|Crab (Caraciolo et al. (2011))||Python||3||4||No||manual|
|LightFM (Kula (2015))||Python||1||2||No||manual|
|Case Recommender (Costa et al. (2018))||Python||26||-||No||manual|
|NeuRec (Wu et al. (2017))||Tensorflow||31||3||Yes||manual|
|Recommenders (Argyriou et al. (2020))||Tensorflow||27||5||Yes||automatic|
|Cornac (Salah et al. (2020))||Tensorflow||39||14||Yes||automatic|
|Spotlight (Kula and Maciej (2017))||PyTorch||3||4||Yes||automatic|
|ReChorus (Wang et al. (2020a))||PyTorch||10||24||Yes||manual|
|Beta-recsys (Meng et al. (2020))||PyTorch||6||16||Yes||manual|
|DaisyRec (Sun et al. (2020))||PyTorch||19||14||Yes||manual|
PT denotes parameter tuning.
Statistics were made before October 1, 2020, which may not reflect up-to-date status.
In this paper, we have released a new recommender system library called RecBole. So far, we have implemented 53 recommendation algorithms on 27 commonly used datasets. We design general and extensible data structures to offer a unified development framework for new recommendation algorithms. We also support extensive and standard evaluation protocols to compare and test different recommendation algorithms. Besides, our library is implemented in a GPU-accelerated way, involving a series of optimization techniques for achieving efficient execution.
The RecBole library is expected to improve the reproducibility of recommendation models, ease the developing process of new algorithms, and set up a benchmark framework for the field of recommender system. In the future, we will make continuous efforts to add more datasets and models. We will also consider adding more utils for facilitating the usage of our library, such as result visualization and algorithm debugging.
We thank Gaole He, Shuqing Bian, Jingsen Zhang and Kaizhou Zhang for contributing model implementations to this library. We thank Chen Yang, Chenzhan Shang, Zheng Gong and Zhen Zhang for their contribution to collect the data and test the models in our library. We thank Shuqing Bian, Gaole He, Zihan Song and Ze Zhang for verifying the correctness of the model implementation. We also thank Yaqing Dai for developing the homepage for our project. This work was partially supported by the National Natural Science Foundation of China under Grant No. 61872369, 61802029 and 61972155, Beijing Outstanding Young Scientist Program under Grant No. BJJWZYJH012019100020098, and Beijing Academy of Artificial Intelligence (BAAI).
Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 115–123. JMLR.org, 2013.
Parallel recurrent neural network architectures for feature-rich session-based recommendations.In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016, pages 241–248, 2016.
Factorizing personalized markov chains for next-basket recommendation.In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 811–820, 2010.
S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization.In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 1893–1902. ACM, 2020.