1 Introduction
Personalization and recommendation systems are currently deployed for a variety of tasks at large internet companies, including ad clickthrough rate (CTR) prediction and rankings. Although these methods have had long histories, these approaches have only recently embraced neural networks. Two primary perspectives contributed towards the architectural design of deep learning models for personalization and recommendation.
The first comes from the view of recommendation systems. These systems initially employed content filtering where a set of experts classified products into categories, while users selected their preferred categories and were matched based on their preferences
mgp . The field subsequently evolved to use collaborative filtering, where recommendations are based on past user behaviors, such as prior ratings given to products. Neighborhood methods ning2015 that provide recommendations by grouping users and products together and latent factor methods that characterize users and products by certain implicit factors via matrix factorization techniques frolov2017tensor ; koren2009matrix were later deployed with success.The second view comes from predictive analytics, which relies on statistical models to classify or predict the probability of events based on the given data
Devroye1996. Predictive models shifted from using simple models such as linear and logistic regression
walker1967to models that incorporate deep networks. In order to process categorical data, these models adopted the use of embeddings, which transform the one and multihot vectors into dense representations in an abstract space
naumov2019 . This abstract space may be interpreted as the space of the latent factors found by recommendation systems.In this paper, we introduce a personalization model that was conceived by the union of the two perspectives described above. The model uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in
rendle2010 . Finally, it finds the event probability by postprocessing the interactions with another MLP. We refer to this model as a deep learning recommendation model (DLRM); see Fig. 1. A PyTorch and Caffe2 implementation of this model will be released for testing and experimentation with the publication of this manuscript.2 Model Design and Architecture
In this section, we will describe the design of DLRM. We will begin with the high level components of the network and explain how and why they have been assembled together in a particular way, with implications for future model design, then characterize the low level operators and primitives that make up the model, with implications for future hardware and system design.
2.1 Components of DLRM
The highlevel components of the DLRM can be more easily understood by reviewing early models. We will avoid the full scientific literature review and focus instead on the four techniques used in early models that can be interpreted as salient highlevel components of the DLRM.
2.1.1 Embeddings
In order to handle categorical data, embeddings map each category to a dense representation in an abstract space. In particular, each embedding lookup may be interpreted as using a onehot vector (with the th position being while others are , where index corresponds to th category) to obtain the corresponding row vector of the embedding table as follows
(1) 
In more complex scenarios, an embedding can also represent a weighted combination of multiple items, with a multihot vector of weights , with elements for and everywhere else, where index the corresponding items. Note that a minibatch of embedding lookups can hence be written as
(2) 
where sparse matrix naumov2019 .
DLRMs will utilize embedding tables for mapping categorical features to dense representations. However, even after these embeddings are meaningfully devised, how are they to be exploited to produce accurate predictions? To answer this, we return to latent factor methods.
2.1.2 Matrix Factorization
Recall that in the typical formulation of the recommendation problem, we are given a set of users that have rated some products. We would like to represent the th product by a vector for and th user by a vector for to find all the ratings, where and denote the total number of products and users, respectively. More rigorously, the set consists of tuples indexing when the th product has been rated by the th user.
The matrix factorization approach solves this problem by minimizing
(3) 
where is the rating of the th product by the th user for and . Then, letting and , we may approximate the full matrix of ratings as the matrix product . Note that and may be interpreted as two embedding tables, where each row represents a user/product in a latent factor space^{1}^{1}1This problem is different from lowrank approximation, which can be solved by SVD golub1996 , because not all entries of matrix are known. koren2009matrix . The dot product of these embedding vectors yields a meaningful prediction of the subsequent rating, a key observation to the design of factorization machines and DLRM.
2.1.3 Factorization Machine
In classification problems, we want to define a prediction function from an input datapoint to a target label . As an example, we can predict the clickthrough rate by defining with denoting the presence of a click and as the absence of a click.
Factorization machines (FM) incorporate secondorder interactions into a linear model with categorical data by defining a model of the form
(4) 
where , , and are the parameters with , and upper selects the strictly upper triangular part of the matrix rendle2010 .
FMs are notably distinct from support vector machines (SVMs) with polynomial kernels
Corinna1995 because they factorize the secondorder interaction matrix into its latent factors (or embedding vectors) as in matrix factorization, which more effectively handles sparse data. This significantly reduces the complexity of the secondorder interactions by only capturing interactions between pairs of distinct embedding vectors, yielding linear computational complexity.2.1.4 Multilayer Perceptrons
Simultaneously, much recent success in machine learning has been due to the rise of deep learning. The most fundamental model of these is the multilayer perceptron (MLP), a prediction function composed of an interleaving sequence of fully connected (FC) layers and an activation function
applied componentwise as shown below(5) 
where weight matrix , bias for layer .
These methods have been used to capture more complex interactions. It has been shown, for example, that given enough parameters, MLPs with sufficient depth and width can fit data to arbitrary precision bishop1995
. Variations of these methods have been widely used in various applications including computer vision and natural language processing. One specific case, Neural Collaborative Filtering (NCF)
he2017neural ; sedhain2015autorec used as part of the MLPerf benchmark mlperf , uses an MLP rather than dot product to compute interactions between embeddings in matrix factorization.2.2 DLRM Architecture
So far, we have described different models used in recommendation systems and predictive analytics. Let us now combine their intuitions to build a stateoftheart personalization model.
Let the users and products be described by many continuous and categorical features. To process the categorical features, each categorical feature will be represented by an embedding vector of the same dimension, generalizing the concept of latent factors used in matrix factorization (3). To handle the continuous features, the continuous features will be transformed by an MLP (which we call the bottom or dense MLP) which will yield a dense representation of the same length as the embedding vectors (5).
We will compute secondorder interaction of different features explicitly, following the intuition for handling sparse data provided in FMs (4), optionally passing them through MLPs. This is done by taking the dot product between all pairs of embedding vectors and processed dense features. These dot products are concatenated with the original processed dense features and postprocessed with another MLP (the top or output MLP) (5
), and fed into a sigmoid function to give a probability.
We refer to the resulting model as DLRM, shown in Fig. 1. We show some of the operators used in DLRM in PyTorch paszke2017pytorch and Caffe2 caffe2 frameworks in Table 1.
Embedding  MLP  Interactions  Loss  

PyTorch  nn.EmbeddingBag  nn.Linear/addmm  matmul/bmm  nn.CrossEntropyLoss 
Caffe2  SparseLengthSum  FC  BatchMatMul  CrossEntropy 
2.3 Comparison with Prior Models
Many deep learningbased recommendation models cheng2016wide ; guo2017deepfm ; wang2017deep ; lian2018xdeepfm ; zhou2018deepi ; zhou2018deep use similar underlying ideas to generate higherorder terms to handle sparse features. Wide and Deep, Deep and Cross, DeepFM, and xDeepFM networks, for example, design specialized networks to systematically construct higherorder interactions. These networks then sum the results from both their specialized model and an MLP, passing this through a linear layer and sigmoid activation to yield a final probability. DLRM specifically interacts embeddings in a structured way that mimics factorization machines to significantly reduce the dimensionality of the model by only considering crossterms produced by the dotproduct between pairs of embeddings in the final MLP. We argue that higherorder interactions beyond secondorder found in other networks may not necessarily be worth the additional computational/memory cost.
A key difference between DLRM and other networks is in how these networks treat embedded feature vectors and their crossterms. In particular, DLRM (and xDeepFM lian2018xdeepfm ) interpret each feature vector as a single unit representing a single category, whereas networks like Deep and Cross treat each element in the feature vector as a new unit that should yield different crossterms. Hence, Deep and Cross networks will produce crossterms not only between elements from different feature vectors as in DLRM via the dot product, but also produce crossterms between elements within the same feature vector, resulting in higher dimensionality.
3 Parallelism
Modern personalization and recommendation systems require large and complex models to capitalize on vast amounts of data. DLRMs particularly contain a very large number of parameters, up to multiple orders of magnitude more than other common deep learning models like convolutional neural networks (CNN), transformer and recurrent networks (RNN), and generative networks (GAN). This results in training times up to several weeks or more. Hence, it is important to parallelize these models efficiently in order to solve these problems at practical scales.
As described in the previous section, DLRMs process both categorical features (with embeddings) and continuous features (with the bottom MLP) in a coupled manner. Embeddings contribute the majority of the parameters, with several tables each requiring in excess of multiple GBs of memory, making DLRM memorycapacity and bandwidth intensive. The size of the embeddings makes it prohibitive to use data parallelism since it requires replicating large embeddings on every device. In many cases, this memory constraint necessitates the distribution of the model across multiple devices to be able satisfy memory capacity requirements.
On the other hand, the MLP parameters are smaller in memory but translate into sizeable amounts of compute. Hence, dataparallelism is preferred for MLPs since this enables concurrent processing of the samples on different devices and only requires communication when accumulating updates. Our parallelized DLRM will use a combination of model parallelism for the embeddings and data parallelism for the MLPs to mitigate the memory bottleneck produced by the embeddings while parallelizing the forward and backward propagations over the MLPs. Combined model and data parallelism is a unique requirement of DLRM as a result of its architecture and large model sizes. Such combined parallelism is not supported in either Caffe2 or PyTorch (as well as other popular deep learning frameworks), therefore we design a custom implementation. We plan to provide its detailed performance study in forthcoming work.
In our setup, the top MLP and the interaction operator require access to part of the minibatch from the bottom MLP and all of the embeddings. Since model parallelism has been used to distribute the embeddings across devices, this requires a personalized alltoall communication commsref . At the end of the embedding lookup, each device has a vector for the embedding tables resident on those devices for all the samples in the minibatch, which needs to be split along the minibatch dimension and communicated to the appropriate devices, as shown in Fig. 2. Neither PyTorch nor Caffe2 provide native support for model parallelism; therefore, we have implemented it by explicitly mapping the embedding operators (nn.EmbeddingBag for PyTorch, SparseLengthSum for Caffe2) to different devices. Then personalized alltoall communication is implemented using the butterfly shuffle operator, which appropriately slices the resulting embedding vectors and transfers them to the target devices. In the current version, these transfers are explicit copies, but we intend to further optimize this using the available communication primitives (such as allgather and sendrecv).
We note that for the data parallel MLPs, the parameter updates in the backward pass are accumulated with an allreduce^{2}^{2}2Optimized implementations for the allreduce op. include Nvidia’s NCCL nccl and Facebook’s gloo gloo . and applied to the replicated parameters on each device commsref in a synchronous fashion, ensuring the updated parameters on each device are consistent before every iteration. In PyTorch, data parallelism is enabled through the nn.DistributedDataParallel and nn.DataParallel modules that replicate the model on each device and insert allreduce with the necessary dependencies. In Caffe2, we manually insert allreduce before the gradient update.
4 Data
In order to measure the accuracy of the model, test its overall performance, and characterize the individual operators, we need to create or obtain a data set for our implementation. Our current implementation of the model supplies three types of data sets: random, synthetic and public data sets.
The former two data sets are useful in experimenting with the model from the systems perspective. In particular, it permits us to exercise different hardware properties and bottlenecks by generating data on the fly while removing dependencies on data storage systems. The latter allows us to perform experiments on real data and measure the accuracy of the model.
4.1 Random
Recall that DLRM accepts continuous and categorical features as inputs. The former can be modeled by generating a vector of random numbers using either a uniform or normal (Gaussian) distributions with the
numpy.random package rand or randn calls with default parameters. Then a minibatch of inputs can be obtained by generating a matrix where each row corresponds to an element in the minibatch.To generate categorical features, we need to determine how many nonzero elements we would like have in a given multihot vector. The benchmark allows this number to be either fixed or random within a range^{3}^{3}3see options numindicesperlookup=k and numindicesperlookupfixed . Then, we generate the corresponding number of integer indices, within a range , where is the number of rows in the embedding in (2). Finally, in order to create a minibatch of lookups, we concatenate the above indices and delineate each individual lookup with lengths (SparseLengthsSum) or offsets (nn.EmbeddingBag)^{4}^{4}4For instance, in order to represent three embedding lookups, with indices , and we use
4.2 Synthetic
There are many reasons to support custom generation of indices corresponding to categorical features. For instance, if our application uses a particular data set, but we would not like to share it for privacy purposes, then we may choose to express the categorical features through distributions. This could potentially serve as an alternative to the privacy preserving techniques used in applications such as federated learning bonawitz2019federated ; gentry2009 . Also, if we would like to exercise system components, such as studying memory behavior, we may want to capture fundamental locality of accesses of original trace within synthetic trace.
Let us now illustrate how we can use a synthetic data set. Assume that we have a trace of indices that correspond to embedding lookups for a single categorical feature (and repeat the process for all features). We can record the unique accesses and frequency of distances between repeated accesses in this trace (Alg. 1) and then generate a synthetic trace (Alg. 2) as proposed in hassan2007synthetic .
Note that we can only generate a stack distance up to s number of unique accesses we have seen so far, therefore s is used to control the support of the distribution p in Alg. 2. Given a fixed number of unique accesses, the longer input trace will result in lower probability being assigned to them in Alg. 1, which will lead to longer time to achieve full distribution support in Alg. 2. In order to address this problem, we increase the probability for the unique accesses up to a minimum threshold and adjust support to remove unique accesses from it once all have been seen. A visual comparison of probability distribution p based on original and synthetic traces is shown in Fig. 3. In our experiments original and adjusted synthetic traces produce similar cache hit/miss rates.
4.3 Public
Few public data sets are available for recommendation and personalization systems. The Criteo AI Labs Ad Kaggle^{5}^{5}5https://www.kaggle.com/c/criteodisplayadchallenge and Terabyte^{6}^{6}6https://labs.criteo.com/2013/12/downloadterabyteclicklogs/ data sets are opensourced data sets consisting of click logs for ad CTR prediction. Each data set contains 13 continuous and 26 categorical features. Typically the continuous features are preprocessed with a simple log transform . The categorical feature are mapped to its corresponding embedding index, with unlabeled categorical features or labels mapped to 0 or NULL.
The Criteo Ad Kaggle data set contains approximately 45 million samples over 7 days. In experiments, typically the 7th day is split into a validation and test set while the first 6 days are used as the training set. The Criteo Ad Terabyte data set is sampled over 24 days, where the 24th day is split into a validation and test set and the first 23 days is used as a training set. Note that there are an approximately equal number of samples from each day.
5 Experiments
Let us now illustrate the performance and accuracy of DLRM. The model is implemented in PyTorch and Caffe2 frameworks and is available on GitHub^{7}^{7}7https://github.com/facebookresearch/dlrm. It uses fp32 floating point and int32(Caffe2)/int64(PyTorch) types for model parameters and indices, respectively. The experiments are performed on the Big Basin platform with Dual Socket Intel Xeon 6138 CPU @ 2.00GHz and eight Nvidia Tesla V100 16GB GPUs, publicly available through the Open Compute Project^{8}^{8}8https://www.opencompute.org, shown in Fig. 4.
5.1 Model Accuracy on Public Data Sets
We evaluate the accuracy of the model on Criteo Ad Kaggle data set and compare the performance of DLRM against a Deep and Cross network (DCN) asis without extensive tuning wang2017deep . We compare with DCN because it is one of the few models that has comprehensive results on the same data set. Notice that in this case the models are sized to accommodate the number of features present in the data set. In particular, DLRM consists of both a bottom MLP for processing dense features consisting of three hidden layers with , and nodes, respectively, and a top MLP consisting of two hidden layers with and nodes. On the other hand DCN consists of six cross layers and a deep network with and nodes. An embedding dimension of is used. Note that this yields a DLRM and DCN both with approximately parameters.
We plot both the training (solid) and validation (dashed) accuracies over a full single epoch of training for both models with SGD and Adagrad optimizers
duchi2011 . No regularization is used. In this experiment, DLRM obtains slightly higher training and validation accuracy, as shown in Fig. 5. We emphasize that this is without extensive tuning of model hyperparameters.
5.2 Model Performance on a Single Socket/Device
To profile the performance of our model on a single socket device, we consider a sample model with categorical features and continuous features. Each categorical feature is processed through an embedding table with vectors, with vector dimension , while the continuous features are assembled into a vector of dimension . Let the bottom MLP have two layers, while the top MLP has four layers. We profile this model on a data set with randomly generated samples organized into minibatches^{9}^{9}9
For instance, this configuration can be achieved with the following command line arguments
archembeddingsize=10000001000000100000010000001000000100000010000001000000 archsparsefeaturesize=64 archmlpbot=51251264 archmlptop=1024102410241 datageneration=random minibatchsize=2048 numbatches=1000 numindicesperlookup=100 [usegpu] [enableprofiling].
This model implementation in Caffe2 runs in around 256 seconds on the CPU and 62 seconds on the GPU, with profiling of individual operators shown in Fig. 6. As expected, the majority of time is spent performing embedding lookups and fully connected layers. On the CPU, fully connected layers take a significant portion of the computation, while on the GPU they are almost negligible.
6 Conclusion
In this paper, we have proposed and opensourced a novel deep learningbased recommendation model that exploits categorical data. Although recommendation and personalization systems still drive much practical success of deep learning within industry today, these networks continue to receive little attention in the academic community. By providing a detailed description of a stateoftheart recommendation system and its opensource implementation, we hope to draw attention to the unique challenges that this class of networks present in an accessible way for the purpose of further algorithmic experimentation, modeling, system codesign, and benchmarking.
Acknowledgments
The authors would like to acknowledge AI Systems CoDesign, Caffe2, PyTorch and AML team members for their help in reviewing this document.
References

[1]
Christopher M. Bishop.
Neural Networks for Pattern Recognition
. The Oxford University Press, 1st edition, 1995.  [2] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In Proc. 2nd Conference on Systems and Machine Learning (SysML), 2019.
 [3] HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. Wide & deep learning for recommender systems. In Proc. 1st Workshop on Deep Learning for Recommender Systems, pages 7–10, 2016.
 [4] Corinna Cortes and Vladimir N. Vapnik. Supportvector networks. Machine Learning, 2:273–297, 1995.
 [5] Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. New York, SpringerVerlag, 1996.
 [6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 [7] Facebook. Collective communications library with various primitives for multimachine training (gloo), https://github.com/facebookincubator/gloo.
 [8] Facebook. Caffe2, https://caffe2.ai, 2016.
 [9] Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017.
 [10] Craig Gentry. A fully homomorphic encryption scheme. PhD thesis, Stanford University, 2009.
 [11] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The John Hopkins University Press, 3rd edition, 1996.
 [12] Ananth Grama, Vipin Kumar, Anshul Gupta, and George Karypis. Introduction to parallel computing. Pearson Education, 2003.
 [13] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. DeepFM: a factorizationmachine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247, 2017.
 [14] Rahman Hassan, Antony Harris, Nigel Topham, and Aris Efthymiou. Synthetic tracedriven simulation of cache memory. In Proc. 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW’07), 2007.
 [15] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. Neural collaborative filtering. In Proc. 26th Int. Conf. World Wide Web, pages 173–182, 2017.
 [16] Sylvain Jeaugey. Nccl 2.0, 2017.
 [17] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
 [18] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. xDeepFM: Combining explicit and implicit feature interactions for recommender systems. In Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1754–1763. ACM, 2018.
 [19] MLPerf. https://mlperf.org/.
 [20] Maxim Naumov. On the dimensionality of embeddings for sparse features and data. In arXiv preprint arXiv:1901.02103, 2019.
 [21] Xia Ning, Christian Desrosiers, and George Karypis. A comprehensive survey of neighborhoodbased recommendation methods. In Recommender Systems Handbook, 2015.
 [22] Pandora. Music genome project https://www.pandora.com/about/mgp.
 [23] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. PyTorch: Tensors and dynamic neural networks in python with strong GPU acceleration https://pytorch.org/, 2017.
 [24] Steffen Rendle. Factorization machines. In Proc. 2010 IEEE International Conference on Data Mining, pages 995–1000, 2010.

[25]
Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie.
Autorec: Autoencoders meet collaborative filtering.
In Proc. 24th Int. Conf. World Wide Web, pages 111–112, 2015.  [26] Strother H. Walker and David B. Duncan. Estimation of the probability of an event as a function of several independent variables. Biometrika, 54:167–178, 1967.
 [27] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. In Proc. ADKDD, page 12, 2017.
 [28] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for clickthrough rate prediction. arXiv preprint arXiv:1809.03672, 2018.
 [29] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for clickthrough rate prediction. In Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1059–1068. ACM, 2018.