An implementation of a deep learning recommendation model (DLRM)
In many real-world applications, e.g. recommendation systems, certain items appear much more frequently than other items. However, standard embedding methods---which form the basis of many ML algorithms---allocate the same dimension to all of the items. This leads to statistical and memory inefficiencies. In this work, we propose mixed dimension embedding layers in which the dimension of a particular embedding vector can depend on the frequency of the item. This approach drastically reduces the memory requirement for the embedding, while maintaining and sometimes improving the ML performance. We show that the proposed mixed dimension layers achieve a higher accuracy, while using 8X fewer parameters, for collaborative filtering on the MovieLens dataset. Also, they improve accuracy by 0.1 parameters or maintain baseline accuracy using 16X fewer parameters for click-through rate prediction task on the Criteo Kaggle dataset.READ FULL TEXT VIEW PDF
An implementation of a deep learning recommendation model (DLRM)
It is difficult to overstate the impact of representation learning and embedding-based models in the present AI landscape. Embedding representations power state-of-the-art applications in diverse domains, including computer vision[barz2019hierarchy, vasileva2018learning]chiu2016train, Mikolov2013distributed, liu2015topical, shoeybi2019megatron, akbik2018contextual, peters2018dissecting, liu2019roberta], computational biology [Asgari2015], and recommendation systems [cheng2016wide, Park2018, Wu2019].
There seems to be a fundamental trade-off between the dimension of an embedding representation and the statistical performance (i.e. accuracy or loss) of embedding-based models on a particular learning task. It is well-documented that statistical performance suffers when embedding dimension is too low [yin2018dimensionality]. Thus, we are interested in the fundamental question: Is it possible to re-architect embedding representations for a more favorable trade-off between number of parameters and statistical performance?
This challenge is particularly prominent in recommendation systems tasks such as collaborative filtering (CF) and click-through rate (CTR) prediction problems. Recommendation models power some of the most commonplace data-driven services on the internet that benefit users across the globe with personalized experiences on a daily basis [Aggarwal2016, hazelwood2018, he2014practical, Pi2019].
At present, the out-sized memory consumption of standard embedding layers in these recommendation systems is a major burden on the memory hierarchy – an embedding layer for a single recommendation engine can consume tens of gigabytes of space [Park2018, Pi2019]. This makes the engineering challenges associated with large-scale recommendation embeddings of particular importance and interest.
In many models, embedding layers are used to map input categorical values into a vector space. This feature mapping outputs a -dimensional vector for each value, and is learned concurrently with the model during supervised training. However, the relationship between data distributions, embedding representations, model architectures, and optimization algorithms is highly complex. This makes it difficult to design efficient quantization and compression algorithms for embedding representations from information-theoretic principles.
Nevertheless, the distributions of accesses for users and items is often heavily skewed in many real-world applications. For instance, for the CF task on MovieLens dataset111https://grouplens.org/datasets/movielens/, the top 10% of users receive as many queries as the remaining 90% and the top 1% of items receive as many queries as the remaining 99%. To an even greater extent, for the CTR prediction task on the Criteo AI Labs Ad Kaggle dataset222https://www.kaggle.com/c/criteo-display-ad-challenge the top of indices receive as many queries as the remaining 30 million, as summarized on Fig. 1. In networks science, this phenomena is referred to as popularity [cho2012wave, papadopoulos2012popularity], a terminology we adopt here.
We propose mixed dimension embedding layers, a novel architecture for embedding representations. The central thesis behind mixed dimension embedding layers is that the dimension of a particular embedding vector should not remain uniform, but should scale with that feature’s popularity. In particular, we show that popular embedding features are often not allotted sufficient parameters, whereas infrequent embeddings waste them. By architecting mixed dimension embedding layers such that the dimension of an embedding vector scales with its popularity, we can make significant improvements to statistical performance at small parameter budgets.
In order to illustrate that the proposed mixed dimension embedding layers greatly improve parameter-efficiency we show that mixed dimension layers achieve slightly lower loss, while using 8 fewer parameters compared to uniform dimension embeddings on the Movielens dataset [Harper2015]. Also, we show that mixed dimension embeddings improve accuracy by 0.1% using half as many parameters and maintain the baseline accuracy using 16 fewer parameters on the Criteo Kaggle dataset [Criteo2014].
We point out that even though we focus on recommendation and and event-probability prediction problems, our methods are applicable to representation learning in general. We state our main contributions next.
We propose mixed dimension embedding layers, where the dimension of a particular embedding vector scales with said vector’s popularity.
We provide a simple heuristic scheme for sizing of embedding vectors given a prior distribution or training sample. The dimensions prescribed by our scheme compare favorably to uniform dimensions.
We identify two distinct mechanisms by which mixed dimension layers improve parameter-efficiency and achieve superior statistical performance at a given parameter budget.
In this work, we focus on explicit CF as well as CTR prediction. In explicit CF user ratings for particular items are directly observed and therefore it can be formally framed as a matrix completion problem [candes2010matrix, candes2009exact, candes2010power, hastie2015matrix]. Embedding-based approaches, such as matrix factorization (MF) [hastie2015matrix, Koren2009, Rendle2019] or neural collaborative filtering (NCF) [Dacrema2019, he2017neural] are among the most popular and efficient solutions to matrix completion. The main alternative is to use a convex relaxation to find the minimum nuclear norm solution. This entails solving a semi-definite program, which takes time [zheng2012efficient]
, and thus will not scale to real-world applications. Instead, embedding/factorization approaches have the drawback of explicitly requiring an embedding dimension, but in practice this can be solved with cross-validation or other hyperparameter tuning techniques. In CTR prediction tasks we are predicting the event-probability of a click, which can also be viewed as context-based CF with binary ratings. A wide variety of models has been developed over the recent years for this task, including but not limited to[cheng2016wide, guo2017deepfm, lian2018xdeepfm, naumov2019deep, zhou2018deepi, zhou2018deep]. These state-of-the-art models share many similar characteristics, and all of them without exception use memory-intensive embedding layers that dwarf the rest of the model.
In modern machine learning, embeddings are often used to represent categorical features. Embedding vectors are mined from data, with the intention that certain semantic relationships between the categorical concepts represented by the vectors are encoded by the spatial or geometric relationships and properties of the vectors[Mikolov2013distributed, naumov2019dimensionality]. Thus, large embeddings are a natural choice for recommendation systems, which require models to understand the relationships between users and items.
Many techniques have been developed to decrease the amount of memory consumed by embedding layers. They can roughly be split into two high-level classes: (i) compression algorithms and (ii) compressed architectures. Compression algorithms usually involve some form of additional processing of the model beyond standard training. They can be performed offline, when they only involve post-training processing, or online, when the compression process is interleaved with or otherwise materially alters the training process. Simple offline compression algorithms include post-training quantization, pruning or low-rank SVD [Andrews2015, Bhavana2019, Sattigeri, Sun2016]. Model distillation techniques, such as compositional coding [Shu2017]
and neural binarization[Tissier2018]
are also a form of complex offline compression in which autoencoders are trained to mimic uncompressed, pre-trained embedding layers. Online compression algorithms include quantization-aware training, gradual pruning, and periodic regularization[alvarez2017compression, Elthakeb2019, frankle2018lottery, Naumov2018reg, park2018value]. We note that many of these compression algorithms are not unique to embedding layers, and are widely used in the model compression literature.
On the other hand, we also have compressed architectures, which attempt to architect embedding representations of comparable statistical quality with fewer parameters. Compressed architectures have the advantage of not only reducing memory requirements for inference time, but also at training time. This is the approach followed by hashing-based and tensor factorization methods[attenberg2009collaborative, gao2018cuckoo, karatzoglou2010collaborative, Khrulkov2019, Shi2018], which focus on reducing the number of parameters used in an embedding layer by re-using parameters in various ways. These techniques stand in contrast with our approach, which focuses on non-uniform reduction of the dimensionality of embedding vectors based on embedding popularity. In principle, nothing precludes the compound use of our proposed technique and most other compression algorithms or compressed architectures. This is an interesting direction for future investigation.
Finally, we note that non-uniform and deterministic sampling has been addressed in the matrix completion literature in a line of work [negahban2012restricted], but only in so far as how to correct for popularity so as to improve statistical recovery performance, or build theoretical guarantees for completion under non-uniform sampling [papadopoulos2012popularity]. As far as we know, we are the first to leverage popularity to actually reduce parameter counts in large-scale embedding layers.
We pose the problem formulations for CF and CTR tasks. For each task, we also describe the relevant models.
Inspired by the apparent inefficiencies of using uniform dimension vectors in the non-uniform sampling regime, we formally state the central formulation of study in the noisy setting, similar to that of [negahban2012restricted].
Let , for , be an unknown target matrix. Let denote a sample of indices. Define for some noise term with . Our observation, denoted , is the set of 3-tuples:
as a set-valued random variable. In particular,samples entries from without replacement according to some distribution . Like [negahban2012restricted], we assume that probability matrix is rank-1: for some -dimensional and -dimensional probability vectors and , respectively. In principle, if and
are known, they may be used directly. In our applications, we estimate them from training data using empirical frequencies of accessf for users and g for items, respectively.
We refer to a collaborative filter as an algorithm that inputs the observation and outputs an estimate of , denoted . The goal of the filter is to minimize the -weighted Frobenius loss:
In MF we define as solving the following optimization problem
where user and item embedding matrices correspond to an -dimensional embedding layer. When the rank of is a known priori, it can be used to set the dimension of the embedding layer, otherwise it is treated as a hyperparameter.
Although NCF lacks the nice theoretical guarantees offered by MF, it empirically performs mildly better on real-world datasets [he2017neural]
. Since we are primarily concerned with the embedding layer, we adopt the simplest NCF model where user and item embeddings are concatenated to form an input to the following multilayer perceptron (MLP).
In NCF we define as solving the following optimization problem
where and are embeddings, while denotes additional parameters of the MLP, denoted by function .
CTR prediction tasks can be interpreted as event-probability prediction problem or as context-based CF with binary ratings. Compared to canonical CF, these tasks also include a large amount of context which can be used to better predict user and item interaction events. Therefore, this problem can be viewed as restricting targets , while allowing user and items to be represented by multiple features, often expressed through sets of indices (categorical) and floating point values (continuous).
Facebook’s state-of-the-art deep learning recommendation model (DLRM)[naumov2019deep] allows for categorical features, which can represent arbitrary details about the context of an on-click or personalization event. The -th categorical feature can be represented by an index for . In addition to categorical features, we also have scalar features, together producing a dense feature vector . Thus, given some , we would like to predict which denotes an on-click event in response to a particular personalized context .
In DLRM we define as solving the following optimization problem
where for are embeddings, while denotes any additional parameters (mostly related to MLPs) in the model . Note that we often use binary cross-entropy instead of MSE loss in practice.
Let us now define the mixed dimension embedding layer and describe the equipartition- as well as popularity-based schemes used to determine its dimensionality. We will also discuss how we can apply it to CF and CTR prediction tasks.
Let a mixed dimension embedding layer consist of blocks and be defined by matrices, so that
with and for , where is implicitly defined as identity.
Let us assume that the dimensions of these blocks are fixed. Then, forward propagation for a mixed dimension embedding layer takes an index in the range from to , and produces an embedding vector as defined in Alg. 1. The steps involved in this algorithm are differentiable, therefore we can perform backward propagation through this layer and update matrices and accordingly during training. We note that Alg. 1 may be generalized to support multi-hot lookups, where embedding vectors corresponding to some query indices are fetched and reduced by a differentiable operator, such as add, multiply or concatenation.
Note that we return projected embeddings for all but the first embedding matrix and that all embedding vectors have the same base dimension . Therefore, models based on a mixed dimension embedding layer should be sized with respect to . We illustrate the matrix architecture of the mixed dimension embedding layer with two blocks on Fig. 2, where the parameter budget (total area) consumed by uniform and mixed dimension matrices is the same, but allocated differently.
Let us now focus on how to find the block structure in the mixed dimension architecture. This includes the row count as well as dimension assigned to each block in the mixed dimension embedding layer. We restrict ourselves to use of popularity information for sizing the mixed dimension embedding layer (i.e. frequency f of access of a particular feature; assumed here to be mostly consistent between training and test samples). We note that in principle, one could also use a related but distinct notion of importance, that refers to how statistically informative a particular feature is, on average, to the inference of the target variable. Importance could be determined either by domain experts or in data-driven manner at training time.
There is some amount of malleability and choice in the re-architecture of a uniform dimension embedding layer into a mixed dimension layer. With appropriate re-indexing, multiple embedding matrices may be stacked into a single block matrix, or a single embedding matrix may be row-wise partitioned into multiple block matrices. The point of the partitioned blocking scheme is to map the total embedding rows into blocks, with block-level row counts given by and offset vector with .
In CF tasks we only have two features – corresponding to users and items – with corresponding embedding matrices and , respectively. To size the mixed dimension embedding layer we apply mixed dimensions within individual embedding matrices by partitioning them. First, we sort and re-index the rows based on row-wise frequency: . Then, we partition each embedding matrix into blocks such that the total popularity (area under the curve) in each block is constant, as shown in Alg. 2. For a given frequency the -equipartition is unique and is simple to compute. In our experiments, we saw that setting anywhere in the range is sufficient to observe the effects induced by mixed dimensions, with diminishing effect beyond that.
In CTR prediction tasks, we have several categorical features, with corresponding embedding matrices . To size the mixed dimension embedding layer for CTR prediction applications we apply mixed dimensions across different embedding matrices by stacking them. Therefore, the problem structure defines the number of blocks and number of vectors in each original embedding defines the row counts in the corresponding block in the mixed dimension embedding layer.
We now assume that the number of vectors in each block of the mixed dimension embedding layer is already fixed. Therefore, it only remains to assign the dimensions to completely specify it.
We propose a popularity-based scheme that operates on a block-level and that is based on a heuristic: Each embedding should be assigned a dimension proportional to some fractional power of its popularity. Note that here we distinguish block-level probability from row-wise frequency . Given , we define as the area under the frequency curve in the interval and total . Then, we let block-level probability vector be defined by its elements . We formalize the popularity-based scheme in Alg. 3, with an extra hyperparameter temperature .
The proposed technique requires knowledge of probability vector p that governs the feature popularity. When such distribution is unknown, we may easily replace it by the empirical distributions from the data sample.333For instance, in CF for practical purposes we can approximate as number of samples pertaining to i-th user. Alternatively, we can set to approximately444This does not account for integer rounding or for the parameters in the projection matrices , but the projection matrices should only add a small number of parameters relative to the total when , which generally holds. constrain the embedding layer sizing to a total (number of parameters) budget . Furthermore, to get crisp sizing, one might also elect to round , perhaps to the nearest power of two, after applying Alg. 3.
We proceed to present our experiments validating mixed dimension embedding layers. All implementations described herein are done in the Pytorch framework[paszke2017automatic] using NVIDIA V100 32GB GPUs [NvidiaVolta100]. In total, we run experiments on two datasets with corresponding models summarized in Table 1. We note that we always use the MLP variation of the NCF model.
|Dataset||Tasks||# Samples||# Categories||Models Used|
We explore the trade-off between memory and statistical performance for mixed dimension embeddings, parameterized by varying . For a given , we vary the budget by varying the base dimension, . In our setting, we measure memory in terms of number of -bit floating point parameters in the embedding layer. We measure performance in terms of MSE loss for CF tasks and accuracy for CTR prediction tasks. We choose to use loss rather than accuracy for CF because we would like to differentiate when the quality of our predictions are closer to the target (e.g. we want to differentiate between predicted ratings of and for a target of ) in our results. We do not have to address this issue for the binary targets in the CTR prediction problem and therefore use more natural accuracy as a metric for it. In all experiments, we use the Xavier uniform initialization [glorot2010understanding], for all matrices in our models.
The parameters are allocated to embedding layers according to the -parameterized rule proposed in Alg. 3. Notice that for models with uniform dimension embedding layers (), the number of parameters directly controls the embedding dimension because the number of vectors per embedding is fixed. On the other hand, for mixed dimension layers () the fixed parameters lead to skewed dimensions. In fact the more alpha increases the more skewed dimension assignments become based on popularity. For instance, we illustrate how different choices of assign embedding dimensions in Fig 3.
MF is a canonical algorithm for CF. From the perspective of this work, pure MF model for CF tasks is important because it offers the most distilled, simplest setting for which we can investigate the effects of mixed dimension embeddings. In our experiments we use the MovieLens dataset [Harper2015]. We train at learning rate using Amsgrad optimizer [Reddi2018]
for 100 epochs, taking model that scored the lowest validation loss. We always run 3 replicates per experiment.
In Fig. 3(a), we report learning curves for MF with three different embedding layers, plotting the number of training epochs against the validation loss for each epoch. We plot the learning curve for uniform dimension embeddings, which has the lowest validation loss (under early termination). We also plot the learning curve for as a baseline. Third, we plot the learning curve for mixed dimension embedding layers at , which uses a number of parameters equivalent to uniform dimension embedding layers at .
Notice that for uniform dimension (orange line) the model underfits the data, because the attained loss is significantly higher than the one achieved with other hyper-parameters. Also, notice that the model with uniform dimension (blue line) severely overfits after about epochs. On the other hand, mixed dimension embeddings (green line) train well and achieve the lowest validation loss among other hyper-parameters.
This evidence that at dimension , the popular embedding vectors are underfitting, and that at dimension the unpopular embedding vectors are overfitting. At the parameter budget, the mixed dimension embeddings at use a baseline dimension and result in an embedding layer architecture that can more adequately fit both popular and unpopular embedding vectors.
In Fig. 3(b), we report the test loss for different at varying total parameter budgets using optimal early termination. We can see that using mixed dimension embedding layers generally improves the memory-performance frontier at each parameter budget. We point out a simple trend illustrated by the uniform dimension (blue line): performance improves with number of parameters, until we reach a critical point, after which performance decreases with increasing parameters. Notice that at all memory budgets there is an that produces mixed embeddings that are equivalent or better than uniform dimension embeddings ().
Finally, we point out that the optimal is dependant on the total memory budget. Thus, at any given budget, one should tune as a hyperparameter. Ultimately, we are able to achieve approximately 0.02 lower MSE while using approximately 4 fewer parameters by using mixed dimension embeddings.
NCF models represent more modern approaches for CF. From the perspective of this work, NCF models are interesting because they add a moderate degree of realism and show how the presence of the non-linearity in neural layers affects the results.
In our experiments we use NCF with a 3-layer MLP with dimension 128. We train at learning rate using Amsgrad optimizer for 100 epochs, taking model that scored the lowest validation loss. We still use the same dataset and run 3 random seeds for each point in the experiments.
In Fig. 4(a), we report learning curves for the NCF model based on three different embedding layers, plotting the number of training epochs against the validation loss for each epoch. We plot the learning curve for uniform dimension embeddings, which has the lowest validation loss (under early termination). We also plot the learning curve for as a baseline. Third, we plot the learning curve for mixed dimension embedding layers at , which uses a number of parameters equivalent to uniform dimension embedding layers at .
Similarly to the MF setting, the optimally terminated model using the embeddings has slightly higher loss than the chosen mixed dimension embeddings. However, unlike in MF, here, we do see that all three models overfit as training progresses, and thus early termination is essential for all three models.
Under prolonged training, the uniform based NCF model overfit severely. The mixed dimension based NCF model overfits signficantly, but not as drastically as the optimal NCF model. The NCF model with embeddings overfits only slightly. Given that the two latter embedding layers, when identically architected and initialized, do not cause the model overfit when the embedding vectors as used for MF, we can attribute the overfitting directly to presence of the MLP layers in the NCF model.
In Fig 4(b), we report the test loss for different at varying total parameter budgets using optimal early termination. Overall, the NCF models attain slightly lower test losses than their MF counterparts, generally a decrease in the range of MSE. Interestingly, the optimally terminated NCF models actually suffered from less overfittitng than the MF models did at higher parameter counts. The critical points corresponding to the optimal total parameter also increased, for each . Yet, still performed the best, albeit at a larger higher budget.
Finally, concerning the comparison between mixed and uniform dimension embeddings, we see similar trends and draw the same conclusions as in the MF setting. Ultimately, we achieve approximately 0.01 lower MSE while using approximately 8 fewer parameters by using mixed dimension embeddings.
CTR prediction tasks can be interpreted as event-probability prediction problem or as context-based CF with binary ratings. From the perspective of this work, using mixed dimension embedding layers on real CTR prediction data with state-of-the-art deep recommendation models is an important experiment that shows how mixed dimension embedding layers might scale to real-world recommendation engines.
In our experiments we use state-of-the-art Facebook’s DLRM [naumov2019deep] and the Criteo Kaggle dataset [Criteo2014]. We train at learning rate using Amsgrad optimizer for a single epoch. For each we increase the embedding layer’s parameter budget up to 32GB (GPU memory limit). Notice that because of this limitation we do not see the overfitting behavior from earlier sections. Also, as usual we perform 3 replicates per experiment. However, for this task we report accuracy and not loss as discussed earlier in this section.
In Fig 5(a), we plot learning curves for the DLRM model based on three different embedding layers. We show that a mixed dimension embeddings with (orange line) produces a learning curve equivalent to that of a uniform dimension embeddings (blue line) using a total parameter count equivalent to that of uniform dimension (green line).
In Fig. 5(b), we present test accuracy for DLRM using mixed dimension embedding layers at various with varying parameter budgets. It is evident that mixed dimension embedding layers improve the memory-performance frontier. In fact, we see a very similar trend compared to the two classical CF settings. Ultimately, we achieve approximately 0.01% higher accuracy while using approximately half as many parameters and achieve on par accuracy with 16 fewer parameters by using mixed dimension embeddings.
To summarize we identify two distinct mechanisms by which mixed dimension embedding layers improve upon uniform-dimension layers in our experiments.
i) Generalization – At training time, embeddings learn to represent categorical features such as users or products. We find that at a uniform-dimension, frequently-accessed vectors are often under-fitting, whereas infrequently-accessed vectors are often over-fitting. Thus, mixed dimension embeddings actually learn in more parameter-efficient manner.
ii) Allocation – Learning aside, at inference time, for a sufficiently constrained parameter budget, one faces a resource allocation trade-off. Even with an oracle factorization of the target matrix, under an expected distortion metric at a given parameter budget, it is more efficient to allocate more parameters to frequently-accessed vectors than to infrequently-accessed vectors.
We propose mixed dimension embedding layers in which the dimension of a particular embedding vector is based on its popularity. This approach addresses severe inefficiencies in the number of parameters used in uniform dimension embedding layers. It offers a superior trade-off between number of parameters and statistical performance of the model. In general, it improves the memory-performance frontier across memory budgets, but is particularly effective at small parameter budgets. For instance, in our experiments we were able to attain the same baseline accuracy with about 8 and 16 smaller models for CF on MovieLens and CTR prediction on Criteo Kaggle datasets, respectively. In the future we would like to investigate the composition of mixed dimension embeddings technique with other compression algorithms and architectures.
The authors would like to thank Shubho Sengupta, Michael Shi, Jongsoo Park, Jonathan Frankle and Misha Smelyanskiy for helpful comments and suggestions about this work.