Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000× Compression and 2.7× Faster Inference

08/04/2021 ∙ by Aditya Desai, et al. ∙ Rice University 1

Deep learning for recommendation data is the one of the most pervasive and challenging AI workload in recent times. State-of-the-art recommendation models are one of the largest models rivalling the likes of GPT-3 and Switch Transformer. Challenges in deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical values. These embedding tables in industrial scale models can be as large as hundreds of terabytes. Such large models lead to a plethora of engineering challenges, not to mention prohibitive communication overheads, and slower training and inference times. Of these, slower inference time directly impacts user experience. Model compression for DLRM is gaining traction and the community has recently shown impressive compression results. In this paper, we present Random Offset Block Embedding Array (ROBE) as a low memory alternative to embedding tables which provide orders of magnitude reduction in memory usage while maintaining accuracy and boosting execution speed. ROBE is a simple fundamental approach in improving both cache performance and the variance of randomized hashing, which could be of independent interest in itself. We demonstrate that we can successfully train DLRM models with same accuracy while using 1000 × less memory. A 1000× compressed model directly results in faster inference without any engineering. In particular, we show that we can train DLRM model using ROBE Array of size 100MB on a single GPU to achieve AUC of 0.8025 or higher as required by official MLPerf CriteoTB benchmark DLRM model of 100GB while achieving about 2.7× (170%) improvement in inference throughput.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommendation systems are one of the top applications of machine learning. Therefore, considerable efforts have been and continue to be expended to develop systems that help users make more personalized and well-informed choices in various application domains. Recent approaches utilize deep learning-based models to achieve state-of-the-art performance. However, a key challenge is the need to handle millions of categorical features that dominate the recommendation data 

[16, 2]

. Following the work in natural language processing 

[14, 19], current approaches [16, 20, 18, 7, 11, 8]

utilize a real-valued feature vector (i.e., embedding) to represent each categorical value. These categorical representations are learned, end-to-end, and organized in a memory structure called

embedding tables.

Inference with deep learning recommendation models is memory-bound. If the set of all categories is and the embedding dimension is , then the embedding table size would be . With the number of categories per feature as large as tens of millions for each feature, embedding tables can take up over 99.9% of the total model memory. Specifically, memory footprint can be as large as hundreds of terabytes (TB) [15, 16, 6, 17]. For example, Facebook recently showcased training of a 50TB sized model distributed over 128 GPUs [15]. Given the large size of embeddings, a majority of the time during inference is spent on fetching the embeddings from memory to the computational resource. Hence, these models are memory bound [17, 6, 22].

Training deep learning recommendation models suffers from high communication cost. If we are using data parallel model training, then, at each step, we need to communicate the gradients of all the parameters to all the involved nodes. The communication cost is exactly equal to the size of the model. Therefore at each step of training industrial scale model, TBs of data need to be communicated. Moreover, the embedding tables for industrial scale models (multiple TBs) cannot be stored on a single GPU/node. Thus, the model has to be distributed on multiple nodes/GPUs and trained in a model parallel fashion as well. This adds additional communication cost to the forward and backward pass making training slow.

Training of Deep learning recommendation models is not accessible to a general user. Training models with multiple terabytes of parameters comes with its own engineering challenges not to mention the cost of expensive hardware. The DLRM models have to be trained in a mixed model and data parallel setting on expensive clusters of nodes or GPUs. Thus, these models are out of the reach of an average machine learning user. This also severely restricts the possibility of fast research in this area.

Deep learning recommendation model (DLRM) [16] gave rise to an increased interest in constructing more memory-efficient embeddings. Recent state-of-the-art works include increasing expressive power of embeddings using additional computing over smaller memory such as compositional embedding [17]; learning different sized embeddings for different values to leverage the inherent power law in frequencies  [6, 9, 12, 13, 3, 23], low rank decomposition of embedding tables  [22]. These approaches show a single (, [17, 6]) or double order (, [22]) of magnitude reduction in embedding table size with no (or minimal) loss of accuracy. In our empirical evaluation, we show that with ROBE Array for DLRM model, we can obtain as much as compression with similar (or even improved) accuracy, at the same time giving a multi-fold increase in the inference throughput performance. Specifically, we can train compressed DLRM MLPerf model for CriteoTB dataset which reaches the same MLPerf AUC value 0.8025 or higher with a inference throughput boost of 2.7 . Also, similar observations can be made on Criteo Kaggle dataset where compressed model can achieve similar or better accuracy as original model over variety of state of the art deep learning models.

What are the implications of 1000 compression of embedding tables?

(1) Eliminate the need of model parallel training. For models as large as 50TBs, a 1000 compression can reduce the model size to 50 gigabytes (GB), which can easily fit on a single high-end GPU (e.g., Nvidia A100). Hence, we can simply run a data-parallel model optimization.

(2) 1000 lower communication cost. With data-parallel model optimization, we would achieve a 1000 reduction in communication cost at each inference step. Therefore, this leads to significant savings in communication cost.

(3) Lower memory latency. With smaller memory footprint, we can potentially store the embedding tables closer to the computational resource. For example, in the case of CPUs, embedding tables can be shifted from disk or multiple nodes to RAM, and those that can fit on RAM to last level cache (LLC).

(4) Overall faster training and inference times. Overall, we have the potential to construct compact models that have faster inference and training time.

(5) Faster refresh cycle for industrial models. With changing interests, recommendation data suffers from frequent concept shift [5]. Faster training implies we can now refresh models at a faster rate, and thus potentially leads to substantial revenue gains.

Weight sharing is a widely used idea in machine learning to reduce memory required for the model. Some examples include feature hashing [21] to reduce input dimension, HashedNet [1]

to compress fully connected multi-layered neural networks, usage of filters in convolution neural networks, and recently demonstrated some success with LSH based weight sharing in recommendation models 

[4]. In this paper, we introduce a memory sharing technique – Random Offset Block Embedding Array (ROBE). We use universal hash functions on chunks/blocks of the embeddings in the embedding table to locate it in a small circular array of memory. We refer to this form of hashing as ROBE hashing. In a standard feature hashing scenario, where we project a vector in to , ROBE hashing outperforms the usual feature hashing as defined in [21]. We discuss the theoretical results in Section 3. In addition to being theoretically superior, ROBE also leads to better cache performance due to coalesced array access. Our results shed new light on how to make randomized hashing algorithms cache friendly, and at the same time, have superior variance. We also provide precise quantification of various trade-offs involved. The results could be of independent interest to algorithms community working on randomized hashing algorithms.

Caveat:

The caveat while using our compression technique is that while training the model, we require more iterations than those required for training the original model. For example, the original CriteoTB MLPerf model (100GB) takes 1 epoch to reach the target AUC of 0.8025, while the same model using 1000

less memory with ROBE Array (100MB) takes 2 epochs to reach the same AUC. We see similar trend in our experiments with the Criteo Kaggle dataset.

While we see a clear improvement of inference throughput, our current proof-of-concept code does not show training time benefits. However, with memory optimizations leveraging on 1000 less memory footprint, we believe we should be able to get faster training times even while requiring more number of iterations. We leave this aspect for future work.

2 Random Offset Block Embedding Array

In the context of models, the memory footprint of the model is determined by the memory used to store the parameters of the model. Generally, each parameter of the model is stored in memory in a separate individual location. However, in the case when the number of parameters far exceed the total amount of memory we intend to use, there are approaches such as pruning or precision-reduction used to fit the parameters in the memory. While these approaches have proven to be useful, the factor of reduction using these approaches is generally quite small (generally up to 4). For further reduction in memory footprint of the model, the weights have to share memory locations. We achieve this memory sharing via memory allocation functions which we describe below. Once we have a allocation function, say , the forward pass accesses the memory via the allocation function – if you want to access some parameter , it is accessed at the location . The backward pass aggregates the gradients from all the parameters that share a particular memory location. The memory allocation function along with the associated forward and backward passes are described below. As we will see in Section 3, in learning a shared memory, we can expect to get good performing models even with very high compression. This is further supported by our experiments in Section 4.

2.1 Robe- : ROBE with block size

We begin by describing ROBE with a block size of as it is easier to convey. Subsequently, we will describe ROBE-, which allows for an arbitrary block size of .

Figure 1: ROBE-1 : Color coding shows the mapping from each of tables into a single memory . The figure also shows how a typical forward pass lookup of the embedding under ROBE with block size 1.

Memory Allocation Function: In this paper, we are interested in allocating parameters of embedding tables to a smaller memory, say . We use hash functions to achieve this allocation. We need to ensure that these hash functions are small to store and fast to compute. Hence, we use universal hash functions, which have space usage, computation cost, and are convenient to implement over a GPU for parallel batch computation. Specifically, in our implementation we use,

(1)

where is the id of embedding table, is the value for which we want the embedding, and is the index of the embedding. and are standard parameters of universal hashing. This hash function is a light-weight replacement of a random hash function which tries to minimize the collisions among different elements of the embedding table.

In principal, this memory allocation is very similar to HashedNet [1]. However, there are differences: (1) we use light-weight universal hashing as opposed to xxhash in HashedNet, (2) HashedNet keeps separate arrays for separate matrices, whereas we use a single array to map all the elements, (3) HashedNet was demonstrated on multi-layered neural networks, and we focus on embedding tables, and (4) most importantly, our view of the memory allocation allows us to extend ROBE- to ROBE- with circular array (described later). We show that ROBE- is theoretically superior to ROBE- (described in Section 3), which provides coalesced access pattern and improves memory access efficiency by reducing the number of memory fetches.

Forward Pass: With universal memory allocation function for ROBE-, we illustrate the forward pass using Figure1. The forward pass requires looking up the embedding for the given value of a categorical feature. In the original (or full fledged) embedding table, this operation is simply indexing into the embedding matrix at the location of the categorical value, say , and reading the embedding at that location. In ROBE-, the lookup of each element is redirected into the memory using a universal memory allocation function described above. Optionally, each element that is read from the memory is multiplied by a sign which is obtained by another independent hash function .

Backward Pass: With universal memory allocation, the backward pass is illustrated in Figure 2. The gradient of a single element in shared memory can be computed by accumulating gradients from all the elements from embedding table that share this particular location.

Figure 2: Aggregation of gradients in backward pass.

2.2 Robe-: ROBE with arbitrary block size

ROBE- is analogous to ROBE-. However, here the basic allocation unit is a block of size instead of . We will divide the flattened embedding table into blocks of size and each block will be mapped into the memory at locations pointed out by our memory allocation function. Note that the the memory is not divided into blocks and hence, there can be overlapping blocks with overlap ranging anywhere from 0 to .

Memory Allocation Function: As with ROBE-, we use universal hash functions to allocate memory in ROBE-. The memory allocation function for ROBE- is

(2)

where is the block number to which the element of x belongs to and is the offset at which elmement of is located inside the block. In our implementation, assuming that where is the set of all values and is the embedding size, we can compute and as follows:

(3)

Essentially, the start of the block is mapped randomly into the memory address and the block is mapped linearly into the block located at the mapped address. While placing this block, if it exceeds the memory size , the latter part of the block is continued from the start of the array. So logically, we use a circular array for memory . The memory allocation is illustrated in Figure 2(a).

Forward and Backward pass: The forward and backward pass is implemented in similar fashion as for ROBE- except that we use the hash function stated in Equation (2) instead of Equation (1).

2.3 Advantages of ROBE-

Memory Latency: The most obvious advantage of using a smaller memory for embedding tables is that it can potentially be stored closer to the processing unit. For example, embedding tables with a collective size of 100GB, when allocated a memory of 100MB (i.e. reduction), can be stored on last level cache. The original model, in this case, without any memory sharing has to be stored on RAM, or even worse on disk.

Memory Fetches: The number of memory fetches can potentially increase when using a ROBE- allocation scheme. We present the number of memory fetches while using ROBE- and compare it against the memory fetches with using original embedding, which is shown in Table 1.

Consider the original embedding of size (generally kept in multiples of bus size). Let the bus size be (generally 32 or 64) . Thus, in order to fetch a single embedding, we would require a maximum of memory fetches. As we can see from Table 1, as we increase the value of the number of memory fetches decrease from the to when is greater than . Also, as we will see in section 3, the greater the value of , the better is dimensionality reduction. So it is advisable to set the value of to be

Dimensionality Reduction: As we will see in Section 3, ROBE- hashing is better than ROBE- in terms of dimensionality reduction. As the value of

increases, while the estimate of inner products in projected space is unbiased, the variance decreases until

reaches .

Condition Max number of memory fetches Comment
Original

With high probability as

With high probability as
Table 1: Number of memory fetches on varying sizes of , where is the embedding size, is the bus size and is the block size. denotes divides .

3 Theoretical Considerations

The procedure described in ROBE- is closely related to the sketching literature, and in particular, the area of random projections. A parameter vector can be created by joining all the flattened embedding matrices. The ROBE- hashing, essentially, projects this parameter vector into a space. We know from Johnson-Lindenstrauss Lemma, that random projections can provide us with low-dimensional and low-distortion embeddings of vectors from high dimension. Feature hashing is an efficient form of random projection where the sketching matrix is sparse - i.e. each row of the matrix has exactly one non-zero (usually ) and this location is determined randomly. We can visualize the sketching matrix for ROBE- as shown in Figure 2(b)

(a)
(b)
Figure 3: (a) Example of allocation using ROBE-2. (b) Sketching matrix for ROBE-1 (left) and ROBE-3 (right).

. Using the mapping function, or alternatively the sketching matrix, we can recover the original embedding vector from the memory. In fact, in this paper, we directly learn the compressed representation of the parameter vector (and hence embedding tables).

We provide two analysis. The first analysis measures the quality of the dimensionality reduction while using ROBE- hashing. This is a standard analysis on the lines of that presented in [21]. We show that ROBE- ( with ) is better than ROBE-1 which is essentially the feature hashing described in [21]. The second analysis is about the quality of embedding structure maintained by the ROBE- hashing in the projected space. In this analysis, we measure how the relation between two embeddings change under this memory allocation scheme.

3.1 Dimensionality reduction : ROBE- beats feature hashing

In order to assess the quality of dimensionality reduction of the parameter vector in , we look at the estimation of the inner product of two vectors in the projected space. This is a standard way to measure the preservation of distances under projection.

Let and be two vectors in . Let the inner product between and be denoted as . Let ROBE- sketching matrix be a matrix projecting the vectors in space of to such that . Let the projected vectors be and respectively. The projections are obtained by using two hash functions and . is the memory allocation function described in section above, which maps the block id into the range , and assigns a value in to each index of the vector. We do not use in practice, but here we use to simplify the analysis. To begin with, the inner product estimator can be written as

(4)

where is an indictor function. The expectation and variance of the inner product estimator can be written as below (Theorem 1). Here, we use the function directly on the index of parameter vector as opposed to how it is defined in Equation (2) when we are dealing with embedding tables. Both notations are equivalent and we use them interchangeably depending on context.

Theorem 1

The inner product of two vectors projected into the space using the sketching matrix for ROBE- with block size , and

is a random variable with expectation and variance as noted below. Let

denote the block id of index as computed in Equation (2).

(5)
(6)

Let be the variance of inner product between while using ROBE-Z on memory size . Then the above equation can be rewritten as follows.

(7)

where refers to slice of vector from index to .

Note that is exactly the variance observed with feature hashing matrix [21]. As we can clearly see that ROBE- has lower variance than ROBE-1. Hence, ROBE- is better at dimensionality reduction than feature hashing.

Intuition: The results are not surprising and can explained by observing that once we hash blocks, we ensure that elements of parameter vector that lie within a particular block do not collide under random projection (benign correlations due to blocking). Also, the marginal probability of collision of two elements that lie in different blocks is same as that in ROBE-1. While there is additional constraint in ROBE- of the form “if and collide then and also collide if they lie in same blocks as and respectively,” these relations do not affect the variance as can be seen in detailed analysis in the Appendix. The exciting part is that the improved variance also comes with improved cache performance.

The analysis can be extended to get concentration inequalities on the lines of the analysis provided by [21]. To obtain similar analysis we simply have to replace the expressions with appropriate variance terms.

3.2 Embedding structure preservation under projection

Let and be two values and their corresponding embedding vectors be . Let be the parameter vector, then for some . We assess the quality of embedding structure preservation under the ROBE- projection by measuring the inner product estimation under the projection. The inner product estimator can be written as

(8)
Theorem 2

The inner product for embeddings of two distinct values and , say , when the parameter vector is projected onto a space using the ROBE- hashing has an expected value as shown below.

(9)
(10)

The variance for the case when embeddings of and lie in separate blocks and can be expressed as

(11)

We provide variance of inner product estimator for a commonly occuring case. Detailed proof of Theorem 2 can be found in the Appendix. Variance for other cases can be computed similarly.

Intuition: The expected value of inner product of two embeddings is unbiased only in the case when embeddings of and lie in the same block. This is expected as when embeddings of and lie in different blocks, element of and can be potentially mapped to the same location in memory leading to biased estimates. Also, the variance depends on the 2-norm of the parameter vector (equivalently frobenius norm of embedding tables). Again, one can expect this as every element can potentially collide with the elements of embeddings of and .

4 Experimental Results

4.1 1000 Compression of CriteoTB MLPerf Model with AUC 0.825 or higher

Dataset: CriteoTB dataset has 13 integer features and 26 categorical features with around 800 million categorical values in total. This is the advertising data of 23 days published by criteo. We use exactly same setting as mentioned for official version by MLPerf for training.

Model: The official MLPerf model for DLRM on CriteoTB111https://github.com/facebookresearch/dlrm/tree/6d75c84d834380a365e2f03d4838bee464157516 requires around 100GB sized embedding tables and achieves the target MLPerf AUC of 0.8025 in 1 epoch. We will use the same quality metric of 0.8025 AUC as prescribed in MLPerf settings for CriteoTB dataset to evaluate ROBE-Z.

Results: With ROBE-Z using 1000 less memory, i.e. only 100MB, we achieve higher than 0.8025 AUC within 2 epochs with different settings of block sizes. The details are given in table below

Model(MB embedding) AUC reached?
ROBE- Yes
ROBE- Yes
ROBE- Yes
Table 2: CriteoTB dataset: reduced memory with ROBE- for varying . AUC was reached in epochs for all settings.

As we can see we can achieve the same target AUC, although with almost 2x time in terms of iterations. We experiment with different block sizes and see that we can achieve the required quality with different block sizes 222We plan to add more experiments for CriteoTB dataset with larger block sizes in follow up version

The results can be reproduced using our code: DLRM Patch333https://github.com/apd10/dlrm and ROBE-Z 444https://github.com/apd10/universal_memory_allocation.

4.2 1000 Compression of Embedding Tables on Criteo Kaggle Dataset

Dataset: The Criteo Kaggle dataset555https://www.kaggle.com/c/criteo-display-ad-challenge has 13 integer features and 26 categorical features with 33.7M total categorical values. It is similar to CriteoTB dataset with lesser number of days and different sampling strategy. We split the data randomly into partitions 9:1, the smaller partition being used for testing. The training partition is further divided into partitions 8:2, the smaller partition being used for validation. We use early stopping based on validation AUC to choose the model.

Models We use six different embedding based models from the literature: DLRM [16], DCN [20], AutoInt [18], DeepFM [7], xDeepFM [11], and FiBiNET [8]

. The exact details of hyperparameters for the models and optimizer parameters, data split used for testing, and properties of the dataset used can be found in Appendix 

6.4. Specifically, we use embedding size of 16 for all the categorical values (around 33.7M). Hence, the original models have 540M parameters. We use Adam[10] for all models except DLRM which uses SGD as provided in original code. In this experiment, we set the compressed memory size to 540K parameters for ROBE- (i.e. compression)

Results: Table 3

shows the results of AUC and along with the corresponding standard deviations for all the models in Table 

3. The standard deviations of AUC of all settings are pretty similar and we exclude putting the results for other models (original and ROBE- for ) to save space.

Model
Original
#1
Epochs
ROBE-1
(avg)
(seed 1,2,3)
ROBE-1
(stdev)
Epochs
(reach #1)
ROBE-2
(seed 1)
ROBE-4
(seed 1)
ROBE-8
(seed 1)
ROBE-16
(seed 1)
DLRM 0.8031 1.37 0.8050 0.0001 3.96 0.8050 0.8049 0.8047 0.8050
DCN 0.7973 1 0.7991 0.0004 1.8 0.7994 0.7995 0.7994 0.7993
AutoInt 0.7968 1 0.7987 0.0002 1.62 0.7984 0.7984 0.7988 0.7985
DeepFM 0.7957 1 0.7951 0.0001 1.99 0.7949 0.7949 0.7947 0.795
XDeepFM 0.8007 1.625 0.7989 0.0004 3.93 0.7987 0.7988 0.799 0.7991
FiBiNET 0.8016 3 0.8011 0.0002 2.99 0.8011 0.8010 0.8013 0.8012
Table 3: Criteo Kaggle dataset: Test AUC of reduced ROBE- model (2MB embedding) compared against original model (2GB). Epochs for ROBE- models measure epochs taken to reach AUC of original models. In case models do not reach quality of original model, we note the epochs to reach their best AUC. The AUC under ROBE- shows the best AUC reached (using early stopping). While all models overfit after a time, DLRM models do not and the training is cutoff at 11 epochs.

We make the following observations from Table 3.

  • [nosep,leftmargin=*]

  • Test AUC of ROBE- 1000 compressed model is either better than original model (3/6 models) or similar (2/6 models). Only in case of XDeepFM , ROBE- performs worse than original model.

  • The quality of model (i.e. AUC) reached is stable across different values of for ROBE-.

  • As we can see, the number of epochs required to train the model of same quality is larger for ROBE- models as compared to original models. This is inline with our observation with CriteoTB dataset as well.

Our results can be reproduced using the DLRM code22footnotemark: 2 for DLRM model and deeptorch code666https://github.com/apd10/criteo_deepctr, which uses original code 777https://github.com/shenweichen/DeepCTR-Torch for other models.

4.3 Inference Time for ROBE-Z

Model samples/second Improvement
Original(100GB) 341454 -
ROBE-1 755469 121%
ROBE-2 865757 153%
ROBE-8 913893 167%
ROBE-32 920183 170%
Table 4: Sample throughput: run with a batch size of 16384. The time includes the time taken to send the batch from RAM to GPU global memory and then the forward pass on the batch. The time also includes hash computation. There is 120% increase in throughput by using ROBE- which can be further improved using ROBE-32 to 170%. Original model is run on 4 QUADRO RTX 8000 GPUs while ROBE- models are run on a single GPU. All models have access to 120 CPUs.

With our proof-of-concept code (experimental and un-optimized), we measure the throughput of the samples during inference on CriteoTB dataset. We can see a phenomenal improvement in throughput during inference. While original 100GB model, run on 4 Quadro RTX-8000 (46GB) GPUs, can process around 341K samples per second, the ROBE- models which are only 100MB large, perform much faster. Using ROBE- we can process about samples. Specifically, we see 120%() improvement in throughput with . As expected, increasing value of in ROBE- improves the throughput further upto 170% () for ROBE-32.

4.4 Training Time for ROBE-

As noted in both the datasets, the running time of ROBE- models is slower w.r.t number of epochs required to reach the same quality. This can potentially be explained by the fact that having more parameters to tune can significantly speed up the learning. We see that with recommendation models with large embedding tables, we can achieve same quality with smaller compressed ROBE- given enough training time. A lot of research on recommendation models on click-through rate (CTR) data like Criteo, restrict themselves to 1 epoch of training. However, we want to stress that smaller models can potentially reach same quality and in these cases and it is just a matter of training more.

Also, we believe that leveraging the smaller size of overall embeddings the training for these smaller ROBE- models can still be faster than the original models w.r.t wall clock time. Our current timing experiments on proof-of-concept code (experimental and un-optimized) does not show any improvements in training time. However, we plan to perform rigorous training time experiments in future with optimized code.

5 Conclusion

While industrial scale recommendation models are exploding due to large number of categorical features, ROBE Array is a perfect alternative to embedding tables and enable training models of less memory to achieve same quality. ROBE Array also shows clear inference throughput benefit and can potentially be trained much faster than original models. Also, training models with ROBE Array is accessible to a average DLRM user who does not have access to high end hardware or engineering expertise required to train hundreds of TBs sized model. We believe DLRM with ROBE Array will serve as a new baseline for compression and expediate the research in recommendation models.

References

  • [1] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen (2015) Compressing neural networks with the hashing trick. In International conference on machine learning, pp. 2285–2294. Cited by: §1, §2.1.
  • [2] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §1.
  • [3] W. Cheng, Y. Shen, and L. Huang (2020) Differentiable neural input search for recommender systems. arXiv preprint arXiv:2006.04466. Cited by: §1.
  • [4] A. Desai, Y. Pan, K. Sun, L. Chou, and A. Shrivastava (2021) Semantically constrained memory allocation (scma) for embedding in efficient recommendation systems. arXiv preprint arXiv:2103.06124. Cited by: §1.
  • [5] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia (2014) A survey on concept drift adaptation. ACM computing surveys (CSUR) 46 (4), pp. 1–37. Cited by: §1.
  • [6] A. Ginart, M. Naumov, D. Mudigere, J. Yang, and J. Zou (2019) Mixed dimension embeddings with application to memory-efficient recommendation systems. Note: arXiv:1909.11810 Cited by: §1, §1, item 1, §6.4.
  • [7] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §1, §4.2.
  • [8] T. Huang, Z. Zhang, and J. Zhang (2019) FiBiNET: combining feature importance and bilinear feature interaction for click-through rate prediction. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 169–177. Cited by: §1, §4.2.
  • [9] M. R. Joglekar, C. Li, M. Chen, T. Xu, X. Wang, J. K. Adams, P. Khaitan, J. Liu, and Q. V. Le (2020) Neural input search for large scale recommendation models. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2387–2397. Cited by: §1.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [11] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun (2018) Xdeepfm: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1754–1763. Cited by: §1, §4.2.
  • [12] H. Liu, X. Zhao, C. Wang, X. Liu, and J. Tang (2020) Automated embedding size search in deep recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2307–2316. Cited by: §1.
  • [13] S. Liu, C. Gao, Y. Chen, D. Jin, and Y. Li (2021) Learnable embedding sizes for recommender systems. arXiv preprint arXiv:2101.07577. Cited by: §1.
  • [14] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. Note: arXiv:1301.3781 Cited by: §1.
  • [15] D. Mudigere, Y. Hao, J. Huang, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luo, et al. (2021) High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158. Cited by: §1.
  • [16] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy (2019) Deep learning recommendation model for personalization and recommendation systems. Note: arXiv:1906.00091 Cited by: §1, §1, §1, §4.2.
  • [17] H. M. Shi, D. Mudigere, M. Naumov, and J. Yang (2019) Compositional embeddings using complementary partitions for memory-efficient recommendation systems. Note: arXiv:1909.02107 Cited by: §1, §1, item 1, §6.4.
  • [18] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, and J. Tang (2019) Autoint: automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170. Cited by: §1, §4.2.
  • [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. Note: arXiv:1706.03762 Cited by: §1.
  • [20] R. Wang, B. Fu, G. Fu, and M. Wang (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD, Cited by: §1, §4.2.
  • [21] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg (2009) Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 1113–1120. External Links: ISBN 9781605585161, Link, Document Cited by: §1, §3.1, §3.1, §3, §6.2.
  • [22] C. Yin, B. Acun, C. Wu, and X. Liu (2021)

    TT-rec: tensor train compression for deep learning recommendation models

    .
    Proceedings of Machine Learning and Systems 3. Cited by: §1, §1.
  • [23] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, and J. Tang (2020) Autoemb: automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252. Cited by: §1.

6 Appendix

6.1 Robe-

The parameter vector is constructed by flattening out the embedding table row-wise and concatenate all the embedding tables. let all the embeddings be of dimension and let the chunk size be .

The ROBE- is performed as follows

  • split the parameter vector into chunks of .

  • hash each chunk to a particular location in the array of size m

  • This chunk is added to the corresponding sub-array of the memory and in case we run outside of we cycle through to add at the beginning of the array. So the array we are sketching the parameter vector is actually a circular array

  • each element is actually multiplied by the sign which is obtained by using another hash function g() and this is applied at the element level. We do not use the sign in our experiments. However, it can be used and greatly simplifies the theory.

6.2 Analysis 1: Analysis of quality of dimensionality reduction - feature hashing

Let the parameter vector be in . The projection maps this vector into .

Let and The estimator we want to analyse is that of inner product estimation . The estimator can be written in terms of the indicator functions as follows:

(12)

We can simplify the above indicator as

(13)
(14)

Let be the chunk-id that is assigned to to , following the same notation used in Equations (2). Then, we know that is 0 if . Using this fact, we have

(15)

We can easily see that this estimator of is unbiased. Let us now look at

(16)

The variance of the estimator can be computed as

(17)
(18)

The expected value of term in summation above is non-zero only if pairs are equal to eliminate the s. As cannot be equal to , or . This implies that

(19)

Note that although there are some constraints that appear when we do chunk hashing. like if and collide then and and lie within the chunk, then and

will also collide. But you can see that this relation does not really appear in the equation for variance. Maybe this appears in higher moments of the estimator.

Using the fact that , we can simplify the expression above as

(20)

Note that when , the equation for variance is exactly the random projection that is used for "feature hashing" as proposed by [21].

Let us denote the variance of inner product of vectors and projected from the dimension n to m while using chunk size of to be . We will use this notation so that we are very precise in our statements.

Note that

(21)

where is an element of sub-vector , which refers to the chunk of the parameter vector.

(22)

It is clear from the above equation that ROBE- has better variance w.r.t feature hashing .

6.3 Effect of ROBE- on inner product of embeddings of two values - i.e. parts of parameter vector that are identified as two separate embeddings.

The previous analysis was the analysis of the sketching matrix and how good it is in preserving distances in a space. However, another important aspect of this projection - pertinent to the discussion of this paper is how does this projection of entire parameter vector affect the inter- embedding relation between embeddings of two different values.

Consider how the parameter vector is constructed. We flatten out each embedding table row wise ( so each embedding is contiguous) and we concatenate all flattened embedding tables together to get a single parameter vector which is projected down.

There are three cases that we need to check. We will assume that either divides or divides (if ) also, both Z and d divided n. Let the embeddings of two values under consideration be and

  • and lie in same chunk ()

  • and lie in different chunks

CASE: , and lie in the same chunk

Let us first look at the product of two elements .

(23)
(24)
(25)
(26)

Hence,

(27)

CASE : or , and lie in different chunks and

Let us first look at the product of two elements .

(28)
(29)
(30)
(31)

Hence,

(32)

Variance:

We will analyse the variance for specific case of and and values have embeddings in different blocks. Other cases can be computed similarly as

(33)
(34)

separating cases when (, , , ) and others

(35)
(36)

We will analyse the case of embeddings of and lie in separate blocks and .

term factor comment
j1 = j2 i1=i2=i3=i4 t^4 1/m^2
The value of i_1, i_2, i_3, i_4 should be equal
in pairs or all of them. Also, they should not match their
own elements. ( i.e. i_1 != x+j1, etc)
t = typical element of theta
i1 = i2 != i3=i4 t1^2 t2^2 1/m^4
i1 = i3 != i2 = i4 t1^2 t2^2 1/m^2
i1 = i4 != i2 =i3 t1^2 t2^2 1/m^4
j1!=j2 i1 = y + j1, i3 = y + j2 , i2 = x + j1, i4=x+j2 x1y1x2y2 1/m^2
The value of i_1, i_2, i_3, i_4 should be equal to
the 4 distinct elements ( 2 of x and 2 of y) without
matching their own elements.
Also, as the indicators should be non-zero
So,
i_1 cannot be x+j1 also it cannot be x+j2 as the value of
indicator in that case will always be 0. Since Z >d h(i_1=x+j2)
cannot be equal to h(x+j1)
Hence i_1 can be equal to y+j1 or y+j2
x,y are vectors
i1 = y + j1, i3 = y + j2 , i2 = x + j2, i4=x+j1 x1y1x2y2 1/m^3
i1 = y + j2, i3 = y + j1 , i2 = x + j1, i4=x+j2 x1y1x2y2 1/m^3
i1 = y + j2, i3 = y + j1 , i2 = x + j2, i4=x+j1 x1y1x2y2 1/m^2
Table 5: For x and y lie in different blocks and Z >d . so each embedding lies completely inside the block.

For convenience we are using and similarity for

(37)

We use less than as not all elements from are present in actual summation.

(38)

6.4 Criteo Kaggle Experiment

The experimental settings are described in detail below

Dataset: We choose the Criteo Kaggle dataset in order to demonstrate the compressive power of UMA. The original dataset of Criteo has 13 integer features and 26 categorical features. The counts of the feature values in total is around 33M. the breakup in each of the feature is as follows.

counts:
[1460, 583, 10131227, 2202608, 305, 24, 12517, 633, 3, 93145, 5683, 8351593,
 3194, 27, 14992, 5461306, 10, 5652, 2173, 4, 7046547, 18, 15, 286181, 105,
 142572]

We do not perform any rare feature filtering, which reduces the number of categorical values as is done in papers reporting SOTA values for the Criteo dataset. We want to demonstrate an algorithm that can deal with large embedding tables (e.g., terabytes-size of embedding tables as observed in industries) and choose the entire feature set of the Criteo dataset to make a reasonable dataset. Also, this is standard practice in papers which deal with “compression” or efficient embedding tables research [6, 17].

Hyperparameters chosen for different models. The hyperparameters for running different models were chosen as mentioned in the respective original papers. We did not do hyperparameter tuning for UMA as the result sufficiently supports our hypothesis that these memory allocation approaches are valuable to efficiently learn compressed embedding tables. We fix the batch size of each of training to 2048 and cutoff the training at 300K iterations for all models except DLRM which is run till 540K (due to SGD). All the models have embedding size = 16.

architecture dropout l2_regularization optimizer learning rate
DLRM
bot: 13-512-256-64-16
top: 512-256-1
0 0 SGD
1.0
DCN 1024-1024-1024-1024 0 0 ADAM 0.001
AutoInt
400-400-400
attention_embedding_size:
32
0 0 ADAM 0.001
DeepFM 400-400-400 0.5 0 ADAM 0.001
XDeepFM
dnn:
400-400-400
cross interaction:
200-200-200
0.5 0.0001 ADAM 0.001
FiBiNET 400-400-400 0.5 0.0001 ADAM 0.001
Table 6: Hyperparameters of different model chosen as per the specification in their papers. The code used for running DLRM model is : https://github.com/facebookresearch/dlrm . For other models, the code used is https://github.com/shenweichen/DeepCTR-Torch.

Train/Test data split. We use the following way to split the data into random training and testing data for all models except DLRM. For DLRM, the code provided has their own data loader and we do not make any changes there.

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.1, random_state=2020)

Choosing models while training. We observe that most models (especially the original models) start to overfit after some iterations. Hence, we use early stopping based on validation AUC to select the final model.

6.4.1 FAQ on Kaggle Experiment:

  1. Why do we not perform rare feature filtering? We do not perform any sort of rare feature filtering which reduces the number of categorical values as is done in papers reporting SOTA values for criteo dataset. We want to demonstrate an algorithm that can deal with large embedding tables (e.g., terabytes-size of embedding tables as observed in industries) and choose the entire feature set of the Criteo dataet to make a reasonable dataset. Also, this is a standard in papers which deal with “compression” or efficient embedding tables research [6, 17].

  2. Why do we not achieve the SOTA results (as reported on original papers)? Reproducing results in the original papers would require access to their codes and dataset preprocessing if any and dataset split they use to create training and testing datasets. Unfortunately, most papers do not have their own codes public and do not specify the random seeds used to split the data. In our experiments, we use the random seed = 2020 for all the models except DLRM which has its own random splitter in the code provided. Apart from this, we also do not perform rare feature filtering which might affect the results. However, our experiments on some models with rare feature filtering showed that it does not help with performance of original model.

  3. Why we use fixed embedding size for models like DCN, which tell us to use custom embedding sizes? While it is true that DCN model gives a custom way to choose the size of the embedding based on the number of values in that category, using the formula leads to very large memory tables for the Criteo dataset using full features (no rare feature filtering). Hence, it is not possible to use the custom formula in our case. We uniformly set the embedding size to 16 across the different models.