Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems

09/04/2019 ∙ by Hao-Jun Michael Shi, et al. ∙ Northwestern University 0

Modern deep learning-based recommendation systems exploit hundreds to thousands of different categorical features, each with millions of different categories ranging from clicks to posts. To respect the natural diversity within the categorical data, embeddings map each category to a unique dense representation within an embedded space. Since each categorical feature could take on as many as tens of millions of different possible categories, the embedding tables form the primary memory bottleneck during both training and inference. We propose a novel approach for reducing the embedding size in an end-to-end fashion by exploiting complementary partitions of the category set to produce a unique embedding vector for each category without explicit definition. By storing multiple smaller embedding tables based on each complementary partition and combining embeddings from each table, we define a unique embedding for each category at smaller cost. This approach may be interpreted as using a specific fixed codebook to ensure uniqueness of each category's representation. Our experimental results demonstrate the effectiveness of our approach over the hashing trick for reducing the size of the embedding tables in terms of model loss and accuracy, while retaining a similar reduction in the number of parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The design of modern deep learning-based recommendation models (DLRMs) is challenging because of the need to handle a large number of categorical (or sparse) features. For personalization or click-through rate (CTR) prediction tasks, examples of categorical features could include users, posts, or pages, with hundreds or thousands of these different features (Naumov et al., 2019). Within each categorical feature, the set of categories could take on many diverse meanings. For example, social media pages could contain topics ranging from sports to movies.

In order to exploit this categorical information, DLRMs utilize embeddings to map each category to a unique dense representation in an embedded space; see Cheng et al. (2016); He et al. (2017); Wang et al. (2017); Guo et al. (2017); Lian et al. (2018); Zhou et al. (2018b, c); Naumov (2019). More specifically, given a set of categories and its cardinality , each categorical instance is mapped to an indexed row vector in an embedding table , as shown in Figure 1. Rather than predetermining the embedding weights, it has been found that jointly

training the embeddings with the rest of the neural network is effective in producing accurate models.

Each categorical feature, however, could take on as many as tens of millions of different possible categories (i.e., ), with an embedding vector dimension . Because of the vast number of categories, the number of embedding vectors form the primary memory bottleneck within both DLRM training and inference since each table could require multiple GBs to store.111Note that the relative dimensionality significantly differs from traditional language models, which use embedding vectors of length 100 to 500, with dictionaries of a maximum of hundreds of thousands of words.

One natural approach for reducing memory requirements is to decrease the size of the embedding tables by defining a hash function (such as a remainder function) that maps each category to an embedding index, where the embedding size is strictly smaller than the number of categories222We consider the case where the hashing trick is used primarily for reducing the number of categories. In practice, one may hash the categories for indexing and randomization purposes, then apply a remainder function to reduce the number of categories. Our proposed technique applies to the latter case. (Weinberger et al., 2009). However, this approach may blindly map vastly different categories to the same embedding vector, resulting in loss of information and deterioration in model quality. Ideally, one ought to reduce the size of the embedding tables while still producing a unique representation for each category in order to respect the natural diversity of the data.

Figure 1: An embedding table.

In addition, DLRMs do not currently exploit the natural partitionable or hierarchical structure that exists within categorical data; in particular, if a categorical feature is naturally partitioned in multiple ways, current models do not capitalize on these partitions to improve model performance or reduce model complexity. Consider, for example, a categorical feature consisting of different cars as part of a dataset to predict the CTR of a particular advertisement. Each car consists of a different make, type, color, year, etc. These properties generate natural partitions of the set of cars. This structure ought to be identified to reduce the dimensionality of the model while constructing meaningful information.

In this paper, we propose an approach for generating a unique embedding for each categorical feature by using complementary partitions of the category set to generate compositional embeddings, which interact multiple smaller embeddings to produce a final embedding. These complementary partitions could be obtained from inherent characteristics of the categorical data, or enforced artificially to reduce model complexity. We propose concrete methods for artificially defining these complementary partitions and demonstrate their usefulness on modified Deep and Cross (DCN) (Wang et al., 2017) and Facebook DLRM networks (Naumov et al., 2019) on the Kaggle Criteo Ad Display Challenge dataset. These methods are simple to implement, compress the model for both training and inference, do not require any additional pre- or post-training processing, and better preserve model quality than the hashing trick.

1.1 Related Work

Wide and deep models (Cheng et al., 2016) jointly train both a deep network and linear model to combine the benefits of memorization and generalization for recommendation systems. Factorization machines (Rendle, 2010, 2012) played a key role in the next step of development of DLRMs by identifying that sparse features (induced by nominal categorical data) could be appropriately exploited by interacting different dense representations of sparse features with an inner product to produce meaningful higher-order terms. Generalizing this observation, some recommendation models (Guo et al., 2017; He et al., 2017; Wang et al., 2017; Lian et al., 2018; Zhou et al., 2018c) jointly train a deep network with a specialized model in order to directly capture higher-order interactions in an efficient manner. The Facebook DLRM network (Naumov et al., 2019)

mimics factorization machines more directly by passing the pairwise dot product between different embeddings into a multilayer perceptron (MLP). More sophisticated techniques that incorporate trees, memory, and (self-)attention mechanisms (to capture sequential user behavior) have also been proposed

(Zheng et al., 2018; Zhou et al., 2018a, b; Zhu et al., 2018).

Towards the design of the embeddings, Naumov (2019) proposed an approach for prescribing the embedding dimension based on the amount of information entropy contained within each categorical feature. Yin and Shen (2018) used perturbation theory for matrix factorization problems to similarly analyze the effect of the embedding dimension on the quality of classification. These methods focused primarily on the choice of .

Much recent work on model compression also use compositional embeddings to reduce model complexity; see Shu and Nakayama (2017); Chen et al. (2018). Most of these approaches require learning and storing discrete codes, similar to the idea of product quantization (Jegou et al., 2010), where each category’s index is mapped to its corresponding embedding indices . In order to learn these codes during training, one is required to store them, hence requiring parameters with only potential to decrease . Since in recommendation systems, these approaches unfortunately remain ineffective in our setting.

Unlike prior approaches that focus on reducing , our method seeks to directly reduce the embedding size using fixed codes that do not require additional storage while enforcing uniqueness of the final embedding. Related work by Khrulkov et al. (2019) may be interpreted as a specific operation applied to each element of the embedding table, similar to the framework described here.

1.2 Main Contributions

Our main contributions in this paper are as follows:

  • We propose a novel modification for reducing the size of the embedding tables while still yielding a unique embedding vector for each category. The trick uses both the quotient and remainder functions to produce two different embeddings, then combines these embeddings to yield the final embedding, called a compositional embedding. This reduces the number of embedding parameters from up to , where is the number of categories and is the embedding dimension.

  • We abstract this approach to compositional embeddings based on complementary partitions of the category set. Complementary partitions require each category to be distinct from every other category according to at least one partition. This has the potential to reduce the number of embedding parameters to where is the number of partitions.

  • The experimental results demonstrate that our compositional embeddings yield better performance than the hashing trick, which is commonly used in practice. Although the best operation for defining the compositional embedding may vary, the element-wise multiplication operation produces embeddings that are most scalable and effective in general.

Section 2 will provide a simple example to motivate our framework for reducing model complexity by introducing the quotient-remainder trick. In Section 3, we will define complementary partitions, and provide some concrete examples that are useful in practice. Section 4 describes our proposed idea of compositional embeddings and clarifies the tradeoffs between this approach and using a full embedding table. Lastly, Section 5 gives our experimental results.

2 A Simple Example

Recall that in the typical DLRM setup, each category is mapped to a unique embedding vector in the embedding table. Mathematically, consider a single categorical feature and let denote an enumeration of .333As an example, if the set of categories consist of , then a potential enumeration of is , , and . Let be its corresponding embedding matrix or table, where is the dimension of the embeddings. We may encode each category (say, category with index ) with a one-hot vector by , then map this to a dense embedding vector by

(1)

Alternatively, the embedding may also be interpreted as a simple row lookup of the embedding table, i.e. . Note that this yields a memory complexity of for storing embeddings, which becomes restrictive when is large.

The naive approach of reducing the embedding table is to use a simple hash function (Weinberger et al., 2009), such as the remainder function, called the hashing trick. In particular, given an embedding table of size where , that is, , one can define a hash matrix by:

(2)

Then the embedding is performed by:

(3)

This process is summarized in Algorithm 1.

0:  Embedding table , category
  Determine index of category .
  Compute hash index .
  Look up embedding .
Algorithm 1 Hashing Trick

Although this approach significantly reduces the size of the embedding matrix from to since , it naively maps multiple categories to the same embedding vector, resulting in loss of information and rapid deterioration in model quality. The key observation is that this approach does not yield a unique embedding for each unique category and hence does not respect the natural diversity of the categorical data in recommendation systems.

To overcome this, we propose using the quotient-remainder trick. Assume for simplicity that divides (although this does not have to hold in order for the trick to be applied). Let “” denote integer division or the quotient operation. Using two complementary functions – the integer quotient and remainder functions – we can produce two separate embedding tables and combine the embeddings in such a way that a unique embedding for each category is produced. This is formalized in Algorithm 2.

0:  Embedding tables and , category
  Determine index of category .
  Compute hash indices and .
  Look up embeddings and .
  Compute .
Algorithm 2 Quotient-Remainder Trick

More rigorously, define two embedding matrices: and . Then define an additional hash matrix

(4)

Then we obtain our embedding by

(5)

where denotes element-wise multiplication. This trick results in a memory complexity of , a slight increase in memory compared to the hashing trick but with the benefit of producing a unique representation. We demonstrate the usefulness of this method in our experiments in Section 5.

3 Complementary Partitions

The quotient-remainder trick is, however, only a single example of a more general framework for decomposing embeddings. Note that in the quotient-remainder trick, each operation (the quotient or remainder) partitions the set of categories into multiple “buckets” such that every index in the same “bucket” is mapped to the same vector. However, by combining embeddings from both the quotient and remainder together, one is able to generate a distinct vector for each index.

Similarly, we want to ensure that each element in the category set may produce its own unique representation, even across multiple partitions. Using basic set theory, we formalize this concept to a notion that we call complementary partitions. Let denote the equivalence class of induced by partition .444We slightly abuse notation by denoting the equivalence class by its partition rather than its equivalence relation for simplicity. For more details on set partitions, equivalence classes, and equivalence relations, please refer to the Appendix.

Definition 1.

Given set partitions of set , the set partitions are complementary if for all such that , there exists an such that .

As a concrete example, consider the set . Then the following three set partitions are complementary: , , and . In particular, one can check that each element is distinct from every other element according to at least one of these partitions.

Note that each equivalence class of a given partition designates a “bucket” that is mapped to an embedding vector. Hence, each partition corresponds to a single embedding table. Under complementary partitions, after each embedding arising from each partition is combined through some operation, each index is mapped to a distinct embedding vector, as we will see in Section 4.

3.1 Examples of Complementary Partitions

Using this definition of complementary partitions, we can abstract the quotient-remainder trick and consider other more general complementary partitions. These examples are proved in the Appendix. For notational simplicity, we denote the set for a given .

  1. Naive Complementary Partition: If

    then is a complementary partition by definition. This corresponds to a full embedding table with dimension .

  2. Quotient-Remainder Complementary Partitions: Given , the partitions

    are complementary. This corresponds to the quotient-remainder trick in Section 2.

  3. Generalized Quotient-Remainder Complementary Partitions: Given for such that , we can recursively define complementary partitions

    where for . This generalizes the quotient-remainder trick.

  4. Chinese Remainder Partitions: Consider a pairwise coprime factorization greater than or equal to , that is, for for all and for all . Then we can define the complementary partitions

    for .

More arbitrary complementary partitions could also be defined depending on the application. Returning to our car example, one could define different partitions based on the year, make, type, etc. Assuming that the unique specification of these properties yields a unique car, these partitions would indeed be complementary. In the following section, we will demonstrate how to exploit this structure to reduce memory complexity.

4 Compositional Embeddings Using Complementary Partitions

Generalizing our approach in Section 2, we would like to create an embedding table for each partition such that each equivalence class is mapped to an embedding vector. These embeddings could either be combined using some operation to generate a compositional embedding or used directly as separate sparse features (which we call the feature generation approach). The feature generation approach, although effective, may significantly increase the amount of parameters needed by adding additional features while not utilizing the inherent structure that the complementary partitions are formed from the same initial categorical feature.

More rigorously, consider a set of complementary partitions of the category set . For each partition , we can create an embedding table where each equivalence class is mapped to an embedding vector indexed by and is the embedding dimension for embedding table . Let be the function that maps each element to its corresponding equivalence class’s embedding index, i.e. .

To generate our (operation-based) compositional embedding, we interact all of the corresponding embeddings from each embedding table for our given category to obtain our final embedding vector

(6)

where is an operation function. Examples of the operation function include (but are not limited to):

  1. Concatenation: Suppose , then .

  2. Addition: Suppose for all , then .

  3. Element-wise Multiplication: Suppose for all , then 555

    This is equivalent to factorizing the embeddings into the product of tensorized embeddings, i.e. if

    is a -dimensional tensor containing all embeddings and for is the embedding table for partition , then
    for fixed, where denotes the tensor outer product. This is similar to Khrulkov et al. (2019) but instead applied vector-wise rather than component-wise..

Figure 2: Visualization of compositional embeddings with element-wise multiplication operation. The red arrows denote the selection of the embedding vector for each embedding table.

One can show that this approach yields a unique embedding for each category under simple assumptions. We show this in the following theorem (proved in the Appendix). For simplicity, we will restrict ourselves to the concatenation operation.

Theorem 1.

Assume that the vectors in each embedding table are distinct, that is for for all . If the concatenation operation is used, then the compositional embedding of any category is unique, i.e. if and , then .

This approach reduces the memory complexity of storing the entire embedding table to . Assuming and can be chosen arbitrarily, this approach yields an optimal memory complexity of , a stark improvement over storing and utilizing the full embedding table. This approach is visualized in Figure 2.

4.1 Path-Based Compositional Embeddings

An alternative approach for generating embeddings is to define a different set of transformations for each partition (aside from the first embedding table). In particular, we can use a single partition to define an initial embedding table then pass our initial embedding through a composition of functions determined by the other partitions to obtain our final embedding vector.

More formally, given a set of complementary partitions of the category set , we can define an embedding table for the first partition, then define sets of functions for every other partition. As before, let be the function that maps each category to its corresponding equivalence class’s embedding index.

To obtain the embedding for category , we can perform the following transformation

(7)

We call this formulation of embeddings path-based compositional embeddings because each function in the composition is determined based on the unique set of equivalence classes from each partition, yielding a unique “path” of transformations. These transformations may contain parameters that also need to be trained concurrently with the rest of the network. Examples of the function could include:

  1. Linear Function: If and are parameters, then .

  2. Multilayer Perceptron (MLP): Let be the number of layers. Let and and for denote the number of nodes at each layer. Then if , , …, , , , …, are parameters, and

    is an activation function (say, ReLU or sigmoid function) that is applied componentwise, then

Unlike operation-based compositional embeddings, path-based compositional embeddings require non-embedding parameters within the function to be learned, which may complicate training. The reduction in memory complexity also depends on how these functions are defined and how many additional parameters they add. For linear functions or MLPs with small fixed size, one can maintain the complexity. This is visualized in Figure 3.

Figure 3: Visualization of path-based compositional embeddings. The red arrows denote the selection of the embedding vector and its corresponding path of transformations.

5 Experiments

In this section, we present a comprehensive set of experiments to test the quotient-remainder trick for reducing the number of parameters while preserving model loss and accuracy over many different operations. In particular, we show that quotient-remainder trick allows us to trade off model accuracy attained by full embedding tables with model size obtained with the hashing trick.

For comparison, we consider both DCN (Wang et al., 2017) and Facebook DLRM networks. These two networks were selected as they are representative of most models for CTR prediction. We provide the model and experimental setup below.

5.1 Model Specifications

The DCN architecture considered in this paper consists of a deep network with 3 hidden layers consisting of 512, 256, and 64 nodes, respectively. The cross network consists of 6 layers. An embedding dimension of 16 is used across all categorical features.

The Facebook DLRM architecture consists of a bottom (or dense) MLP with 3 hidden layers with 512, 256, and 64 nodes, respectively, and an top (or output) MLP with 2 hidden layers consisting of 512 and 256 nodes. An embedding dimension of 16 is used. When thresholding, the concatenation operation uses an embedding dimension of 32 for non-compositional embeddings.

Note that in the baseline (using full embedding tables), the total number of rows in each table is determined by the cardinality of the category set for the table’s corresponding feature.

5.2 Experimental Setup and Data Pre-processing

The experiments are performed on the Criteo Ad Kaggle Competition dataset666http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ (Kaggle). Kaggle has 13 dense features and 26 categorical features. It consists of approximately 45 million datapoints sampled over 7 days. We use the first 6 days as the training set and split the 7th day equally into a validation and test set. The dense features are transformed using a log-transform. Unlabeled categorical features or labels are mapped to NULL or , respectively.

Note that for this dataset, each category is preprocessed to map to its own index. However, it is common in practice to apply the hashing trick to map each category to an index in an online fashion and to randomize the categories prior to reducing the number of embedding rows using the remainder function. Our techniques may still be applied in addition to the initial hashing as a replacement to the remainder function to systematically reduce embedding sizes.

Each model is optimized using the Adagrad (Duchi et al., 2011) and AMSGrad (Kingma and Ba, 2014; Reddi et al., 2019)

optimizers with their default hyperparameters; we choose the optimizer that yields the best validation loss. Single epoch training is used with a batch size of 128 and no regularization. All experiments are averaged over 5 trials. Both the mean and a single standard deviation are plotted. Here, we use an embedding dimension of 16. The model loss is evaluated using binary cross-entropy.

Figure 4: Validation loss against the number of iterations when training the DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. “Full Table” corresponds to the baseline using full embedding tables (without hashing), “Hash Trick” refers to the hashing trick, and “Q-R Trick” refers to the quotient-remainder trick (with element-wise multiplication). Note that the hashing trick and quotient-remainder trick result in an approximate reduction in model size.

Only the test loss is shown for brevity unless designated otherwise. Please refer to the Appendix for a complete set of experimental results.

5.3 Simple Comparison

To illustrate the quotient-remainder trick, we provide a simple comparison of the validation loss throughout training for full embedding tables, the hashing trick, and the quotient-remainder trick (with the element-wise multiplication) in Figure 4. We enforce 4 hash collisions (yielding about a reduction in model size). Each curve shows the average and standard deviation of the validation loss over 5 trials.

As expected, we see in Figure 4

that the quotient-remainder trick interpolates between the compression of the hashing trick and the accuracy attained by the full embedding tables.

5.4 Compositional Embeddings

To provide a more comprehensive comparison, we vary the number of hash collisions enforced within each feature and plot the number of parameters against the test loss for each operation. We enforce between 2-7 and 60 hash collisions on each categorical feature. We plot our results in Figure 5, where each point corresponds to the averaged result for a fixed number of hash collisions over 5 trials. In particular, since the number of embedding parameters dominate the total number of parameters in the entire network, the number of hash collisions is approximately inversely proportional to the number of parameters in the network.

Figure 5: Test loss against the number of parameters for 2-7 and 60 hash collisions on DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. Hash, Feature, Concat, Add, and Mult correspond to different operations. Full corresponds to the baseline using full embedding tables (without hashing). The baseline model using full embedding tables contains approximately parameters.

The multiplication operation performs best overall, performing closely to the feature generation baseline which comes at the cost of an additional half-million parameters for the Facebook DLRM and significantly outperforming all other operations for DCN. Interestingly, we found that AMSGrad significantly outperformed Adagrad when using the multiplication operation. Compared to the hashing trick with 4 hash collisions, we were able to attain similar or better solution quality with up to 60 hash collisions, an approximately smaller model. With up to 4 hash collisions, we are within 0.3% of the baseline model for DCN and within 0.7% of the baseline model for DLRM. Note that the baseline performance for DLRM outperforms DCN in this instance.

Because the number of categories within each categorical feature may vary widely, it may be useful to only apply the hashing trick to embedding tables with sizes larger than some threshold. To see the tradeoff due to thresholding, we consider the thresholds and plot the threshold number against the test loss for 4 hash collisions. For comparison, we include the result with the full embedding table as a baseline in Figure 6.

Figure 6: Test loss against the threshold number with 4 hash collisions on DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. Hash, Feature, Concat, Add, and Mult correspond to different operations. Full corresponds to the baseline using full embedding tables (without hashing). The baseline model using full embedding tables contains approximately parameters.

We see that when thresholding is used, the results are much more nuanced and improvement in performance depends on the operation considered. In particular, we find that the element-wise multiplication works best for DCN, while the concatenation operation works better for Facebook DLRM. For DLRM, we were able to observe an improvement from a 0.7% error to 0.5% error to the baseline while maintaining an approximate reduction in model size. The performance, however, may still vary depending on the number of hash collisions used.

5.5 Path-Based Compositional Embeddings

In these experiments, we consider the quotient-remainder trick for path-based compositional embeddings. Here, we fix to 4 hash collisions and define an MLP with a single hidden layer of sizes 16, 32, 64, and 128. The results are shown in Table 1.


Hidden Layer 16 32 64 128
DCN # Parameters 135,464,410 135,519,322 135,629,146 135,848,794
Test Loss 0.45263 0.45254 0.45252 0.4534
DLRM # Parameters 135,581,537 135,636,449 135,746,273 135,965,921
Test Loss 0.45349 0.45312 0.45306 0.45651
Table 1: Average test loss and number of parameters for different MLP sizes with 4 hash collisions over 5 trials.

From Table 1, we obtain an optimal hidden layer of size 64. The trend follows an intuitive tradeoff: using a smaller network may be easier to train but may not sufficiently transform the embeddings, while a larger network may have greater capacity to fit a more complex transformation but require more parameters to be learned. In this case, the modified DCN outperforms the modified DLRM network in this case, although this is not true in general.

5.6 Discussion of Tradeoffs

As seen in Figure 5, using the quotient-remainder trick enforces arbitrary structures on the categorical data which can yield arbitrary loss in performance, as expected. This yields a trade-off between memory and performance; a larger embedding table will yield better model quality, but at the cost of increased memory requirements. Similarly, using a more aggressive version of the quotient-remainder trick will yield smaller models, but lead to a reduction in model quality. Most models exponentially decrease in performance with the number of parameters.

Both types of compositional embeddings reduce the number of parameters by implicitly enforcing some structure defined by the complementary partitions in the generation of each category’s embedding. Hence, the quality of the model ought to depend on how closely the chosen partitions reflect intrinsic properties of the category set and their respective embeddings. In some problems, this structure may be identified; however, the dataset considered in this paper contains no additional knowledge of the categories. Since practitioners performing CTR prediction typically apply the hashing trick, our method clearly improves upon this baseline with only small additional cost in memory.

Path-based compositional embeddings also yield more compute-intensive models with the benefit of lower model complexity. Whereas the operation-based approach attempts to definitively operate on multiple coarse representations, path-based embeddings explicitly define the transformations on the representation, a more difficult but intriguing problem. Unfortunately, our preliminary experiments show that our current implementation of path-based compositional embeddings do not supersede operation-based compositional embeddings; however, we do believe that path-based compositional embeddings are potentially capable of producing improved results with improved modeling and training techniques, and are worthy of further investigation.

6 Conclusion

Modern recommendation systems, particularly for CTR prediction and personalization tasks, handle large amounts of categorical data by representing each category with embeddings that require multiple GBs each. We have proposed an improvement for reducing the number of embedding vectors that is easily implementable and applicable end-to-end while preserving uniqueness of the embedding representation for each category. We extensively tested multiple operations for composing embeddings from complementary partitions.

Based on our results, we suggest combining the use of thresholding with the quotient-remainder trick (and compositional embeddings) in practice. The appropriate operation depends on the network architecture; in these two cases, the element-wise multiplication operation appear to work well.

This work provides an improved trick for compressing embedding tables by reducing the number of embeddings in the recommendation setting, with room for design of more intricate operations. There are, however, general weaknesses of this framework; it does not take into account the frequency of categories or learn the intrinsic structure of the embeddings as in codebook learning. Although this would be ideal, we found that categorical features for CTR prediction or personalization are far less structured, with embedding sizes that often prohibit the storage of an explicit codebook during training. It remains to be seen if other compression techniques that utilize further structure within categorical data (such as codebook learning (Shu and Nakayama, 2017; Chen et al., 2018)) can be devised or generalized to end-to-end training and inference for CTR prediction.

Acknowledgments

We thank Tony Ginart, Jianyu Huang, Krishnakumar Nair, Jongsoo Park, Misha Smelyanskiy, and Chonglin Sun for their helpful comments. Also, we express our gratitude to Jorge Nocedal for his consistent support and encouragement.

References

  • T. Chen, M. R. Min, and Y. Sun (2018) Learning k-way d-dimensional discrete codes for compact embedding representations. arXiv preprint arXiv:1806.09464. Cited by: §1.1, §6.
  • H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §1.1, §1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §5.2.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §1.1, §1.
  • X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: §1.1, §1.
  • H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §1.1.
  • V. Khrulkov, O. Hrinchuk, L. Mirvakhabova, and I. Oseledets (2019) Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787. Cited by: §1.1, footnote 5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun (2018) xDeepFM: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1754–1763. Cited by: §1.1, §1.
  • M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, et al. (2019) Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091. Cited by: §1.1, §1, §1.
  • M. Naumov (2019) On the dimensionality of embeddings for sparse features and data. arXiv preprint arXiv:1901.02103. Cited by: §1.1, §1.
  • S. J. Reddi, S. Kale, and S. Kumar (2019) On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §5.2.
  • S. Rendle (2010) Factorization machines. In 2010 IEEE International Conference on Data Mining, pp. 995–1000. Cited by: §1.1.
  • S. Rendle (2012) Factorization machines with LibFM. ACM Transactions on Intelligent Systems and Technology (TIST) 3 (3), pp. 57. Cited by: §1.1.
  • R. Shu and H. Nakayama (2017) Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068. Cited by: §1.1, §6.
  • R. Wang, B. Fu, G. Fu, and M. Wang (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 12. Cited by: §1.1, §1, §1, §5.
  • K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola (2009) Feature hashing for large scale multitask learning. arXiv preprint arXiv:0902.2206. Cited by: §1, §2.
  • Z. Yin and Y. Shen (2018) On the dimensionality of word embedding. In Advances in Neural Information Processing Systems, pp. 887–898. Cited by: §1.1.
  • L. Zheng, C. Lu, L. He, S. Xie, V. Noroozi, H. Huang, and P. S. Yu (2018) Mars: memory attention-aware recommender system. arXiv preprint arXiv:1805.07037. Cited by: §1.1.
  • C. Zhou, J. Bai, J. Song, X. Liu, Z. Zhao, X. Chen, and J. Gao (2018a) ATRank: an attention-based user behavior modeling framework for recommendation. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.1.
  • G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai (2018b) Deep interest evolution network for click-through rate prediction. arXiv preprint arXiv:1809.03672. Cited by: §1.1, §1.
  • G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai (2018c) Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §1.1, §1.
  • H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, and K. Gai (2018) Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1079–1088. Cited by: §1.1.

Appendix A Background on Set Partitions, Equivalence Relations, and Equivalence Classes

For completeness, we include the definitions of set partitions, equivalence relations, and equivalence classes, which we use extensively in the paper.

Definition 2.

Given a set , a set partition is a family of sets such that:

  1. for all where .

Definition 3.

A binary relation on a set is an equivalence relation if and only if for ,

  1. if and only if

  2. If and , then .

Definition 4.

Given an equivalence relation on , the equivalence class of is defined as

Given a set partition , we can define an equivalence relation on defined as if and only if such that . (One can easily show that this binary relation is indeed an equivalence relation.) In words, is “equivalent” to if and only if and are in the same set of the partition. This equivalence relation yields a set of equivalence classes consisting of for .

As an example, consider a set of numbers . One partition of is . Then the equivalence classes are defined as and .

To see how this is relevant to compositional embeddings, consider the following partition of the set of categories :

(8)

where is given. This partition induces an equivalence relation as defined above. Then each equivalence class consists of all elements whose remainder is the same, i.e.

(9)

Mapping each equivalence class of this partition to a single embedding vector is hence equivalent to performing the hashing trick, as seen in Section 2.

Appendix B Proof of Complementary Partition Examples

In this section, we prove why each of the listed family of partitions are indeed complementary. Note that in order to show that these partitions are complementary, it is sufficient to show that for each pair of , there exists a partition such that .

  1. If , then is a complementary partition.

    Proof.

    Note that since all are in different sets by definition of , for all .

  2. Given , the partitions

    are complementary.

    Proof.

    Suppose that such that and . (If , then we are done.) Then there exists an such that and . In other words, and where . Since , we have that , and hence . Thus, and , so .

  3. Given for such that , we can recursively define complementary partitions

    (10)
    (11)

    where for . Then are complementary.

    Proof.

    We can show this by induction. The base case is trivial.

    Suppose that the statement holds for , that is if for for and are defined by equation 10 and equation 11, then are complementary. We want to show that the statement holds for .

    Consider a factorization with elements, that is, if for for . Let for be defined as in equation 10 and equation 11 with

    (12)
    (13)

    for all and .

    Let with . Since , . We want to show that for some . We have two cases:

    1. If , then we are done.

    2. Suppose . Then

      for some . Consider the subset

      Note that . We will define an enumeration over by

      Note that this is an enumeration since if , then

      for , so . This function is clearly a bijection on since is a bijection. Using this new enumeration, we can define the sets

      where for and , and their corresponding partitions

      Since , by the inductive hypothesis, we have that this set of partitions are complementary. Thus, there exists an such that .

      In order to show that this implies that , one must show that for all and .

      To see this, since , by modular arithmetic we have

      and

      for . Thus, for all and and we are done.

  4. Consider a pairwise coprime factorization greater than or equal to , that is, for for all and for all . Then we can define the partitions

    for . Then are complementary.

    Proof.

    Let . Since for are pairwise coprime and , by the Chinese Remainder Theorem, there exists a bijection defined as . Let such that . Then , and so there must exist an index such that , as desired. Hence .

Appendix C Proof of Theorem 1

Theorem 1.

Assume that vectors in each embedding table are distinct, that is for for all . If the concatenation operation is used, then the compositional embedding of any category is unique, i.e. if and , then .

Proof.

Suppose that and . Let be complementary partitions. Define to be their respective embedding tables. Since the concatenation operation is used, denote their corresponding final embeddings as

respectively, where are embedding vectors from each corresponding partition’s embedding table.

Since are complementary and , for some . Thus, since the embedding vectors in each embedding table is distinct, . Hence, , as desired.

Appendix D Additional Experimental Results

We present complete results on the training, validation, and test loss/accuracy against the number of parameters for 2-7 and 60 hash collisions on DCN and Facebook DLRM networks over 5 trials in Figures 7 and 8. We observe consistent results between the training, validation, and test performance across both networks. As noted in Section 5, the element-wise multiplication operation appears to perform best across both networks.

In order to avoid evaluating the training loss and accuracy on the entire training set, we approximate these quantities by averaging over a window from the results of the forward pass over the last 1024 iterations. This notably results in a larger standard deviation compared to the validation and test loss and accuracy.

Figure 7: Training, validation, and test loss against the number of parameters for 2-7 and 60 hash collisions on DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. Hash, Feature, Concat, Add, and Mult correspond to different operations. Full corresponds to the baseline using full embedding tables (without hashing). The baseline model using full embedding tables contains approximately parameters.
Figure 8: Training, validation, and test accuracy against the number of parameters for 2-7 and 60 hash collisions on DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. Hash, Feature, Concat, Add, and Mult correspond to different operations. Full corresponds to the baseline using full embedding tables (without hashing). The baseline model using full embedding tables contains approximately parameters.

We present the complete results for the thresholding experiments, where we compare the threshold against the test loss for 4 hash collisions; see Figures 9 and 10. We also plot the number of parameters against the threshold number in Figure 11.

Figure 9: Training, validation, and test loss against the threshold number with 4 hash collisions on DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. Hash, Feature, Concat, Add, and Mult correspond to different operations. Full corresponds to the baseline using full embedding tables (without hashing). The baseline model using full embedding tables contains approximately parameters.
Figure 10: Training, validation, and test accuracy against the threshold number with 4 hash collisions on DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. Hash, Feature, Concat, Add, and Mult correspond to different operations. Full corresponds to the baseline using full embedding tables (without hashing). The baseline model using full embedding tables contains approximately parameters.
Figure 11: Number of parameters against the threshold number with 4 hash collisions on DCN (left) and Facebook DLRM (right) networks. Hash, Feature, Concat, Add, and Mult correspond to different operations. Full corresponds to the baseline using full embedding tables (without hashing). The baseline model using full embedding tables contains approximately parameters.

We also present the complete results for the path-based compositional embeddings with different MLP sizes in Table 2. A full table containing results from the best operation for each fixed number of hash collisions is provided in Table 3. The table containing the thresholding experimental results for 4 hash collisions is provided in Table 4.


Hidden Layer 16 32 64 128
DCN # Parameters 135,464,410 135,519,322 135,629,146 135,848,794
Training Loss 0.44628 0.44649 0.44649 0.4473
Training Accuracy 0.79269 0.79247 0.79267 0.79208
Validation Loss 0.45281 0.45277 0.45273 0.45363
Validation Accuracy 0.78871 0.78873 0.78876 0.78838
Test Loss 0.45263 0.45254 0.45252 0.4534
Test Accuracy 0.78888 0.7889 0.78892 0.78849
DLRM # Parameters 135,581,537 135,636,449 135,746,273 135,965,921
Training Loss 0.44736 0.44706 0.44676 0.45025
Training Accuracy 0.79209 0.79241 0.79268 0.79091
Validation Loss 0.45366 0.4534 0.45324 0.45652
Validation Accuracy 0.78823 0.78834 0.78851 0.78667
Test Loss 0.45349 0.45312 0.45306 0.45651
Test Accuracy 0.78834 0.78856 0.78862 0.78667
Table 2: Table of the average training, validation, and test loss and accuracy and the number of parameters for different MLP sizes with 4 hash collisions on DCN and Facebook DLRM networks over 5 trials.

Hash Collisions 0 2 3 4 5 6 7 60
DCN # Parameters 540,558,778 270,458,970 180,425,866 135,409,498 108,399,882 90,393,562 77,532,042 9,385,882
Operation N/A Mult Mult Mult Mult Mult Mult Mult
Training Loss 0.44302 0.44433 0.4446 0.44476 0.44491 0.44508 0.44546 0.44874
Training Accuracy 0.79414 0.79381 0.79339 0.79343 0.79325 0.79308 0.79307 0.79155
Validation Loss 0.44951 0.4507 0.45095 0.4512 0.45141 0.4516 0.4518 0.45489
Validation Accuracy 0.79047 0.78986 0.78974 0.7895 0.78948 0.78946 0.78923 0.78774
Test Loss 0.44924 0.45054 0.45079 0.45103 0.45129 0.4513 0.45161 0.4548
Test Accuracy 0.79066 0.79004 0.78982 0.78977 0.78972 0.78968 0.7895 0.78787
DLRM # Parameters 540,675,905 271,101,921 181,068,817 136,052,449 109,042,833 91,036,513 78,174,993 10,028,832
Operation N/A Feature Feature Feature Feature Feature Feature Feature
Training Loss 0.44066 0.44206 0.44281 0.4436 0.44364 0.44418 0.44438 0.44866
Training Accuracy 0.79517 0.79443 0.79424 0.79402 0.79423 0.79396 0.79364 0.7917
Validation Loss 0.4473 0.44864 0.44959 0.45017 0.4503 0.45064 0.45102 0.45499
Validation Accuracy 0.79147 0.7909 0.79037 0.79015 0.79007 0.78988 0.78974 0.78782
Test Loss 0.44705 0.44832 0.44925 0.44981 0.44996 0.45031 0.45065 0.45466
Test Accuracy 0.79168 0.79111 0.79062 0.79037 0.79035 0.79015 0.79002 0.78804
Table 3: Table of the average training, validation, and test loss and accuracy and the number of parameters for the best operation (with respect to validation loss) across a varied number of hash collisions on DCN and Facebook DLRM networks over 5 trials.

Threshold 0 20 200 2,000 20,000
DCN # Parameters 135,409,498 135,409,802 135,411,482 135,447,018 135,977,178
Operation Mult Mult Mult Mult Mult
Training Loss 0.44476 0.44487 0.44487 0.44493 0.44429
Training Accuracy 0.79343 0.79357 0.79349 0.79339 0.7936
Validation Loss 0.4512 0.4513 0.45122 0.4513 0.45076
Validation Accuracy 0.7895 0.78957 0.78955 0.78957 0.78981
Test Loss 0.45103 0.45112 0.45106 0.45114 0.45045
Test Accuracy 0.78977 0.78971 0.78971 0.78973 0.79002
DLRM # Parameters 136,052,449 135,924,753 135,855,777 135,804,273 136,853,921
Operation Feature Feature Feature Feature Concat
Training Loss 0.4436 0.44333 0.44311 0.4431 0.44119
Training Accuracy 0.79402 0.79406 0.79405 0.79391 0.79503
Validation Loss 0.45017 0.45001 0.44977 0.44967 0.44787
Validation Accuracy 0.79015 0.79021 0.79027 0.79027 0.7913
Test Loss 0.44981 0.44965 0.44945 0.44935 0.44757
Test Accuracy 0.79037 0.79045 0.7905 0.79056 0.79149
Table 4: Table of the average training, validation, and test loss and accuracy and the number of parameters for the best operation (with respect to validation loss) over different thresholds with 4 hash collisions on DCN and Facebook DLRM networks averaged over 5 trials.