Differentiable Product Quantization for End-to-End Embedding Compression

08/26/2019 ∙ by Ting Chen, et al. ∙ 4

Embedding layer is commonly used to map discrete symbols into continuous embedding vectors that reflect their semantic meanings. As the number of symbols increase, the number of embedding parameter, as well as their size, increase linearly and become problematically large. In this work, we aim to reduce the size of embedding layer via learning discrete codes and composing embedding vectors from the codes. More specifically, we propose a differentiable product quantization framework with two instantiations, which can serve as an efficient drop-in replacement for existing embedding layer. Empirically, we evaluate the proposed method on three different language tasks, and show that the proposed method enables end-to-end training of embedding compression that achieves significant compression ratios (14-238×) at almost no performance cost (sometimes even better).



There are no comments yet.


page 6

page 10

page 11

Code Repositories


Differentiable Product Quantization for End-to-End Embedding Compression.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Embedding layer is a basic neural network module, which maps a discrete symbol/word into continuous hidden vector. It is used in almost all NLP related applications, including language modeling, machine translation, and text classification. With a large vocabulary size, the embedding layer consumes a large amount of storage and memory size. For example, in LSTM-based medium-sized language model on PTB dataset 


, embedding table accounts for more than 95% of total parameters. Even with sub-words encoding (e.g. Byte-pair encoding), the size of embedding layer is still very significant. Beyond text, embedding layer has wider applications such as in knowledge graph 

[2, 18] and recommender system [13, 3], where the size of vocabulary is even larger.

To reduce the size of embedding layer, recent efforts have been made [5, 17]. In their work, they first learn to encode symbols/words with discrete codes (such as 5-1-2-4 for “cat” and 5-1-2-3 for “dog”), and then compose the codes to form the output symbol embedding. However, in [17], the discrete codes are fixed before training, thus cannot adapt to task-specific down-stream network. [5] proposes to learn codes in end-to-end fashion, which shows better performance. However, their method [5] still requires a distillation procedure, which incorporates a pre-trained embedding table as guidance, in order to avoid performance drop compared to original full embedding baseline.

In this work, we propose a novel differentiable product quantization (DPQ) framework. The proposal is based on the observation that the discrete codes (KD codes) can be obtained via the process of quantization (product quantization [10] in particular). Our framework can be instantiated by two approximation techniques that allow the differentiable learning. And by making quantization differentiable, we are able to learn the KD codes in an end-to-end fashion. Compared to previous methods [5, 17], our framework brings a new perspective and allows for more flexible designs (such as the approximation algorithm). Furthermore, [5, 17] use sophisticated transformation function (such as MLP or LSTM) to turn discrete codes into continuous embedding vectors, while we simplify this function enabling better trade-offs between efficiency and effectiveness.

We conduct experiments on three different language tasks, by simply replacing the original full embedding layer with our proposed one. The results show that the proposed method can achieve higher compression ratios than current methods, at almost the same performance as the original embedding. More importantly, our results are obtained from the end-to-end training where no extra procedure, e.g. distillation, is required.

2 Method

We first introduce the end-to-end embedding compression problem and KD codes for embedding compression, and then introduce the proposed method.

Problem setup.

An embedding function can be defined as , where denotes the vocabulary, which contains a set of all possible discrete symbols, such as words/sub-words [15, 6], entities [2], users/items [13]; and is the continuous feature space. In standard end-to-end training, the embedding parameter is jointly trained with other neural net parameter

to optimize a given loss function, i.e.

, where is the discrete input for -th example, is the target value, and is a neural network function applied on the embedding vector .

The problem of end-to-end embedding compression is to find in the same end-to-end fashion, but the number of bits to represent is substantially smaller than . Typically, embedding parameter is a table/matrix with . The total number of bits used to represent this table is (e.g, if each real number is represented by 32-bit floating point), which is problematic with large and/or .


The intuition behind the embedding compression method, similar to [5], is to decompose each symbol in to a sequence of discrete codes, and compose a symbol embedding vector from the embedding vectors of these codes. When there are many shared factors can be abstracted into these discrete codes, we are able to reduce the redundancy in conventional embedding table that is flat and not factorized. For example, both “cat” and “dog” are mammals and pets, the model can reflect this similarity with two continuous embedding vectors that are close to each other. Alternatively, we can use two similar compact discrete codes that share prefix, and compose the corresponding continuous vectors from a small set of code embeddings. It is clear that latter is more efficient since fewer parameters are required.

To learn the discrete codes, we propose a differentiable product quantization framework that is trainable in an end-to-end fashion. The key insight of our novel framework, different from [5], is based on the following observation: the process of finding the discrete codes can be considered as a process of quantization (product quantization [10] in particular). Making this quantization process differentiable enables the end-to-end learning of discrete codes via optimizing some task-specific objective.

2.1 Differentiable Production Quantization Framework

Figure 1: Illustration of the proposed framework. During the training, we use differentiable quantization to approximate raw embedding table (i.e. Query Matrix). After training, we only keep Codebooks and Value matrix to construct the Embedding Table.

We start by introducing the differentiable quantization function, which (during training) takes a raw embedding table and produces a quantized one. A quantization function is composition of two functions: 1) a discretization function that maps a continuous vector into a K-way D-dimensional discrete code with cardinality (namely, KD code), and 2) a reverse-discretization function that maps the KD code into a continuous embedding vector. That is, . During training, both and are learned; then every symbol is represented by a KD code via applying to save space (compression). In the inference stage, only is used to decode the KD codes into regular embedding vectors.

A form for discretization function .

We use a Query matrix and a Key matrix to find KD codes in the space for symbols in vocabulary. The Query matrix can be considered as a raw embedding table (before quantization), which has the same number of rows as vocabulary size. The Key matrix has the same number of rows as cardinality in KD codes, which is much smaller than vocabulary size. We further split columns of and into groups such that and . Each group corresponds to one of dimensions in KD codes.

With Query matrix and Key matrix, we compute each dimension of the dimensional discrete codes separately. The -th dimension of a KD code for the -th symbol as follows.


In other words, . The computes distance score between two vectors, and use it to decide which discrete code to take. Note that after training, we discard and , and only keep the codebook of KD codes inferred from Eq. 1.

A form for reverse-discretization function .

We introduce the Value matrix , which allows us to assign a learnable embedding for each code. We also split the columns of into groups the same way as and , i.e. . We can compute embedding vector given the -th dimension of KD code as follows.


In other words, the final embedding vector for is , which is a concatenation of vectors from different groups. We note that this is a simplification of the ones used in [5, 17], which reduces the overhead and eases the optimization.

Figure 1 illustrate the proposed framework, which we dub Differentiable Product Quantization (DPQ).

Storage complexity.

Our framework decouples size of vocabulary and number of continuous embedding weights with extra discrete KD codes, which can be compactly stored. Assume the default 32-bit floating point is used, the original full embedding table requires bits. As for our method, we have 1) a Value matrix that requires or bits depending on whether or not we tie the weights among groups, and 2) KD codes that require extra bits. In our framework, only the compact discrete KD codes relate to the vocabulary size .

Inference complexity.

During the inference, we want to retrieve the continuous embedding vector for a given symbol. As shown above, this is achieved by the retrieval of KD codes and their code embedding vectors, then the final embedding vector is obtained by concatenation. Since only indexing and concatenation are used, both the extra computation complexity and memory footprint are very small compared to the conventional full embedding (which directly indexes the continuous vector).

The proposed framework is general with several concrete design choices to make. Specifically, what is the distance function used in Eq. 1? How can we compute gradients through the function in Eq. 1? Do we tie the Key matrix with Value matrix? We will introduce two instantiations that answer these questions with specific design choices.

2.2 Vector Quantization-based

Here we use Euclidean distance function to compute KD codes as follows.

We also tie the Key and Value matrices, so they are in the same space, i.e. .

The resulting model is similar to Vector Quantization (VQ) in [19], with a key difference that we split the space into orthogonal ones (groups). We name this model DPQ-VQ. Intuitively, the model uses Query to search for the nearest neighbor in Key/Value space, and output it as embedding vector.

Due to operation, the resulting quantization function has no gradient towards its input

, so we utilize the straight-through estimator 

[1] to allows a pseudo gradient. That is to rewrite the quantization function as follows.

Here the sg is stop gradient operator, which is an identity function, but prevents gradient from back-propagate through what is inside. So during the forward pass, , but during backward pass, we use the gradient of .

The sg trick only computes gradient for Query matrix, so to update the tied Key/Value matrix, similar to [19], we add a regularization term: , which makes entries of the Key/Value matrix arithmetic mean of their members. Alternatively, one can also use Exponential Moving Average [12] to update the centroids.

2.3 Softmax-based

Different from DPQ-VQ, which uses Euclidean distance with tied Key/Value matrices. A different design choice here is to use dot product and softmax for computing the proximity between Query and Key vectors, similar to [5]. Importantly, the Key and Value matrices are not shared so they are in different latent spaces. Intuitively, the model decides the KD code by comparing proximity of Query and Key in one latent space, and emit the output embedding from a different latent space. Specifically,

where denotes dot product of two vectors. Due to the , we cannot compute gradient for this function. So we relax the softmax function with temperature :

Note that now is a probabilistic vector (i.e. soft one-hot vector) instead of an integer . And , or . With the KD code relaxed into soft one-hot vector, we replace index operation with dot product to compute the output embedding vector as follows

To compute discrete KD codes, we can set , the softmax function becomes a spike concentrated on the -th dimension. This is equivalent to the operation so we cannot compute the gradient. To enable a pseudo gradient, we use different temperatures, i.e. set in the forward pass, but in the backward pass. Such a quantization function can be expressed as follows.

We name this model KDQ-SX.

A comparison between KDQ-VQ and KDQ-SX.

At inference, KDQ-VQ and KDQ-SX are the same (i.e. concatenation of the code embedding vectors from learned codes), they mainly differ during the training. KDQ-SX directly models the soft one-hot distribution (each symbol in the batch has a matrix) which is more memory intensive, while KDQ-VQ only uses nearest neighbor as an approximation, making it more scalable (to large K, D, and batch size). Since the key matrix and value matrix are not tied in KDQ-SX, it gives more flexibility on the selection of dimensionality.

3 Experiments

3.1 Datasets and settings

We conduct the experiments on three different tasks, namely language modeling (LM), neural machine translation (NMT), and text classification (TextC). For LM, we test on PTB and Wikitext-2 datasets. For NMT, we test on IWSLT15 on both English-Vietnamese and Vietnamese-English directions. For TextC, we test on five datasets from 

[22], namely AG News, Yahoo! Answers, DBpedia, Yelp Polarity and Yelp Full 222

For the text classification datasets, Yahoo! answers and AG news represent topic prediction, Yelp Polarity and Yelp Full represent sentiment analysis, while DBpedia represents ontology classification.

. The detailed data statistics are shown in Table 1.

Dataset Vocab. size


PTB 10,000
Wikitext-2 33,278


IWSLT15 (En-Vi) 17,191
IWSLT15 (Vi-En) 7,709

Text Classif.

AG News 69,322
Yahoo! Ans. 477,522
DBpedia 612,530
Yelp P 246,739
Yelp F 268,414
Table 1: The encoder’s vocabulary size for each dataset.

We adopt existing architectures for these tasks and only replace the encoder embedding layer with the proposed method. For LM, we adopt LSTM-based models from [21], which contains three different model sizes; for NMT, we adopt the seq2seq-based model from [14]; and for TextC, we use a model that resembles fasttext [11], and has one hidden layer after mean pooling of word vectors.

Baselines. The main baselines we consider are 1) original full word embedding, 2) variants of KD code based methods from [5] and  [17], including the following. Pretrain: pretrain and fix KD codes; E2E: end-to-end training without distillation guidance from pre-trained embedding table; and E2E-dist.: end-to-end training with distillation procedure.

Ablations. For the proposed method, we tune and compare two sets of hyper-parameters that trade-off between compression ratio and task performance: 1) the size of KD codes, i.e. , and 2) whether or to share/tie groups in the Key/Value Matrices, i.e. setting and .

Metrics. The effectiveness metric is given by each tasks, such as perplexity in LM, BLEU score in NMT, and accuracy in TextC. To evaluate the (compression) efficiency for the encoder embedding table, we adopt the compression ratio, which can be computed as follows (based on 32-bit floating point).

(a) PTB
(b) IWSLT15 (En-Vi)
Figure 2: Heatmaps of task performance and CR. The darker the better.
Small Medium Large
Full 114.5 1 83.4 1 78.7 1
Pre-train 108.0 4.8 84.9 11.7 80.7 18.5
E2E 108.5 4.8 89.0 11.7 86.4 18.5
E2E-dist. 107.8 4.8 83.1 11.7 77.7 18.5
DPQ-SX 105.8 85.5 82.0 82.9 78.5 238.3
DPQ-VQ 106.5 51.1 83.3 58.7 79.5 238.3
Table 2: Comparisons on LM using PTB. Three LSTM model sizes are studied.
Dataset Metric Full SX VQ


PTB PPL 83.38 83.17 83.27
CR 1 163.18 58.67
Wikitext-2 PPL 95.61 94.94 95.92
CR 1 59.25 59.25


IWSLT15 BLEU 25.4 25.3 25.3
(En-Vi) CR 1 86.17 16.13
IWSLT15 BLEU 23.0 23.1 22.5
(Vi-En) CR 1 72.00 14.05

Text Classification

AG News Acc(%) 92.59 92.49 92.55
CR 1 19.26 23.95
Yahoo! Ans. Acc(%) 69.41 69.62 69.15
CR 1 48.16 19.24
DBpedia Acc(%) 98.12 98.13 98.14
CR 1 24.08 38.45
Yelp P Acc(%) 93.92 94.17 93.91
CR 1 38.52 24.04
Yelp F Acc(%) 60.33 60.10 60.22
CR 1 48.16 24.05
Table 3: Comparisons among DPQ variants (DPQ-SX and DPQ-VQ) and full embedding baseline on three different tasks.
(a) PTB
(b) IWSLT15 (En-Vi)
(c) DBpedia
Figure 3: Compression ratio and task performance trade-off curves on three tasks/datasets for DPQ variants.

3.2 Results

Table 2 shows the comparisons of our method w.r.t. the existing methods in terms of perplexity (PPL) and compression ratio (CR) on three different model sizes for LM using PTB. We find that 1) both Pre-train and E2E baselines achieve good compression ratios but at the cost of worse perplexity at Medium and Large models, 2) the E2E-dist. baseline achieves same CR and does not sacrifice performance, but it requires extra distillation procedure, and 3) our methods (DPQ-SX and DPQ-VQ), without distilltion procedure, achieve much better CR compared to baselines, at the same time obtain smallest PPL in almost all cases.

Table 3 provides a comparison among variants of the proposed methods on different tasks. For LM, we show evaluation on the medium-sized LSTM. We find that: 1) the proposed methods can achieve 14-163 compression ratio, at the same time obtain comparable or even better performance against the full embedding baseline; 2) In 7 out of 9 datasets, DPQ-SX provides better compression ratio or performance than DPQ-VQ.

Figure 3 shows the trade-off between CR and performance for DPQ variants: DPQ-SX vs. DPQ-VQ, and share Key/Value between groups vs. not share. We find that: 1) for LM, sharing is better, 2) for NMT, not sharing is better, and 3) for TextC, it is beneficial to share among groups for DPQ-SX, and not to share for DPQ-VQ.

Figure 2 shows the performance/CR heatmaps under different and on PTB and IWSLT15 (En-Vi). We find that: large and small is a bad trade-off, while small and large seems better. An intermediate setting of and yields optimal trade-off between performance and CR. Furthermore, we find that when K is small and D is large (i.e. the nearest neighbor approximation is not reliable), DPQ-SX performs much better than DPQ-VQ.

More experimental results can be found in the appendix.

4 Related Work

Modern neural networks are very large and redundant, the compression of such models has attracted many research efforts [7, 8, 4]. Most of these compression techniques focus on the weights that are shared among many examples, such as convolutional or dense layers [8, 4]. The embedding layers are different in the sense that they are tabular and very sparsely accessed, i.e. the pruning cannot remove row/symbol in the embedding table, and only a few symbols are accessed in each data sample. This makes the compression challenges different for the embedding layer. There are existing work on compressing embedding layers [17, 5]. And our work generalize the methods in [17, 5] to a new DPQ framework and improve the compression without resorting to the distillation process.

Our work differs from traditional quantization techniques [10] in that they can be trained in an end-to-end fashion. The idea of utilizing multiple orthogonal subspace/group for quantization is used in product quantization [10, 16] and multi-head attention [20]. Our work also resembles the Transformer [20], with attention being discrete and Key/Value matrices being internal parameters/memory (instead of hidden states of the input sequence).

The two instantiations of our model also share similarities with Gumbel-softmax [9] and VQ-VAE [19]. However, we do not find using the stochastic noises (as in Gumbel-softmax) useful since we aim to get deterministic codes. It is also worth pointing out that these techniques [9, 19] by themselves cannot be directly applied for compression, while our DPQ framework enables it.

5 Conclusion

In this work, we propose a novel and general differentiable product quantization framework for embedding table compression. We give two instantiations under our framework, which can serve as an efficient drop-in replacement for existing embedding layer. Empirically, we evaluate the proposed method on 3 different language tasks (9 datasets), and show that the proposed method surpass state-of-the-art can compress the embedding table up to 238 times without suffering performance lost. We believe we are the first to show such layer can be trained in an end-to-end fashion without distillation. In the future, we want to apply this technique to wider applications and architectures, as well as understand better the effectiveness of the proposed framework.


We would like to thank Koyoshi Shindo for helpful discussions.


  • [1] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv preprint arXiv:1308.3432. Cited by: §2.2.
  • [2] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §1, §2.
  • [3] T. Chen, L. Hong, Y. Shi, and Y. Sun (2017) Joint text embedding for personalized content-based recommendation. arXiv preprint arXiv:1706.01084. Cited by: §1.
  • [4] T. Chen, J. Lin, T. Lin, S. Han, C. Wang, and D. Zhou (2018) Adaptive mixture of low-rank factorizations for compact neural modeling. Neural Information Processing Systems (CDNNRIA workshop). Cited by: §4.
  • [5] T. Chen, M. R. Min, and Y. Sun (2018) Learning k-way d-dimensional discrete codes for compact embedding representations. In

    International Conference on Machine Learning

    Cited by: §1, §1, §2, §2, §2.1, §2.3, §3.1, §4.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • [7] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §4.
  • [8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    arXiv preprint arXiv:1704.04861. Cited by: §4.
  • [9] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §4.
  • [10] H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §1, §2, §4.
  • [11] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Cited by: §3.1.
  • [12] Ł. Kaiser, A. Roy, A. Vaswani, N. Parmar, S. Bengio, J. Uszkoreit, and N. Shazeer (2018) Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382. Cited by: §2.2.
  • [13] Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer, pp. 30–37. Cited by: §1, §2.
  • [14] M. Luong, E. Brevdo, and R. Zhao (2017) Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt. Cited by: §3.1.
  • [15] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
  • [16] M. Norouzi and D. J. Fleet (2013)

    Cartesian k-means


    Proceedings of the IEEE Conference on computer Vision and Pattern Recognition

    pp. 3017–3024. Cited by: §4.
  • [17] R. Shu and H. Nakayama (2017) Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068. Cited by: §1, §1, §2.1, §3.1, §4.
  • [18] R. Socher, D. Chen, C. D. Manning, and A. Ng (2013)

    Reasoning with neural tensor networks for knowledge base completion

    In Advances in neural information processing systems, pp. 926–934. Cited by: §1.
  • [19] A. van den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §2.2, §2.2, §4.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
  • [21] W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Cited by: §1, §3.1.
  • [22] X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §3.1.

Appendix A More Experimental Results

(a) PTB
(b) IWSLT15 (En-Vi)
(c) DBpedia
(d) Wikitext-2
(e) IWSLT15 (Vi-En)
(f) AG News
(g) Yahoo! Answers
(h) Yelp Polarity
(i) Yelp Full
Figure 4: Trade-off curves between compression ratio and model performance. The model performances varies per tasks.

Figure 4 shows the trade-off between CR and performance for DPQ variants: DPQ-SX vs. DPQ-VQ, and share Key/Value between groups vs. not share. We find that: 1) for LM, sharing is better, 2) for NMT, not sharing is better, and 3) for TextC, it is beneficial to share among groups for DPQ-SX, and not to share for DPQ-VQ.

Figure 5: Heatmaps of task performance and CR on Wikitext-2. The darker the better.
Figure 6: Heatmaps of task performance and CR on IWSLT15 (En-Vi). The darker the better.
Figure 7: Heatmaps of task performance and CR on IWSLT15 (Vi-En). The darker the better.

Figure 5, 6, and 7 shows the impact of different choices of hyper-parameters ( and ) on task performance and compression ratio.