dpq_embedding_compression
Differentiable Product Quantization for End-to-End Embedding Compression.
view repo
Embedding layer is commonly used to map discrete symbols into continuous embedding vectors that reflect their semantic meanings. As the number of symbols increase, the number of embedding parameter, as well as their size, increase linearly and become problematically large. In this work, we aim to reduce the size of embedding layer via learning discrete codes and composing embedding vectors from the codes. More specifically, we propose a differentiable product quantization framework with two instantiations, which can serve as an efficient drop-in replacement for existing embedding layer. Empirically, we evaluate the proposed method on three different language tasks, and show that the proposed method enables end-to-end training of embedding compression that achieves significant compression ratios (14-238×) at almost no performance cost (sometimes even better).
READ FULL TEXT VIEW PDF
Conventional embedding methods directly associate each symbol with a
con...
read it
Network embedding, as a promising way of the network representation lear...
read it
Continuous representation of words is a standard component in deep
learn...
read it
Production recommendation systems rely on embedding methods to represent...
read it
To date, a large number of experiments are performed to develop a bioche...
read it
The embedding-based representation learning is commonly used in deep lea...
read it
Product quantization (PQ) is a popular approach for maximum inner produc...
read it
Differentiable Product Quantization for End-to-End Embedding Compression.
Embedding layer is a basic neural network module, which maps a discrete symbol/word into continuous hidden vector. It is used in almost all NLP related applications, including language modeling, machine translation, and text classification. With a large vocabulary size, the embedding layer consumes a large amount of storage and memory size. For example, in LSTM-based medium-sized language model on PTB dataset
[21], embedding table accounts for more than 95% of total parameters. Even with sub-words encoding (e.g. Byte-pair encoding), the size of embedding layer is still very significant. Beyond text, embedding layer has wider applications such as in knowledge graph
[2, 18] and recommender system [13, 3], where the size of vocabulary is even larger.To reduce the size of embedding layer, recent efforts have been made [5, 17]. In their work, they first learn to encode symbols/words with discrete codes (such as 5-1-2-4 for “cat” and 5-1-2-3 for “dog”), and then compose the codes to form the output symbol embedding. However, in [17], the discrete codes are fixed before training, thus cannot adapt to task-specific down-stream network. [5] proposes to learn codes in end-to-end fashion, which shows better performance. However, their method [5] still requires a distillation procedure, which incorporates a pre-trained embedding table as guidance, in order to avoid performance drop compared to original full embedding baseline.
In this work, we propose a novel differentiable product quantization (DPQ) framework. The proposal is based on the observation that the discrete codes (KD codes) can be obtained via the process of quantization (product quantization [10] in particular). Our framework can be instantiated by two approximation techniques that allow the differentiable learning. And by making quantization differentiable, we are able to learn the KD codes in an end-to-end fashion. Compared to previous methods [5, 17], our framework brings a new perspective and allows for more flexible designs (such as the approximation algorithm). Furthermore, [5, 17] use sophisticated transformation function (such as MLP or LSTM) to turn discrete codes into continuous embedding vectors, while we simplify this function enabling better trade-offs between efficiency and effectiveness.
We conduct experiments on three different language tasks, by simply replacing the original full embedding layer with our proposed one. The results show that the proposed method can achieve higher compression ratios than current methods, at almost the same performance as the original embedding. More importantly, our results are obtained from the end-to-end training where no extra procedure, e.g. distillation, is required.
We first introduce the end-to-end embedding compression problem and KD codes for embedding compression, and then introduce the proposed method.
An embedding function can be defined as , where denotes the vocabulary, which contains a set of all possible discrete symbols, such as words/sub-words [15, 6], entities [2], users/items [13]; and is the continuous feature space. In standard end-to-end training, the embedding parameter is jointly trained with other neural net parameter
to optimize a given loss function, i.e.
, where is the discrete input for -th example, is the target value, and is a neural network function applied on the embedding vector .The problem of end-to-end embedding compression is to find in the same end-to-end fashion, but the number of bits to represent is substantially smaller than . Typically, embedding parameter is a table/matrix with . The total number of bits used to represent this table is (e.g, if each real number is represented by 32-bit floating point), which is problematic with large and/or .
The intuition behind the embedding compression method, similar to [5], is to decompose each symbol in to a sequence of discrete codes, and compose a symbol embedding vector from the embedding vectors of these codes. When there are many shared factors can be abstracted into these discrete codes, we are able to reduce the redundancy in conventional embedding table that is flat and not factorized. For example, both “cat” and “dog” are mammals and pets, the model can reflect this similarity with two continuous embedding vectors that are close to each other. Alternatively, we can use two similar compact discrete codes that share prefix, and compose the corresponding continuous vectors from a small set of code embeddings. It is clear that latter is more efficient since fewer parameters are required.
To learn the discrete codes, we propose a differentiable product quantization framework that is trainable in an end-to-end fashion. The key insight of our novel framework, different from [5], is based on the following observation: the process of finding the discrete codes can be considered as a process of quantization (product quantization [10] in particular). Making this quantization process differentiable enables the end-to-end learning of discrete codes via optimizing some task-specific objective.
We start by introducing the differentiable quantization function, which (during training) takes a raw embedding table and produces a quantized one. A quantization function is composition of two functions: 1) a discretization function that maps a continuous vector into a K-way D-dimensional discrete code with cardinality (namely, KD code), and 2) a reverse-discretization function that maps the KD code into a continuous embedding vector. That is, . During training, both and are learned; then every symbol is represented by a KD code via applying to save space (compression). In the inference stage, only is used to decode the KD codes into regular embedding vectors.
We use a Query matrix and a Key matrix to find KD codes in the space for symbols in vocabulary. The Query matrix can be considered as a raw embedding table (before quantization), which has the same number of rows as vocabulary size. The Key matrix has the same number of rows as cardinality in KD codes, which is much smaller than vocabulary size. We further split columns of and into groups such that and . Each group corresponds to one of dimensions in KD codes.
With Query matrix and Key matrix, we compute each dimension of the dimensional discrete codes separately. The -th dimension of a KD code for the -th symbol as follows.
(1) |
In other words, . The computes distance score between two vectors, and use it to decide which discrete code to take. Note that after training, we discard and , and only keep the codebook of KD codes inferred from Eq. 1.
We introduce the Value matrix , which allows us to assign a learnable embedding for each code. We also split the columns of into groups the same way as and , i.e. . We can compute embedding vector given the -th dimension of KD code as follows.
(2) |
In other words, the final embedding vector for is , which is a concatenation of vectors from different groups. We note that this is a simplification of the ones used in [5, 17], which reduces the overhead and eases the optimization.
Figure 1 illustrate the proposed framework, which we dub Differentiable Product Quantization (DPQ).
Our framework decouples size of vocabulary and number of continuous embedding weights with extra discrete KD codes, which can be compactly stored. Assume the default 32-bit floating point is used, the original full embedding table requires bits. As for our method, we have 1) a Value matrix that requires or bits depending on whether or not we tie the weights among groups, and 2) KD codes that require extra bits. In our framework, only the compact discrete KD codes relate to the vocabulary size .
During the inference, we want to retrieve the continuous embedding vector for a given symbol. As shown above, this is achieved by the retrieval of KD codes and their code embedding vectors, then the final embedding vector is obtained by concatenation. Since only indexing and concatenation are used, both the extra computation complexity and memory footprint are very small compared to the conventional full embedding (which directly indexes the continuous vector).
The proposed framework is general with several concrete design choices to make. Specifically, what is the distance function used in Eq. 1? How can we compute gradients through the function in Eq. 1? Do we tie the Key matrix with Value matrix? We will introduce two instantiations that answer these questions with specific design choices.
Here we use Euclidean distance function to compute KD codes as follows.
We also tie the Key and Value matrices, so they are in the same space, i.e. .
The resulting model is similar to Vector Quantization (VQ) in [19], with a key difference that we split the space into orthogonal ones (groups). We name this model DPQ-VQ. Intuitively, the model uses Query to search for the nearest neighbor in Key/Value space, and output it as embedding vector.
Due to operation, the resulting quantization function has no gradient towards its input
, so we utilize the straight-through estimator
[1] to allows a pseudo gradient. That is to rewrite the quantization function as follows.Here the sg is stop gradient operator, which is an identity function, but prevents gradient from back-propagate through what is inside. So during the forward pass, , but during backward pass, we use the gradient of .
The sg trick only computes gradient for Query matrix, so to update the tied Key/Value matrix, similar to [19], we add a regularization term: , which makes entries of the Key/Value matrix arithmetic mean of their members. Alternatively, one can also use Exponential Moving Average [12] to update the centroids.
Different from DPQ-VQ, which uses Euclidean distance with tied Key/Value matrices. A different design choice here is to use dot product and softmax for computing the proximity between Query and Key vectors, similar to [5]. Importantly, the Key and Value matrices are not shared so they are in different latent spaces. Intuitively, the model decides the KD code by comparing proximity of Query and Key in one latent space, and emit the output embedding from a different latent space. Specifically,
where denotes dot product of two vectors. Due to the , we cannot compute gradient for this function. So we relax the softmax function with temperature :
Note that now is a probabilistic vector (i.e. soft one-hot vector) instead of an integer . And , or . With the KD code relaxed into soft one-hot vector, we replace index operation with dot product to compute the output embedding vector as follows
To compute discrete KD codes, we can set , the softmax function becomes a spike concentrated on the -th dimension. This is equivalent to the operation so we cannot compute the gradient. To enable a pseudo gradient, we use different temperatures, i.e. set in the forward pass, but in the backward pass. Such a quantization function can be expressed as follows.
We name this model KDQ-SX.
At inference, KDQ-VQ and KDQ-SX are the same (i.e. concatenation of the code embedding vectors from learned codes), they mainly differ during the training. KDQ-SX directly models the soft one-hot distribution (each symbol in the batch has a matrix) which is more memory intensive, while KDQ-VQ only uses nearest neighbor as an approximation, making it more scalable (to large K, D, and batch size). Since the key matrix and value matrix are not tied in KDQ-SX, it gives more flexibility on the selection of dimensionality.
We conduct the experiments on three different tasks, namely language modeling (LM), neural machine translation (NMT), and text classification (TextC). For LM, we test on PTB and Wikitext-2 datasets. For NMT, we test on IWSLT15 on both English-Vietnamese and Vietnamese-English directions. For TextC, we test on five datasets from
[22], namely AG News, Yahoo! Answers, DBpedia, Yelp Polarity and Yelp Full ^{2}^{2}2For the text classification datasets, Yahoo! answers and AG news represent topic prediction, Yelp Polarity and Yelp Full represent sentiment analysis, while DBpedia represents ontology classification.
. The detailed data statistics are shown in Table 1.Dataset | Vocab. size | |
LM |
PTB | 10,000 |
Wikitext-2 | 33,278 | |
NMT |
IWSLT15 (En-Vi) | 17,191 |
IWSLT15 (Vi-En) | 7,709 | |
Text Classif. |
AG News | 69,322 |
Yahoo! Ans. | 477,522 | |
DBpedia | 612,530 | |
Yelp P | 246,739 | |
Yelp F | 268,414 |
We adopt existing architectures for these tasks and only replace the encoder embedding layer with the proposed method. For LM, we adopt LSTM-based models from [21], which contains three different model sizes; for NMT, we adopt the seq2seq-based model from [14]; and for TextC, we use a model that resembles fasttext [11], and has one hidden layer after mean pooling of word vectors.
Baselines. The main baselines we consider are 1) original full word embedding, 2) variants of KD code based methods from [5] and [17], including the following. Pretrain: pretrain and fix KD codes; E2E: end-to-end training without distillation guidance from pre-trained embedding table; and E2E-dist.: end-to-end training with distillation procedure.
Ablations. For the proposed method, we tune and compare two sets of hyper-parameters that trade-off between compression ratio and task performance: 1) the size of KD codes, i.e. , and 2) whether or to share/tie groups in the Key/Value Matrices, i.e. setting and .
Metrics. The effectiveness metric is given by each tasks, such as perplexity in LM, BLEU score in NMT, and accuracy in TextC. To evaluate the (compression) efficiency for the encoder embedding table, we adopt the compression ratio, which can be computed as follows (based on 32-bit floating point).
Small | Medium | Large | ||||
Method | PPL | CR | PPL | CR | PPL | CR |
Full | 114.5 | 1 | 83.4 | 1 | 78.7 | 1 |
Pre-train | 108.0 | 4.8 | 84.9 | 11.7 | 80.7 | 18.5 |
E2E | 108.5 | 4.8 | 89.0 | 11.7 | 86.4 | 18.5 |
E2E-dist. | 107.8 | 4.8 | 83.1 | 11.7 | 77.7 | 18.5 |
DPQ-SX | 105.8 | 85.5 | 82.0 | 82.9 | 78.5 | 238.3 |
DPQ-VQ | 106.5 | 51.1 | 83.3 | 58.7 | 79.5 | 238.3 |
Dataset | Metric | Full | SX | VQ | |
LM |
PTB | PPL | 83.38 | 83.17 | 83.27 |
CR | 1 | 163.18 | 58.67 | ||
Wikitext-2 | PPL | 95.61 | 94.94 | 95.92 | |
CR | 1 | 59.25 | 59.25 | ||
NMT |
IWSLT15 | BLEU | 25.4 | 25.3 | 25.3 |
(En-Vi) | CR | 1 | 86.17 | 16.13 | |
IWSLT15 | BLEU | 23.0 | 23.1 | 22.5 | |
(Vi-En) | CR | 1 | 72.00 | 14.05 | |
Text Classification |
AG News | Acc(%) | 92.59 | 92.49 | 92.55 |
CR | 1 | 19.26 | 23.95 | ||
Yahoo! Ans. | Acc(%) | 69.41 | 69.62 | 69.15 | |
CR | 1 | 48.16 | 19.24 | ||
DBpedia | Acc(%) | 98.12 | 98.13 | 98.14 | |
CR | 1 | 24.08 | 38.45 | ||
Yelp P | Acc(%) | 93.92 | 94.17 | 93.91 | |
CR | 1 | 38.52 | 24.04 | ||
Yelp F | Acc(%) | 60.33 | 60.10 | 60.22 | |
CR | 1 | 48.16 | 24.05 |
Table 2 shows the comparisons of our method w.r.t. the existing methods in terms of perplexity (PPL) and compression ratio (CR) on three different model sizes for LM using PTB. We find that 1) both Pre-train and E2E baselines achieve good compression ratios but at the cost of worse perplexity at Medium and Large models, 2) the E2E-dist. baseline achieves same CR and does not sacrifice performance, but it requires extra distillation procedure, and 3) our methods (DPQ-SX and DPQ-VQ), without distilltion procedure, achieve much better CR compared to baselines, at the same time obtain smallest PPL in almost all cases.
Table 3 provides a comparison among variants of the proposed methods on different tasks. For LM, we show evaluation on the medium-sized LSTM. We find that: 1) the proposed methods can achieve 14-163 compression ratio, at the same time obtain comparable or even better performance against the full embedding baseline; 2) In 7 out of 9 datasets, DPQ-SX provides better compression ratio or performance than DPQ-VQ.
Figure 3 shows the trade-off between CR and performance for DPQ variants: DPQ-SX vs. DPQ-VQ, and share Key/Value between groups vs. not share. We find that: 1) for LM, sharing is better, 2) for NMT, not sharing is better, and 3) for TextC, it is beneficial to share among groups for DPQ-SX, and not to share for DPQ-VQ.
Figure 2 shows the performance/CR heatmaps under different and on PTB and IWSLT15 (En-Vi). We find that: large and small is a bad trade-off, while small and large seems better. An intermediate setting of and yields optimal trade-off between performance and CR. Furthermore, we find that when K is small and D is large (i.e. the nearest neighbor approximation is not reliable), DPQ-SX performs much better than DPQ-VQ.
More experimental results can be found in the appendix.
Modern neural networks are very large and redundant, the compression of such models has attracted many research efforts [7, 8, 4]. Most of these compression techniques focus on the weights that are shared among many examples, such as convolutional or dense layers [8, 4]. The embedding layers are different in the sense that they are tabular and very sparsely accessed, i.e. the pruning cannot remove row/symbol in the embedding table, and only a few symbols are accessed in each data sample. This makes the compression challenges different for the embedding layer. There are existing work on compressing embedding layers [17, 5]. And our work generalize the methods in [17, 5] to a new DPQ framework and improve the compression without resorting to the distillation process.
Our work differs from traditional quantization techniques [10] in that they can be trained in an end-to-end fashion. The idea of utilizing multiple orthogonal subspace/group for quantization is used in product quantization [10, 16] and multi-head attention [20]. Our work also resembles the Transformer [20], with attention being discrete and Key/Value matrices being internal parameters/memory (instead of hidden states of the input sequence).
The two instantiations of our model also share similarities with Gumbel-softmax [9] and VQ-VAE [19]. However, we do not find using the stochastic noises (as in Gumbel-softmax) useful since we aim to get deterministic codes. It is also worth pointing out that these techniques [9, 19] by themselves cannot be directly applied for compression, while our DPQ framework enables it.
In this work, we propose a novel and general differentiable product quantization framework for embedding table compression. We give two instantiations under our framework, which can serve as an efficient drop-in replacement for existing embedding layer. Empirically, we evaluate the proposed method on 3 different language tasks (9 datasets), and show that the proposed method surpass state-of-the-art can compress the embedding table up to 238 times without suffering performance lost. We believe we are the first to show such layer can be trained in an end-to-end fashion without distillation. In the future, we want to apply this technique to wider applications and architectures, as well as understand better the effectiveness of the proposed framework.
We would like to thank Koyoshi Shindo for helpful discussions.
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §2.2.International Conference on Machine Learning
, Cited by: §1, §1, §2, §2, §2.1, §2.3, §3.1, §4.Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §4.Cartesian k-means
. InProceedings of the IEEE Conference on computer Vision and Pattern Recognition
, pp. 3017–3024. Cited by: §4.Reasoning with neural tensor networks for knowledge base completion
. In Advances in neural information processing systems, pp. 926–934. Cited by: §1.Figure 4 shows the trade-off between CR and performance for DPQ variants: DPQ-SX vs. DPQ-VQ, and share Key/Value between groups vs. not share. We find that: 1) for LM, sharing is better, 2) for NMT, not sharing is better, and 3) for TextC, it is beneficial to share among groups for DPQ-SX, and not to share for DPQ-VQ.
Comments
There are no comments yet.