Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations

06/21/2018 ∙ by Ting Chen, et al. ∙ 0

Conventional embedding methods directly associate each symbol with a continuous embedding vector, which is equivalent to applying a linear transformation based on a "one-hot" encoding of the discrete symbols. Despite its simplicity, such approach yields the number of parameters that grows linearly with the vocabulary size and can lead to overfitting. In this work, we propose a much more compact K-way D-dimensional discrete encoding scheme to replace the "one-hot" encoding. In the proposed "KD encoding", each symbol is represented by a D-dimensional code with a cardinality of K, and the final symbol embedding vector is generated by composing the code embedding vectors. To end-to-end learn semantically meaningful codes, we derive a relaxed discrete optimization approach based on stochastic gradient descent, which can be generally applied to any differentiable computational graph with an embedding layer. In our experiments with various applications from natural language processing to graph convolutional networks, the total size of the embedding layer can be reduced up to 98% while achieving similar or better performance.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Embedding methods, such as word embedding (Mikolov et al., 2013; Pennington et al., 2014), have become pillars in many applications when learning from discrete structures. The examples include language modeling (Kim et al., 2016), machine translation (Sennrich et al., 2015), text classification (Zhang et al., 2015b)

, knowledge graph and social network modeling

(Bordes et al., 2013; Chen & Sun, 2017), and many others (Kipf & Welling, 2016; Chen et al., 2016)

. The objective of the embedding module in neural networks is to represent a discrete symbol, such as a word or an entity, with some continuous embedding vector

. This seems to be a trivial problem, at the first glance, in which we can directly associate each symbol with a learnable embedding vector, as is done in existing work. To retrieve the embedding vector of a specific symbol, an embedding table lookup operation can be performed. This is equivalent to the following: first we encode each symbol with an “one-hot” encoding vector where ( is the total number of symbols), and then generate the embedding vector by simply multiplying the “one-hot” vector with the embedding matrix , i.e. .

Despite the simplicity of this “one-hot” encoding based embedding approach, it has several issues. The major issue is that the number of parameters grows linearly with the number of symbols. This becomes very challenging when we have millions or billions of entities in the database, or when there are lots of symbols with only a few observations (e.g. Zipf’s law). There also exists redundancy in the parameterization, considering that many symbols are actually similar to each other. This over-parameterization can further lead to overfitting; and it also requires a lot of memory, which prevents the model from being deployed to mobile devices. Another issue is purely from the code space utilization perspective, where we find “one-hot” encoding is extremely inefficient. Its code space utilization rate is almost zero as when , while dimensional discrete coding system can effectively represent symbols.

To address these issues, we propose a novel and much more compact coding scheme that replaces the “one-hot” encoding. In the proposed approach, we use a -way -dimensional code to represent each symbol, where each code has dimensions, and each dimension has a cardinality of . For example, a concept of cat may be encoded as (5-1-3-7), and a concept of dog may be encoded as (5-1-3-9). The code allocation for each symbol is based on data and specific tasks such that the codes can capture semantics of symbols, and similar codes should reflect similar meanings. While we mainly focus on the encoding of symbols in this work, the learned discrete codes can have larger applications, such as information retrieval. We dub the proposed encoding scheme as “KD encoding”.

The KD code system is much more compact than its “one-hot” counterpart. To represent a set of symbols of size , the “KD encoding” only requires . Increasing or by a small amount, we can easily achieve , in which case it will still be much more compact and keep . Consider , the utilization rate of “KD encoding” is , which is times more compact than its “one-hot” counterpart111Assuming we have vocabulary size and the dimensionality , it is times more efficient..

The compactness of the code can be translated into compactness of the parametrization. Dropping the giant embedding matrix that stores symbol embeddings and leveraging semantic similarities between symbols, the symbol embedding vector is generated by composing much fewer code embedding vectors. This can be achieved as follows: first we embed each “KD code” into a sequence of code embedding vectors in , and then apply embedding transformation function to generate the final symbol embedding. By adopting the new approach, we can reduce the number of embedding parameters from to , where is the code embedding size, and is the number of neural network parameters.

Due to the the discreteness of the code allocation problem, it is very challenging to learn the meaningful discrete codes that can exploit the similarities among symbols according to a target task in an end-to-end fashion. A compromise is to learn the code given a trained embedding matrix, and then fix the code during the stage of task-specific training. While this has been shown working relatively well in previous work (Chen et al., 2017; Shu & Nakayama, 2017), it produces a sub-optimal solution, and requires a multi-stage procedure that is hard to tune. In this work, we derive a relaxed discrete optimization approach based on stochastic gradient descent (SGD), and propose two guided methods to assist the end-to-end code learning. To validate our idea, we conduct experiments on three different tasks from natural language processing to graph convolutional networks for semi-supervised node classification. We achieve 95% of embedding model size reduction in the language modeling task and 98% in text classification with similar or better performance.

2 The K-way D-dimensional Discrete Encoding Framework

In this section, we introduce the “KD encoding” framework in details.

2.1 Problem Formulation

Symbols are represented with a vocabulary where corresponds to the -th symbol. Here we aim to learn a transformation function that maps a symbol to a continuous embedding vector , i.e. . In the case of conventional embedding method, is a linear transformation of “one-hot” code of a symbol.

To measure the fitness of , we consider a differentiable computational graph that takes discrete symbols as input and outputs the predictions

, such as text classification model based on word embeddings. We also assume a task-specific loss function

is given. The task-oriented learning of is to learn such that is minimized, i.e. where are task-specific parameters.

2.2 The “KD Encoding” Framework

In the proposed framework, each symbol is associated with a -way -dimensional discrete code. We denote the discrete code for the -th symbol as , where is the set of code bits with cardinality . To connect symbols with discrete codes, a code allocation function is used. The learning of this mapping function will be introduced later, and once fixed it can be stored as a hash table for fast lookup. Since a discrete code has dimensions, we do not directly use embedding lookup to find the symbol embedding as used in “one-hot” encoding. Instead, we want to learn an adaptive code composition function that takes a discrete code and generates a continuous embedding vector, i.e. . The details of will be introduced in the next subsection. In sum, the “KD encoding” framework we have with a “KD code” allocation function and a composition function as illustrated in Figure 1(a) and 1(b).

Figure 1: (a) The conventional symbol embedding based on “one-hot” encoding. (b) The proposed KD encoding scheme. (c) and (d) are examples of embedding transformation functions by DNN and RNN used in the “KD encoding” when generating the symbol embedding from its code.

In order to uniquely identify every symbol, we only need to set , as we can assign a unique code to each symbol in this case. When this holds, the code space is fully utilized, and none of the symbol can change its code without affecting other symbols. We call this type of code system compact code. The optimization problem for compact code can be very difficult, and usually requires approximated combinatorial algorithms such as graph matching (Li et al., 2016). Realizing the difficulties in optimization, we propose to adopt the redundant code system, where

, namely, there are a lot of “empty” codes with no symbol associated. Changing the code of one symbol may not affect other symbols under this scheme, since the random collision probability can be very small

222For example, we can set for a billion symbols, in a random code assignment, the probability of the NO collision at all is 99.5%., which makes it easier to optimize. The redundant code can be achieved by slightly increasing the size of or thanks to the exponential nature of their relations to . Therefore, in both compact code or redundant code, it only requires .

2.3 Discrete Code Embedding

As mentioned above, given learned and the -th symbol , we can retrieve its code via a code lookup, i.e. . In order to generate the composite embedding vector , we adopt an adaptive code composition function . To do so, we first embed the code to a sequence of code embedding vectors , and then apply another transformation to generate . Here is the code embedding matrix for the -th code dimension, and is the embedding transformation function that maps the code embedding vectors to the symbol embedding vector. The choice of is very flexible and varies from task to task. In this work, we consider two types of embedding transformation functions.

The first one is based on a linear transformation:

where is a transformation matrix for matching the dimensions. While this is simple and efficient, due to its linear nature, the capacity of the generated symbol embedding may be limited when the size of or the code embedding dimension is small.

Another type of embedding transformation functions are nonlinear, and here we introduce one that is based on a recurrent neural network, LSTM

(Hochreiter & Schmidhuber, 1997), in particular. That is, we have (see supplementary for details).

The final symbol embedding can be computed by summing over LSTM outputs at all code dimensions (and using a linear layer to match dimension if ), i.e. . Figure 1(c) and 1(d) illustrate the above two embedding transformation functions.

2.4 Analysis of the Proposed “KD Encoding”

To measure the parameter and model size reduction, we first introduce two definitions as follows.

Definition 1.

(Embedding parameters) The embedding parameters are the parameters that are used in code composition function . Specifically, it includes code embedding matrices , as well as other parameters used in the embedding transformation function .

It is worth noting that we do not explicitly include the code as embedding parameters. This is due to the fact that we do not count “one-hot” codes as parameters. Also in some cases the codes are not adaptively learned, such as hashed from symbols (Svenstrup et al., 2017). However, when we export the model to embedded devices, the storage of discrete codes does occupy space. Hence, we introduce another concept below to take it into consideration as well.

Definition 2.

(Embedding layer’s size) The embedding layer’s size is the number of bits used to store both embedding parameters as well as the discrete codes.

Lemma 1.

The number of embedding parameters used in KD encoding is , where is the number of parameters of neural nets.

The proof is given in the supplementary material.

For the analysis of the embedding layer’s size under “KD encoding”, we assume that 32-bits floating point number is used. The total bits used by the “KD encoding” is consisting both code size as well as the size of embedding parameters. Comparing to the total model size by conventional full embedding, which is , it can still be a huge saving of model space, especially when are large.

Here we provide a theoretical connection between the proposed “KD encoding” and the SVD or low-rank factorization of the embedding matrix. We consider the scenario where the composition function is a linear function with no hidden layer, that is .

Proposition 1.

A linear composition function with no hidden layer is equivalent to a sparse binary low-rank factorization of the embedding matrix.

The proof is also provided in the supplementary material. But the overall idea is that the “ code” mimics an 1-out-of- selection within each of the groups.

The computation overhead brought by linear composition is very small compared to the downstream neural network computation (without hidden layer in linear composition function, we only need to sum up vectors). However, the expressiveness of the linear factorization is limited by the number of bases or rank of the factorization, which is determined by and . And the use of non-linear composition function can largely increase the expressiveness of the composite embedding matrix and may be an appealing alternative, this is shown by the proposition 2 in supplementary.

3 End-to-End Learning of the Discrete Code

In this section, we propose methods for learning task-specific “KD codes”.

3.1 Continuous Relaxation for Discrete Code Learning

As mentioned before, we want to learn the symbol-to-embedding-vector mapping function, , to minimize the target task loss, i.e. . This includes optimizing both code allocation function and code composition function . While is differentiable w.r.t. its parameters , is very challenging to learn due to the discreteness and non-differentiability of the codes.

Specifically, we are interested in solving the following optimization problem,


where is the embedding transformation function mapping code embedding to the symbol embedding, contains code embeddings and the composition parameters, and denotes other task-specific parameters.

We assume the above loss function is differentiable w.r.t. to the continuous parameters including embedding parameters and other task-specific parameters , so they can be optimized by following standard stochastic gradient descent and its variants (Kingma & Ba, 2014). However, each is a discrete code, it cannot be directly optimized via SGD as other parameters. In order to adopt gradient based approach to simplify the learning of discrete codes in an end-to-end fashion, we derive a continuous relaxation of the discrete code to approximate the gradient effectively.

We start by making the observation that each code can be seen as a concatenation of “one-hot” vectors, i.e. , where and , where is the -th component of . To make it differentiable, we relax the from a “one-hot” vector to a continuous relaxed vector by applying tempering Softmax:

Where is a temperature term, as , this approximation becomes exact (except for the case of ties). We show this approximation effects for with in Figure 1(a). Similar techniques have been introduced in Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016).

(a) The output probabilities.
(b) The entropy.
Figure 2: The effects of temperature on output probability of Softmax and its entropy (when ). As decreases, the probabilistic output approximates step function when , and generally “one-hot” vector when .

Since is continuous (given is not approaching 0), instead of learning the discrete code assignment directly, we learn as an approximation to

. To do so, we can adjust the code logits

using SGD and gradually decrease the temperature during the training. Since the indexing operator for retrieval of code embedding vectors, i.e. , is non-differentiable, to generate the embedding vector for -th code dimension, we instead use an affine transformation operator, i.e. , which enables the gradient to flow backwards normally.

It is easy to see that control of temperature can be important. When is too large, the output is close to uniform, which is too far away from the desired “one-hot” vector . When is too small, the slight differences between different logits and will be largely magnified. Also, the gradient vanishes when the Softmax output approaches “one-hot” vector, i.e. when it is too confident. A “right” schedule of temperature can thus be crucial. While we can handcraft a good schedule of temperature, we also observe that the temperature is closely related to the entropy of the output probabilistic vector, as shown in Figure 1(b), where a same set of random logits can produce probabilities of different entropies when varies. This motivates us to implicitly control the temperature via regularizing the entropy of the model. To do so, we add the following entropy regularization term: A large penalty for this regularization term encourages a small entropy for the relaxed codes, i.e. a more spiky distribution.

Up to this point, we still use the continuous relaxation to approximate

during the training. In inference, we will only use discrete codes. The discrepancy of the continuous and discrete codes used in training and inference is undesirable. To close the gap, we take inspiration from Straight-Through Estimator

(Bengio et al., 2013). In the forward pass, instead of using the relaxed tempering Softmax output , which is likely a smooth continuous vector, we take its and turn it into a “one-hot” vector as follows, which recovers a discrete code.

We interpret the use of straight-through estimator as using different temperatures during the forward and backward pass. In forward pass, is used, for which we simply apply the operator. In the backward pass (to compute the gradient), it pretends that a larger was used. Compared to using the same temperature in both passes, this always outputs “one-hot” discrete code , which closes the previous gap between training and inference.

The training procedure is summarized in Algorithm 1, in which the stop_gradient operator will prevent the gradient from back-propagating through it.

  Parameters: code logits , code embedding matrices , transformation parameters , and other task specific parameters .
  for  to  do
     for  to  do
     end for
     A step of SGD on to reduce
  end for
Algorithm 1

An epoch of code learning via Straight-through Estimator with Tempering Softmax.

3.2 Code Learning with Guidances

It is not surprising the optimization problem is more challenging for learning discrete codes than learning conventional continuous embedding vectors, due to the discreteness of the problem (which can be NP-hard). This could lead to a suboptimal solution where discrete codes are not as competitive. Therefore, we propose to use guidances from the continuous embedding vectors to mitigate the problem. The basic idea is that instead of adjusting codes according to noisy gradients from the end task as shown above, we also require the composite embedding vectors from codes to mimic continuous embedding vectors, which can be either jointly trained (online distillation guidance), or pre-trained (pre-train distillation guidance). The continuous embedding can provide better signals for both code learning as well as the rest parts of the neural network, improve the training subsequently.

Online Distillation Guidance (ODG).

A good learning progress in code allocation function can be important for the rest of the neural network to learn. For example, it is hard to imagine we can train a good model based on “KD codes” if we have . However, the learning of the also depends on the rest of network to provide good signals.

Based on the observation, we propose to associate a regular continuous embedding vector with each symbol during the training, and we want the “KD encoding” function to mimic the continuous embedding vectors, while both of them are simultaneously optimized for the end task. More specifically, during the training, instead of using the embedding vector generated from the code, i.e. , we use a dropout average of them, i.e.


is a Bernoulli random variable for selecting between the regular embedding vector or the KD embedding vector. When

is turned on with a relatively high probability (e.g. 0.7), even if is difficult to learn, can still be learned to assist the improvement of the task-specific parameters , which in turn helps code learning. During the inference, we only use as output embedding. This choice can lead to a gap between training and generalization errors. Hence, we add a regularization loss during the training that encourages the match between and 333Here we use to prevent embedding vectors being dragged to as it has too much freedom..

Pre-trained Distillation Guidance (PDG).

Figure 3: Online Distillation Guidance. Dashed lines denotes regularization, doted line in the middle denotes sharing of transformation function.

It is important to close the gap between training and inference in the online distillation guidance process, unfortunately this can still be difficult. Alternatively, we can also adopt pre-trained continuous embedding vectors as guidance. Instead of training the continuous embedding vectors alongside the discrete codes, we utilize a pre-trained continuous embedding matrix produced from the same model with conventional continuous embedding vectors. During the end-to-end training of the codes (as well as other parameters), we ask the composite embedding vector generated by “KD encoding” to mimic the the given embedding vector by minimizing the distance.

Furthermore, we can build an auto-encoder of the pre-trained continuous embedding vectors, and use both continuous embedding vectors as well as the code logits as guidances. In the encoding pass, a transformation function is used to map to the code logits . In its decoding pass, it utilizes the same transformation function that is used in “KD encoding” to reconstruct . The loss function for the auto-encoders is

To follow the guidance of the pre-trained embedding matrix , we ask the code logits and composite symbol embedding 444Here we overload the function by considering that code can be turned into “one-hot” , and . to mimic the ones in the auto-encoder as follows

During the training, both and will be added to the task-specific loss function to train jointly. The method is illustrated in the Figure 3.

Here we also make a distinction between pre-trained distillation guidance (PDG) and pre-training of codes. Firstly, PDG can learn codes end-to-end to optimize the task’s loss, while the pre-trained codes will be fixed during the task learning. Secondly, the PDG training procedure is much easier, especially for the tuning of discrete code learning, while pre-training of codes requires three stages and is unfriendly for parameter tuning.

4 Experiments

In this section, we conduct experiments to validate the proposed approach. Since the proposed “KD Encoding" can be applied to various tasks and applications with embedding layers involved. We choose three important tasks for evaluation, they are (1) language modeling, (2) text classification, and (3) graph convolutional networks for semi-supervised node classification. For the detailed descriptions of these tasks and other applications of our method, we refer readers to the supplementary material.

For the language modeling task, we test on the widely used English Penn Treebank (Marcus et al., 1993) dataset, which contains 1M words with vocabulary size of 10K. The training/validation/test split is provided by convention according to (Mikolov et al., 2010). Since we only focus on the embedding layer, we simply adopt a previous state-of-the-art model (Zaremba et al., 2014), in which they provide three different variants of LSTMs (Hochreiter & Schmidhuber, 1997) of different sizes: The larger model has word embedding size and LSTM hidden size of 1500, while the number is 650 and 200 for the medium and small models. By default, we use and pre-trained distillation guidance for the proposed method, and linear embedding transformation function with 1 hidden layer of 300 hidden units.

For the text classification task, we utilize five different datasets from (Zhang et al., 2015b), namely Yahoo! news, AG’s news, DBpedia, Yelp review polarity ratings as well Yelp review full-scale ratings 555YahooAnswers has 477K unique words and 131M tokens, and Yelp has 268K unique words and 94M tokens. More details available in (Zhang et al., 2015b).. We adopt network architecture used in FastText (Joulin et al., 2016b, a), where a SoftMax is stacked on top of the averaged word embedding vectors of the text. For simplicity, we only use unigram word information but not sub-words or bi-grams, as used in their work. The word embedding dimension is chosen to be 300 as it yields a good balance between size and performance. By default, we use for the proposed method, and linear transformation with no hidden layer. That is to add code embedding vectors together to generate symbol embedding vector, and the dimension of code embedding is the same as word embedding.

For the application with graph convolutional networks, we follow the same setting and hyper-parameters as in (Kipf & Welling, 2016). Three datasets are used for comparison, namely Cora, Citeseer, Pubmed. Since both the number of symbols (1433, 3703, and 500 respectively) as well as its embedding dimension (16) are small, the compressible space is actually quite small. Nevertheless, we perform the proposed method with for Cora and Citeseer, and for Pubmed. Again, a linear embedding transformation function is used with one hidden layer of size 16. We do not use guidances for text classification and graph node classification tasks since the direct optimization is already satisfying enough.

Model Full Lr(5X) Lr(10X) Ours
Perplexity Small 114.53 134.01 134.89 107.77
Medi. 83.38 84.84 85.53 83.11
Large 78.71 81.23 81.85 77.72
# of emb. params. (M) Small 2.00 0.40 0.19 0.37
Medi. 6.50 1.30 0.65 0.50
Large 15.00 2.99 1.50 0.76
# of bits (M) Small 64.00 12.73 6.20 13.39
Medi. 208.00 41.58 20.79 17.75
Large 480.00 95.68 47.84 26.00
Table 1: Language modeling (PTB). Compared with Conventional full embedding, and low-rank (denoted with Lr) with different compression rates.
Model Full Lr(10X) Lr(20X) Ours
Accuracy Yahoo! 0.698 0.695 0.691 0.695
AG N. 0.914 0.914 0.915 0.916
Yelp P. 0.932 0.924 0.923 0.931
Yelp F. 0.592 0.578 0.573 0.590
DBpedia 0.977 0.977 0.979 0.980
# of emb. params. (M) Yahoo! 143.26 13.857 6.690 0.308
AG N. 20.797 2.019 0.975 0.308
Yelp P. 74.022 7.164 3.459 0.308
Yelp F. 80.524 7.793 3.762 0.308
DBpedia 183.76 17.772 8.580 0.308
# of bits (G) Yahoo! 4.584 0.443 0.214 0.086
AG N. 0.665 0.065 0.031 0.021
Yelp P. 2.369 0.229 0.111 0.049
Yelp F. 2.577 0.249 0.120 0.053
DBpedia 5.880 0.569 0.275 0.108
Table 2: Text classification. Lr denotes low-rank.

We mainly compare the proposed “KD encoding” approach with the conventional continuous (full) embedding counterpart, and also compare with low-rank factorization (Sainath et al., 2013) with different compression ratios. The results for three tasks are shown in Table 1, 2, 3, respectively. In these tables, three types of metrics are shown: (1) the performance metric, perplexity for language modeling and accuracy for the others, (2) the number of embedding parameters used in , and (3) the total embedding layer’s size includes as well as the codes. From these tables, we observe that the proposed “KD encoding” with end-to-end code learning perform similarly, or even better in many cases, while consistently saving more than 90% of embeding parameter and model size, 98% in the text classification case. In order to achieve similar level of compression, we note that low-rank factorization baseline will reduce the performance significantly.

Dataset Full Lr(2X) Lr(4X) Ours
Accuracy Cora 0.814 0.789 0.767 0.823
Citese. 0.721 0.710 0.685 0.723
Pubm. 0.795 0.773 0.780 0.797
# of emb. params. (K) Cora 22.93 10.14 5.8 8.22
Citese. 59.25 26.03 14.88 8.22
Pubm. 8.00 3.61 2.06 2.69
# of bits (M) Cora 0.73 0.32 0.19 0.33
Citese. 1.90 0.83 0.48 0.44
Pubm. 0.26 0.12 0.07 0.10
Table 3: Graph Convolutional Networks. Lr denotes low-rank.

We further compare with broader baselines on language modeling tasks (with medium sized language model for convenience): (1) directly using first 10 chars of a word as its code (padding when necessary), (2) training aware quantization

(Jacob et al., 2017), and (3) product quantization (Jegou et al., 2011; Joulin et al., 2016a). The results are shown in Table 4. We can see that our methods significantly outperform these baselines, in terms of both PPL as well as model size (bits) reduction.

Methods PPL Bits saved
Char-as-codes 108.14 96%
Scalar quantization (8 bits) 84.06 75%
Scalar quantization (6 bits) 87.73 81%
Scalar quantization (4 bits) 92.86 88%
Product quantization(64x325) 84.03 88%
Product quantization(128x325) 83.71 85%
Product quantization(256x325) 83.66 81%
Ours 83.11 92%
Table 4: Comparisons with more baselines in Language Modeling (Medium sized model).

In the following, we scrutinize different components of the proposed model based on PTB language modeling. To start with, we test various code learning methods, and demonstrate the impact of training with guidance. The results are shown in Table 5. First, we note that both random codes as well as pre-trained codes are suboptimal, which is understandable as they are not (fully) adaptive to the target tasks. Then, we see that end-to-end training without guidance suffers serious performance loss, especially when the task specific networks increase its complexity (with larger hidden size and use of dropout). Finally, by adopting the proposed continuous guidances (especially distillation guidance), the performance loss can be overcame.

Small Medium Large
Full embedding 114.53 83.38 78.71
Random code 115.79 104.12 98.38
Pre-trained code 107.95 84.92 80.69
Ours (no guidance) 108.50 89.03 86.41
Ours (ODG) 108.19 85.50 83.00
Ours (PDG) 107.77 83.11 77.72
Table 5: Comparisons of different code learning methods.

We further vary the size of or and see how they affect the performance. As shown in Figure 3(a) and 3(b), small K or D may harm the performance (even though that is satisfied), which suggests that the redundant code can be easier to learn. The size of seems to have higher impact on the performance compared to . Also, when is small, non-linear encoder such as RNN performs much better than the linear counterpart, which verifies our Proposition 2.

(a) Linear instantiation.
(b) RNN instantiation.
Figure 4: The effects of various K and D under different instantiation of embedding transformation function .


Code Words
3-1-0-3 up when over into time back off set left open half behind quickly starts
3-1-0-4 week tuesday wednesday monday thursday friday sunday saturday
3-1-0-5 by were after before while past ago close soon recently continued meanwhile
3-1-1-1 year month months record fall annual target cuts


Table 6: Learned codes for 10K Glove embeddings (K=6, D=4).

To examine the learned codes, we apply our method on the pre-trained embedding vectors from Glove (Pennington et al., 2014), which has better coverage and quality. We force the model to assign multiple words to the same code by setting (code space is 1296) for vocabulary size of 10K. Table 6 show a snippet of the learned codes, which shows that semantically similar words are assigned to the same or close-by discrete codes.

5 Related Work

The idea of using more efficient coding system traces to information theory, such as error correction code (Hamming, 1950), and Hoffman code (Huffman, 1952). However, in most embedding techniques such as word embedding (Mikolov et al., 2013; Pennington et al., 2014), entity embedding (Chen et al., 2016; Chen & Sun, 2017), “one-hot” encoding is used along with a usually large embedding matrix. Recent work (Kim et al., 2016; Sennrich et al., 2015; Zhang et al., 2015b) explores character or sub-word based embedding model instead of the word embedding model and show some promising results. (Svenstrup et al., 2017) proposes using hash functions to automatically map texts to pre-defined bases with a smaller vocabulary size, according to which vectors are composed. However, in their cases, the chars, sub-words and hash functions are fixed and given a priori dependent on language, thus may have few semantic meanings attached and may not be available for other type of data. In contrast, we learn the code assignment function from data and tasks, and our method is language independent.

The compression of neural networks (Han et al., 2015a, b; Chen et al., 2015) has become more and more important in order to deploy large networks to small mobile devices. Our work can be seen as a way to compress the embedding layer in neural networks. Most existing network compression techniques focus on dense/convolutional layers that are shared/amortized by all data instances, while one data instance only utilizes a fraction of embedding layer weights associated with the given symbols. To compress these types of weights, some efforts have been made, such as product quantization (Jegou et al., 2011; Joulin et al., 2016a; Zhang, ; Zhang et al., 2015a; Babenko & Lempitsky, 2014). Compared to their methods, our framework is more general. Many of these methods can be seen as a special case of “KD encoding” using a linear embedding transformation function without hidden layer. Also, under our framework, both the codes and the transformation functions can be learned jointly by minimizing task-specific losses.

Our work is also related to LightRNN (Li et al., 2016), which can be seen as a special case of our proposed KD code with and . Due to the use of a more compact code, its code learning is harder and more expensive. This work is an extension of our previous workshop paper (Chen et al., 2017) with guided end-to-end code learning. In parallel to (Chen et al., 2017), (Shu & Nakayama, 2017) explores similar ideas with linear composition functions and pre-trained codes.

6 Conclusions

In this paper, we propose a novel K-way D-dimensional discrete encoding scheme to replace the “one-hot" encoding, which significantly improves the efficiency of the parameterization of models with embedding layers. To learn semantically meaningful codes, we derive a relaxed discrete optimization technique based on SGD enabling end-to-end code learning. We demonstrate the effectiveness of our work with applications in language modeling, text classification and graph convolutional networks.


We would like to thank anonymous reviewers for their constructive comments. We would also like to thank Chong Wang, Denny Zhou, Lihong Li for some helpful discussions. This work is partially supported by NSF III-1705169, NSF CAREER Award 1741634, and Snapchat gift funds.


Appendix A Proofs of Lemmas and Propositions

Lemma 2.

The number of embedding parameters used in KD encoding is , where is the number of parameters of neural nets.


As mentioned, the embedding parameters include code embedding matrix and embedding transformation function . There are code embedding vectors with dimensions. As for the number of parameters in embedding transformation function such as neural networks (LSTM) that is in , it can be treated as a constant to the number of symbols since is independent of , provided that there are certain structures presented in the symbol embeddings. For example, if we assume all the symbol embeddings are within -ball of a finite number of centroids in -dimensional space, it should only require a constant to achieve -distance error bound, regardless of the vocabulary size, since the neural networks just have to memorize the finite centroids. ∎


Proposition 2.

A linear composition function with no hidden layer is equivalent to a sparse binary low-rank factorization of the embedding matrix.

Proof sketch. First consider when , and the composed embedding matrix can be written as , where is the binary code for each symbol, and is the code embedding matrix. This is a low rank factorization of the embedding matrix with binary code . When we increase , by representing a choice of as one-hot vector of size , we still have with additional constraints in B that it is a concatenation of one-hot vector. Due to the one-hot constraint, each row in will be sparse as only ratio of entries are non-zero, thus corresponds to a sparse binary low-rank factorization of the embedding matrix.

As the linear composition with no hidden layer can be limited in some cases as the expressiveness of the function highly relies on the number of bases or rank of the factorization. Hence, the non-linear composition may be more appealing in some cases.

Proposition 3.

Given the same dimensionality of the “KD code”, i.e. K, D, and code embedding dimension , the non-linear embedding transformation functions can reconstruct the embedding matrix with higher rank than the linear counterpart.

Proof sketch. As shown above, in the linear case, we approximate the embedding by a low-rank factorization, . The rank will be constrained by the dimensionality of binary matrix , i.e. . However, if we consider a nonlinear transformation function , we will have . As long as that no two rows in and no two columns in are the same, i.e. every data point has its quite code and every code has its unique embedding vector, then the non-linear function , such as a neural network with enough capacity, can approximate a matrix that has much higher rank, even full rank, than .

Appendix B The LSTM Code Embedding Transformation Function

Here we present more details on the LSTM code embedding transformation function. Assuming the code embedding dimension is the same as the LSTM hidden dimension, the formulation is given as follows.

where and

are, respectively, standard sigmoid and tanh activation functions. Please note that the symbol index

is ignored for simplicity.

Appendix C Examples and Applications

Our proposed task-specific end-to-end learned “KD Encoding" can be applied to any problem involving learning embeddings to reduce model size and increase efficiency. In the following, we list some typical examples and applications, for which detailed descriptions can be found in the supplementary material.

Language Modeling

Language modeling is a fundamental problem in NLP, and it can be formulated as predicting the probability over a sequence of words. Models based on recurrent neural networks (RNN) with word embedding (Mikolov et al., 2010; Kim et al., 2016)

achieve state-of-the-art results, so on which we will base our experiments. A RNN language model estimates the probability distribution of a sequence of words by modeling the conditional probability of each word given its preceding words,


where is the -th word in a vocabulary, and the conditional probability can be naturally modeled by a softmax output at the -th time step of the RNN. The RNN parameters and the word embeddings are model parameters of the language model.

Text Classification

Text classification is another important problem in NLP with many different applications. In this problem, given a training set of documents with each containing a number of words and its target label, we learn the embedding representation of each word and a binary or multi-class classifier with a logistic or softmax output, predicting the labels of test documents with the same vocabulary as in the training set. To test the “KD Encoding" of word embedding on several typical text classification applications, we use several different types of datasets: Yahoo answer and AG news represent topic prediction, Yelp Polarity and Yelp Full represent sentiment analysis, while DBpedia represents ontology classification.

Graph Convolutional Networks for Semi-Supervised Node Classification

In (Kipf & Welling, 2016), graph convolutional networks (GCN) are proposed for semi-supervised node classification on undirected graphs. In GCN, the matrix based on standard graph adjacency matrix with added self connections after normalization, , is used to approximate spectral graph convolutions. As a result, defines a non-linear convolutional feature transformation on node embedding matrix with a projection matrix and non-linear activation function . This layer-wise transformation can be repeated to build a deep network before making predictions using the final output layer. Minimizing a task-specific loss function, the network weights s and the node embedding matrix are learned simultaneously using standard back-propagation. A simple GCN with one hidden layer takes the following form:


where and are network weights, and softmax is performed in a row-wise manner. When the labels of only a subset of nodes are given, this framework is readily extended for graph-based semi-supervised node classification by minimizing the following loss function,


where is the number of labeled graph nodes, is the total number of classes of the graph nodes, and is a binary label matrix with each row summing to . We apply our proposed KD code learning to graph node embeddings in the above GCN framework for semi-supervised node classification.


The learned discrete code can also be seen as a data-dependent hashing for fast data retrieval. In this paper, we also perform some case studies evaluating the effectiveness of our learned KD code as hash code.

Appendix D Additional Experimental Results

We also test the effects of different code embedding dimensions, and the result is presented in Figure 5. We found that linear encoder requires larger code embedding dimensionality, while the non-linear encoder can work well with related small ones. This again verifies the proposition 2.

Figure 5: The perplexity on PTB as a function of different code embedding dimensions as well as the embedding transformation functions.

Table 7 shows the effectiveness of variants of the tricks in continuous relaxation based optimization. We can clearly see that the positive impacts of temperature scheduling, and/or entropy regularization, as well as the auto-encoding. However, here the really big performance jump is brought by using the proposed distillation guidance.

Variants PPL
CR 90.61
CR + STE 90.15
CR + STE + temperature scheduling 89.55
CR + STE + entropy reg 89.03

CR + STE + entropy reg + PDG (w/o autoencod.)

CR + STE + entropy reg + PDG (w/ autoencod.) 83.11
Table 7: Effectiveness of different optimization tricks. Here, CR=Continuous Relaxation using softmax, STE=straight-through estimation, CDG=continuous distillation guidance.

Appendix E Notations

For clarity, Table 1 provides explanations to major notations used in our paper.

Notations Explanation
One-hot representations of the code.
Continuously relaxed .
code logits for computing .
Code embedding matrix.
The transformation from symbol to the embedding ,
The transformation from symbol to code.
The code transformation function maps code to embedding. It has parameters
The embedding transformation function maps code embedding vectors to a symbol embedding vector.
The composite symbol embedding vector.
The task-specific (non-embedding) parameters.
Pre-trained symbol embedding matrix.
Pre-trained symbol embedding vector.
Symbol embedding dimensionality.
Code embedding dimensionality.
Table 8: Notations