Learning K-way D-dimensional Discrete Code For Compact Embedding Representations

11/08/2017 ∙ by Ting Chen, et al. ∙ 0

Embedding methods such as word embedding have become pillars for many applications containing discrete structures. Conventional embedding methods directly associate each symbol with a continuous embedding vector, which is equivalent to applying linear transformation based on "one-hot" encoding of the discrete symbols. Despite its simplicity, such approach yields number of parameters that grows linearly with the vocabulary size and can lead to overfitting. In this work we propose a much more compact K-way D-dimensional discrete encoding scheme to replace the "one-hot" encoding. In "KD encoding", each symbol is represented by a D-dimensional code, and each of its dimension has a cardinality of K. The final symbol embedding vector can be generated by composing the code embedding vectors. To learn the semantically meaningful code, we derive a relaxed discrete optimization technique based on stochastic gradient descent. By adopting the new coding system, the efficiency of parameterization can be significantly improved (from linear to logarithmic), and this can also mitigate the over-fitting problem. In our experiments with language modeling, the number of embedding parameters can be reduced by 97% while achieving similar or better performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Embedding methods, such as word embedding Mikolov et al. (2013); Pennington et al. (2014), have become pillars in many applications when learning from discrete structures. The examples include language modeling Kim et al. (2016), machine translation Sennrich et al. (2015), text classification Zhang et al. (2015)

, knowledge graph and social network modeling

Bordes et al. (2013), and many others Chen et al. (2016)

. The objective of the embedding module in neural networks is to represent a discrete symbol, such as a word or an entity, with some continuous embedding vector

. This seems to be a trivial problem, at the first glance, in which we can directly associate each symbol with a learnable embedding vector, as it is done in existing work. To retrieve the embedding vector of a specific symbol, an embedding table lookup operation can be performed. This is equivalent to the following: first we encode each symbol with an “one-hot” encoding vector where ( is the total number of symbols); then to generate the embedding vector, we simply multiply the “one-hot” vector with the embedding matrix , i.e. .

Despite the simplicity of this “one-hot” encoding based embedding approach, it has several issues. The major issue is that the number of parameters grows linearly with the number of symbols. This becomes very challenging when we have millions or billions of entities in the database, or when there are lots of symbols with only a few observations (e.g. Zipf’s law). There also exists redundancy in the parameterization, assuming many symbols may actually be similar to each other. This over-parameterization can further lead to overfitting; and it also requires a lot of memory, which prevents the model from being deployed to mobile devices. Another issue is purely from the code space utilization perspective, where we find “one-hot” encoding is extremely inefficient. Its code space utilization rate is almost zero as , while bits/dimensions of code can effectively represent symbols.

To address these issues, we propose a novel and much more compact coding scheme that replaces the “one-hot” encoding. In the proposed approach, we use a -way -dimensional code to represent each symbol, where each code has dimensions, and each dimension has a cardinality of . For example, a concept of cat may be encoded as (5-1-3-7), and a concept of dog may be encoded as (5-1-3-9). The code allocation for each symbol is based on data such that they will be able to capture semantics of symbols, and similar codes may reflect similar meanings. We dub the proposed encoding scheme as “KD encoding”.

The KD code system is much more compact than its “one-hot” counterpart. To represent a set of symbols of size , the “KD encoding” only requires that . By increasing or by a small amount, we can easily achieve , in which case it will still be much more compact. Consider , the utilization rate of “KD encoding” is , which is times more compact than “one-hot” counterpart 111Assuming we have vocabulary size , and setting number of dimensions , that is times more efficient.

The compactness of the code can be translated into compactness of the parametrization. Dropping the giant embedding matrix that stores symbol embeddings, the symbol embedding vector is generated by composing much fewer code embedding vectors. This can be achieved as follows: first we embed each KD code into a sequence of vector in , and then apply some transformation , which can be based on neural networks, to generate the final symbol embedding. In order to learn meaningful discrete codes that can exploit the similarities among symbols, we derive a relaxed discrete optimization algorithm based on stochastic gradient descent (SGD). By adopting the new approach, we can reduce the the number of parameters form to , where is the code embedding size, and is the number of neural network parameters. To validate our idea, we conduct experiments on both synthetic data as well as a real language modeling task. We achieve 97% of embedding parameter reduction in the language modeling task and obtain similar or better performance.

2 The K-way D-dimensional Discrete Encoding

In this section we introduce the “KD encoding” in details. Specifically, we present methods to generate symbol embedding from its (given/learned) “KD code”, and also the techniques for learning “KD code” from the data.

2.1 The “KD encoding” Framework

In the proposed framework, each symbol is associated with a -way and -dimensional discrete code. We denote each symbol by , where is a set of symbols with cardinality . And each discrete code is denoted by , where is the set of code bits with cardinality . To connect symbols with discrete codes, a mapping function is used. The learning of this mapping function will be introduced later, and once fixed it can be stored as a hash table for fast lookup.

Given the -th symbol , we can retrieve its code via a code lookup, . The final embedding is generated by first embedding the code to a sequence of code embedding vectors , and then apply a differentiable transformation function , which is learned as well. We introduce the transformation function in the next sub-section. Here is the embedding matrix for the -th code bit. The overall framework is illustrated in Figure 1.

Figure 1: (a) The conventional symbol embedding based on “one-hot” encoding. (b) The proposed KD encoding scheme. (c) An example of embedding transformation function by RNN used in the KD encoding when generating the symbol embedding from the code.

In order to uniquely identify a symbol, we only need that , as we can assign an unique code to each symbol. When this holds, the code space is fully utilized, and none of the symbol can change its code without affecting the other symbols. We call this type of code system the compact code. The optimization problem for compact code can be very difficult, and usually requires approximated combinatorial algorithms such as graph matching Li et al. (2016). Opposite to the compact code is the redundant code system, where we have

, and there will be a lot of “empty” code space that has no symbol correspondence, so that changing the code of one symbol may not affect other symbols, since the random collision probability can be very small

222For example, we can set for a billion symbols, in a random code assignment, the probability of the NO collision at all is 99.5%., which makes it easier to optimize. The redundant code can be achieved by slightly increasing the size of or thanks to the exponential nature of their relations. Hence, in both compact code or redundant code, we have .

2.2 Discrete Code Embedding

Since a discrete code has multiple bits/dimensions, we cannot directly use embedding lookup to find the symbol embedding as used in “one-hot” encoding. Hence, we first map each code into code embedding vectors via a code lookup , and then use a function that transforms the code embedding vectors into the final symbol embedding vector.

As mentioned above, we associate an embedding matrix for each -th dimension in the discrete code. this enables us to turn a discrete code into a sequence of code embedding vectors .

Now to generate the final embedding vector , a transformation function is applied. In this work we consider two types of embedding transformation functions. The first one is based on a linear transformation,

Where

is the linear matrix. While this is simple, due to its linear nature, the capacity of the generated symbol embedding can be limited. This motivates us to adopt a non-linear transformation function based on a recurrent neural network, LSTM

Hochreiter and Schmidhuber (1997), in particular. Assuming the code embedding dimension is the same as the LSTM hidden dimension, the formulation is given as follows.

where and

are, respectively, standard sigmoid and tanh activation functions. Please also noted the symbol index

is ignored for simplicity. The final symbol embedding can be computed by summing over LSTM outputs at all code bits (with a linear transformation to match dimension if ), i.e. .

Lemma 1.

The number of embedding parameters used in KD encoding is , where is the number of parameters of neural nets.

The proof is straight-forward. There are two types of embedding parameters in the KD encoding: (1) code embedding vectors, and (2) neural network parameters. And there are code embedding vectors with dimensions. As for the number of parameters in neural networks (LSTM) that is in , it may be treated as a constant to the number of symbols since is independent of , provided that there are certain structures presented in the symbol embeddings. For example, if we assume the symbol embeddings are within -ball of a finite number of centroids in -dimensional space, it should only require a constant to achieve -distance error bound, regardless of the vocabulary size, since the neural networks just have to memorize the finite centroids.

2.3 Discrete Code Learning

The code assignment can be very important for both parameterization efficiency and generalization. So we want to learn the code allocation function end-to-end from data, in contrast to hand-coded “one-hot” encoding. In this work, we assume that we are already given the pre-trained embedding vectors and each . Thus we will learn the discrete codes based on given . Once the codes are learned, we can re-learn the code embedding parameters including transformation function according to the specific task. In the future, we will extend it to the case where such embeddings are not available.

To find the optimal codes, we minimize the squared loss between the real embedding vector and the embedding vector generated from the KD code. This yields to the following.

(1)

Where is a differentiable transformation function as introduced above.

Since each is a discrete code, it cannot be directly optimized via stochastic gradient descent as other parameters do. Thus we need to use a relaxation in order to learn it effectively via SGD. We observe that each code can be seen as a concatenation of “one-hot” vector, i.e. , where and , where is the -th component of . We can adjust in order to update the code, but it is still non-differentiable. To address the issue, we relax the from an “one-hot” vector to some continuous vector by applying tempering Softmax:

Where is a temperature term, as , this approximation becomes exact (except for the case of ties). Similar techniques have been applied in Gumbel-Softmax Jang et al. (2016); Maddison et al. (2016). We show effects of the temperature when with in Figure 2.

Figure 2: The effects of temperature ().

To learn the relaxed code logits

, we can gradually decrease the temperature during the training. When is not small enough, is still a smooth vector, so we use linear combination, i.e. , instead of indexing, i.e. , to generate the embedding vector for -th code dimension.

Noted that the tempering Softmax approximation is only differentiable when is not too small, but the gradient will disappear when . So at the beginning when is not small enough, we are actually learning some continuous codes instead of discrete codes, which may not be desirable. When becomes small enough such that we start to learn real discrete codes, the small in turn prevents the code from further update as it makes gradient disappear.

To address this issue, we take inspiration from Straight-Through Estimator

Bengio et al. (2013). In the forward pass, instead of using the tempering Softmax output, which is likely a smooth continuous vector, we take its maximum and turn it into a “one-hot” vector as follows, which resembles the exactly discrete code.

The use of straight-through estimator is equivalent to use different temperatures during the forward and backward pass. In forward pass, is used, for which we simply take the argmax. In the backward pass (to compute the gradient), we pretend that a larger was used. Although this is a biased gradient estimator, but the sign of the gradient is still correct. Compared to using the same temperatures in both passes, this always output “one-hot” discrete code

, and there is no vanishing gradient problem as long as the backward temperature is not approaching zero.

The training procedure is summarized in Algorithm 1, in which the stop_gradient operator will prevent the gradient from back-propagating through it.

Input: Symbol embedding , code logits , code embedding matrices , transformation parameters .
Output: Discrete codes {}.
1 for  to  do
2        for  to  do
3              
4       A step of SGD on to reduce
Algorithm 1

An epoch of code learning via Straight-through Estimator with Tempering Softmax.

3 Experiments

In this section we present both real and synthetic experiments to validate our proposed approach. The first set of experiments are based on language modeling task. The language modeling is a fundamental task in NLP, and it can be formulated as predicting the probability over a sequence of words. Models based on recurrent neural networks with word embedding Mikolov et al. (2010); Kim et al. (2016) achieve state-of-the-art results, so on which we will base our experiments. The widely used English Penn Treebank Marcus et al. (1993) dataset is used in our experiments, which contains 1M words with vocabulary size of 10K. The training/validation/test split is by convention according to Mikolov et al. (2010). We utilize standard LSTM Hochreiter and Schmidhuber (1997) with two different model sizes, which trade-off model size and accuracy. The larger model has word embedding size and LSTM hidden size of 1500, and the number is 200 for the smaller model. By default, is used in the proposed approach. A temperature schedule, i.e. , is used to train the code, where , and is the iteration number. We first train the model regularly using conventional embedding approach to obtain the embedding vectors, which are used to learn discrete codes. Once the discrete codes are obtained and fixed, we re-train the model with the same architecture and hyper-parameters for the code embedding from scratch.

Table 1 shows the performance comparisons between the conventional “one-hot” word embeddings against the proposed KD encoding. We presents several variants of the KD encoding schemes, distinguished by the combinations of (1) discrete code learning model and (2) symbol embedding re-learning/re-training model. For the discrete code learning, we have three cases: random assignment, code learned by a linear transformation, and code learned by a LSTM transformation function; the latter two can also be utilized in the symbol embedding re-learning model. Firstly, we observe that the discrete code learning is critical for KD encoding, as random discrete codes produce much worse performance. Secondly, we observe that with appropriate code learning, the test perplexity is similar or better compared to the “one-hot” encoding case, while saving 82%-97% of embedding parameters.

 

Small model Large model
PPL E. Size C. Rate PPL E. Size C. Rate
Conventional 114.53 2M 1 84.04 15M 1
Random + Linear 144.32 0.1M 0.05 103.44 0.4M 0.033
Random + LSTM 147.13 0.37M 0.185 119.62 0.63M 0.042
Linear + Linear 118.40 0.1M 0.05 87.42 0.4M 0.033
Linear + LSTM 111.13 0.37M 0.185 88.82 0.63M 0.042
LSTM + Linear 117.21 0.1M 0.05 84.61 0.4M 0.033
LSTM + LSTM 111.31 0.37M 0.185 85.37 0.63M 0.042

 

Table 1: Comparisons of language modeling in PTB. Test perplexity, embedding size, and compression rate are shown for both small and large model settings. See text for variants of KD encoding.

We also vary the size of or and see how they affect the performance. As shown in Figure 2(a) and 2(b), small K or D may harm the performance (even though that is satisfied), which suggests that the redundant code may be easier to learn.

In order to understand the effects of temperature, and the importance of using discrete code output (i.e., with zero temperature), we create another set of experiments based on the synthetic embedding clusters. We generate 10K nodes that belong to 100 well separated clusters in 10-dimensional space. And

is used, which mimics the K-means clustering problem as each code represents a cluster assignment. Both squared loss and clustering NMI are shown in Figure

2(c) and 2(d). We observed that the STE with temperature scheduling is much more effective comparing to its counterparts. When the temperature is kept constant, there are always some percent of codes changing, and the loss as well as NMI converge to a worse local optimal. When a smooth continuous code instead of discrete code is used, we observe that the loss first decreases and then increases. This is due to that only when temperature is small enough, its behavior mimics the discrete code output.

(a) Fix K vary D.
(b) Fix D vary K.
(c) Clustering Loss
(d) Clustering NMI
Figure 3: (a) and (b) are clustering results on synthetic tasks. (c) and (d) are varying K/D on the PTB language modeling task.

To further inspect the learned code, we use the pre-trained embedding from Glove Pennington et al. (2014), which has better coverage and quality than the pre-trained from PTB language modeling. We intentionally use (code space is 1296) for vocabulary size of 10K, such that the model is forced to collide words. Table 2 show the learned code based on Glove vectors, which demonstrates that similar discrete codes are learned for semantically similar words.

 

Code Words
3-1-0-3 up when over into time back off set left open half behind quickly starts
3-1-0-4 week tuesday wednesday monday thursday friday sunday saturday
3-1-0-5 by were after before while past ago close soon recently continued meanwhile
3-1-1-1 year month months record fall annual target cuts

 

Table 2: Learned code for K=6, D=4 in 10K Glove word embeddings.

4 Related Work

The idea of using more efficient coding system dates back to information theory, such as error correction code Hamming (1950), and Hoffman code Huffman (1952). However, in most embedding techniques such as word embedding Mikolov et al. (2013); Pennington et al. (2014), entity embedding Chen et al. (2016), “one-hot” encoding is used along with a usually large embedding matrix. Recent work Kim et al. (2016); Sennrich et al. (2015); Zhang et al. (2015) explores character or sub-word based embedding model instead of the word embedding model yields some good results. However, in their cases, the chars and sub-words are fixed and given a priori according to the language, thus may have few semantic meanings attached and not available for other data. In contrast, we learn the code assignment function from data, as well as using a fixed length for the code.

The compression of neural networks Han et al. (2015a, b); Chen et al. (2015) has risen to be an important and hot topic as the size of parameters is too large and becomes a bottleneck for deploying the model to mobile devices. Our work can also be seen as a way to compress the embedding layer in neural networks. Most existing network compression techniques focus on layers that are shared in all data examples, while only one or a few symbols will be utilized in embedding layer at a time in our work.

LightRNN Li et al. (2016) can be seen as a special case of the proposed KD code, where , . Due to the use of a more compact code, code learning is harder and more expensive. We also note that a similar work of encoding embeddings with discrete codes Shu (2017) is conducted in parallel to ours.

5 Conclusions and Future Work

In this paper, we propose a novel K-way D-dimensional discrete encoding scheme to replace the “one-hot" encoding. By adopting the new coding system, the efficiency of parameterization can be significantly improved. Furthermore, the reduction of parameters can also mitigate the over-fitting problem. To learn the semantically meaningful code, we derive a relaxed discrete optimization technique based on SGD. In our experiments of language modeling, the number of free parameters can be reduced by 97% while achieving similar or better performance. We are currently working on improving the on-the-fly KD code learning along with the given tasks, where the symbol embeddings are not given beforehand.

References