Embedding methods, such as word embedding Mikolov et al. (2013); Pennington et al. (2014), have become pillars in many applications when learning from discrete structures. The examples include language modeling Kim et al. (2016), machine translation Sennrich et al. (2015), text classification Zhang et al. (2015)
, knowledge graph and social network modelingBordes et al. (2013), and many others Chen et al. (2016)
. The objective of the embedding module in neural networks is to represent a discrete symbol, such as a word or an entity, with some continuous embedding vector. This seems to be a trivial problem, at the first glance, in which we can directly associate each symbol with a learnable embedding vector, as it is done in existing work. To retrieve the embedding vector of a specific symbol, an embedding table lookup operation can be performed. This is equivalent to the following: first we encode each symbol with an “one-hot” encoding vector where ( is the total number of symbols); then to generate the embedding vector, we simply multiply the “one-hot” vector with the embedding matrix , i.e. .
Despite the simplicity of this “one-hot” encoding based embedding approach, it has several issues. The major issue is that the number of parameters grows linearly with the number of symbols. This becomes very challenging when we have millions or billions of entities in the database, or when there are lots of symbols with only a few observations (e.g. Zipf’s law). There also exists redundancy in the parameterization, assuming many symbols may actually be similar to each other. This over-parameterization can further lead to overfitting; and it also requires a lot of memory, which prevents the model from being deployed to mobile devices. Another issue is purely from the code space utilization perspective, where we find “one-hot” encoding is extremely inefficient. Its code space utilization rate is almost zero as , while bits/dimensions of code can effectively represent symbols.
To address these issues, we propose a novel and much more compact coding scheme that replaces the “one-hot” encoding. In the proposed approach, we use a -way -dimensional code to represent each symbol, where each code has dimensions, and each dimension has a cardinality of . For example, a concept of cat may be encoded as (5-1-3-7), and a concept of dog may be encoded as (5-1-3-9). The code allocation for each symbol is based on data such that they will be able to capture semantics of symbols, and similar codes may reflect similar meanings. We dub the proposed encoding scheme as “KD encoding”.
The KD code system is much more compact than its “one-hot” counterpart. To represent a set of symbols of size , the “KD encoding” only requires that . By increasing or by a small amount, we can easily achieve , in which case it will still be much more compact. Consider , the utilization rate of “KD encoding” is , which is times more compact than “one-hot” counterpart 111Assuming we have vocabulary size , and setting number of dimensions , that is times more efficient.
The compactness of the code can be translated into compactness of the parametrization. Dropping the giant embedding matrix that stores symbol embeddings, the symbol embedding vector is generated by composing much fewer code embedding vectors. This can be achieved as follows: first we embed each KD code into a sequence of vector in , and then apply some transformation , which can be based on neural networks, to generate the final symbol embedding. In order to learn meaningful discrete codes that can exploit the similarities among symbols, we derive a relaxed discrete optimization algorithm based on stochastic gradient descent (SGD). By adopting the new approach, we can reduce the the number of parameters form to , where is the code embedding size, and is the number of neural network parameters. To validate our idea, we conduct experiments on both synthetic data as well as a real language modeling task. We achieve 97% of embedding parameter reduction in the language modeling task and obtain similar or better performance.
2 The K-way D-dimensional Discrete Encoding
In this section we introduce the “KD encoding” in details. Specifically, we present methods to generate symbol embedding from its (given/learned) “KD code”, and also the techniques for learning “KD code” from the data.
2.1 The “KD encoding” Framework
In the proposed framework, each symbol is associated with a -way and -dimensional discrete code. We denote each symbol by , where is a set of symbols with cardinality . And each discrete code is denoted by , where is the set of code bits with cardinality . To connect symbols with discrete codes, a mapping function is used. The learning of this mapping function will be introduced later, and once fixed it can be stored as a hash table for fast lookup.
Given the -th symbol , we can retrieve its code via a code lookup, . The final embedding is generated by first embedding the code to a sequence of code embedding vectors , and then apply a differentiable transformation function , which is learned as well. We introduce the transformation function in the next sub-section. Here is the embedding matrix for the -th code bit. The overall framework is illustrated in Figure 1.
In order to uniquely identify a symbol, we only need that , as we can assign an unique code to each symbol. When this holds, the code space is fully utilized, and none of the symbol can change its code without affecting the other symbols. We call this type of code system the compact code. The optimization problem for compact code can be very difficult, and usually requires approximated combinatorial algorithms such as graph matching Li et al. (2016). Opposite to the compact code is the redundant code system, where we have
, and there will be a lot of “empty” code space that has no symbol correspondence, so that changing the code of one symbol may not affect other symbols, since the random collision probability can be very small222For example, we can set for a billion symbols, in a random code assignment, the probability of the NO collision at all is 99.5%., which makes it easier to optimize. The redundant code can be achieved by slightly increasing the size of or thanks to the exponential nature of their relations. Hence, in both compact code or redundant code, we have .
2.2 Discrete Code Embedding
Since a discrete code has multiple bits/dimensions, we cannot directly use embedding lookup to find the symbol embedding as used in “one-hot” encoding. Hence, we first map each code into code embedding vectors via a code lookup , and then use a function that transforms the code embedding vectors into the final symbol embedding vector.
As mentioned above, we associate an embedding matrix for each -th dimension in the discrete code. this enables us to turn a discrete code into a sequence of code embedding vectors .
Now to generate the final embedding vector , a transformation function is applied. In this work we consider two types of embedding transformation functions. The first one is based on a linear transformation,
is the linear matrix. While this is simple, due to its linear nature, the capacity of the generated symbol embedding can be limited. This motivates us to adopt a non-linear transformation function based on a recurrent neural network, LSTMHochreiter and Schmidhuber (1997), in particular. Assuming the code embedding dimension is the same as the LSTM hidden dimension, the formulation is given as follows.
are, respectively, standard sigmoid and tanh activation functions. Please also noted the symbol indexis ignored for simplicity. The final symbol embedding can be computed by summing over LSTM outputs at all code bits (with a linear transformation to match dimension if ), i.e. .
The number of embedding parameters used in KD encoding is , where is the number of parameters of neural nets.
The proof is straight-forward. There are two types of embedding parameters in the KD encoding: (1) code embedding vectors, and (2) neural network parameters. And there are code embedding vectors with dimensions. As for the number of parameters in neural networks (LSTM) that is in , it may be treated as a constant to the number of symbols since is independent of , provided that there are certain structures presented in the symbol embeddings. For example, if we assume the symbol embeddings are within -ball of a finite number of centroids in -dimensional space, it should only require a constant to achieve -distance error bound, regardless of the vocabulary size, since the neural networks just have to memorize the finite centroids.
2.3 Discrete Code Learning
The code assignment can be very important for both parameterization efficiency and generalization. So we want to learn the code allocation function end-to-end from data, in contrast to hand-coded “one-hot” encoding. In this work, we assume that we are already given the pre-trained embedding vectors and each . Thus we will learn the discrete codes based on given . Once the codes are learned, we can re-learn the code embedding parameters including transformation function according to the specific task. In the future, we will extend it to the case where such embeddings are not available.
To find the optimal codes, we minimize the squared loss between the real embedding vector and the embedding vector generated from the KD code. This yields to the following.
Where is a differentiable transformation function as introduced above.
Since each is a discrete code, it cannot be directly optimized via stochastic gradient descent as other parameters do. Thus we need to use a relaxation in order to learn it effectively via SGD. We observe that each code can be seen as a concatenation of “one-hot” vector, i.e. , where and , where is the -th component of . We can adjust in order to update the code, but it is still non-differentiable. To address the issue, we relax the from an “one-hot” vector to some continuous vector by applying tempering Softmax:
Where is a temperature term, as , this approximation becomes exact (except for the case of ties). Similar techniques have been applied in Gumbel-Softmax Jang et al. (2016); Maddison et al. (2016). We show effects of the temperature when with in Figure 2.
To learn the relaxed code logits, we can gradually decrease the temperature during the training. When is not small enough, is still a smooth vector, so we use linear combination, i.e. , instead of indexing, i.e. , to generate the embedding vector for -th code dimension.
Noted that the tempering Softmax approximation is only differentiable when is not too small, but the gradient will disappear when . So at the beginning when is not small enough, we are actually learning some continuous codes instead of discrete codes, which may not be desirable. When becomes small enough such that we start to learn real discrete codes, the small in turn prevents the code from further update as it makes gradient disappear.
To address this issue, we take inspiration from Straight-Through EstimatorBengio et al. (2013). In the forward pass, instead of using the tempering Softmax output, which is likely a smooth continuous vector, we take its maximum and turn it into a “one-hot” vector as follows, which resembles the exactly discrete code.
The use of straight-through estimator is equivalent to use different temperatures during the forward and backward pass. In forward pass, is used, for which we simply take the argmax. In the backward pass (to compute the gradient), we pretend that a larger was used. Although this is a biased gradient estimator, but the sign of the gradient is still correct. Compared to using the same temperatures in both passes, this always output “one-hot” discrete code
, and there is no vanishing gradient problem as long as the backward temperature is not approaching zero.
The training procedure is summarized in Algorithm 1, in which the stop_gradient operator will prevent the gradient from back-propagating through it.
In this section we present both real and synthetic experiments to validate our proposed approach. The first set of experiments are based on language modeling task. The language modeling is a fundamental task in NLP, and it can be formulated as predicting the probability over a sequence of words. Models based on recurrent neural networks with word embedding Mikolov et al. (2010); Kim et al. (2016) achieve state-of-the-art results, so on which we will base our experiments. The widely used English Penn Treebank Marcus et al. (1993) dataset is used in our experiments, which contains 1M words with vocabulary size of 10K. The training/validation/test split is by convention according to Mikolov et al. (2010). We utilize standard LSTM Hochreiter and Schmidhuber (1997) with two different model sizes, which trade-off model size and accuracy. The larger model has word embedding size and LSTM hidden size of 1500, and the number is 200 for the smaller model. By default, is used in the proposed approach. A temperature schedule, i.e. , is used to train the code, where , and is the iteration number. We first train the model regularly using conventional embedding approach to obtain the embedding vectors, which are used to learn discrete codes. Once the discrete codes are obtained and fixed, we re-train the model with the same architecture and hyper-parameters for the code embedding from scratch.
Table 1 shows the performance comparisons between the conventional “one-hot” word embeddings against the proposed KD encoding. We presents several variants of the KD encoding schemes, distinguished by the combinations of (1) discrete code learning model and (2) symbol embedding re-learning/re-training model. For the discrete code learning, we have three cases: random assignment, code learned by a linear transformation, and code learned by a LSTM transformation function; the latter two can also be utilized in the symbol embedding re-learning model. Firstly, we observe that the discrete code learning is critical for KD encoding, as random discrete codes produce much worse performance. Secondly, we observe that with appropriate code learning, the test perplexity is similar or better compared to the “one-hot” encoding case, while saving 82%-97% of embedding parameters.
|Small model||Large model|
|PPL||E. Size||C. Rate||PPL||E. Size||C. Rate|
|Random + Linear||144.32||0.1M||0.05||103.44||0.4M||0.033|
|Random + LSTM||147.13||0.37M||0.185||119.62||0.63M||0.042|
|Linear + Linear||118.40||0.1M||0.05||87.42||0.4M||0.033|
|Linear + LSTM||111.13||0.37M||0.185||88.82||0.63M||0.042|
|LSTM + Linear||117.21||0.1M||0.05||84.61||0.4M||0.033|
|LSTM + LSTM||111.31||0.37M||0.185||85.37||0.63M||0.042|
We also vary the size of or and see how they affect the performance. As shown in Figure 2(a) and 2(b), small K or D may harm the performance (even though that is satisfied), which suggests that the redundant code may be easier to learn.
In order to understand the effects of temperature, and the importance of using discrete code output (i.e., with zero temperature), we create another set of experiments based on the synthetic embedding clusters. We generate 10K nodes that belong to 100 well separated clusters in 10-dimensional space. And
is used, which mimics the K-means clustering problem as each code represents a cluster assignment. Both squared loss and clustering NMI are shown in Figure2(c) and 2(d). We observed that the STE with temperature scheduling is much more effective comparing to its counterparts. When the temperature is kept constant, there are always some percent of codes changing, and the loss as well as NMI converge to a worse local optimal. When a smooth continuous code instead of discrete code is used, we observe that the loss first decreases and then increases. This is due to that only when temperature is small enough, its behavior mimics the discrete code output.
To further inspect the learned code, we use the pre-trained embedding from Glove Pennington et al. (2014), which has better coverage and quality than the pre-trained from PTB language modeling. We intentionally use (code space is 1296) for vocabulary size of 10K, such that the model is forced to collide words. Table 2 show the learned code based on Glove vectors, which demonstrates that similar discrete codes are learned for semantically similar words.
|3-1-0-3||up when over into time back off set left open half behind quickly starts|
|3-1-0-4||week tuesday wednesday monday thursday friday sunday saturday|
|3-1-0-5||by were after before while past ago close soon recently continued meanwhile|
|3-1-1-1||year month months record fall annual target cuts|
4 Related Work
The idea of using more efficient coding system dates back to information theory, such as error correction code Hamming (1950), and Hoffman code Huffman (1952). However, in most embedding techniques such as word embedding Mikolov et al. (2013); Pennington et al. (2014), entity embedding Chen et al. (2016), “one-hot” encoding is used along with a usually large embedding matrix. Recent work Kim et al. (2016); Sennrich et al. (2015); Zhang et al. (2015) explores character or sub-word based embedding model instead of the word embedding model yields some good results. However, in their cases, the chars and sub-words are fixed and given a priori according to the language, thus may have few semantic meanings attached and not available for other data. In contrast, we learn the code assignment function from data, as well as using a fixed length for the code.
The compression of neural networks Han et al. (2015a, b); Chen et al. (2015) has risen to be an important and hot topic as the size of parameters is too large and becomes a bottleneck for deploying the model to mobile devices. Our work can also be seen as a way to compress the embedding layer in neural networks. Most existing network compression techniques focus on layers that are shared in all data examples, while only one or a few symbols will be utilized in embedding layer at a time in our work.
5 Conclusions and Future Work
In this paper, we propose a novel K-way D-dimensional discrete encoding scheme to replace the “one-hot" encoding. By adopting the new coding system, the efficiency of parameterization can be significantly improved. Furthermore, the reduction of parameters can also mitigate the over-fitting problem. To learn the semantically meaningful code, we derive a relaxed discrete optimization technique based on SGD. In our experiments of language modeling, the number of free parameters can be reduced by 97% while achieving similar or better performance. We are currently working on improving the on-the-fly KD code learning along with the given tasks, where the symbol embeddings are not given beforehand.
- Bengio et al.  Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Bordes et al.  Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
Chen et al. 
Ting Chen, Lu-An Tang, Yizhou Sun, Zhengzhang Chen, and Kai Zhang.
Entity embedding-based anomaly detection for heterogeneous categorical events.In
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 1396–1403. AAAI Press, 2016.
Chen et al. 
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen.
Compressing neural networks with the hashing trick.
International Conference on Machine Learning, pages 2285–2294, 2015.
- Hamming  Richard W Hamming. Error detecting and error correcting codes. Bell Labs Technical Journal, 29(2):147–160, 1950.
- Han et al. [2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Han et al. [2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015b.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Huffman  David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
- Jang et al.  Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Kim et al.  Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2741–2749. AAAI Press, 2016.
- Li et al.  Xiang Li, Tao Qin, Jian Yang, Xiaolin Hu, and Tieyan Liu. Lightrnn: Memory and computation-efficient recurrent neural networks. In Advances in Neural Information Processing Systems, pages 4385–4393, 2016.
- Maddison et al.  Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- Marcus et al.  Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
- Mikolov et al.  Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
- Mikolov et al.  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
Pennington et al. 
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing, pages 1532–1543, 2014.
- Sennrich et al.  Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Shu  Hideki Shu, Raphael; Nakayama. Compressing word embeddings via deep compositional code learning. https://arxiv.org/abs/1711.01068, 2017.
- Zhang et al.  Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.