1 Introduction
Embedding methods, such as word embedding Mikolov et al. (2013); Pennington et al. (2014), have become pillars in many applications when learning from discrete structures. The examples include language modeling Kim et al. (2016), machine translation Sennrich et al. (2015), text classification Zhang et al. (2015)
, knowledge graph and social network modeling
Bordes et al. (2013), and many others Chen et al. (2016). The objective of the embedding module in neural networks is to represent a discrete symbol, such as a word or an entity, with some continuous embedding vector
. This seems to be a trivial problem, at the first glance, in which we can directly associate each symbol with a learnable embedding vector, as it is done in existing work. To retrieve the embedding vector of a specific symbol, an embedding table lookup operation can be performed. This is equivalent to the following: first we encode each symbol with an “onehot” encoding vector where ( is the total number of symbols); then to generate the embedding vector, we simply multiply the “onehot” vector with the embedding matrix , i.e. .Despite the simplicity of this “onehot” encoding based embedding approach, it has several issues. The major issue is that the number of parameters grows linearly with the number of symbols. This becomes very challenging when we have millions or billions of entities in the database, or when there are lots of symbols with only a few observations (e.g. Zipf’s law). There also exists redundancy in the parameterization, assuming many symbols may actually be similar to each other. This overparameterization can further lead to overfitting; and it also requires a lot of memory, which prevents the model from being deployed to mobile devices. Another issue is purely from the code space utilization perspective, where we find “onehot” encoding is extremely inefficient. Its code space utilization rate is almost zero as , while bits/dimensions of code can effectively represent symbols.
To address these issues, we propose a novel and much more compact coding scheme that replaces the “onehot” encoding. In the proposed approach, we use a way dimensional code to represent each symbol, where each code has dimensions, and each dimension has a cardinality of . For example, a concept of cat may be encoded as (5137), and a concept of dog may be encoded as (5139). The code allocation for each symbol is based on data such that they will be able to capture semantics of symbols, and similar codes may reflect similar meanings. We dub the proposed encoding scheme as “KD encoding”.
The KD code system is much more compact than its “onehot” counterpart. To represent a set of symbols of size , the “KD encoding” only requires that . By increasing or by a small amount, we can easily achieve , in which case it will still be much more compact. Consider , the utilization rate of “KD encoding” is , which is times more compact than “onehot” counterpart ^{1}^{1}1Assuming we have vocabulary size , and setting number of dimensions , that is times more efficient.
The compactness of the code can be translated into compactness of the parametrization. Dropping the giant embedding matrix that stores symbol embeddings, the symbol embedding vector is generated by composing much fewer code embedding vectors. This can be achieved as follows: first we embed each KD code into a sequence of vector in , and then apply some transformation , which can be based on neural networks, to generate the final symbol embedding. In order to learn meaningful discrete codes that can exploit the similarities among symbols, we derive a relaxed discrete optimization algorithm based on stochastic gradient descent (SGD). By adopting the new approach, we can reduce the the number of parameters form to , where is the code embedding size, and is the number of neural network parameters. To validate our idea, we conduct experiments on both synthetic data as well as a real language modeling task. We achieve 97% of embedding parameter reduction in the language modeling task and obtain similar or better performance.
2 The Kway Ddimensional Discrete Encoding
In this section we introduce the “KD encoding” in details. Specifically, we present methods to generate symbol embedding from its (given/learned) “KD code”, and also the techniques for learning “KD code” from the data.
2.1 The “KD encoding” Framework
In the proposed framework, each symbol is associated with a way and dimensional discrete code. We denote each symbol by , where is a set of symbols with cardinality . And each discrete code is denoted by , where is the set of code bits with cardinality . To connect symbols with discrete codes, a mapping function is used. The learning of this mapping function will be introduced later, and once fixed it can be stored as a hash table for fast lookup.
Given the th symbol , we can retrieve its code via a code lookup, . The final embedding is generated by first embedding the code to a sequence of code embedding vectors , and then apply a differentiable transformation function , which is learned as well. We introduce the transformation function in the next subsection. Here is the embedding matrix for the th code bit. The overall framework is illustrated in Figure 1.
In order to uniquely identify a symbol, we only need that , as we can assign an unique code to each symbol. When this holds, the code space is fully utilized, and none of the symbol can change its code without affecting the other symbols. We call this type of code system the compact code. The optimization problem for compact code can be very difficult, and usually requires approximated combinatorial algorithms such as graph matching Li et al. (2016). Opposite to the compact code is the redundant code system, where we have
, and there will be a lot of “empty” code space that has no symbol correspondence, so that changing the code of one symbol may not affect other symbols, since the random collision probability can be very small
^{2}^{2}2For example, we can set for a billion symbols, in a random code assignment, the probability of the NO collision at all is 99.5%., which makes it easier to optimize. The redundant code can be achieved by slightly increasing the size of or thanks to the exponential nature of their relations. Hence, in both compact code or redundant code, we have .2.2 Discrete Code Embedding
Since a discrete code has multiple bits/dimensions, we cannot directly use embedding lookup to find the symbol embedding as used in “onehot” encoding. Hence, we first map each code into code embedding vectors via a code lookup , and then use a function that transforms the code embedding vectors into the final symbol embedding vector.
As mentioned above, we associate an embedding matrix for each th dimension in the discrete code. this enables us to turn a discrete code into a sequence of code embedding vectors .
Now to generate the final embedding vector , a transformation function is applied. In this work we consider two types of embedding transformation functions. The first one is based on a linear transformation,
Where
is the linear matrix. While this is simple, due to its linear nature, the capacity of the generated symbol embedding can be limited. This motivates us to adopt a nonlinear transformation function based on a recurrent neural network, LSTM
Hochreiter and Schmidhuber (1997), in particular. Assuming the code embedding dimension is the same as the LSTM hidden dimension, the formulation is given as follows.where and
are, respectively, standard sigmoid and tanh activation functions. Please also noted the symbol index
is ignored for simplicity. The final symbol embedding can be computed by summing over LSTM outputs at all code bits (with a linear transformation to match dimension if ), i.e. .Lemma 1.
The number of embedding parameters used in KD encoding is , where is the number of parameters of neural nets.
The proof is straightforward. There are two types of embedding parameters in the KD encoding: (1) code embedding vectors, and (2) neural network parameters. And there are code embedding vectors with dimensions. As for the number of parameters in neural networks (LSTM) that is in , it may be treated as a constant to the number of symbols since is independent of , provided that there are certain structures presented in the symbol embeddings. For example, if we assume the symbol embeddings are within ball of a finite number of centroids in dimensional space, it should only require a constant to achieve distance error bound, regardless of the vocabulary size, since the neural networks just have to memorize the finite centroids.
2.3 Discrete Code Learning
The code assignment can be very important for both parameterization efficiency and generalization. So we want to learn the code allocation function endtoend from data, in contrast to handcoded “onehot” encoding. In this work, we assume that we are already given the pretrained embedding vectors and each . Thus we will learn the discrete codes based on given . Once the codes are learned, we can relearn the code embedding parameters including transformation function according to the specific task. In the future, we will extend it to the case where such embeddings are not available.
To find the optimal codes, we minimize the squared loss between the real embedding vector and the embedding vector generated from the KD code. This yields to the following.
(1) 
Where is a differentiable transformation function as introduced above.
Since each is a discrete code, it cannot be directly optimized via stochastic gradient descent as other parameters do. Thus we need to use a relaxation in order to learn it effectively via SGD. We observe that each code can be seen as a concatenation of “onehot” vector, i.e. , where and , where is the th component of . We can adjust in order to update the code, but it is still nondifferentiable. To address the issue, we relax the from an “onehot” vector to some continuous vector by applying tempering Softmax:
Where is a temperature term, as , this approximation becomes exact (except for the case of ties). Similar techniques have been applied in GumbelSoftmax Jang et al. (2016); Maddison et al. (2016). We show effects of the temperature when with in Figure 2.
To learn the relaxed code logits
, we can gradually decrease the temperature during the training. When is not small enough, is still a smooth vector, so we use linear combination, i.e. , instead of indexing, i.e. , to generate the embedding vector for th code dimension.Noted that the tempering Softmax approximation is only differentiable when is not too small, but the gradient will disappear when . So at the beginning when is not small enough, we are actually learning some continuous codes instead of discrete codes, which may not be desirable. When becomes small enough such that we start to learn real discrete codes, the small in turn prevents the code from further update as it makes gradient disappear.
To address this issue, we take inspiration from StraightThrough Estimator
Bengio et al. (2013). In the forward pass, instead of using the tempering Softmax output, which is likely a smooth continuous vector, we take its maximum and turn it into a “onehot” vector as follows, which resembles the exactly discrete code.The use of straightthrough estimator is equivalent to use different temperatures during the forward and backward pass. In forward pass, is used, for which we simply take the argmax. In the backward pass (to compute the gradient), we pretend that a larger was used. Although this is a biased gradient estimator, but the sign of the gradient is still correct. Compared to using the same temperatures in both passes, this always output “onehot” discrete code
, and there is no vanishing gradient problem as long as the backward temperature is not approaching zero.
The training procedure is summarized in Algorithm 1, in which the stop_gradient operator will prevent the gradient from backpropagating through it.
3 Experiments
In this section we present both real and synthetic experiments to validate our proposed approach. The first set of experiments are based on language modeling task. The language modeling is a fundamental task in NLP, and it can be formulated as predicting the probability over a sequence of words. Models based on recurrent neural networks with word embedding Mikolov et al. (2010); Kim et al. (2016) achieve stateoftheart results, so on which we will base our experiments. The widely used English Penn Treebank Marcus et al. (1993) dataset is used in our experiments, which contains 1M words with vocabulary size of 10K. The training/validation/test split is by convention according to Mikolov et al. (2010). We utilize standard LSTM Hochreiter and Schmidhuber (1997) with two different model sizes, which tradeoff model size and accuracy. The larger model has word embedding size and LSTM hidden size of 1500, and the number is 200 for the smaller model. By default, is used in the proposed approach. A temperature schedule, i.e. , is used to train the code, where , and is the iteration number. We first train the model regularly using conventional embedding approach to obtain the embedding vectors, which are used to learn discrete codes. Once the discrete codes are obtained and fixed, we retrain the model with the same architecture and hyperparameters for the code embedding from scratch.
Table 1 shows the performance comparisons between the conventional “onehot” word embeddings against the proposed KD encoding. We presents several variants of the KD encoding schemes, distinguished by the combinations of (1) discrete code learning model and (2) symbol embedding relearning/retraining model. For the discrete code learning, we have three cases: random assignment, code learned by a linear transformation, and code learned by a LSTM transformation function; the latter two can also be utilized in the symbol embedding relearning model. Firstly, we observe that the discrete code learning is critical for KD encoding, as random discrete codes produce much worse performance. Secondly, we observe that with appropriate code learning, the test perplexity is similar or better compared to the “onehot” encoding case, while saving 82%97% of embedding parameters.


Small model  Large model  
PPL  E. Size  C. Rate  PPL  E. Size  C. Rate  
Conventional  114.53  2M  1  84.04  15M  1 
Random + Linear  144.32  0.1M  0.05  103.44  0.4M  0.033 
Random + LSTM  147.13  0.37M  0.185  119.62  0.63M  0.042 
Linear + Linear  118.40  0.1M  0.05  87.42  0.4M  0.033 
Linear + LSTM  111.13  0.37M  0.185  88.82  0.63M  0.042 
LSTM + Linear  117.21  0.1M  0.05  84.61  0.4M  0.033 
LSTM + LSTM  111.31  0.37M  0.185  85.37  0.63M  0.042 

We also vary the size of or and see how they affect the performance. As shown in Figure 2(a) and 2(b), small K or D may harm the performance (even though that is satisfied), which suggests that the redundant code may be easier to learn.
In order to understand the effects of temperature, and the importance of using discrete code output (i.e., with zero temperature), we create another set of experiments based on the synthetic embedding clusters. We generate 10K nodes that belong to 100 well separated clusters in 10dimensional space. And
is used, which mimics the Kmeans clustering problem as each code represents a cluster assignment. Both squared loss and clustering NMI are shown in Figure
2(c) and 2(d). We observed that the STE with temperature scheduling is much more effective comparing to its counterparts. When the temperature is kept constant, there are always some percent of codes changing, and the loss as well as NMI converge to a worse local optimal. When a smooth continuous code instead of discrete code is used, we observe that the loss first decreases and then increases. This is due to that only when temperature is small enough, its behavior mimics the discrete code output.To further inspect the learned code, we use the pretrained embedding from Glove Pennington et al. (2014), which has better coverage and quality than the pretrained from PTB language modeling. We intentionally use (code space is 1296) for vocabulary size of 10K, such that the model is forced to collide words. Table 2 show the learned code based on Glove vectors, which demonstrates that similar discrete codes are learned for semantically similar words.


Code  Words 
3103  up when over into time back off set left open half behind quickly starts 
3104  week tuesday wednesday monday thursday friday sunday saturday 
3105  by were after before while past ago close soon recently continued meanwhile 
3111  year month months record fall annual target cuts 

4 Related Work
The idea of using more efficient coding system dates back to information theory, such as error correction code Hamming (1950), and Hoffman code Huffman (1952). However, in most embedding techniques such as word embedding Mikolov et al. (2013); Pennington et al. (2014), entity embedding Chen et al. (2016), “onehot” encoding is used along with a usually large embedding matrix. Recent work Kim et al. (2016); Sennrich et al. (2015); Zhang et al. (2015) explores character or subword based embedding model instead of the word embedding model yields some good results. However, in their cases, the chars and subwords are fixed and given a priori according to the language, thus may have few semantic meanings attached and not available for other data. In contrast, we learn the code assignment function from data, as well as using a fixed length for the code.
The compression of neural networks Han et al. (2015a, b); Chen et al. (2015) has risen to be an important and hot topic as the size of parameters is too large and becomes a bottleneck for deploying the model to mobile devices. Our work can also be seen as a way to compress the embedding layer in neural networks. Most existing network compression techniques focus on layers that are shared in all data examples, while only one or a few symbols will be utilized in embedding layer at a time in our work.
5 Conclusions and Future Work
In this paper, we propose a novel Kway Ddimensional discrete encoding scheme to replace the “onehot" encoding. By adopting the new coding system, the efficiency of parameterization can be significantly improved. Furthermore, the reduction of parameters can also mitigate the overfitting problem. To learn the semantically meaningful code, we derive a relaxed discrete optimization technique based on SGD. In our experiments of language modeling, the number of free parameters can be reduced by 97% while achieving similar or better performance. We are currently working on improving the onthefly KD code learning along with the given tasks, where the symbol embeddings are not given beforehand.
References
 Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Bordes et al. [2013] Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pages 2787–2795, 2013.

Chen et al. [2016]
Ting Chen, LuAn Tang, Yizhou Sun, Zhengzhang Chen, and Kai Zhang.
Entity embeddingbased anomaly detection for heterogeneous categorical events.
InProceedings of the TwentyFifth International Joint Conference on Artificial Intelligence
, pages 1396–1403. AAAI Press, 2016. 
Chen et al. [2015]
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen.
Compressing neural networks with the hashing trick.
In
International Conference on Machine Learning
, pages 2285–2294, 2015.  Hamming [1950] Richard W Hamming. Error detecting and error correcting codes. Bell Labs Technical Journal, 29(2):147–160, 1950.
 Han et al. [2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
 Han et al. [2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015b.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Huffman [1952] David A Huffman. A method for the construction of minimumredundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
 Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Kim et al. [2016] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Characteraware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2741–2749. AAAI Press, 2016.
 Li et al. [2016] Xiang Li, Tao Qin, Jian Yang, Xiaolin Hu, and Tieyan Liu. Lightrnn: Memory and computationefficient recurrent neural networks. In Advances in Neural Information Processing Systems, pages 4385–4393, 2016.
 Maddison et al. [2016] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Marcus et al. [1993] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Mikolov et al. [2010] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
 Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

Pennington et al. [2014]
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation.
In
Proceedings of the 2014 conference on empirical methods in natural language processing
, pages 1532–1543, 2014.  Sennrich et al. [2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
 Shu [2017] Hideki Shu, Raphael; Nakayama. Compressing word embeddings via deep compositional code learning. https://arxiv.org/abs/1711.01068, 2017.
 Zhang et al. [2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. Characterlevel convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.
Comments
There are no comments yet.