1 Introduction
Deep neural networks (DNNs) typically used in natural language processing (NLP) employ large embeddings layers, which map the input words into continuous representations and usually have the form of lookup tables. Despite such simplicity and, arguably because of it, the resulting models are cumbersome, which may cause problems in training and deploying them in a limited resource setting. Thus, the compression of large neural networks and the development of novel lightweight architectures have become essential problems in NLP research.
One way to reduce the number of parameters in the trained model is to imply a specific structure on its weight matrices (e.g., assume that they are lowrank or can be well approximated by lowrank tensor networks). Such approaches are successful at compressing the pretrained models, but they do not facilitate the training itself. Furthermore, they usually add to overall training time by requiring an additional finetuning phase as the compression algorithms usually optimize different objective functions.
In this paper, we introduce a new, parameter efficient embedding layer, termed TT–embedding, which can be plugged in into any model and trained endtoend. The benefits of our compressed TT–layer are twofold. Firstly, instead of storing huge rectangular embedding matrix, we store a sequence of much smaller 2dimensional and 3dimensional tensors, necessary for reconstructing the required embeddings, which allows compressing the model significantly at the cost of a negligible performance drop. Secondly, the number of model parameters can be relatively small (and constant) during the whole training stage, which allows to use larger batches and train efficiently in a case of limited resources.
To validate the efficiency of the proposed approach, we have tested it on a variety of popular NLP tasks, namely sentiment analysis, neural machine translation, and language modeling. In our computational experiments, we have observed that in a majority of tasks, the standard embeddings can be replaced by TT–embeddings with the compression ratio of or orders without any significant drop (and sometimes even with a slight gain) of the metric of interest. Specifically, we report the following compression ratios of the embedding layers: on the IMDB dataset with absolute increase in classification accuracy; on the WMT En–De dataset with drop in the BLEU score, and on the WikiText–103 dataset with drop in perplexity.
Additionally, we have also evaluated our algorithm on a task of binary classification based on a large number of categorical features. More concretely, we applied TT–embedding to the click through rate (CTR) prediction problem, a crucial task in the field of digital advertising. Neural networks, typically used for solving this problem, while being rather elementary, include a large number of embedding layers of significant size. As a result, a majority of model parameters that represent these layers, usually occupy hundreds of gigabytes of space. We show that TT–embedding not only considerably reduces the number of parameters in such models, but also sometimes improves their accuracy.
2 Related work
A number of prior works have explored different methods for compressing DNNs. (Sainath et al., 2013; Xue et al., 2013; Yu et al., 2017b) proposed to replace weight matrices in fullyconnected layers with their lowrank approximations, obtained via truncated SVD. (Jaderberg et al., 2014) showed that using rank1 decompositions of convolutional filters in the spatial domain led to significant compression and speed up on inference. (Kim et al., 2015; Howard et al., 2017)
developed lowrank structural approximation with automatic selection of hyperparameters (e.g., ranks) for a specific purpose of deploying large multilayer neural networks on mobile devices. Other methods for DNNs compression include but not limited to pruning
(Han et al., 2015b), quantization (Hubara et al., 2017; Xu et al., 2018), or their combination with Huffman coding (Han et al., 2015a).In recent years, a large body of research was devoted to compressing and speeding up various components of neural networks used in NLP tasks. (Joulin et al., 2016) adapted the framework of product quantization to reduce the number of parameters in linear models used for text classification. (See et al., 2016) proposed to compress LSTMbased neural machine translation models with pruning algorithms. (Lobacheva et al., 2017) showed that the recurrent models could be significantly sparsified with the help of variational dropout (Kingma et al., 2015). (Chen et al., 2018b)
proposed more compact Kway Ddimensional discrete encoding scheme to replace the “onehot” encoding of categorical features, such as words in NLP taks. Very recently,
(Chen et al., 2018a) and (Variani et al., 2018)introduced GroupReduce and WEST, two very efficient compression methods for the embedding and softmax layers, based on structured lowrank matrix approximation. Concurrently,
(Lam, 2018) proposed the quantization algorithm for compressing word vectors and showed the superiority of the obtained embeddings on word similarity, word analogy, and question answering tasks.Tensor methods have also been already successfully applied to neural networks compression. (Novikov et al., 2015) coined the idea of reshaping weights of fullyconnected layers into highdimensional tensors and representing them in Tensor Train (TT) (Oseledets, 2011) format. This approach was later extended to convolutional (Garipov et al., 2016) and recurrent (Yang et al., 2017; Tjandra et al., 2017; Yu et al., 2017a) neural networks. Furthermore, (Lebedev et al., 2014) showed that convolutional layers could be also compressed with canonical (CP) tensor decomposition (Carroll & Chang, 1970; Harshman, 1970). While all these methods allowed to reduce the number of parameters in the networks dramatically, they mostly capitalized on heavy fullyconnected and convolutional layers (present in AlexNet (Krizhevsky et al., 2012) or VGG (Simonyan & Zisserman, 2014) networks), which became outdated in the following years. In this work, we show the benefits of applying tensor machinery to the compression of embedding layers, which are still widely used in NLP.
3 Tensor Train embedding
In this section, we briefly introduce the necessary notation and present the algorithm for constructing and training the TT–embedding layer. Hereinafter, by way tensor we mean a multidimensional array:
with entries such that .
3.1 Motivation
Since most of the parameters in the NLP models occupy the embedding layers, we can greatly compress the entire model by compressing these layers, which is the problem we attack in this work. Our goal is to replace the standard embedding layer specified by an embedding matrix with a more compact, yet powerful and trainable, representation which would allow us to efficiently map words into vectors.
The simplest approach to compactly represent a matrix of a large size is to use the low–rank matrix factorization, which treats matrix as a product of two matrices . Here and are much “thinner” matrices, and is the rank hyperparameter. Note that rather than training the model with the standard embedding layer, and then trying to compress the obtained embedding, we can initially seek the embedding matrix in the described low–rank format. Then, for evaluation and training, the individual word embedding can be computed as a product which does not require materializing the full matrix
. This approach reduces the number of degrees of freedom in the embedding layer from
to .However, typically, in the NLP tasks the embedding dimension is much smaller than the vocabulary size , and obtaining significant compression ratio using lowrank matrix factorization is problematic. In order to preserve the model performance, the rank cannot be taken very small, and the compression ratio is bounded by , which is close to for usually fullrank embedding matrix. To overcome this bound and achieve significant compression ratio even for matrices of disproportional dimensionalities, we reshape them into multidimensional tensors and apply the Tensor Train decomposition, which allows for more compact representation, where the number of parameters falls down to logarithmic with respect to .
3.2 Tensor Train decomposition
A tensor is said to be represented in the Tensor Train (TT) format (Oseledets, 2011) if each element of can be computed as:
where the tensors are the socalled TT–cores and by definition. The minimal values of for which the TT–decomposition exists are called TT–ranks. Note, that the element is just effectively the product of vectors and matrices:
where stands for the slice (a subset of a tensor with some indices fixed) of the corresponding TT–core .
The number of degrees of freedom in such a decomposition can be evaluated to be . Thus, in the case of small ranks, the total number of parameters required to store a tensor in TT–representation is significantly smaller than parameters required to store the full tensor of the corresponding size. This observation makes the application of the TT–decomposition appealing in many problems dealing with extremely large tensors.
TT–decomposition exists for any tensor (but it is not unique), however, compressing the tensor by a significant factor is only possible up to some relative error. To make use of this significant parameter reduction for tensors of low TT–ranks, in many practical problems it is common to seek a solution to a problem in the TT–format explicitly (i.e., via TT–cores only, without forming the full tensor), since it allows to perform many operations with low complexity with respect to hyperparameters of the decomposition. These operations include computing slices and performing basic linear operations on tensors.
3.3 TT–matrix
Let be a matrix of size . Given two arbitrary factorizations of its dimensions into natural numbers, and , we can reshape^{1}^{1}1by reshape we mean a columnmajor reshape command, implemented, for example, as numpy.reshape in Python. and transpose this matrix into an way tensor and then apply the TT–decomposition to it, resulting in a more compact representation.
More concretely, define the bijections and that map row and column indices and of the matrix to the dimensional vectorindices such that . From the matrix we can form an way tensor whose th dimension is of length and is indexed by the tuple . This tensor is then represented in the TT–format:
(1) 
Such representation of the matrix in the TT–format is called TT–matrix (Oseledets, 2010; Novikov et al., 2015) and is also known as Matrix Product Operator (MPO) (Pirvu et al., 2010) in physics literature. The factorizations will be referred to as the shape of TT–matrix, or TT–shapes. The process of constructing the TT–matrix from the standard matrix is visualized in Figure 1 for the tensor of order . Note, that in this case the TT–cores are in face th order tensors, but all the operations defined for tensors in the TT–format are naturally extended to TT–matrices.
3.4 TT–embedding
By TT–embedding, we call a layer with trainable parameters (TT–cores) represented as a TT–matrix of the underlying tensor shape , which can be transformed into a valid embedding layer , with and . To specify the shapes of TT–cores one has also to provide the TT–ranks, which are treated as hyperparameters of the layer and explicitly define the total compression ratio.
In order to compute the embedding for a particular word indexed in the vocabulary, we first map the row index into the dimensional vector index , and then calculate components of the embedding with formula (1). Note, that the computation of all its components is equivalent to selecting the particular slices in TTcores (slices of shapes in , in and so on) and performing a sequence of matrix multiplications, which is executed efficiently in modern linear algebra packages, such as cuBLAS. The procedure of computing the mapping is given by Algorithm 1.
In order to construct TT–embedding layer for a vocabulary of size and embedding dimension , and to train a model with such a layer, one has to perform the following steps.

Provide factorizations of and into factors and , and specify the set of TT–ranks .

Initialize the set of parameters of the embedding . Concrete initialization scenarios are discussed further in the text.

During training, given a batch of indices , compute the corresponding embeddings using Eq. (1) and Algorithm 1.

Computed embeddings can be followed by any standard layer such as LSTM (Hochreiter & Schmidhuber, 1997) or selfattention (Vaswani et al., 2017)
, and trained with backpropagation since they differentially depend on the parameters
.
TT–embedding implies a specific structure on the order of tokens in the vocabulary (the order of rows in the embedding matrix), and determining the optimal order is an appealing problem to solve. However, we leave this problem for future work and use the order produced by the standard tokenizer (sorted by frequency) in our current experiments.
Initialization
The standard way to initialize an embedding matrix is via, e.g., Glorot initializer (Glorot & Bengio, 2010), which initializes each element as . For the TT–embedding, we can only initialize the TT–cores, and the distribution of the elements of the resulting matrix is rather non–trivial. However, it is easy to verify that if we initialize each TT–core element as , the resulting distribution of the matrix elements has the property that and
. Capitalizing on this observation, in order to obtain the desired variance
while keeping , we can simply initialize each TT–core as(2) 
The resulting distribution is not Gaussian, however, it approaches the Gaussian distribution with the increase of the TT–rank (
Figure 2).In our experiments, we have used the modified Glorot initializer implemented by formula (2
), which greatly improved performance, as opposed to initializing TT–cores simply via a standard normal distribution. It is also possible to initialize TT–embedding layer by converting the learned embedding matrix into TT–format using the standard TT–SVD algorithm
(Oseledets, 2011), however, this approach requires the pretrained embedding matrix and does not exhibit better performance in practice.Hyperparameter selection
Our embedding layer introduces two additional structurespecific hyperparameters, namely TT–shapes and TT–ranks.
TT–embedding does not require the vocabulary size to be represented exactly as the product of factors , in fact, any factorization will suffice. However, in order to achieve the highest possible compression ratio for a fixed value of , the factors should be as close to each other as possible. Our implementation includes a simple automated procedure for selecting a good values of during TT–embedding initialization. The factors are defined by the embedding dimensionality which can be easily chosen to support good factorization, e.g., or .
The values of TT–ranks directly define the compression ratio, so choosing them to be too small or too large will result into either significant performance drop or little reduction of the number of parameters. In our experiments, we set all TT–ranks to be equal to for the problems with small vocabularies and or for the problems with larger vocabularies, which allowed us to achieve significant compression of the embedding layer, at the cost of a tiny sacrifice in the metrics of interest.
4 Experiments
Dataset  Model  Embedding shape  Test acc.  Compr. 

IMDB  Full  
TT1  
TT2  
TT3  
SST  Full  
TT1  
TT2  
TT3 
Model  Embedding shape  TT–rank  Test BLEU  Compr. 

Full  —  
TT1  
TT2  
TT3 
Code
We have implemented TT–embeddings described in Section 3 in Python using PyTorch (Paszke et al., 2017). The code is available at the repository https://github.com/KhrulkovV/ttpytorch.
Experimental setup
We tested our approach on several popular NLP tasks:

Sentiment analysis — as a starting point in our experiments, we test TT–embeddings on a rather simple task of predicting polarity of a sentence.

Neural Machine Translation (NMT) — to verify the applicability of TT–embeddings in more practical problems, we test it on a more challenging task of performing translation from one language to another.

Language Modeling (LM) — finally, we evaluate TT–embeddings on language modeling tasks in the case of extremely large vocabularies.
Moreover, since our approach is not limited to NLP tasks but can also be applied to any problem possessing categorical features, we have performed the following experiment:

Click Through Rate (CTR) prediction — we show that TT–embeddings can be successfully applied for the task of binary classification with numerous categorical features of significant cardinality.
In order to prove the generality and wide applicability of the proposed approach, we tested it on various popular architectures, such as MLPs (CTR), LSTMs (sentiment analysis), and Transformers (LM, NMT).
4.1 Sentiment analysis
Sentiment analysis is a classification task, where one has to predict whether the sequence of tokens (usually words or sentences) contains either positive or negative meaning. For this experiment, we have used the IMDB dataset (Maas et al., 2011) with two categories, and the Stanford Sentiment Treebank (SST) with five categories. We have taken the most frequent words for the IMDB dataset and for SST, embedded them into a –dimensional space using either standard embedding or TT–embedding layer, and performed classification using a standard bidirectional two–layer LSTM with hidden size , and dropout rate . For our experiments, we have set to , and trained the model for various values of and various TTshapes (for TT–embedding).
Our findings are summarized in Table 1. We observe that the models with largely compressed embedding layers can perform equally or even better than the full uncompressed models. For instance, in the case of TT3 for the IMDB dataset, the number of parameters in the embedding layer was reduced from to just , while the test accuracy had not changed significantly. This suggests that learning individual independent embeddings for each particular word is superfluous, as the expressive power of LSTM is sufficiently large to make use of these intertwined, yet more compact embeddings. Moreover, slightly better test accuracy of the compressed models in certain cases (e.g., for the SST dataset of a rather small size) insinuates that imposing specific tensorial low–rank structure on the embedding matrix can be viewed as a special form of regularization, thus potentially improving the generalization power of the model. A detailed and comprehensive test of this hypothesis goes beyond the scope of this paper, and we leave it for future work.
4.2 Neural Machine Translation
In the task of Neural Machine Translation, the goal is to map an input sequence of symbols representing a phrase in one language, to an output sequence representing the same phrase in a different language. A typical architecture employed in this task is based on an encoder–decoder framework, which maps the input sequence into continuous representations and uses them for generating the output sequence , commonly also making use of the attention mechanism. The encoder–decoder framework serves as a foundation of most part of modern NMT models (Cho et al., 2014; Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017).
For this experiment, we have trained the popular Transformer model (Vaswani et al., 2017) on WMT English–German dataset consisting of roughly million sentence pairs. For validation, we used the news–commentary–v11 dataset. Sentences were tokenized using the SentencePiece^{2}^{2}2https://github.com/google/sentencepiece software, resulting in tokens for each language. As the baseline implementation of Transformer, we have used ‘base model’ architecture from (Vaswani et al., 2017) implemented in the OpenNMT–py^{3}^{3}3https://github.com/OpenNMT/OpenNMTpy library (Klein et al., 2017), and for our experiments we have replaced each of the embedding layers with the corresponding TT–embedding. For evaluation we used beam search with a beam size of and length penalty .
Our results are summarized in Table 2. We observe that even in this rather challenging task, both embedding layers can be compressed significantly, at the cost of a small drop in the BLEU score. Compared to the sentiment analysis, NMT is a much more complex task which benefits more from additional capacity (in the form of more powerful RNN or more transformer blocks) rather than regularization (Vaswani et al., 2017; Baevski & Auli, 2018), which may explain why we did not manage to improve the model by regularizing its embedding layers. However, note, that for a fixed memory budget, TTembeddings allow to include more transformer blocks, which may lead to a more powerful model with the same number of parameters as in the full model with standard embedding layer.
Model  Embedding shape  TT–rank  Train ppl  Test ppl  Compr. 

Full  —  
TT1  
TT2  
Full–tied  —  — 
Hashing  Model  Factorization  Test loss  Compr.  Model size 
Full  —  Mb  
TT1  factors  Mb  
TT2  factors  Mb  
—  TT1  factors  Mb  
TT2  factors  Mb 
4.3 Language modeling
The task of language modeling is to estimate the joint probability
of a corpus of tokens , which resemble sentences, words, word pieces, or single characters. The resulting models can be used to generate text or further finetuned to solve other NLP tasks (Radford et al., 2018). In this paper, we employ the standard setting of predicting next token given the sequence of preceding tokens, based on factorization . However, more complex scenarios can also be used, such as masking some words in the sentence and predicting them from the context or predicting next sentences from the previous ones (Devlin et al., 2018).Specifically, we take the TransformerXL (Dai et al., 2018), the open source^{4}^{4}4https://github.com/kimiyoung/transformerxl stateoftheart language modeling architecture at the time of this writing, and replace the standard embedding layer with TT–embedding. Then, we test different model configurations on the WikiText–103 (Merity et al., 2016) dataset and report the results in Table 3.
We compare the model with distinct softmax and embedding layers (Full), the original TransformerXL model (Full–shared) which ties softmax and embedding layers together as suggested in (Press & Wolf, 2016), and the models with TT–embeddings of different shapes. We see that the model with TT–embedding is superior to the full model which learns the embedding and softmax layers separately and overfits strongly to the training data. A simple modification which uses the same weight matrix in embedding and softmax layers (and can be seen as a form of regularization) performs much better. However, a larger difference between test and train perplexity suggests that it overfits more than the architecture with TT–embedding.
4.4 Click Through Rate prediction
Among other applications of the TT–embedding layer, we chose to focus on the experiments lying in the field of click–through rate prediction, a popular task in digital advertising (He et al., 2014). In this paper, we consider the open dataset provided by Criteo for Kaggle Display Advertising Challenge (Criteo Labs, 2014). This dataset consists of categorical features, samples and is binary labeled according to whether the user clicked on the given advertisement. Unique values of categorical features are first bijectively mapped into integers. In order to reduce the amount of stored data, if the size of a corresponding vocabulary is immense (e.g., a cardinality of some features in this dataset is of order ), these integers are further hashed by taking modulus with respect to some fixed number such as . However, due to strong compression properties of TT–embeddings, this is not necessary for our approach. In our experiments, we consider both full and hashed datasets.
Hashing  Model  Hidden size  Factorization  Test loss  Compr.  Model size 

Full  —  Mb  
TT1  factors  Mb  
TT2  factors  Mb  
Full  —  Mb  
TT1  factors  Mb  
TT2  factors  Mb 
CTR with the baseline algorithm
The task at hand can be treated as a binary classification problem. As a baseline algorithm, we consider the neural network with the following architecture. First, each of the categorical features is passed through a separate embedding layer with embedding size . After that, the embedded features are concatenated and passed through fullyconnected layers of neurons and ReLU activation functions. In all experiments, we use Adam optimizer with the learning rate equal to . In this format, since many input features have a large number of unique values (e.g., ) and storing the corresponding embedding matrices would require an immense amount of memory, we employ the hashing procedure mentioned earlier.
CTR with TT–embeddings
Similarly, as in the previous experiments, we propose to substitute the embedding layers with the TT–embedding layers. Besides the embedding layers, we leave the overall structure of the neural network unchanged with the same parameters as in the baseline approach. Throughout our experiments, we consider a set of different TT–ranks and various factorizations.
Table 4 presents the experimental results on the Criteo CTR dataset. We have fixed the embedding dimension equal to and the TT–rank to . To the best of our knowledge, our loss value is very close to the stateoftheart result (Juan et al., 2016). These experiments indicate that the substitution of large embedding layers with TT–embeddings leads to significant compression ratios (up to times) with a slight improvement in test loss. If we use the hashing procedure, the dataset is already compressed, which is in line with a smaller compressing power of TT–embedding layers. Nevertheless, the total size of the compressed model does not exceed Mb, while the baseline model weighs about Mb. The obtained compression ratio suggests that the usage of TT–embedding layers may be beneficial in CTR prediction tasks; however, rigorous evaluation on large industrial benchmarks would shed more light on this case.
Finally, to make the usage of the proposed method more applicable and practical, we have performed the experiments aiming to compress the model by a greater factor. Since these lightweight models are not as precise as larger ones, they can serve as preliminary prediction methods in the context of industrial purposes. We have considered the following parameters: the rank of underlying TT–matrix was equal to , and the hidden size was taken to be either or while leaving the remaining architecture untouched. The performance of these lightweight models is summarized in the Table 5.
5 Discussion and future work
We propose a novel embedding layer, the TT–embedding, for compressing huge lookup tables used for encoding categorical features of significant cardinality, such as the index of a token in natural language processing tasks. The proposed approach, based on the TT–decomposition, experimentally proved to be effective, as it heavily decreases the number of training parameters at the cost of a small deterioration in performance. In addition, our method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size.
Our experimental results suggest several appealing directions for future work. First of all, TT–embeddings impose a concrete tensorial lowrank structure on the embedding matrix, which was shown to improve the generalization ability of the networks acting as a regularizer. The properties and conditions of applicability of this regularizer are subject to more rigorous analysis. Secondly, it is important to understand how the order of tokens in the vocabulary affects the properties of the networks with TT–embedding. We hypothesize that there exists the optimal order of tokens which better exploits the particular structure of TT–embedding and leads to a boost in performance and/or compression ratio. Additionally, another interesting direction is to determine the optimal number of factors of TT–cores as our extensive experiments demonstrate a slight dependence of total accuracy on the number of factors. Finally, the idea of applying higher–order tensor decompositions to reduce the number of parameters in neural nets is complementary to more traditional methods such as pruning and quantization. Thus, it would be interesting to make a thorough comparison of all these methods and investigate whether their combination may lead to even stronger compression.
Acknowledgements
We would like to thank Andrzej Cichocki for constructive discussions during the preparation of the manuscript. This work was supported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001).
References
 Baevski & Auli (2018) Baevski, A. and Auli, M. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Carroll & Chang (1970) Carroll, J. D. and Chang, J.J. Analysis of individual differences in multidimensional scaling via an nway generalization of EckartYoung decomposition. Psychometrika, 35(3):283–319, 1970.
 Chen et al. (2018a) Chen, P. H., Si, S., Li, Y., Chelba, C., and Hsieh, C.j. GroupReduce: Blockwise lowrank approximation for neural language model shrinking. arXiv preprint arXiv:1806.06950, 2018a.
 Chen et al. (2018b) Chen, T., Min, M. R., and Sun, Y. Learning Kway Ddimensional Discrete Codes for Compact Embedding Representations. arXiv preprint arXiv:1806.09464, 2018b.
 Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Criteo Labs (2014) Criteo Labs. Kaggle Display Advertising Challenge, 2014. URL https://www.kaggle.com/c/criteodisplayadchallenge.
 Dai et al. (2018) Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le, Q. V., and Salakhutdinov, R. TransformerXL: Language modeling with longerterm dependency. arXiv preprint arXiv:1901.02860, 2018.
 Devlin et al. (2018) Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 Garipov et al. (2016) Garipov, T., Podoprikhin, D., Novikov, A., and Vetrov, D. Ultimate tensorization: compressing convolutional and FC layers alike. arXiv preprint arXiv:1611.03214, 2016.

Glorot & Bengio (2010)
Glorot, X. and Bengio, Y.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pp. 249–256, 2010.  Han et al. (2015a) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
 Han et al. (2015b) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143, 2015b.
 Harshman (1970) Harshman, R. A. Foundations of the PARAFAC procedure: Models and conditions for an” explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 1970.
 He et al. (2014) He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. ACM, 2014.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Hubara et al. (2017) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(187):1–30, 2017.
 Jaderberg et al. (2014) Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 Joulin et al. (2016) Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., and Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
 Juan et al. (2016) Juan, Y., Zhuang, Y., Chin, W.S., and Lin, C.J. Fieldaware factorization machines for CTR prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 43–50. ACM, 2016.
 Kim et al. (2015) Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., and Shin, D. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
 Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.
 Klein et al. (2017) Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M. OpenNMT: OpenSource Toolkit for Neural Machine Translation. In Proc. ACL, 2017. doi: 10.18653/v1/P174012. URL https://doi.org/10.18653/v1/P174012.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 Lam (2018) Lam, M. Word2Bitsquantized word vectors. arXiv preprint arXiv:1803.05651, 2018.
 Lebedev et al. (2014) Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., and Lempitsky, V. Speedingup convolutional neural networks using finetuned CPdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 Lobacheva et al. (2017) Lobacheva, E., Chirkova, N., and Vetrov, D. Bayesian sparsification of recurrent neural networks. arXiv preprint arXiv:1708.00077, 2017.
 Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P111015.
 Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 Novikov et al. (2015) Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
 Oseledets (2010) Oseledets, I. V. Approximation of matrices using tensor decomposition. SIAM Journal on Matrix Analysis and Applications, 31(4):2130–2145, 2010.
 Oseledets (2011) Oseledets, I. V. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. In NeurIPS 2017 Autodiff Workshop: The Future of Gradientbased Machine Learning Software and Techniques, 2017.
 Pirvu et al. (2010) Pirvu, B., Murg, V., Cirac, J. I., and Verstraete, F. Matrix product operator representations. New Journal of Physics, 12(2):025012, 2010.
 Press & Wolf (2016) Press, O. and Wolf, L. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
 Radford et al. (2018) Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining. 2018. URL https://s3uswest2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language_understanding_paper.pdf.
 Sainath et al. (2013) Sainath, T. N., Kingsbury, B., Sindhwani, V., Arisoy, E., and Ramabhadran, B. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6655–6659. IEEE, 2013.
 See et al. (2016) See, A., Luong, M.T., and Manning, C. D. Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274, 2016.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
 Tjandra et al. (2017) Tjandra, A., Sakti, S., and Nakamura, S. Compressing recurrent neural network with tensor train. arXiv preprint arXiv:1705.08052, 2017.
 Variani et al. (2018) Variani, E., Suresh, A. T., and Weintraub, M. WEST: Word Encoded Sequence Transducers. arXiv preprint arXiv:1811.08417, 2018.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Xu et al. (2018) Xu, Y., Wang, Y., Zhou, A., Lin, W., and Xiong, H. Deep neural network compression with single and multiple level quantization. arXiv preprint arXiv:1803.03289, 2018.

Xue et al. (2013)
Xue, J., Li, J., and Gong, Y.
Restructuring of deep neural network acoustic models with singular value decomposition.
In Interspeech, pp. 2365–2369, 2013.  Yang et al. (2017) Yang, Y., Krompass, D., and Tresp, V. Tensortrain recurrent neural networks for video classification. arXiv preprint arXiv:1707.01786, 2017.
 Yu et al. (2017a) Yu, R., Zheng, S., Anandkumar, A., and Yue, Y. Longterm forecasting using tensortrain RNNs. arXiv preprint arXiv:1711.00073, 2017a.

Yu et al. (2017b)
Yu, X., Liu, T., Wang, X., and Tao, D.
On compressing deep models by low rank and sparse decomposition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 7370–7379, 2017b.
Comments
There are no comments yet.