Tensorized Embedding Layers for Efficient Model Compression

01/30/2019 ∙ by Valentin Khrulkov, et al. ∙ 0

The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large (e.g., 800k unique words in the One-Billion-Word dataset), the corresponding weight matrices can be enormous, which precludes their deployment in a limited resource setting. We introduce a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance. Importantly, our method does not take the pre-trained model and compress its weights but rather supplants the standard embedding layers with their TT-based counterparts. The resulting model is then trained end-to-end, however, it can capitalize on larger batches due to the reduced memory requirements. We evaluate our method on a wide range of benchmarks in sentiment analysis, neural machine translation, and language modeling, and analyze the trade-off between performance and compression ratios for a wide range of architectures, from MLPs to LSTMs and Transformers.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) typically used in natural language processing (NLP) employ large embeddings layers, which map the input words into continuous representations and usually have the form of lookup tables. Despite such simplicity and, arguably because of it, the resulting models are cumbersome, which may cause problems in training and deploying them in a limited resource setting. Thus, the compression of large neural networks and the development of novel lightweight architectures have become essential problems in NLP research.

One way to reduce the number of parameters in the trained model is to imply a specific structure on its weight matrices (e.g., assume that they are low-rank or can be well approximated by low-rank tensor networks). Such approaches are successful at compressing the pre-trained models, but they do not facilitate the training itself. Furthermore, they usually add to overall training time by requiring an additional fine-tuning phase as the compression algorithms usually optimize different objective functions.

In this paper, we introduce a new, parameter efficient embedding layer, termed TT–embedding, which can be plugged in into any model and trained end-to-end. The benefits of our compressed TT–layer are twofold. Firstly, instead of storing huge rectangular embedding matrix, we store a sequence of much smaller 2-dimensional and 3-dimensional tensors, necessary for reconstructing the required embeddings, which allows compressing the model significantly at the cost of a negligible performance drop. Secondly, the number of model parameters can be relatively small (and constant) during the whole training stage, which allows to use larger batches and train efficiently in a case of limited resources.

To validate the efficiency of the proposed approach, we have tested it on a variety of popular NLP tasks, namely sentiment analysis, neural machine translation, and language modeling. In our computational experiments, we have observed that in a majority of tasks, the standard embeddings can be replaced by TT–embeddings with the compression ratio of or orders without any significant drop (and sometimes even with a slight gain) of the metric of interest. Specifically, we report the following compression ratios of the embedding layers: on the IMDB dataset with absolute increase in classification accuracy; on the WMT En–De dataset with drop in the BLEU score, and on the WikiText–103 dataset with drop in perplexity.

Additionally, we have also evaluated our algorithm on a task of binary classification based on a large number of categorical features. More concretely, we applied TT–embedding to the click through rate (CTR) prediction problem, a crucial task in the field of digital advertising. Neural networks, typically used for solving this problem, while being rather elementary, include a large number of embedding layers of significant size. As a result, a majority of model parameters that represent these layers, usually occupy hundreds of gigabytes of space. We show that TT–embedding not only considerably reduces the number of parameters in such models, but also sometimes improves their accuracy.

2 Related work

A number of prior works have explored different methods for compressing DNNs. (Sainath et al., 2013; Xue et al., 2013; Yu et al., 2017b) proposed to replace weight matrices in fully-connected layers with their low-rank approximations, obtained via truncated SVD. (Jaderberg et al., 2014) showed that using rank-1 decompositions of convolutional filters in the spatial domain led to significant compression and speed up on inference. (Kim et al., 2015; Howard et al., 2017)

developed low-rank structural approximation with automatic selection of hyperparameters (e.g., ranks) for a specific purpose of deploying large multilayer neural networks on mobile devices. Other methods for DNNs compression include but not limited to pruning 

(Han et al., 2015b), quantization (Hubara et al., 2017; Xu et al., 2018), or their combination with Huffman coding (Han et al., 2015a).

In recent years, a large body of research was devoted to compressing and speeding up various components of neural networks used in NLP tasks. (Joulin et al., 2016) adapted the framework of product quantization to reduce the number of parameters in linear models used for text classification. (See et al., 2016) proposed to compress LSTM-based neural machine translation models with pruning algorithms. (Lobacheva et al., 2017) showed that the recurrent models could be significantly sparsified with the help of variational dropout (Kingma et al., 2015). (Chen et al., 2018b)

proposed more compact K-way D-dimensional discrete encoding scheme to replace the “one-hot” encoding of categorical features, such as words in NLP taks. Very recently, 

(Chen et al., 2018a) and (Variani et al., 2018)

introduced GroupReduce and WEST, two very efficient compression methods for the embedding and softmax layers, based on structured low-rank matrix approximation. Concurrently, 

(Lam, 2018) proposed the quantization algorithm for compressing word vectors and showed the superiority of the obtained embeddings on word similarity, word analogy, and question answering tasks.

Tensor methods have also been already successfully applied to neural networks compression. (Novikov et al., 2015) coined the idea of reshaping weights of fully-connected layers into high-dimensional tensors and representing them in Tensor Train (TT) (Oseledets, 2011) format. This approach was later extended to convolutional (Garipov et al., 2016) and recurrent (Yang et al., 2017; Tjandra et al., 2017; Yu et al., 2017a) neural networks. Furthermore, (Lebedev et al., 2014) showed that convolutional layers could be also compressed with canonical (CP) tensor decomposition (Carroll & Chang, 1970; Harshman, 1970). While all these methods allowed to reduce the number of parameters in the networks dramatically, they mostly capitalized on heavy fully-connected and convolutional layers (present in AlexNet (Krizhevsky et al., 2012) or VGG (Simonyan & Zisserman, 2014) networks), which became outdated in the following years. In this work, we show the benefits of applying tensor machinery to the compression of embedding layers, which are still widely used in NLP.

3 Tensor Train embedding

In this section, we briefly introduce the necessary notation and present the algorithm for constructing and training the TT–embedding layer. Hereinafter, by -way tensor we mean a multidimensional array:

with entries such that .

Figure 1: Construction of the TT–matrix from the standard embedding matrix. Blue color depicts how the single element in the initial matrix is transformed into the product of the highlighted vectors and matrices in the TT–cores.

3.1 Motivation

Since most of the parameters in the NLP models occupy the embedding layers, we can greatly compress the entire model by compressing these layers, which is the problem we attack in this work. Our goal is to replace the standard embedding layer specified by an embedding matrix with a more compact, yet powerful and trainable, representation which would allow us to efficiently map words into vectors.

The simplest approach to compactly represent a matrix of a large size is to use the low–rank matrix factorization, which treats matrix as a product of two matrices . Here and are much “thinner” matrices, and is the rank hyperparameter. Note that rather than training the model with the standard embedding layer, and then trying to compress the obtained embedding, we can initially seek the embedding matrix in the described low–rank format. Then, for evaluation and training, the individual word embedding can be computed as a product which does not require materializing the full matrix

. This approach reduces the number of degrees of freedom in the embedding layer from

to .

However, typically, in the NLP tasks the embedding dimension is much smaller than the vocabulary size , and obtaining significant compression ratio using low-rank matrix factorization is problematic. In order to preserve the model performance, the rank cannot be taken very small, and the compression ratio is bounded by , which is close to for usually full-rank embedding matrix. To overcome this bound and achieve significant compression ratio even for matrices of disproportional dimensionalities, we reshape them into multidimensional tensors and apply the Tensor Train decomposition, which allows for more compact representation, where the number of parameters falls down to logarithmic with respect to .

3.2 Tensor Train decomposition

A tensor is said to be represented in the Tensor Train (TT) format (Oseledets, 2011) if each element of can be computed as:

where the tensors are the so-called TT–cores and by definition. The minimal values of for which the TT–decomposition exists are called TT–ranks. Note, that the element is just effectively the product of vectors and matrices:

where stands for the slice (a subset of a tensor with some indices fixed) of the corresponding TT–core .

The number of degrees of freedom in such a decomposition can be evaluated to be . Thus, in the case of small ranks, the total number of parameters required to store a tensor in TT–representation is significantly smaller than parameters required to store the full tensor of the corresponding size. This observation makes the application of the TT–decomposition appealing in many problems dealing with extremely large tensors.

TT–decomposition exists for any tensor (but it is not unique), however, compressing the tensor by a significant factor is only possible up to some relative error. To make use of this significant parameter reduction for tensors of low TT–ranks, in many practical problems it is common to seek a solution to a problem in the TT–format explicitly (i.e., via TT–cores only, without forming the full tensor), since it allows to perform many operations with low complexity with respect to hyperparameters of the decomposition. These operations include computing slices and performing basic linear operations on tensors.

3.3 TT–matrix

Let be a matrix of size . Given two arbitrary factorizations of its dimensions into natural numbers, and , we can reshape111by reshape we mean a column-major reshape command, implemented, for example, as numpy.reshape in Python. and transpose this matrix into an -way tensor and then apply the TT–decomposition to it, resulting in a more compact representation.

More concretely, define the bijections and that map row and column indices and of the matrix to the -dimensional vector-indices such that . From the matrix we can form an -way tensor whose -th dimension is of length and is indexed by the tuple . This tensor is then represented in the TT–format:


Such representation of the matrix in the TT–format is called TT–matrix (Oseledets, 2010; Novikov et al., 2015) and is also known as Matrix Product Operator (MPO) (Pirvu et al., 2010) in physics literature. The factorizations will be referred to as the shape of TT–matrix, or TT–shapes. The process of constructing the TT–matrix from the standard matrix is visualized in Figure 1 for the tensor of order . Note, that in this case the TT–cores are in face -th order tensors, but all the operations defined for tensors in the TT–format are naturally extended to TT–matrices.

3.4 TT–embedding

By TT–embedding, we call a layer with trainable parameters (TT–cores) represented as a TT–matrix of the underlying tensor shape , which can be transformed into a valid embedding layer , with and . To specify the shapes of TT–cores one has also to provide the TT–ranks, which are treated as hyperparameters of the layer and explicitly define the total compression ratio.

In order to compute the embedding for a particular word indexed in the vocabulary, we first map the row index into the -dimensional vector index , and then calculate components of the embedding with formula (1). Note, that the computation of all its components is equivalent to selecting the particular slices in TT-cores (slices of shapes in , in and so on) and performing a sequence of matrix multiplications, which is executed efficiently in modern linear algebra packages, such as cuBLAS. The procedure of computing the mapping is given by Algorithm 1.

  Require: – vocabulary size, – an arbitrary factorization of , – index of the target word in vocabulary.
  Returns: -dimensional index.
  for  to  do
  end for
Algorithm 1 The algorithm implementing the bijection as described in Section 3.3.

In order to construct TT–embedding layer for a vocabulary of size and embedding dimension , and to train a model with such a layer, one has to perform the following steps.

  • Provide factorizations of and into factors and , and specify the set of TT–ranks .

  • Initialize the set of parameters of the embedding . Concrete initialization scenarios are discussed further in the text.

  • During training, given a batch of indices , compute the corresponding embeddings using Eq. (1) and Algorithm 1.

  • Computed embeddings can be followed by any standard layer such as LSTM (Hochreiter & Schmidhuber, 1997) or self-attention (Vaswani et al., 2017)

    , and trained with backpropagation since they differentially depend on the parameters


TT–embedding implies a specific structure on the order of tokens in the vocabulary (the order of rows in the embedding matrix), and determining the optimal order is an appealing problem to solve. However, we leave this problem for future work and use the order produced by the standard tokenizer (sorted by frequency) in our current experiments.


The standard way to initialize an embedding matrix is via, e.g., Glorot initializer (Glorot & Bengio, 2010), which initializes each element as . For the TT–embedding, we can only initialize the TT–cores, and the distribution of the elements of the resulting matrix is rather non–trivial. However, it is easy to verify that if we initialize each TT–core element as , the resulting distribution of the matrix elements has the property that and

. Capitalizing on this observation, in order to obtain the desired variance

while keeping , we can simply initialize each TT–core as


The resulting distribution is not Gaussian, however, it approaches the Gaussian distribution with the increase of the TT–rank (

Figure 2).

Figure 2: Distribution of a matrix element of the TT–matrix of shape , with cores initialized by formula (2) with . As the TT–rank increases, the resulting distribution approaches .

In our experiments, we have used the modified Glorot initializer implemented by formula (2

), which greatly improved performance, as opposed to initializing TT–cores simply via a standard normal distribution. It is also possible to initialize TT–embedding layer by converting the learned embedding matrix into TT–format using the standard TT–SVD algorithm

(Oseledets, 2011), however, this approach requires the pretrained embedding matrix and does not exhibit better performance in practice.

Hyperparameter selection

Our embedding layer introduces two additional structure-specific hyperparameters, namely TT–shapes and TT–ranks.

TT–embedding does not require the vocabulary size to be represented exactly as the product of factors , in fact, any factorization will suffice. However, in order to achieve the highest possible compression ratio for a fixed value of , the factors should be as close to each other as possible. Our implementation includes a simple automated procedure for selecting a good values of during TT–embedding initialization. The factors are defined by the embedding dimensionality which can be easily chosen to support good factorization, e.g., or .

The values of TT–ranks directly define the compression ratio, so choosing them to be too small or too large will result into either significant performance drop or little reduction of the number of parameters. In our experiments, we set all TT–ranks to be equal to for the problems with small vocabularies and or for the problems with larger vocabularies, which allowed us to achieve significant compression of the embedding layer, at the cost of a tiny sacrifice in the metrics of interest.

4 Experiments

Dataset Model Embedding shape Test acc. Compr.
SST Full
Table 1: Sentiment analysis results. Embedding compression is calculated as the ratio between the number of parameters in the full embedding layer and TT–embedding layer. The LSTM parts are identical in both models, and the TT–ranks were set to in these experiments. In IMDB experiments, we observe that both the best accuracy and the highest compression ratio are achieved with TT–cores in TT–embedding layer. As for the SST dataset, the highest performance is attained with TT–cores, while the best compression ratio is realized in the experiment with TT–cores.
Model Embedding shape TT–rank Test BLEU Compr.
Table 2: Application of TT–embeddings to the task of English-to-German translation. WMT 14 En–De dataset was used for training, and news-commentary-11 for testing. The Transformer architecture (‘base model’) from (Vaswani et al., 2017) was used for this task. All layers except for embeddings are identical, and the models were trained using the same learning rate schedule, defined by Eq. (3) in (Vaswani et al., 2017). In these experiments, the embedding dimension was fixed to .


We have implemented TT–embeddings described in Section 3 in Python using PyTorch (Paszke et al., 2017). The code is available at the repository https://github.com/KhrulkovV/tt-pytorch.

Experimental setup

We tested our approach on several popular NLP tasks:

  • Sentiment analysis — as a starting point in our experiments, we test TT–embeddings on a rather simple task of predicting polarity of a sentence.

  • Neural Machine Translation (NMT) — to verify the applicability of TT–embeddings in more practical problems, we test it on a more challenging task of performing translation from one language to another.

  • Language Modeling (LM) — finally, we evaluate TT–embeddings on language modeling tasks in the case of extremely large vocabularies.

Moreover, since our approach is not limited to NLP tasks but can also be applied to any problem possessing categorical features, we have performed the following experiment:

  • Click Through Rate (CTR) prediction — we show that TT–embeddings can be successfully applied for the task of binary classification with numerous categorical features of significant cardinality.

In order to prove the generality and wide applicability of the proposed approach, we tested it on various popular architectures, such as MLPs (CTR), LSTMs (sentiment analysis), and Transformers (LM, NMT).

4.1 Sentiment analysis

Sentiment analysis is a classification task, where one has to predict whether the sequence of tokens (usually words or sentences) contains either positive or negative meaning. For this experiment, we have used the IMDB dataset (Maas et al., 2011) with two categories, and the Stanford Sentiment Treebank (SST) with five categories. We have taken the most frequent words for the IMDB dataset and for SST, embedded them into a –dimensional space using either standard embedding or TT–embedding layer, and performed classification using a standard bidirectional two–layer LSTM with hidden size , and dropout rate . For our experiments, we have set to , and trained the model for various values of and various TT-shapes (for TT–embedding).

Our findings are summarized in Table 1. We observe that the models with largely compressed embedding layers can perform equally or even better than the full uncompressed models. For instance, in the case of TT3 for the IMDB dataset, the number of parameters in the embedding layer was reduced from to just , while the test accuracy had not changed significantly. This suggests that learning individual independent embeddings for each particular word is superfluous, as the expressive power of LSTM is sufficiently large to make use of these intertwined, yet more compact embeddings. Moreover, slightly better test accuracy of the compressed models in certain cases (e.g., for the SST dataset of a rather small size) insinuates that imposing specific tensorial low–rank structure on the embedding matrix can be viewed as a special form of regularization, thus potentially improving the generalization power of the model. A detailed and comprehensive test of this hypothesis goes beyond the scope of this paper, and we leave it for future work.

4.2 Neural Machine Translation

In the task of Neural Machine Translation, the goal is to map an input sequence of symbols representing a phrase in one language, to an output sequence representing the same phrase in a different language. A typical architecture employed in this task is based on an encoder–decoder framework, which maps the input sequence into continuous representations and uses them for generating the output sequence , commonly also making use of the attention mechanism. The encoder–decoder framework serves as a foundation of most part of modern NMT models (Cho et al., 2014; Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017).

For this experiment, we have trained the popular Transformer model (Vaswani et al., 2017) on WMT English–German dataset consisting of roughly million sentence pairs. For validation, we used the news–commentary–v11 dataset. Sentences were tokenized using the SentencePiece222https://github.com/google/sentencepiece software, resulting in tokens for each language. As the baseline implementation of Transformer, we have used ‘base model’ architecture from (Vaswani et al., 2017) implemented in the OpenNMT–py333https://github.com/OpenNMT/OpenNMT-py library (Klein et al., 2017), and for our experiments we have replaced each of the embedding layers with the corresponding TT–embedding. For evaluation we used beam search with a beam size of and length penalty .

Our results are summarized in Table 2. We observe that even in this rather challenging task, both embedding layers can be compressed significantly, at the cost of a small drop in the BLEU score. Compared to the sentiment analysis, NMT is a much more complex task which benefits more from additional capacity (in the form of more powerful RNN or more transformer blocks) rather than regularization (Vaswani et al., 2017; Baevski & Auli, 2018), which may explain why we did not manage to improve the model by regularizing its embedding layers. However, note, that for a fixed memory budget, TT-embeddings allow to include more transformer blocks, which may lead to a more powerful model with the same number of parameters as in the full model with standard embedding layer.

Model Embedding shape TT–rank Train ppl Test ppl Compr.
Table 3: WikiText-103 language modeling results. The Transformer-XL architecture from (Dai et al., 2018) was used for this task. All layers except for embeddings are identical. The same training schedule was used for all the models. In these experiments, the embedding dimension was equal to , the number of transformer blocks , the number of attention heads .
Hashing Model Factorization Test loss Compr. Model size
Full Mb
TT1 factors Mb
TT2 factors Mb
TT1 factors Mb
TT2 factors Mb
Table 4: Criteo CTR results. The hashed dataset is constructed as specified in Section 4.4 with hashing value , and the unhashed dataset is considered as is. For the baseline algorithm, we have used the hashed version. Large embedding layers (with more than unique tokens) were replaced by TT–embedding layers with shape factorizations consisting of or factors. In the case of the full dataset, the compression ratio is measured with respect to the original dataset without hashing procedure. In these experiments, we took the TT–rank equal to and the embedding dimension is .

4.3 Language modeling

The task of language modeling is to estimate the joint probability

of a corpus of tokens , which resemble sentences, words, word pieces, or single characters. The resulting models can be used to generate text or further fine-tuned to solve other NLP tasks (Radford et al., 2018). In this paper, we employ the standard setting of predicting next token given the sequence of preceding tokens, based on factorization . However, more complex scenarios can also be used, such as masking some words in the sentence and predicting them from the context or predicting next sentences from the previous ones (Devlin et al., 2018).

Specifically, we take the Transformer-XL (Dai et al., 2018), the open source444https://github.com/kimiyoung/transformer-xl state-of-the-art language modeling architecture at the time of this writing, and replace the standard embedding layer with TT–embedding. Then, we test different model configurations on the WikiText–103 (Merity et al., 2016) dataset and report the results in Table 3.

We compare the model with distinct softmax and embedding layers (Full), the original Transformer-XL model (Full–shared) which ties softmax and embedding layers together as suggested in (Press & Wolf, 2016), and the models with TT–embeddings of different shapes. We see that the model with TT–embedding is superior to the full model which learns the embedding and softmax layers separately and overfits strongly to the training data. A simple modification which uses the same weight matrix in embedding and softmax layers (and can be seen as a form of regularization) performs much better. However, a larger difference between test and train perplexity suggests that it overfits more than the architecture with TT–embedding.

4.4 Click Through Rate prediction

Among other applications of the TT–embedding layer, we chose to focus on the experiments lying in the field of click–through rate prediction, a popular task in digital advertising (He et al., 2014). In this paper, we consider the open dataset provided by Criteo for Kaggle Display Advertising Challenge (Criteo Labs, 2014). This dataset consists of categorical features, samples and is binary labeled according to whether the user clicked on the given advertisement. Unique values of categorical features are first bijectively mapped into integers. In order to reduce the amount of stored data, if the size of a corresponding vocabulary is immense (e.g., a cardinality of some features in this dataset is of order ), these integers are further hashed by taking modulus with respect to some fixed number such as . However, due to strong compression properties of TT–embeddings, this is not necessary for our approach. In our experiments, we consider both full and hashed datasets.

Hashing Model Hidden size Factorization Test loss Compr. Model size
Full Mb
TT1 factors Mb
TT2 factors Mb
Full Mb
TT1 factors Mb
TT2 factors Mb
Table 5: The performance of lightweight models for CTR prediction on Criteo dataset. In these experiments, all TT–ranks were equal to , and hidden sizes were taken either equal to or . Other parameters are the same as in the previous experiments (Table 4). We can observe the drop in accuracy paired, however, with an impressive compression ratio.

CTR with the baseline algorithm

The task at hand can be treated as a binary classification problem. As a baseline algorithm, we consider the neural network with the following architecture. First, each of the categorical features is passed through a separate embedding layer with embedding size . After that, the embedded features are concatenated and passed through fully-connected layers of neurons and ReLU activation functions. In all experiments, we use Adam optimizer with the learning rate equal to . In this format, since many input features have a large number of unique values (e.g., ) and storing the corresponding embedding matrices would require an immense amount of memory, we employ the hashing procedure mentioned earlier.

CTR with TT–embeddings

Similarly, as in the previous experiments, we propose to substitute the embedding layers with the TT–embedding layers. Besides the embedding layers, we leave the overall structure of the neural network unchanged with the same parameters as in the baseline approach. Throughout our experiments, we consider a set of different TT–ranks and various factorizations.

Table 4 presents the experimental results on the Criteo CTR dataset. We have fixed the embedding dimension equal to and the TT–rank to . To the best of our knowledge, our loss value is very close to the state-of-the-art result  (Juan et al., 2016). These experiments indicate that the substitution of large embedding layers with TT–embeddings leads to significant compression ratios (up to times) with a slight improvement in test loss. If we use the hashing procedure, the dataset is already compressed, which is in line with a smaller compressing power of TT–embedding layers. Nevertheless, the total size of the compressed model does not exceed Mb, while the baseline model weighs about Mb. The obtained compression ratio suggests that the usage of TT–embedding layers may be beneficial in CTR prediction tasks; however, rigorous evaluation on large industrial benchmarks would shed more light on this case.

Finally, to make the usage of the proposed method more applicable and practical, we have performed the experiments aiming to compress the model by a greater factor. Since these lightweight models are not as precise as larger ones, they can serve as preliminary prediction methods in the context of industrial purposes. We have considered the following parameters: the rank of underlying TT–matrix was equal to , and the hidden size was taken to be either or while leaving the remaining architecture untouched. The performance of these lightweight models is summarized in the Table 5.

5 Discussion and future work

We propose a novel embedding layer, the TT–embedding, for compressing huge lookup tables used for encoding categorical features of significant cardinality, such as the index of a token in natural language processing tasks. The proposed approach, based on the TT–decomposition, experimentally proved to be effective, as it heavily decreases the number of training parameters at the cost of a small deterioration in performance. In addition, our method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size.

Our experimental results suggest several appealing directions for future work. First of all, TT–embeddings impose a concrete tensorial low-rank structure on the embedding matrix, which was shown to improve the generalization ability of the networks acting as a regularizer. The properties and conditions of applicability of this regularizer are subject to more rigorous analysis. Secondly, it is important to understand how the order of tokens in the vocabulary affects the properties of the networks with TT–embedding. We hypothesize that there exists the optimal order of tokens which better exploits the particular structure of TT–embedding and leads to a boost in performance and/or compression ratio. Additionally, another interesting direction is to determine the optimal number of factors of TT–cores as our extensive experiments demonstrate a slight dependence of total accuracy on the number of factors. Finally, the idea of applying higher–order tensor decompositions to reduce the number of parameters in neural nets is complementary to more traditional methods such as pruning and quantization. Thus, it would be interesting to make a thorough comparison of all these methods and investigate whether their combination may lead to even stronger compression.


We would like to thank Andrzej Cichocki for constructive discussions during the preparation of the manuscript. This work was supported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001).