## 1 Introduction

Neural Machine Translation (NMT) is a deep learning model that provides a robust method for machine translation using recurrent neural networks (RNNs). Originally proposed in

[seq2seq], NMT relies primarily on an encoder-decoder architecture that provides increased fluency over phrase-based systems. This was implemented successfully in [google] for fast, accurate use on very large datasets. However, it has been suggested that there is significant redundancy in the current method of neural network parametrization [novikov2015tensor], presenting the opportunity for significant speedup.Tensor Train (TT) decomposition [originalTT] is a method by which large tensors can be approximated by the product of a ‘train’ of smaller matrices (see Section 2.2). TT-decomposition has been proposed as a method of speeding up and reducing the memory usage of machine translation systems with dense weight matrices by reducing the number of parameters required to describe the system [novikov2015tensor]. The dense weight matrix of the fully-connected layers can be decomposed into the Tensor Train format, creating what we will refer to as a ‘TT-layer’. This lower-rank factorization can then be trained in a similar way to the original dense weight matrix. In this work, we execute a TT-layer using t3f, a library which enables straightforward implementation of TT-decomposition within TensorFlow models. T3f provides an implementation with GPU support, eliminating the need to rewrite core functionality from scratch, which has been necessary for previous libraries (see [novikov2018tensor]).

## 2 Background

### 2.1 Neural Machine Translation

Neural Machine Translation (NMT) is a method of machine translation that uses an encoder-decoder architecture to coherently translate whole sentences by capturing long range dependencies. This creates more fluent and accurate results in contrast to previous phrase-based approaches [luong17]

. The encoder and decoder use recurrent neural networks (RNNs) to train the model, finally enabling translation of an input source vector

to an outputusing the linear transformation

(1) |

where the weight matrix

and bias vector

are trained by the model.There are a wide range of RNN models that differ, for example, in terms of directionality (unidirectional or bidirectional), depth (single- or multi-layer) and type (straightforward RNN, Long Short-term Memory (LSTM), or a gated recurrent unit (GRU)). Here, we use a deep multi-layer RNN with Long Short-term Memory (LSTM) as a recurrent unit. This high level NMT model consists of two recurrent neural networks; the encoder consumes input source words to build a ‘thought’ vector, while the decoder processes the vector to emit a translation, thereby using information from the entire source sentence

[luongdissertation]. Further information about this particular model can be found in [luong17].### 2.2 Tensor Train Decomposition

The Tensor Train (TT) format can be used to represent the dense weight matrix of a fully-connected layer using fewer parameters. In this work, we implement a ‘TT-layer’ as described in [novikov2015tensor]; a fully-connected layer with the weight matrix stored in the TT-format. The following outline follows the descriptions given in [originalTT] and [novikov2015tensor] and uses similar notation.

A tensor is said to be represented in the TT-format if there exist matrices such that all the elements of can be computed by the matrix product

(2) |

where index the tensor elements and is an matrix, where are the ‘ranks’ of the TT-representation. The values and equal 1 in order to keep the matrix product (2), and hence each element of , of size . Note that each matrix is actually a array with elements , using which we can write equation (2) in index form:

(3) |

Each three-dimensional tensor is referred to as a ‘core’ of the TT-decomposition.

The weight matrix , where and , can be written in Tensor Train form. We define bijections and , mapping the row and column indices and to -dimensional vector-indices, whose -th dimensions are of length and respectively. In this case, the cores are now described by a four dimensional array. We will not go further into the mathematical details here, as the precise implementation is handled by the t3f library. Further information can be found in [novikov2015tensor].

In this paper, we use an explicit notation of the form to clarify the shape of the tensors that we are training. We choose , splitting each weight matrix into 3 cores. For example, for a rank distribution and an underlying weight matrix shape , the shapes of the cores being trained are given by Table 1.

Core | Tensor Shape |
---|---|

. |

## 3 Methodology

We implement the TT-format in the TensorFlow NMT model [luong17], decomposing the weight matrix using the t3f library [novikov2018tensor] as described in Section 2.2 to create a TT-layer. We achieve this by creating a new BasicLSTMCell class, the code for which is provided in the Appendix. The TT-decomposition itself is performed using the key function t3f.to_tt_matrix as follows:

where the maximum rank is set by max_tt_rank and the product of the core dimensions is determined by the original dimensions of the weight matrix. The above code implements the specific example with the core dimensions given in Table 1. The max_tt_rank and shape arguments can be changed as necessary for different configurations of ranks and core dimensions. Note that, as outlined in Section 2.2, the core dimensions and must be chosen such that and for a weight matrix .

We test our TT model by training on two publicly available datasets. First, we perform benchmark tests using the original TensorFlow NMT model and a similar model which uses a low-rank approximation factorization. We then carry out approximately 20 training runs with different parameters using the IWSLT English-Vietnamese ’15 dataset^{1}^{1}1https://sites.google.com/site/iwsltevaluation2015/ with 133K examples to determine which configurations give results competitive with those achieved by the benchmarks. Second, we train the WMT German-English ’16 dataset^{2}^{2}2http://www.statmt.org/wmt16/translation-task.html with 4.5M examples to test the suitability of our model for larger datasets. All training runs are performed on *Nvidia Tesla P100* GPUs.

## 4 Results

### 4.1 IWSLT English-Vietnamese ’15

#### 4.1.1 Benchmark

We first perform a benchmark run against which to compare our TT model, using similar hyperparameters to the IWSLT English-Vietnamese training in

[luong17]. We use a 2-layer LSTM with 512 hidden units, a bidirectional encoder (i.e., 1 bidirectional layer for the encoder) and embedding dimension 512. LuongAttention is used with scale=True, together with dropout probability 0.2. We use the Adam optimizer with learning rate 0.0004. We train for 12K steps (

12 epochs) where after 6K steps, we halve the learning rate every 600 steps.

We obtain a BLEU test/dev [bleuscore] score of 24.1/22.6, reaching a score of 24.0/22.4 after 7K steps. For our Tensor Train runs, we would like to test whether we can reach a comparable accuracy with a comparable number (or fewer) total flops. We therefore choose the cutoff BLEU test = 24.0 as a target against which to compare.

#### 4.1.2 Low-Rank Approximation

We perform a second benchmark run with the weight matrix

decomposed using a low-rank approximation factorization inspired by Singular Value Decomposition (SVD). We assume such a decomposition exists and initialise and train the matrices

and defined by , changing the order of computation of in equation (1) from to . Reordering the calculation in this way reduces the number of flops from , where is the batch size, to , where is an appropriately chosen SVD dimension. It is implemented by splitting the tf.matmul as follows:We perform one benchmark run on the IWSLT English-Vietnamese dataset using the same parameters as in Section 4.1.1. We obtain a BLEU test/dev score of 24.8/22.5 after 12K steps, reaching a score of 24.0/22.1 after 7K steps.

#### 4.1.3 Tensor Train

We perform approximately 20 training runs on the IWSLT English-Vietnamese dataset with the weight matrix decomposed using Tensor Train decomposition. We use a range of maximum rank and initial core dimensions , , }. All tests were performed using the same parameters as the benchmark runs, other than the learning rate, which is specified as necessary. All runs were performed using one GPU and take approximately 1-2 hours. For the runs which obtain BLEU test 24.0, we report the total percentage of flops used compared with the original model. For the rest, we report the BLEU scores after 12000 training steps. The results are given in Table 2.

We find that the IWSLT dataset obtains a BLEU test score 24.0 for core dimensions with learning rate 0.0012 and rank distributions and , for which we obtain BLEU test/dev = 24.0/21.9 and 24.2/21.9 respectively. These runs use 113% and 397% of the flops of the benchmark run respectively. We also find in general that increasing the learning rate increases the BLEU score within a given number of steps, as does a lower and .

Rank Dist. | Weight Matrix Dimensions | Learning | BLEU | Flops |

Rate | test/dev | % | ||

Original Model: | 24.0/22.4 | 100% | ||

Low-Rank Model: | 24.0/22.1 | 69% | ||

0.0012 | 23.3/21.8 | 84% | ||

0.0008 | 21.7/20.1 | - | ||

0.0004 | 18.8/17.3 | - | ||

0.0012 | 24.0/21.9 | 113% | ||

- | 0.0008 | 23.0/21.6 | - | |

0.0004 | 21.5/19.9 | - | ||

0.0012 | 23.0/20.8 | - | ||

0.0008 | 22.3/20.7 | - | ||

0.0004 | 20.8/19.0 | - | ||

0.0012 | 22.3/20.8 | - | ||

0.0008 | 21.8/20.6 | - | ||

0.0004 | 19.9/18.6 | - | ||

0.0012 | 23.9/22.1 | - | ||

0.0008 | 23.7/21.9 | - | ||

0.0004 | 23.0/21.3 | - | ||

0.0012 | 23.2/21.6 | - | ||

0.0008 | 23.1/21.5 | - | ||

0.0004 | 21.8/20.4 | - | ||

0.0012 | 23.1/21.1 | - | ||

0.0008 | 22.3/20.6 | - | ||

0.0004 | 20.9/19.4 | - | ||

0.0012 | 24.2/21.9 | 397% | ||

0.0008 | 24.1/21.6 | 397% | ||

0.0004 | 23.3/21.5 | - | ||

0.0012 | 23.4/22.0 | - | ||

0.0008 | 23.1/21.2 | - | ||

0.0004 | 22.9/21.3 | - | ||

0.0012 | 23.0/21.5 | - | ||

0.0004 | 21.3/20.0 | - |

### 4.2 WMT German-English ’16

For the WMT German-English dataset, we again use hyperparameters similar to the corresponding experiment outlined in [luong17]. We train 4-layer LSTMs of 1024 units with a bidirectional encoder (i.e., 2 bidirectional layers for the encoder) with embedding dimension 1024, using the Adam optimizer and a learning rate 0.0012. We train for 340K steps ( 10 epochs) where after 170K steps, we halve the learning rate every 17K steps. The data is split into subword units using BPE (32K operations).

We perform one training run using a rank distribution and core dimensions . This run was performed using 4 GPUs and took approximately 5 days. We obtain a final BLEU test/dev score of 24.0/23.8.

We also attempted training runs using core dimensions and a maximum rank , also on 4 GPUs. We find that for these configurations, the application crashes due to a lack of memory. As the total number of parameters should be less than the original model when using Tensor Train decomposition, we assume this is due to intermediate copies of the matrices being stored. However, this requires further investigation.

## 5 Conclusion and Future Work

We have successfully implemented TT-layers for the TensorFlow NMT model using the t3f Tensor Train library. We have performed training runs on two datasets, the first using the IWSLT English-Vietnamese ’15 dataset and the second with the WMT German-English ’16 dataset. We find that the IWSLT model obtains a BLEU test score 24.0 for the core dimensions with learning rate 0.0012 and rank distributions and , for which we obtain BLEU test/dev = 24.0/21.9 and 24.2/21.9 respectively. We also find that, of the parameters surveyed, a higher learning rate and more ‘rectangular’ weight matrix decomposition, i.e. a lower and , generally produce higher BLEU scores. We have also performed one successful training run using the larger WMT German-English dataset, using core dimensions and rank distribution . We obtained a final BLEU test/dev score of 24.0/23.8.

This work shows that TT-layers can be straightforwardly introduced to the TensorFlow NMT model and can obtain BLEU scores compatible with the original. With optimization, there is potential for this model to enable more efficient model training using fewer flops, less memory and less overall training time. Training on larger datasets is currently limited by the memory consumption of the model, despite the fact that Tensor Train decomposition should use fewer parameters. This suggests that the t3f library stores intermediate copies of the matrices, which could be addressed in future work on optimization. Finally, there is further scope to apply this decomposition to the Transformer model [transformer], which produces the best BLEU results at the time of writing, with the potential to improve its efficiency.

## Acknowledgements

This work was undertaken on the Fawcett supercomputer at the Department of Applied Mathematics and Theoretical Physics (DAMTP), University of Cambridge, funded by STFC Consolidated Grant ST/P000673/1. We would like to thank Yang You for his work on the low-rank matrix format Singular Value Decomposition, and Cole Hawkins for useful comments.

AD is supported by an EPSRC iCASE Studentship in partnership with Intel (EP/N509620/1, Voucher 16000206).