Tensor Train decomposition on TensorFlow (T3F)

01/05/2018 ∙ by Alexander Novikov, et al. ∙ HSE University Skoltech cornell university 0

Tensor Train decomposition is used across many branches of machine learning, but until now it lacked an implementation with GPU support, batch processing, automatic differentiation, and versatile functionality for Riemannian optimization framework, which takes in account the underlying manifold structure in order to construct efficient optimization methods. In this work, we propose a library that aims to fix it and makes machine learning papers that rely on Tensor Train decomposition easier to implement. The library includes 92

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

t3f

Tensor Train decomposition on TensorFlow


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Methods based on tensor decompositions gain more and more traction in the machine learning community and are used for analyzing theoretical properties of deep learning 

(Cohen et al., 2016; Cohen and Shashua, 2016; Khrulkov et al., 2017)

, compactly parametrizing models 

(Lebedev et al., 2015; Novikov et al., 2015; Yu et al., 2017), training probabilistic models (Anandkumar et al., 2012; Jernite et al., 2013; Song et al., 2013) and deep learning models (Janzamin et al., 2015), parameterizing recommender systems (Frolov and Oseledets, 2017), and many more. In this work, we focus on implementing a library for working with one tensor decomposition in particular – the Tensor Train decomposition (Oseledets, 2011).

Despite the fact that there are already three different libraries111https://github.com/oseledets/ttpy222https://www.mathworks.com/matlabcentral/fileexchange/46312-oseledets-tt-toolbox333https://pypi.python.org/pypi/TensorToolbox/ that implement the Tensor Train decomposition, all the recent papers that use it for machine learning purposes had to rewrite core functionality from scratch because the existing implementations do not support GPU execution, automatic differentiation (which forced Novikov et al. (2015) to derive gradients by hand), do not support parallel processing of a batch of tensors (which forced Novikov et al. (2016) to rewrite basic operations in TensorFlow), and lack advanced support for Riemannian geometry operations, which is a technique that allows to boost the optimization of models with the constraint that parameters have compact Tensor Train representation (or other constraint set that forms a smooth manifold).

In the presented library, we aim to make all the results in machine learning papers utilizing the TT decomposition easy to reproduce and provide flexible support for developing new ideas. The library is released444https://github.com/Bihaqo/t3f under MIT license and is distributed as a PyPI package555https://pypi.python.org/pypi/t3f to simplify installation process. The API reference documentation is also available online666https://t3f.readthedocs.io

. The library includes several Jupyter notebook examples such as compressing neural network weights by factorizing them into the Tensor Train format or performing tensor completion by assuming that the result has low TT-rank. The library has

test coverage.

2 Implementation details

The library provides two base classes: TensorTrain and TensorTrainBatch that support storing one tensor in the Tensor Train format and a batch of such tensors respectively, i.e. a list of tensors of the same shape that are supposed to be processed together. These two classes support most of the logic of tf.Tensor class (e.g. .op, .name, and .get_shape methods). Under the hood, these classes are containers for the factors (which are represented as tf.Tensor objects) of the TT-format plus lightweight meta-information, which means that shall a person need to work with the factors directly she may easily access them. The rest of the library is a collection of functions which take as an input one or two TT-objects and output a TT-object or a tf.Tensor object depending on the semantics of a particular function. For example, function t3f.multiply(left, right) implements elementwise multiplication of two TT tensors (or batches of TT tensors), but also supports multiplication of a TT tensor by a number. As an output, this function returns a TensorTrain or a TensorTrainBatch object.

Basic functionality of the library consists of tools for creating tensors (e.g. t3f.ones or t3f.random_tensor), rich indexing of the tensors, element-wise operations (addition and multiplication), matrix by matrix multiplication, SVD based operations (e.g. factorizing a tensor into the TT-format or rounding a TT-object to find it’s closest lower rank approximation). For a complete list of supported operations, see the API reference documentation.

2.1 Batch processing

Most operations accept broadcasting and getting a batch of TT-objects as an input. For example, C = t3f.matmul(A, B) for a batch of TT-matrices A and a TT-matrix B will return a batch of TT-matrices C where and the result is computed in parallel across the batch dimension. Also, there are operations specifically tailored to batch inputs such as pairwise_flat_inner(x, y), which computes the matrix of inner products .

2.2 Riemannian geometry operations

One of the advantages of the Tensor Train format is that the set of tensors of fixed TT-ranks forms a Riemannian manifold, which allows using Riemannian geometry ideas to speed up tensor calculus while preserving theoretical guaranties (see Steinlechner (2016) for more details). The T3F library has a rich support for Riemannian operations, the most basic being projecting a TT-object (or a batch of them) onto the tangent space of another TT-object . We denote this projection operation by .

Other supported operations are special cases of combining this basic projection operation with non-Riemannian operations, but are heavily optimized by exploiting the structure of objects that are projected onto the same tangent space. Such operations include projecting a weighted sum of a batch of TT-objects on a tangent space (necessary for efficiently computing the Riemannian gradient):

S = t3f.project_sum(what=A, where=B, weights=c)

Mathematically, this function implements the following operation: . The same operation can be implemented by a projection followed by summation and rounding operations in asymptotic complexity where is the batch-size, is the number of TT-cores, is the mode size of each axis of the tensor (i.e. the tensors are of size ), and are the TT-ranks of the tensors and . But, the tailored project_sum operation requires only for the same operation.

Other tailored operations include computing the Gram matrix of a batch of tensors from the same tangent space with asymptotic complexity (the same operation in the general case is

); and projecting matrix-by-vector product onto a tangent space

with asymptotic complexity , while doing it in two steps – matrix by vector multiplication followed by projection – would require .

2.3 Kronecker products

For matrices, the TT-format is introduced in a special way (in contrast to treating a matrix as a 2-dimensional tensor) such that the Kronecker product of two matrices is a TT-matrix with two TT-factors and the TT-rank being 1 (for details see Novikov et al. (2014)). Since the Kronecker product is a special case of a TT-object, T3F library provides means to work with Kronecker products. For example, one can find the closest approximation (according to Frobenius norm) of a matrix as a Kronecker product of two matrices and of sizes and

t3f.to_tt_matrix(E, shape=((M1, N1), (M2, N2)), max_tt_rank=1)

Kronecker product matrices allow many operations such as computing the determinant, inverse, or computing the Cholesky decomposition to be performed much faster than in the general case. These operations are supported in the t3f.kronecker module (see the full list of supported operations in the API reference documentation). However, some operations such as summing two Kronecker product matrices will result in a general matrix that lacks these properties. To retract back to the class of Kronecker product matrices, one can compute closest approximation of a sum of two Kronecker product matrices as a Kronecker product matrix without ever materializing the large matrix with the following code

first = t3f.TensorTrain([A1, B1])
second = t3f.TensorTrain([A2, B2])
res_exact = first + second
res_kronecker_product = t3f.round(res_exact, max_tt_rank=1)

3 Benchmarking

In this section, we benchmark the basic functionality of T3F on CPU and GPU and compare its performance against the most actively developed alternative library TTPY. To reproduce the benchmark on your hardware, see examples/profile folder in the T3F library.

For benchmarking, we generated a batch of 100 random TT-matrices of sizes (so the TT-representation consists of 10 factors) of TT-rank 10 and a batch of 100 random TT-vectors of size . We benchmarked the matrix-by-vector multiplication (‘matvec’), matrix-by-matrix multiplication (‘matmul’), computing the Frobenious norm (‘norm’), computing the Gram matrix of 1 or of 100 TT-vectors (in the case of one vector this is just computing the dot-product of the only vector with itself). There are also two additional operation with different inputs: rounding one or a batch of 100 TT-vectors of TT-rank 100 to the closest TT-rank 10 TT-vectors (‘round’) and projecting 1 or a batch of 100 TT-vectors of TT-rank 100 onto a tangent space of a TT-vector of TT-rank 10 and size . The results are reported in Tbl. 1. Note that TTPY lacks GPU and batch processing support. In the batch case, the time is reported per object, e.g. it actually takes 0.3 ms to process a batch of 100 matrix-by-vector multiplications, but in the table number 0.003 is reported.

Op TTPY
1 object CPU
T3F
1 object CPU
T3F
1 object GPU
T3F
100 objects CPU
T3F
100 objects GPU
matvec 11.142 0.129 0.121 0.003 0.003
matmul 86.191 0.125 0.133 0.004 0.004
norm 3.790 1.902 0.893 0.422 0.050
round 73.027 0.159 0.165 0.006 0.006
gram 0.145 0.806 0.703 0.029 0.001
project 116.868 1.564 1.658 0.017 0.018
Table 1: Time in ms for different operations implemented in TTPY library vs T3F on CPU and GPU. The timing for a batch of 100 objects is reported per single object. The comparison is made on an NVIDIA DGX-1 station with Tesla V100 GPUs (using only 1 GPU at a time) and Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz with 80 logical cores in double precision.

References

  • Anandkumar et al. (2012) Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems 25, pages 917–925. 2012.
  • Cohen and Shashua (2016) N. Cohen and A. Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pages 955–963, 2016.
  • Cohen et al. (2016) N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
  • Frolov and Oseledets (2017) Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3), 2017.
  • Janzamin et al. (2015) Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. In Advances in Neural Information Processing Systems 28. 2015.
  • Jernite et al. (2013) Y. Jernite, Y. Halpern, and D. Sontag. Discovering hidden variables in noisy-or networks using quartet tests. In Neural Information Processing Systems (NIPS), 2013.
  • Khrulkov et al. (2017) V. Khrulkov, A. Novikov, and I. Oseledets. Expressive power of recurrent neural networks. arXiv preprint arXiv:1711.00811, 2017.
  • Lebedev et al. (2015) V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky.

    Speeding-up convolutional neural networks using fine-tuned cp-decomposition.

    In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • Novikov et al. (2014) A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov. Putting MRFs on a Tensor Train. 2014.
  • Novikov et al. (2015) A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 442–450, 2015.
  • Novikov et al. (2016) A. Novikov, M. Trofimov, and I. Oseledets. Exponential machines. arXiv preprint arXiv:1605.03795, 2016.
  • Oseledets (2011) I. V. Oseledets. Tensor-Train decomposition. SIAM J. Scientific Computing, 33(5):2295–2317, 2011.
  • Song et al. (2013) L. Song, M. Ishteva, A. Parikh, E. Xing, and H. Park. Hierarchical tensor decomposition of latent tree graphical models. In International Conference on Machine Learning (ICML), 2013.
  • Steinlechner (2016) M. Steinlechner. Riemannian optimization for solving high-dimensional problems with low-rank tensor structure. 2016.
  • Yu et al. (2017) Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using tensor-train rnns. arXiv preprint arXiv:1711.00073, 2017.