TensorNetwork for Machine Learning

06/07/2019 ∙ by Stavros Efthymiou, et al. ∙ 0

We demonstrate the use of tensor networks for image classification with the TensorNetwork open source library. We explain in detail the encoding of image data into a matrix product state form, and describe how to contract the network in a way that is parallelizable and well-suited to automatic gradients for optimization. Applying the technique to the MNIST and Fashion-MNIST datasets we find out-of-the-box performance of 98 the same tensor network architecture. The TensorNetwork library allows us to seamlessly move from CPU to GPU hardware, and we see a factor of more than 10 improvement in computational speed using a GPU.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Tensor networks have seen numerous applications in the physical sciences Fannes ; White ; Vidal ; Perez-Garcia ; MERA ; MERA2 ; MERAalgorithms ; Shi ; Tagliacozzo ; Murg ; PEPS1 ; PEPS2 ; PEPS3 ; rev1 ; rev2 ; rev3 ; rev4 ; rev5 ; MERA ; MERA2 ; MERAalgorithms ; QC1 ; QC2 ; QC3 ; QC4 ; cMPS ; cMERA ; CTMRG ; TRG ; TEFRG ; TNR ; Swingle ; dS1 ; dS2 ; dS3 ; MERAgeometry , but there has been significant progress recently in applying the same methods to problems in machine learning ML1 ; ML3 ; ML9 ; ML2 ; ML4 ; ML5 ; ML6 ; ML7 ; ML8 ; ML10 ; ML11 . The TensorNetwork library library was created to facilitate this research and accelerate the adoption of tensor network methods by the ML community. In a previous paper paper1

we showed how TensorNetwork could be used in a physics setting. Here we are going to illustrate how to use a matrix product state (MPS) tensor network to classify MNIST and Fashion-MNIST images. The basic technique was applied to the MNIST dataset by Stoudenmire and Schwab 

ML1 , who adapted the DMRG algorithm from physics White to train the network. Our implementation differs from theirs and follows more closely the implementation in the specialized TorchMPS library torchmps

. The most significant change from Stoudenmire and Schwab is that we use automatic gradients to train the model rather than DMRG. This method of training is familiar to ML practitioners and is built-in to TensorFlow 

TensorFlow , which is the TensorNetwork backend we use. We also empirically find it useful to use an alternative contraction order compared to Stoudenmire and Schwab when computing components of the MPS for ease of parallelization. MPS methods were previously applied into Fashion-MNIST in fashionMPS , and we achieve the same accuracy here. In terms of speed, we note that a factor of more than 10 improvement is gained moving from CPU to GPU hardware using the same code with a TensorFlow backend.

Up to a few tweaks of implementation, the main strategies we employ for image classification can be found elsewhere. The purpose of this note is to be a resource for machine learning practitioners who wish to learn how tensor networks can be applied to classification problems. An important part of this is the code that we have uploaded to GitHub library , which we hope will also be a valuable resource for the community.

To summarize the results, we find that using an MPS tensor network for classification results in 98% test accuracy for MNIST and 88% test accuracy for Fashion-MNIST. The only hyperparameter in the model is the bond dimension,

, but we find that the results are largely independent of for . We also compare training times on CPU and GPU, and find that the GPU leads to about 10x speedup over the 64-core CPU.

Ii Setup

ii.1 Encoding Data in a Tensor Network

In this section we will briefly review the structure of a tensor network and how it is used to encode image data. See library for further background that does not assume expertise in quantum physics.

The motivation for using a tensor network for data analysis is similar to the motivation for using a kernel method. Even though tensor networks are most useful for linear operations on data, they act on a very high-dimensional space, and that high dimensionality can be leveraged into representation power. We begin by carefully defining that high-dimensional space.

For us an image consists of a list of greyscale pixels. In the MNIST and Fashion-MNIST datasets . By flattening the images into a list we lose the two-dimensional nature of the image. This is obviously a drawback and will negatively impact the performance, but our goal is just to illustrate the ideas of a tensor network in a simple application. More complicated schemes could be employed to get around this limitation. See, for example, ML9 for an example of two-dimensional tensor network applied to image classification.

The pixel values of an image will be encoded in a

-dimensional vector space as follows. First, each pixel of the image is encoded into its own two-dimensional pixel space according to a

local feature map. One local feature map, used in ML1 , is

Here is the pixel value normalized to unit range. Notice that the pixel values and

get mapped to independent vectors in this space, and all other pixel values are linear combinations of those. If the image were purely black-and-white this would simply be a one-hot encoding of the pixel values. Another feature map which has similar properties is

In practice we will use the latter, linear feature map, though this is not very important. With color images it would make sense to consider a higher-dimensional pixel space, such as a -dimensional space encoding the RGB values of each pixel. Finally, by way of notation we will refer to the components , where the index takes on two values. So for the linear feature map, and .

The total image space is defined as the tensor product of all of the pixel spaces. A key property of this space is that flipping a single pixel value from black to white results in an independent image vector. In equations, an image is encoded in the image space as

We will refer to this object as a “data tensor” or “data state.” The data tensor has components, each of which is the product of one of the two components of the local feature map (e.g., or ) over all of the pixels. In index notation, the components of the data tensor are given by

Since there are two possible values for each of the indices, the total number of components is .

Figure 1: Matrix Product State (MPS) tensor network. The free indices correspond to the pixels in the image.

The MPS tensor network represents another type of vector in the image space. As such, it also has components, and the MPS writes each of those components in terms of the product of matrices. Letting represent the total MPS tensor, we have

(1)

The ranges of each of the indices, called the bond dimensions, are hyperparameters of the model. The bond dimensions determine the sizes of the tensors. The components of the tensors are variational parameters, or weights, that are fixed via training. There is some redundancy in those parameters, known as gauge freedom gauge , but we will not concern ourselves with that here.

To summarize, we have described an image space and an embedding of our data into that space in the form of the data tensors. The MPS tensor is another vector in that space which is not itself the data tensor of a single image. Intuitively speaking, we would like the MPS tensor to be equal to a linear combination of all of the images in a given class. An image not belonging to the class will be orthogonal to , while an image belonging to the class will not. In the next section we will describe how to train the MPS to have this property. When we have multiple classes to label, we can either construct MPS vectors or add an extra “label” node to the MPS to keep track of the class. The details of this are discussed in the following sections.

ii.2 Objective Function

The classification task can be expressed as finding a map from the space of objects to the space of labels. In the MNIST case should map each handwritten image to the corresponding digit from the set . In a machine learning setting is parametrized using a large number of variational parameters that are then optimized using pairs of labeled examples. Here represents a flattened image and the corresponding label (with and

for MNIST). Typical choices for such parametrization range from a simple linear regression or support vector machines to more complicated deep neural networks. In our setting, following 

ML1 , we define the classification map as follows: First we calculate the inner product between the encoded image vector (see Section II.1) and a variational MPS :

Figure 2: Inner product between the variational MPS (blue nodes) and the encoded data vector (red nodes). Notice the free MPS label index, depicted in lighter grey.
(2)

The inner product is depicted in Fig. 2 in tensor network graphical notation. Note that all pixel indices of the MPS are contracted with data, except the index which is free and used to distinguish the different labels. The position of the index in the MPS chain is arbitrary, and a typical choice is in the middle (position ). After calculating the inner product the classification map is defined as:

In other words, for each image we select the label whose MPS has the largest overlap with the corresponding encoded image vector.

Following the typical machine learning procedure, the variational parameters that define each MPS (see Eq. (1)) should be tuned to minimize an objective function in the training set. In the original work ML1 , the average mean squared error was chosen as the objective function. Here we choose instead to optimize the multi-class cross-entropy defined on our training set as:

(3)

where

Note that the outputs the softmax function can be interpreted as the predicted probabilities for each label. The final prediction corresponds to the label with the maximum probability. The cross-entropy following a softmax activation is a choice that is well suited to classification problems, as is known from standard machine learning methods.

ii.3 Implementation

High level frameworks such as TensorFlow allow for an efficient implementation of standard machine learning algorithms, such us neural networks, through a very simple API. One of the reasons for this simplicity is the automatic calculation of gradients used in the famous backpropagation algorithm 

backprop . The user only needs to define the forward pass of the model, which is generally straightforward, while the more complicated backward pass is handled automatically by the library.

Gradient optimization methods are not the typical choice for optimizing tensor networks. In most physics applications, sweeping algorithms, such as the celebrated Density Matrix Renormalization Group (DMRG) White are preferred, as they can lead to faster convergence. Regardless of that, a ”brute force” optimization using the gradients of the objective function is will work for the tensor network case, too. This approach may be suboptimal when compared to a more sophisticated sweeping method, but the simplicity of the underlying code for gradient-based optimization when written in a high-level machine learning library can be more attractive to machine learning practitioners.

Figure 3: Contract MPS in order.

The way the automatic differentiation works is strongly dependent on how the forward pass is defined by the user. When using high-level libraries, one should take advantage of the efficiency of vectorized operations, as this will lead to a more efficient forward pass and possibly more efficient gradients as well. In our case, the forward pass amounts to calculating the inner product of Eq. (2) (see Fig. 2).

A straightforward way to calculate this inner product (or equivalently to contract the tensor network), which is also commonly used in sweeping methods, is depicted in Fig. 3. Denoting with the dimension of the pixel space (size of vertical legs in the figure, where for the feature map described in Section II.1) and the bond dimension (size of horizontal legs) which we assume to be constant across the MPS, then the cost of contracting a vertical leg is , while the cost of contracting a horizontal leg is . Assuming that we continue the contraction as depicted from left to right, once we pass the free label index we will have to keep tack of the label for the rest of the contractions, increasing all costs by a factor of (the number of different labels) and leading to a total cost of order . An easy way to avoid the extra factor of is to start a contraction from both ends of the chain and contract with the tensor that carries the label index in the final step, resulting to an improved total cost of . Note that this analysis only takes into account the forward pass, that is calculating for a given , and not its gradients with respect the MPS parameters, for which we cannot avoid the additional factors.

Figure 4: Parallelized MPS contraction. (i) Contract all of the pixel indices with a data tensor. This creates new effective tensors depicted as dark blue squares. (ii) Contract the new tensors in pairs. This step can be done independently and in parallel for each pair. The result is a new chain with half as many effective tensors. (iii) Contract again in pairs, and repeat until the chain is fully contracted.

Although this method of contraction is expected to work well when coded using a low-level language, we empirically find it to be suboptimal for our specific application and the particular choices of the parameters , and , both in the forward pass and also in the automatic backward pass. We follow an alternative contraction order inspired by the implementation of torchmps and depicted in Fig 4. The total cost of this contraction is , where the first term comes from step (i) and the second term from the consecutive pairwise contractions. This method requires fewer contractions per site, however the cost scales as as it requires matrix-matrix multiplications. In contrast, the first method only has matrix-vector contractions. Even though the total cost is higher asymptotically for the second method, an advantage is that each step is easy to parallelize as the matrix multiplications are independent and do not require results from the neighboring calculation (as they do in the first method). This is of particular importance when using a machine learning library such as TensorFlow, as the supported batching operations can be used to easily implement these contractions in parallel. We note that this implementation does not only lead to a faster forward pass, but also to a more efficient automatic gradient calculation.

ii.4 Optimization

As described in the previous section, upon defining our model’s forward pass as depicted in Fig. 4

, TensorFlow automatically calculates the gradients of the loss function in Eq. (

3

). These gradients are then used to minimize the loss via a stochastic gradient descent method. A typical setting that we find to perform well is to use Adam optimizer 

adam with a learning rate of and batch sizes ranging from to samples depending on the total amount of training data used. We note that in the original sweeping DMRG-like optimization method proposed in ML1 each step updates two of the MPS tensors. In contrast, in gradient based methods using automatic differentiation typically all variational tensors are updated simultaneously in each update step.

A disadvantage of this method, when implemented naively, is that the bond dimension

is an additional hyperparameter that is set a priori and is kept constant during training. In the sweeping implementation, a singular value decomposition (SVD) step allows to adaptively change bond dimensions during training. This is a particularly interesting feature as it allows the model to change the number of its variational parameters according to the complexity of data to be learned.

Iii Results

Figure 5: Evolution of training (solid) and test (dashed) loss and accuracy during training on the MNIST dataset.
Figure 6: Evolution of training (solid) and test (dashed) loss and accuracy during training on the Fashion-MNIST dataset.

We implement the contraction method described in Fig. 4 using TensorNetwork with the TensorFlow backend, and we optimize using the built-in automatic differentiation and Adam optimizer with learning rate set to .

We first train on the total MNIST training set consisting of 60,000 images of size and we show the relevant training history in Fig. 5

. Each training epoch corresponds to a full iteration over the training set using a batch size of

. We observe that with this setting the model requires about epochs to converge to almost 100% training accuracy and about 98% test accuracy. Here test accuracy is calculated on the whole test dataset consisting of 10,000 images. Training with automatic gradients is found to be independent of the used bond dimension, with largest bond dimensions being of course more computationally expensive. In Fig. 7 left we plot the final training and test accuracies as a function of the bond dimension. Here the test accuracy is calculated on the MNIST test set of images once training is completed. We again find no dependence on the bond dimension obtaining about 98% test accuracy, in agreement with previous works. Furthermore we compare optimizing with our softmax cross-entropy loss (denoted as CE) with the original mean square loss from ML1 and we find no significant difference in terms of final accuracies. The square loss is more efficient to calculate and thus leads to slightly faster convergence in terms of actual wall time, however the difference is negligible as the bond dimension is increased, when contractions dominate the computation time.

Figure 7: Final accuracies on the full train (60,000 images) and test (10,000) sets as a function of bond dimension. The left plot is on the MNIST dataset of hand-written digits and the right plot on Fashion-MNIST. Blue color (squares and circles) corresponds to the training set and red color (triangles and diamonds) to the test set. For MNIST we also compare performance using the cross-entropy loss (solid lines) to the mean square loss (dotted lines) employed in ML1 .

We repeat the same optimization schedule for Fashion-MNIST, a dataset that has exactly the same structure as MNIST ( grayscale images of clothing with

labels), however it is significantly harder to classify, with state of the art deep learning methods obtaining about 93% test accuracy 

fashionSOTA . The corresponding training dynamics are shown in Fig. 6. As demonstrated in Fig. 7 right we are able to obtain 88% test accuracy and we again find no significant dependence on the bond dimension.

Finally, it is well known that the use of accelerators such us GPU can greatly reduce training time for many machine learning models, particularly in cases where complexity is dominated by linear algebra operations. Since this is the case with tensor networks, we expect to see some advantage in our methods. TensorNetwork with the TensorFlow backend allows direct implementation on a GPU without any changes in the code. In Fig. 8 we verify the advantage, with the GPU being four times faster for the smallest bond dimension and the relative speed-up increasing with increasing bond dimension.

Figure 8: Wall-time required per optimization epoch when implementing the same TensorNetwork/TensorFlow code on GPU and CPU. One epoch corresponds to a full iteration over the whole training set.

Iv Conclusion

Here we have described a tensor network algorithm for image classification using MPS tensor networks. All of the code required for reproducing the results in the open source TensorNetwork library is available on GitHub library . We are hopeful that the techniques we reviewed here will be taken up by the wider ML community, and that the code examples we are providing will become a valuable resource. By using TensorFlow as a backend, we were able to access automatic gradients for optimization of the tensor network, which moves us beyond the physics-centric techniques ordinarily used. Furthermore, TensorFlow already offers high-level methods for deploying state-of-the-art deep learning models. Adding tensor network machinery to that same library will allow direct comparison between these tools to more traditional machine learning methods, such as neural networks, a particularly active research area. Ultimately, one might be even able to further push the state-of-the-art by combining tensor networks with more traditional methods, something that can be implemented very easily using TensorNetwork on top of TensorFlow.

Acknowledgements.
We would like to thank Glen Evenbly, Martin Ganahl, Ash Milsted, Chase Roberts, Miles Stoudenmire, and Guifre Vidal for valuable discussions. X is formerly known as Google[x] and is part of the Alphabet family of companies, which includes Google, Verily, Waymo, and others (www.x.company).

References