Tensor networks have seen numerous applications in the physical sciences Fannes ; White ; Vidal ; Perez-Garcia ; MERA ; MERA2 ; MERAalgorithms ; Shi ; Tagliacozzo ; Murg ; PEPS1 ; PEPS2 ; PEPS3 ; rev1 ; rev2 ; rev3 ; rev4 ; rev5 ; MERA ; MERA2 ; MERAalgorithms ; QC1 ; QC2 ; QC3 ; QC4 ; cMPS ; cMERA ; CTMRG ; TRG ; TEFRG ; TNR ; Swingle ; dS1 ; dS2 ; dS3 ; MERAgeometry , but there has been significant progress recently in applying the same methods to problems in machine learning ML1 ; ML3 ; ML9 ; ML2 ; ML4 ; ML5 ; ML6 ; ML7 ; ML8 ; ML10 ; ML11 . The TensorNetwork library library was created to facilitate this research and accelerate the adoption of tensor network methods by the ML community. In a previous paper paper1
we showed how TensorNetwork could be used in a physics setting. Here we are going to illustrate how to use a matrix product state (MPS) tensor network to classify MNIST and Fashion-MNIST images. The basic technique was applied to the MNIST dataset by Stoudenmire and SchwabML1 , who adapted the DMRG algorithm from physics White to train the network. Our implementation differs from theirs and follows more closely the implementation in the specialized TorchMPS library torchmps
. The most significant change from Stoudenmire and Schwab is that we use automatic gradients to train the model rather than DMRG. This method of training is familiar to ML practitioners and is built-in to TensorFlowTensorFlow , which is the TensorNetwork backend we use. We also empirically find it useful to use an alternative contraction order compared to Stoudenmire and Schwab when computing components of the MPS for ease of parallelization. MPS methods were previously applied into Fashion-MNIST in fashionMPS , and we achieve the same accuracy here. In terms of speed, we note that a factor of more than 10 improvement is gained moving from CPU to GPU hardware using the same code with a TensorFlow backend.
Up to a few tweaks of implementation, the main strategies we employ for image classification can be found elsewhere. The purpose of this note is to be a resource for machine learning practitioners who wish to learn how tensor networks can be applied to classification problems. An important part of this is the code that we have uploaded to GitHub library , which we hope will also be a valuable resource for the community.
To summarize the results, we find that using an MPS tensor network for classification results in 98% test accuracy for MNIST and 88% test accuracy for Fashion-MNIST. The only hyperparameter in the model is the bond dimension,, but we find that the results are largely independent of for . We also compare training times on CPU and GPU, and find that the GPU leads to about 10x speedup over the 64-core CPU.
ii.1 Encoding Data in a Tensor Network
In this section we will briefly review the structure of a tensor network and how it is used to encode image data. See library for further background that does not assume expertise in quantum physics.
The motivation for using a tensor network for data analysis is similar to the motivation for using a kernel method. Even though tensor networks are most useful for linear operations on data, they act on a very high-dimensional space, and that high dimensionality can be leveraged into representation power. We begin by carefully defining that high-dimensional space.
For us an image consists of a list of greyscale pixels. In the MNIST and Fashion-MNIST datasets . By flattening the images into a list we lose the two-dimensional nature of the image. This is obviously a drawback and will negatively impact the performance, but our goal is just to illustrate the ideas of a tensor network in a simple application. More complicated schemes could be employed to get around this limitation. See, for example, ML9 for an example of two-dimensional tensor network applied to image classification.
The pixel values of an image will be encoded in a
-dimensional vector space as follows. First, each pixel of the image is encoded into its own two-dimensional pixel space according to alocal feature map. One local feature map, used in ML1 , is
Here is the pixel value normalized to unit range. Notice that the pixel values and
get mapped to independent vectors in this space, and all other pixel values are linear combinations of those. If the image were purely black-and-white this would simply be a one-hot encoding of the pixel values. Another feature map which has similar properties is
In practice we will use the latter, linear feature map, though this is not very important. With color images it would make sense to consider a higher-dimensional pixel space, such as a -dimensional space encoding the RGB values of each pixel. Finally, by way of notation we will refer to the components , where the index takes on two values. So for the linear feature map, and .
The total image space is defined as the tensor product of all of the pixel spaces. A key property of this space is that flipping a single pixel value from black to white results in an independent image vector. In equations, an image is encoded in the image space as
We will refer to this object as a “data tensor” or “data state.” The data tensor has components, each of which is the product of one of the two components of the local feature map (e.g., or ) over all of the pixels. In index notation, the components of the data tensor are given by
Since there are two possible values for each of the indices, the total number of components is .
The MPS tensor network represents another type of vector in the image space. As such, it also has components, and the MPS writes each of those components in terms of the product of matrices. Letting represent the total MPS tensor, we have
The ranges of each of the indices, called the bond dimensions, are hyperparameters of the model. The bond dimensions determine the sizes of the tensors. The components of the tensors are variational parameters, or weights, that are fixed via training. There is some redundancy in those parameters, known as gauge freedom gauge , but we will not concern ourselves with that here.
To summarize, we have described an image space and an embedding of our data into that space in the form of the data tensors. The MPS tensor is another vector in that space which is not itself the data tensor of a single image. Intuitively speaking, we would like the MPS tensor to be equal to a linear combination of all of the images in a given class. An image not belonging to the class will be orthogonal to , while an image belonging to the class will not. In the next section we will describe how to train the MPS to have this property. When we have multiple classes to label, we can either construct MPS vectors or add an extra “label” node to the MPS to keep track of the class. The details of this are discussed in the following sections.
ii.2 Objective Function
The classification task can be expressed as finding a map from the space of objects to the space of labels. In the MNIST case should map each handwritten image to the corresponding digit from the set . In a machine learning setting is parametrized using a large number of variational parameters that are then optimized using pairs of labeled examples. Here represents a flattened image and the corresponding label (with andML1 , we define the classification map as follows: First we calculate the inner product between the encoded image vector (see Section II.1) and a variational MPS :
The inner product is depicted in Fig. 2 in tensor network graphical notation. Note that all pixel indices of the MPS are contracted with data, except the index which is free and used to distinguish the different labels. The position of the index in the MPS chain is arbitrary, and a typical choice is in the middle (position ). After calculating the inner product the classification map is defined as:
In other words, for each image we select the label whose MPS has the largest overlap with the corresponding encoded image vector.
Following the typical machine learning procedure, the variational parameters that define each MPS (see Eq. (1)) should be tuned to minimize an objective function in the training set. In the original work ML1 , the average mean squared error was chosen as the objective function. Here we choose instead to optimize the multi-class cross-entropy defined on our training set as:
Note that the outputs the softmax function can be interpreted as the predicted probabilities for each label. The final prediction corresponds to the label with the maximum probability. The cross-entropy following a softmax activation is a choice that is well suited to classification problems, as is known from standard machine learning methods.
High level frameworks such as TensorFlow allow for an efficient implementation of standard machine learning algorithms, such us neural networks, through a very simple API. One of the reasons for this simplicity is the automatic calculation of gradients used in the famous backpropagation algorithmbackprop . The user only needs to define the forward pass of the model, which is generally straightforward, while the more complicated backward pass is handled automatically by the library.
Gradient optimization methods are not the typical choice for optimizing tensor networks. In most physics applications, sweeping algorithms, such as the celebrated Density Matrix Renormalization Group (DMRG) White are preferred, as they can lead to faster convergence. Regardless of that, a ”brute force” optimization using the gradients of the objective function is will work for the tensor network case, too. This approach may be suboptimal when compared to a more sophisticated sweeping method, but the simplicity of the underlying code for gradient-based optimization when written in a high-level machine learning library can be more attractive to machine learning practitioners.
The way the automatic differentiation works is strongly dependent on how the forward pass is defined by the user. When using high-level libraries, one should take advantage of the efficiency of vectorized operations, as this will lead to a more efficient forward pass and possibly more efficient gradients as well. In our case, the forward pass amounts to calculating the inner product of Eq. (2) (see Fig. 2).
A straightforward way to calculate this inner product (or equivalently to contract the tensor network), which is also commonly used in sweeping methods, is depicted in Fig. 3. Denoting with the dimension of the pixel space (size of vertical legs in the figure, where for the feature map described in Section II.1) and the bond dimension (size of horizontal legs) which we assume to be constant across the MPS, then the cost of contracting a vertical leg is , while the cost of contracting a horizontal leg is . Assuming that we continue the contraction as depicted from left to right, once we pass the free label index we will have to keep tack of the label for the rest of the contractions, increasing all costs by a factor of (the number of different labels) and leading to a total cost of order . An easy way to avoid the extra factor of is to start a contraction from both ends of the chain and contract with the tensor that carries the label index in the final step, resulting to an improved total cost of . Note that this analysis only takes into account the forward pass, that is calculating for a given , and not its gradients with respect the MPS parameters, for which we cannot avoid the additional factors.
Although this method of contraction is expected to work well when coded using a low-level language, we empirically find it to be suboptimal for our specific application and the particular choices of the parameters , and , both in the forward pass and also in the automatic backward pass. We follow an alternative contraction order inspired by the implementation of torchmps and depicted in Fig 4. The total cost of this contraction is , where the first term comes from step (i) and the second term from the consecutive pairwise contractions. This method requires fewer contractions per site, however the cost scales as as it requires matrix-matrix multiplications. In contrast, the first method only has matrix-vector contractions. Even though the total cost is higher asymptotically for the second method, an advantage is that each step is easy to parallelize as the matrix multiplications are independent and do not require results from the neighboring calculation (as they do in the first method). This is of particular importance when using a machine learning library such as TensorFlow, as the supported batching operations can be used to easily implement these contractions in parallel. We note that this implementation does not only lead to a faster forward pass, but also to a more efficient automatic gradient calculation.
As described in the previous section, upon defining our model’s forward pass as depicted in Fig. 4
, TensorFlow automatically calculates the gradients of the loss function in Eq. (3
). These gradients are then used to minimize the loss via a stochastic gradient descent method. A typical setting that we find to perform well is to use Adam optimizeradam with a learning rate of and batch sizes ranging from to samples depending on the total amount of training data used. We note that in the original sweeping DMRG-like optimization method proposed in ML1 each step updates two of the MPS tensors. In contrast, in gradient based methods using automatic differentiation typically all variational tensors are updated simultaneously in each update step.
A disadvantage of this method, when implemented naively, is that the bond dimension
is an additional hyperparameter that is set a priori and is kept constant during training. In the sweeping implementation, a singular value decomposition (SVD) step allows to adaptively change bond dimensions during training. This is a particularly interesting feature as it allows the model to change the number of its variational parameters according to the complexity of data to be learned.
We implement the contraction method described in Fig. 4 using TensorNetwork with the TensorFlow backend, and we optimize using the built-in automatic differentiation and Adam optimizer with learning rate set to .
We first train on the total MNIST training set consisting of 60,000 images of size and we show the relevant training history in Fig. 5
. Each training epoch corresponds to a full iteration over the training set using a batch size of. We observe that with this setting the model requires about epochs to converge to almost 100% training accuracy and about 98% test accuracy. Here test accuracy is calculated on the whole test dataset consisting of 10,000 images. Training with automatic gradients is found to be independent of the used bond dimension, with largest bond dimensions being of course more computationally expensive. In Fig. 7 left we plot the final training and test accuracies as a function of the bond dimension. Here the test accuracy is calculated on the MNIST test set of images once training is completed. We again find no dependence on the bond dimension obtaining about 98% test accuracy, in agreement with previous works. Furthermore we compare optimizing with our softmax cross-entropy loss (denoted as CE) with the original mean square loss from ML1 and we find no significant difference in terms of final accuracies. The square loss is more efficient to calculate and thus leads to slightly faster convergence in terms of actual wall time, however the difference is negligible as the bond dimension is increased, when contractions dominate the computation time.
We repeat the same optimization schedule for Fashion-MNIST, a dataset that has exactly the same structure as MNIST ( grayscale images of clothing with
labels), however it is significantly harder to classify, with state of the art deep learning methods obtaining about 93% test accuracyfashionSOTA . The corresponding training dynamics are shown in Fig. 6. As demonstrated in Fig. 7 right we are able to obtain 88% test accuracy and we again find no significant dependence on the bond dimension.
Finally, it is well known that the use of accelerators such us GPU can greatly reduce training time for many machine learning models, particularly in cases where complexity is dominated by linear algebra operations. Since this is the case with tensor networks, we expect to see some advantage in our methods. TensorNetwork with the TensorFlow backend allows direct implementation on a GPU without any changes in the code. In Fig. 8 we verify the advantage, with the GPU being four times faster for the smallest bond dimension and the relative speed-up increasing with increasing bond dimension.
Here we have described a tensor network algorithm for image classification using MPS tensor networks. All of the code required for reproducing the results in the open source TensorNetwork library is available on GitHub library . We are hopeful that the techniques we reviewed here will be taken up by the wider ML community, and that the code examples we are providing will become a valuable resource. By using TensorFlow as a backend, we were able to access automatic gradients for optimization of the tensor network, which moves us beyond the physics-centric techniques ordinarily used. Furthermore, TensorFlow already offers high-level methods for deploying state-of-the-art deep learning models. Adding tensor network machinery to that same library will allow direct comparison between these tools to more traditional machine learning methods, such as neural networks, a particularly active research area. Ultimately, one might be even able to further push the state-of-the-art by combining tensor networks with more traditional methods, something that can be implemented very easily using TensorNetwork on top of TensorFlow.
Acknowledgements.We would like to thank Glen Evenbly, Martin Ganahl, Ash Milsted, Chase Roberts, Miles Stoudenmire, and Guifre Vidal for valuable discussions. X is formerly known as Google[x] and is part of the Alphabet family of companies, which includes Google, Verily, Waymo, and others (www.x.company).
- (1) C. Roberts et al. TensorNetwork: A Library for Physics and Machine Learning, arXiv:1905.01330. The TensorNetwork library and all of the code to reproduce the results of this paper are available at https://github.com/google/TensorNetwork.
- (2) M. Fannes, B. Nachtergaele, and R. F. Werner, Finitely correlated states on quantum spin chains, Commun. Math. Phys. 144, 443 (1992).
- (3) S.R. White, Density matrix formulation for quantum renormalization groups, Phys. Rev. Lett. 69, 2863 (1992).
- (4) G. Vidal, Efficient classical simulation of slightly entangled quantum computations, Phys. Rev. Lett., 91, 147902 (2003), arXiv:quant-ph/0301063
- (5) D. Perez-Garcia, F. Verstraete, M. M.Wolf, and J. I. Cirac, Matrix Product State Representations, Quant. Inf. Comput. 7, 401 (2007), arXiv:quant-ph/0608197
- (6) G. Vidal, Entanglement renormalization, Phys. Rev. Lett. 99, 220405 (2007), arXiv:cond-mat/0512165
- (7) G. Vidal, A class of quantum many-body states that can be efficiently simulated, Phys. Rev. Lett. 101, 110501 (2008), arXiv:quant-ph/0610099
- (8) G. Evenbly, G. Vidal, Algorithms for entanglement renormalization, Phys. Rev. B 79, 144108 (2009), arXiv: 0707.1454
- (9) Y. Shi, L. Duan, and G. Vidal, Classical simulation of quantum many-body systems with a tree tensor network, Phys. Rev. A 74, 022320 (2006), quant-ph/0511070
- (10) L. Tagliacozzo, G. Evenbly, and G. Vidal Simulation of two-dimensional quantum systems using a tree tensor network that exploits the entropic area law, Phys. Rev. B 80, 235127 (2009), arXiv:0903.5017
- (11) V. Murg, F. Verstraete, O. Legeza, and R. M. Noack Simulating strongly correlated quantum systems with tree tensor networks, Phys. Rev. B 82, 205105 (2010), arXiv:1006.3095
- (12) F. Verstraete, and J. I. Cirac, Renormalization algorithms for Quantum-Many Body Systems in two and higher dimensions, arXiv:cond-mat/0407066 (2004).
- (13) G. Sierra and M.A. Martin-Delgado, The Density Matrix Renormalization Group, Quantum Groups and Conformal Field Theory, G. Sierra, M.A. Martin-Delgado arXiv:cond-mat/9811170v3 (1998).
- (14) T. Nishino and K. Okunishi, A Density Matrix Algorithm for 3D Classical Models J. Phys. Soc. Jpn., 67, 3066, 1998.
- (15) J.C. Bridgeman and C. T. Chubb, Hand-waving and Interpretive Dance: An Introductory Course on Tensor Networks J. Phys. A: Math. Theor. 50 223001 (2017), arXiv:1603.03039
- (16) R. Orus, A practical introduction to tensor networks: Matrix product states and projected entangled pair states, Ann. Phys. 349, 117-158 (2014), arXiv preprint arXiv:1306.2164
- (17) G. Evenbly, G. Vidal, Tensor network states and geometry, J. Stat. Phys. 145:891-918 (2011), arXiv:1106.1082
- (18) J. I. Cirac, F. Verstraete, Renormalization and tensor product states in spin chains and lattices, J. Phys. A: Math. Theor. 42, 504004 (2009), arXiv:0910.1130
- (19) U. Schollwoeck, The density-matrix renormalization group, Rev. Mod. Phys. 77, 259 (2005), arXiv:cond-mat/0409292
- (20) S. R. White, R. L. Martin, Ab Initio Quantum Chemistry using the Density Matrix Renormalization Group, J. Chem. Phys. 110, 4127 (1999), arXiv:cond-mat/9808118
- (21) G. K.-L. Chan, J. J. Dorando, D. Ghosh, J. Hachmann, E. Neuscamman, H. Wang and T. Yanai, An Introduction to the Density Matrix Renormalization Group Ansatz in Quantum Chemistry arXiv:0711.1398.
- (22) S. Szalay, M. Pfeffer, V. Murg, G. Barcza, F. Verstraete, R. Schneider, O. Legeza, Tensor product methods and entanglement optimization for ab initio quantum chemistry, Int. J. Quant. Chem. 115, 1342 (2015). arXiv:1412.5829
- (23) C. Krumnow, L. Veis, O. Legeza and J. Eisert, Fermionic orbital optimisation in tensor network states Phys. Rev. Lett. 117, 210402 (2016), arXiv:1504.00042
- (24) T. Nishino, K. Okunishi, Corner Transfer Matrix Renormalization Group Method, J. Phys. Soc. Jpn. 65, pp. 891-894 (1996), arXiv:cond-mat/9507087
- (25) M. Levin, C. P. Nave, Tensor renormalization group approach to 2D classical lattice models, Phys. Rev. Lett. 99, 120601 (2007), arXiv:cond-mat/0611687
- (26) Z.-C. Gu, X.-G. Wen, Tensor-Entanglement-Filtering Renormalization Approach and Symmetry Protected Topological Order, Phys. Rev. B 80, 155131 (2009), arXiv:0903.1069
- (27) G. Evenbly, G. Vidal, Tensor Network Renormalization, Phys. Rev. Lett. 115, 180405 (2015), arXiv:1412.0732
- (28) F. Verstraete and J. I. Cirac, Continuous Matrix Product States for Quantum Fields Phys. Rev. Let. 104, 190405 (2010), arXiv:1002.1824
- (29) J. Haegeman, T. J. Osborne, H. Verschelde, F. Verstraete, Entanglement renormalization for quantum fields, Phys. Rev. Lett. 110, 100402 (2013), arXiv:1102.5524
- (30) B. Swingle, Entanglement Renormalization and Holography, Phys. Rev. D 86, 065007 (2012), arXiv:0905.1317
- (31) C. Beny, Causal structure of the entanglement renormalization ansatz, New J. Phys. 15 (2013) 023020, arXiv:1110.4872
- (32) B. Czech, L. Lamprou, S. McCandlish, and J. Sully, Tensor Networks from Kinematic Space, JHEP07 (2016) 100, arXiv:1512.01548
- (33) N. Bao, C. Cao, S. M. Carroll, A. Chatwin-Davies, De Sitter space as a tensor network: Cosmic no-hair, complementarity, and complexity, Phys. Rev. D 96, 123536 (2017), arXiv:1709.03513
- (34) A. Milsted, G. Vidal Geometric interpretation of the multi-scale entanglement renormalization ansatz, arXiv:1812.00529
- (35) E. M. Stoudenmire, D. J. Schwab, Supervised Learning with Quantum-Inspired Tensor Networks, Adv. Neu. Inf. Proc. Sys. 29, 4799 (2016), arXiv:1605.05775
- (36) Y. Levine, D. Yakira, N. Cohen, A. Shashua, Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design, arXiv:1704.01552
- (37) D. Liu, S.-J. Ran, P. Wittek, C. Peng, R. B. García, G. Su, M. Lewenstein, Machine Learning by Two-Dimensional Hierarchical Tensor Networks: A Quantum Information Theoretic Perspective on Deep Architectures arXiv: 1710.04833
- (38) J. Chen, S. Cheng, H. Xie, L. Wang, T. Xiang, Equivalence of restricted Boltzmann machines and tensor network states, Phys. Rev. B 97, 085104 (2018), arXiv:1701.04831
- (39) Y. Levine, O. Sharir, N. Cohen, A. Shashua, Quantum entanglement in deep learning architectures, Physical Review Letters, 122(6), 065301 (2019), arXiv:1803.09780
- (40) I. Glasser, N. Pancotti, J. I. Cirac, Supervised learning with generalized tensor networks, arXiv:1806.05964
- (41) S. D. Sarma, D.-L. Deng, L.-M. Duan, Machine learning meets quantum physics, Physics Today 72, 3, 48 (2019), arXiv:1903.03516
- (42) W. Huggins, P. Patel, K. B. Whaley, E. M. Stoudenmire, Towards Quantum Machine Learning with Tensor Networks, Quantum Science and Technology, Volume 4, 024001 (2019), arXiv:1803.11537
- (43) E. M. Stoudenmire Learning Relevant Features of Data with Multi-scale Tensor Networks, Quantum Science and Technology, Volume 3, 034003 (2018), arXiv: 1801.00315
- (44) A. Cichocki, N. Lee, I. V. Oseledets, A.-H. Phan, Q. Zhao, D. Mandic, Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1, Foundations and Trends in Machine Learning, 9(4-5), 249–429. arXiv: 1609.00893
- (45) A. Cichocki Tensor Networks for Big Data Analytics and Large-Scale Optimization Problems, arXiv: 1407.3124
- (46) A. Milsted, M. Ganahl, S. Leichenauer, J. Hidary, G. Vidal, TensorNetwork on TensorFlow: A Spin Chain Application Using Tree Tensor Networks (2019), arXiv: 1905.01331
- (47) J. Miller, TorchMPS, https://github.com/jemisjoky/torchmps (2019)
- (48) M. Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (2015), Software available from tensorflow.org.
- (49) Z.-Z. Sun et al., Generative Tensor Network Classification Model for Supervised Machine Learning, arXiv: 1903.10742
- (50) G. Evenbly, Gauge fixing, canonical forms, and optimal truncations in tensor networks with closed loops, Physical Review B, 98(8), 085155 (2018).
- (51) D. Rumelhart et al., Learning representations by back-propagating errors, Cognitive Modeling 5(3), 1 (1988).
- (52) D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv: 1412.6980
S. Bhatnagar, D. Ghosal, and M. H. Kolekar,
Classification of fashion article images using convolutional neural networksFourth International Conference on Image Information Processing (ICIIP), 1–6 (2017)/