Machine Learning by Two-Dimensional Hierarchical Tensor Networks: A Quantum Information Theoretic Perspective on Deep Architectures

10/13/2017 ∙ by Ding Liu, et al. ∙ ICFO 0

The resemblance between the methods used in studying quantum-many body physics and in machine learning has drawn considerable attention. In particular, tensor networks (TNs) and deep learning architectures bear striking similarities to the extent that TNs can be used for machine learning. Previous results used one-dimensional TNs in image recognition, showing limited scalability and a high bond dimension. In this work, we train two-dimensional hierarchical TNs to solve image recognition problems, using a training algorithm derived from the multipartite entanglement renormalization ansatz (MERA). This approach overcomes scalability issues and implies novel mathematical connections among quantum many-body physics, quantum information theory, and machine learning. While keeping the TN unitary in the training phase, TN states can be defined, which optimally encodes each class of the images into a quantum many-body state. We study the quantum features of the TN states, including quantum entanglement and fidelity. We suggest these quantities could be novel properties that characterize the image classes, as well as the machine learning tasks. Our work could be further applied to identifying possible quantum properties of certain artificial intelligence methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

Code Repositories

Tree-Tensor-Networks-in-Machine-Learning

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Results

Power of representation and generalization

To verify the representation power of the TTN, we use the CIFAR-10 dataset [32], which consists of 10 classes with 50,000 RGB images in the training dataset and 10,000 images in the testing dataset. Each RGB image was originally pixels. We transformed them to gray-scale to reduce the complexity of training, which is a reasonable trade-off for reducing information for the training.

Figs. 2 (a) and (b) exhibit that the relation between the representation power (learnability or model complexity) and the bond dimensions of the TTN. The TTN actually gives a mapping that optimally project the vectorized images from the -dimensional space to the -dimensional one. Thus, from the perspective of tensor algebra, the limitation of the representation power of depends on the input dimension of . On the other hand, the TTN can be considered as an approximation of such an exponentially large mapping, by writing it into the contraction of small tensors. The dummy indexes that are contracted inside the TTN are called virtual bonds, whose dimensions determine how close can reach the limitation.

The sequence of convolutional and pooling layers in the feature extraction part of a deep learning network is known to arrive at higher and higher levels of abstraction that help separate the classes in a discriminative learner [13]. This is often visualized by embedding the representation in two dimensions by t-SNE [29]

, and by coloring the instances according to their classes. If the classes clearly separate in this embedding, the subsequent classifier will have an easy task performing classification at a high accuracy. We plotted this embedding for each layer in the TN in Fig. 

4. We observe the same pattern as in deep learning, having a clear separation in the highest level of abstraction.

Figure 2: Binary classification accuracy on CIFAR-10 (Horses vs Planes) with (a) 200 training samples and (b) 600 training samples;
Figure 3: Training and test accuracy as the function of the bond dimensions on the MNIST dataset. The virtual bond dimensions are set equal to input bond dimensions. The number of training samples is 1000 for each pair of classes.
model 0 1 2 3 4 5 6 7 8 9 10-class
Training accuracy (%) 96 97 96 94 96 94 97 94 93 94 95
Testing accuracy (%) 97 97 95 93 95 95 96 94 93 93 92
Input bond 3 3 3 4 2 6 2 6 6 4 /
Virtual bond 3 3 4 4 3 6 3 6 6 6 /
Table 1: 10-class classification on MNIST

Furthermore, to test the generalization power of TTNs, we used the MNIST dataset, which is widely used in handwritten recognition. The training set consists of 60,000 () gray-scale images, with 10,000 testing examples. For the simplicity of encoding, we rescaled them to () images so that the TTN can be built with four layers.

Figure 4: Embedding of data instances of CIFAR-10 by t-SNE corresponding to each layer in the TTN: (a) original data distribution and (b) the 1st, (c) 2nd, (d) 3rd, (e) 4th, and (f) 5th layer.

With the increase of bond dimensions (both of the input and virtual bonds), we find an apparent rise of training accuracy, which is shown in Fig. 3

. At the same time, we observed the decline of testing accuracy. The increase of bond dimension leads to a sharp increase of the number of parameters and, as a result, it will give rise to overfitting and lower the performance of generalization, mirroring the theoretical principles of statistic learning. Therefore, one must pay attention to finding the optimal bond dimension – we can think of this as a hyperparameter controlling model complexity. Considering the efficiency and avoiding overfitting, we use the minimal values of bond dimensions (Table 

1) to reach the training accuracy around . Our results indicate that only small bond dimensions () are needed.

Encoding images to states: fidelity and entanglement

With , -dimensional TTN states can be defined as , with the vector corresponding to the -th label. In this work, we use the following strategy to obtain TTN states for the classification of classes. For the -th image class as an example, we first label the training samples as “yes” or “no”. Then we train the TTN as a binary classifier. Finally, is obtained as . We keep normalized in the algorithm.

The fidelity between two states is defined as . It measures the distance between the two quantum states in the Hilbert space. Fig. 5 shows the fidelity between each two ’s trained from the MNIST dataset. One can see that remains quite small in most cases. This means that are almost orthonormal. Although the total dimension of the vectorized images is , most of the relevant information gathers in a small corner spanned by the orthonormal states .

In addition, the largest value of the fidelity appears as . We speculate that this is closely related to the way how the data instances are fed and processed in the TTN. In our case, two image classes that have similar shapes will result in a larger fidelity, because the TTN essentially provides a real-space renormalization flow. In other words, the input vectors are still initially arranged and renormalized layer by layer according to their spatial locations in the image; each tensor renormalizes four nearest-neighboring vectors into one vector. Fidelity can be potentially applied to building a network, where the nodes are classes of images and the weights of the connections are given by the . This might provide a mathematical model on how different classes of images are associated to each other. We leave these questions for future investigations.

Another important concept of quantum mechanics is (bipartite) entanglement, a quantum version of correlations [33]. It is one of the key characters that distinguishes the quantum states from classical ones. Entanglement is usually given by a normalized positive-defined vector called entanglement spectrum (denoted as ). The strength of entanglement is measured by the entanglement entropy . Fig. 5 shows the entanglement entropy of trained with the MNIST dataset. We compute two kinds of entanglement entropy by cutting the images in the middle along the x and y directions as shown in fig. 1. The results were marked by up-down and left-right in Fig.5

. The first one denotes the entanglement between the upper part of the images with the downer part. The latter denotes the entanglement between the left and the right parts. With the TTN, the entanglement spectrum is simply the singular values of the matrix

with the top tensor. This is because the all the tensors in the TTN are orthogonal. Note that has four indexes, of which each represents the effective space renormalized from one quarter of the vectorized image. Thus, the bipartition of the entanglement determines how the four indexes of are grouped into two bigger indexes before calculating the SVD.

Two implications can be achieved from the entanglement entropy. Firstly, it is known from tensor network that entanglement entropy reveals the needed dimensions of the virtual bonds for reaching a certain precision. In other words, entanglement entropy is a characterization of the computational complexity of the classification using TTN. Secondly, for a physical state with two subsystems, entanglement entropy measures the amount of information of one subsystem that can be gained by measuring the other subsystem. Here, an important analog is between knowing a part of the image and measuring the corresponding subsystem of the quantum state. Thus, we suggest that in our image recognition, entanglement entropy characterizes how much information of one part of the image we can gain by knowing the rest part of the image. In other words, if we only know a part of an image and want to predict the rest according to the trained TTN state, the entanglement entropy measures how accurately this can be done. Moreover, we show that actually possess small entanglement, meaning that the TTN can efficiently capture and classify the images with a relatively small virtual bond dimension. Our results suggest that the images of “0” and “4” are the easiest and hardest, respectively, to predict the missing part given the other part.

Figure 5: (a) Fidelity between each two handwritten digits, which ranges from to . The diagonal terms because the quantum states are normalized; (b) Entanglement entropy corresponding to each handwritten digit entropy.

Discussion

We continued the forays into using tensor networks for machine learning, focusing on hierarchical, two-dimensional tree tensor networks that we found a natural fit for image recognition problems. This provides a scalable approach of a high precision. We conclude with the following observations:

  • The limitation of representation power (learnability) of a TTN strongly depends on the input bond dimensions, and the virtual bond dimensions determine how well the TTN reaches this limitation.

  • A hierarchical tensor network exhibits the same increase level of abstraction as a deep convolutional neural network or a deep belief network.

  • Our scheme naturally connects classical images to quantum states, permitting to use quantum properties (fidelity and entanglement) to characterize the classical data and computational tasks.

Moreover, our work contributes towards the implementation of machine learning by quantum simulations/computations. Firstly, since we propose to encode image classes into TTN states, it is possible to realize the proposed machine learning by, e.g., quantum state tomography techniques [27]. Secondly, arbitrary unitary gates can in principle be realized by the so-called digital quantum simulators [34]. This makes another possible way of realizing our proposal by quantum simulations, thanks to the unitary conditions of the local tensors.

Methods

Feature map

Our approach to classify image data begins by mapping each pixel to a -component vector . This feature map was introduced by [11]) and defined as:

(1)

where runs from to . By using a larger , the TTN has the potential to approximate a richer class of functions. With such an nonlinear feature map, we can project a gray-scale image from scalar space to -dimensional vector space, where the image is represented as a direct product state of local -dimensional vectors . The coefficients of are given by the feature map [Eq. (1)] from the -th pixel.

MERA-inspired training algorithm

can be written as a hierarchical structure of layers TN (see Fig.1 for example), whose coefficients are given by

(2)

where is the number of tensors in the -th layer. The output for classifying the -th sample is a -dimensional vector obtained by contracting the vectorized image (denoted by for the -th sample) with the TTN, which reads as

(3)

Where acts as the predicted label corresponding to the -th sample. Based on these, we derive a highly efficient training algorithm inspired by MERA [22]. We choose the cost function to be minimized as the square error, which is defined as

(4)

To proceed, let us give the cost function in the following form

(5)

The third term comes from the normalization of , and we assume the second term is always real.

The dominant cost comes from the first term. We borrow the idea from the MERA approach to reduce this cost. Mathematically speaking, the central idea is to impose that is orthogonal, i.e., . Then is optimized with satisfied in the valid subspace that optimizes the classification. By satisfying in the subspace, we do not require an identity from , but mean under the training samples.

In MERA, a stronger constraint is used. With the TTN, each tensor has one upward and four downward indexes, which gives a non-square orthogonal matrix by grouping the downward indexes into a large one. Such tensors are called isometries and satisfy

after contracting all downwards indexes with its conjugate. When all the tensors are isometries, the TTN gives a unitary transformation that satisfies ; it compresses a -dimensional space to a -dimensional one.

In this way, the first terms becomes a constant, and we only need to deal with the second term. The cost function becomes

(6)

Each term in is simply the contraction of the tensor network, which can be efficiently computed.

The tensors in the TTN are updated alternatively to minimize Eq. (6). To update for instance, we assume other tensors are fixed and define the environment tensor , which is calculated by contracting everything in Eq. (6) after taking out (Fig. 1[25]. Then the cost function becomes . Under the constraint that is an isometry, the solution of the optimal point is given by where and

are calculated from the singular value decomposition

. At this point, we have .

Then, the update of one tensor becomes the calculation of the environment tensor and its singular value decomposition. In the alternating process for updating all the tensors, some tricks are used to accelerate the computations. The idea is to save some intermediate results to avoid repetitive calculations by taking advantage of the tree structure. Another important detail is to normalize the vector obtained each time by contracting four vectors with a tensor.

The scaling of both time complexity and space complexity is , where is the dimension of input vector; the dimension of virtual bond; the dimension of input bond; the number of training inputs.

Multi-class classification

The strategy for building a multi-class classifier is the one-against-all classification scheme in machine learning. For each class, we train one TTN so that it recognizes whether an image belongs to this class or not. The output of Eq. (3) is a two-dimensional vector. We fix the label for a yes answer as . For the image classes, we accordingly have TTNs that satisfy . Then for recognizing the -th sample, we introduce a -dimensional vector , where the p-th element is defined as the inner product between and the vectorized image, satisfying

(7)

The position of its maximal element gives which class the image belongs to.

Acknowledgments

SJR is grateful to Ivan Glasser and Nicola Pancotti for stimulating discussions. DL was supported by the China Scholarship Council (201609345008), the National Natural Science Key Foundation of China (61433015), and the National Natural Science Foundation of China (61771340). SJR, PW, and ML acknowledge support the Spanish Ministry of Economy and Competitiveness (Severo Ochoa Programme for Centres of Excellence in R&D SEV-2015-0522), Fundació Privada Cellex, and Generalitat de Catalunya CERCA Programme. SJR and ML were further supported by ERC AdG OSYRIS (ERC-2013-AdG Grant No. 339106), the Spanish MINECO grants FOQUS (FIS2013-46768-P), FISICATEAMO (FIS2016-79508-P), Catalan AGAUR SGR 874, EU FETPRO QUIC, EQuaM (FP7/2007-2013 Grant No. 323714), and Fundació Catalunya - La Pedrera Ignacio Cirac Program Chair. PW acknowledges financial support from the ERC (Consolidator Grant QITBOX) and QIBEQI FIS2016-80773-P), and a hardware donation by Nvidia Corporation. GS and CP were supported by the MOST of China (Grant No. 2013CB933401), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB07010100), the NSFC (Grant No. 14474279). CP appreciates ICFO (Spain) for the hospitality during her visit and is grateful to financial support from UCAS and ICFO.

References

  • [1] Andreas Trabesinger, J. Ignacio Cirac, Peter Zoller, Immanuel Bloch, Jean Dalibard, and Sylvain Nascimbéne. Nature physics insight - quantum simulation. Nature Physics, 8, 2012.
  • [2] Andrew Steane. Quantum computing. Reports on Progress in Physics, 61(2):117–173, 1998.
  • [3] Emanuel Knill. Physics: quantum computing. Nature, 463(7280):441–443, 2010.
  • [4] Iulia Buluta, Sahel Ashhab, and Franco Nori. Natural and artificial atoms for quantum computation. Reports on Progress in Physics, 74(10):104401, 2011.
  • [5] Frank Verstraete, Valentin Murg, and J. Ignacio Cirac. Matrix product states, projected entangled pair states, and variational renormalization group methods for quantum spin systems. Advances in Physics, 57:143–224, 2008.
  • [6] Román Orús. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349:117, 2014.
  • [7] Román Orús. Advances on tensor network theory: symmetries, fermions, entanglement, and holography. The European Physical Journal B, 87(11):280, November 2014.
  • [8] Shi-Ju Ran, Emanuele Tirrito, Cheng Peng, Xi Chen, Gang Su, and Maciej Lewenstein. Review of tensor network contraction approaches. 2017.
  • [9] Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao, Danilo P. Mandic, and Others. Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends® in Machine Learning, 9(4-5):249–429, 2016.
  • [10] Andrzej Cichocki, Anh-Huy Phan, Qibin Zhao, Namgil Lee, Ivan Oseledets, Masashi Sugiyama, Danilo P. Mandic, and Others. Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends® in Machine Learning, 9(6):431–673, 2017.
  • [11] E. Miles Stoudenmire and David J. Schwab. Supervised learning with tensor networks. Advances in Neural Information Processing Systems, 29:4799–4807, 2016.
  • [12] Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling using matrix product states. September 2017.
  • [13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
  • [14] Cédric Bény. Deep learning and the renormalization group. January 2013.
  • [15] Yi-Zhuang You, Zhao Yang, and Xiao-Liang Qi. Machine learning spatial geometry from entanglement features. September 2017.
  • [16] Wen-Cong Gan and Fu-Wen Shu. Holography as deep learning. 2017.
  • [17] Yoav Levine, David Yakira, Nadav Cohen, and Amnon Shashua. Deep learning and quantum entanglement: Fundamental connections with implications to network design. 2017.
  • [18] Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602–606, February 2017.
  • [19] Jing Chen, Song Cheng, Haidong Xie, Lei Wang, and Tao Xiang.

    On the equivalence of restricted boltzmann machines and tensor network states.

    2017.
  • [20] Yichen Huang and Joel E. Moore. Neural network representation of tensor network and chiral states. 2017.
  • [21] Ivan Glasser, Nicola Pancotti, Moritz August, Ivan D. Rodriguez, and J. Ignacio Cirac. Neural networks quantum states, string-bond states and chiral topological states. October 2017.
  • [22] Guifre Vidal. Entanglement renormalization. Physical Review Letters, 99:220405, 2007.
  • [23] Guifre Vidal. Class of quantum many-body states that can be efficiently simulated. Physical Review Letters, 101:110501, 2008.
  • [24] Lukasz Cincio, Jacek Dziarmaga, and Marek M. Rams. Multiscale entanglement renormalization ansatz in two dimensions: quantum Ising model. Physical Review Letters, 100:240603, 2008.
  • [25] Glen Evenbly and Guifre Vidal. Algorithms for entanglement renormalization. Physical Review B, 79:144108, April 2009.
  • [26] Valentin Murg, Frank Verstraete, Reinhold Schneider, Péter R. Nagy, and Örs Legeza. Tree tensor network state with variable tensor order: An efficient multireference method for strongly correlated systems. Journal of Chemical Theory and Computation, 11:1027–1036, 2015.
  • [27] Yuan-Yuan Zhao, Zhibo Hou, Guo-Yong Xiang, Yong-Jian Han, Chuan-Feng Li, and Guang-Can Guo. Experimental demonstration of efficient quantum state tomography of matrix product states. Opt. Express, 25(8):9010–9018, Apr 2017.
  • [28] Elina Robeva and Anna Seigal. Duality of graphical models and tensor networks. 2017.
  • [29] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(85):2579–2605, 2008.
  • [30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, volume 25, pages 1097–1105. 2012.
  • [31] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, July 2006.
  • [32] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [33] Charles H. Bennett and David P. DiVincenzo. Quantum information and computation. Nature, 404:247–255, 2000.
  • [34] Seth Lloyd. Universal quantum simulators. Science, pages 1073–1078, 1996.