Kernel methods revolutionised pattern recognition and machine learning with the class of support vector machines (SVMs) in the 90’s, based on the fundamental insight that difficult problems in low dimensions may become easier when lifted to high dimensional spaces(Boser et al., 1992; Cortes and Vapnik, 1995; Hofmann et al., 2008).
An efficient approach to dealing with such high dimensional spaces can be with tensor networks, also known as tensor trains. Tensor networks are factorisations of high dimensional tensors into networks of low rank tensors and come with a class of efficient algorithms to perform these approximations (Oseledets, 2011; Bridgeman and Chubb, 2017). The number of parameters needed to specify an dimensional tensor using tensor networks can be drastically reduced, from exponentially increasing with to a polynomial dependence on (Perez-Garcia et al., 2006).
Recently, there has been an increasing interest in using tensor networks in the context of supervised machine learning, specifically focused on image classification tasks (Stoudenmire and Schwab, 2016; Efthymiou et al., 2019). These methods rely on transforming the 2-d input images into 1-d vectors before encoding them into high dimensional spaces. As a consequence of this flattening these methods are constrained to work with images of small spatial resolution ( px to px), and focus on employing improved flattening strategies to maximize the retained pixel correlation (Stoudenmire and Schwab, 2016; Efthymiou et al., 2019). For small enough images (like in MNIST or Fashion MNIST datasets) there is some residual correlation in the flattened images which can be exploited using tensor networks. In medical imaging tasks, however, images with such low spatial resolutions are rarely encountered. Further, the information lost by flattening of images in medical imaging tasks can be crucial as many decisions can be dependent on the global structure of the pixels.
In this work, we extend the use of tensor networks to be useful in classification of medical images. We propose the locally orderless tensor network or LoTeNet (pronounced “low tenet”) inspired from the classical theory of locally orderless images in Koenderink and Van Doorn (1999)
. According to the theory of locally orderless images, statistics from small neighbourhoods in images can be derived by ignoring the local order of pixels while still capturing the global structure by operating at different scales. The proposed LoTeNet model is used to perform linear classification in high dimensional spaces and it is optimized end-to-end by backpropagating the error signal through the tensor network. We present experiments on two medical imaging datasets: PCam dataset with histopathology images(Veeling et al., 2018) and LIDC-IDRI dataset with thoracic CT images (Armato III et al., 2004). Our model fares comparably to state-of-the-art deep learning models with fewer model hyperparameters and utilizing a fraction of the GPU memory when compared to their CNN counterparts.
2 Background and Problem Formulation
2.1 Tensor Network Notations
Tensor networks and operations on them are described using an intuitive graphical notation, introduced in Penrose (1971). Figure 1 (left) shows the commonly used notations for a scalar , vector , matrix and a general 3-D tensor . The number of dimensions of a tensor are captured by the number of edges emanating from the nodes denoted by the edge indices. For instance, the vector has one dimension indicated by the edge with index and a 3-d tensor has three indices depicted by the three edges, and so on.
Operations on high dimensional tensors can be succinctly captured using tensor networks as shown in Figure 1 (center) where matrix product is depicted, which is also known as tensor contraction. The edge between the tensor nodes and is the dimension subsumed in matrix multiplication resulting in . More thorough introduction to tensor notations can be found in Bridgeman and Chubb (2017).
2.2 Linear model in high dimensions
A linear model in a sufficiently high dimensional space can be very powerful (Novikov et al., 2016). In SVMs, this is accomplished by the implicit
mapping of the input data into an infinite dimensional space using radial basis function kernels(Hofmann et al., 2008). In this section, we describe the procedure followed in this work to map the input data into a higher dimensional space.
Consider an input vector , which can be obtained from a flattened 2-d image with pixels in total and intensity values normalized in the interval . A commonly used feature map for tensor networks is obtained by taking the tensor product of pixel wise feature maps (Stoudenmire and Schwab, 2016):
where the local feature map acting on pixel is indicated by . The local feature map is -dimensional and usually is a simple non-linear function which additionally is restricted to have unit norm so that the joint feature map in Eq. eq:jointRef also has unit norm. A widely used local feature map with inspired from quantum wave function analysis is (Stoudenmire and Schwab, 2016):
The dimensionality of the joint feature map is due to tensor products in Eq. eq:jointRef, as the local feature maps in Eq. eq:localRef are of dimensionality . The joint feature map basically maps each image as a vector in the dimensional feature space. For an RGB image, or other image modalities with input channels as commonly encountered in medical imaging, the local feature map can be applied to each channel separately such that the resulting space is of dimension (Efthymiou et al., 2019).
Given the high dimensional feature map of Eq. eq:jointRef for the input data , a decision rule for a multi-class classification task can be formulated of the form:
where are the classes,
and the weight tensor is an dimensional tensor with output tensor index .
In tensor notation, the linear model of Eq. eq:linModel is depicted in Figure 2 (Step 1) where the first column of gray nodes are the individual pixel feature maps of feature dimension . The pixel feature maps are connected to the weight tensor along edges and has one output dimension marked with index .
The dimensional weight tensor results in total of number of weights. Even for a relatively small gray scale image, say of size , the total number of components in can be massive: . In the next section we will see how tensor networks can represent such high dimensional tensors with parameters that grow linearly with instead of growing exponentially with .
2.3 Matrix Product State (MPS)
Consider two 1-d vectors, and with dimension indices and respectively. The tensor (outer) product of these two vectors yields a 2-d matrix . The matrix product state (MPS) Perez-Garcia et al. (2006) is a type of tensor network that expands on this notion allowing the factorization of an -dimensional tensor (with edge indices) into a chain of rank-3 tensors (with three edge indices) except on the borders where they are of rank-2, as shown in Figure 2 (Step 2). Concretely, a tensor of dimensions with indices can be approximated using lower rank tensors as
The subscript indices are the virtual indices that are contracted and are of dimension which is known the bond dimension. The components of the intermediate lower rank tensors are the parameters of the MPS approximation. The placement of the output dimension on in Eq. eq:mps is an arbitrary choice and can be adapted during the optimisation (Stoudenmire and Schwab, 2016). Note that any dimensional tensor can be represented exactly using an MPS if , where is the feature dimension. In most applications, however, is fixed to a small value or allowed to adapt dynamically when the MPS is used to approximate a high dimensional tensor (Perez-Garcia et al., 2006; Miller, 2019).
The decision function in Eq. eq:linModel can now be computed using the MPS approximation of in Eq. eq:mps depicted in Figure 2 (Step 2). The order in which the tensor contractions are performed can yield a computationally efficient algorithm. The original MPS algorithm (Perez-Garcia et al., 2006) starts from one of the ends, contracts a pair of tensors to obtain a new tensor which is then contracted with the next tensor and this process is repeated until the output tensor is reached. The cost of this algorithm as when compared to the cost that scales as without the MPS approximation. In this work we use the MPS implementation in Miller (2019) which contracts the horizontal edges parallely and proceeds to contract these contracted tensors vertically as depicted in Figure 2 (Step 3) and yields improved approximations (Efthymiou et al., 2019).
Recently proposed Tensor Network models for image classification purposes flatten entire 2-d images into 1-d vectors with different raveling strategies Han et al. (2018); Efthymiou et al. (2019). In contrast to these methods, we only flatten small regions of the images which can be assumed to be locally orderless and derive useful features using MPS operations Koenderink and Van Doorn (1999). We process these locally orderless regions using a hierarchical MPS tensor network which we call the Locally Orderless Tensor Network or LoTeNet, which is shown in Figure 4. This enables our model to handle larger images without losing their global structure.
3.1 Locally orderless Tensor Network (LoTeNet)
The locally orderless image regions are created in two steps. First, the squeeze operation illustrated in Figure 3 is applied on image patches where is the stride of the square kernel. This operation rearranges pixels in spatially local regions and stacks them along the feature dimension. Similar strategies have also been used in normalizing flow models to provide spatial context via feature dimensions such as in Dinh et al. (2017). The stride of the kernel decides the extent of reduction in spatial dimensions and the corresponding increase in feature dimension of the squeezed image. In the second step, the squeezed image with an inflated feature dimension of is flattened from 2-d to 1-d. Flattening these local regions with spatial information along the feature axis provides our model with additional structural information. Further, the increase in makes the tensor network more flexible as it increases the dimensionality of the feature space (Stoudenmire and Schwab, 2016).
Consider the input image to the first MPS layer in Figure 4 with grids marking the different patches. Each of these patches are squeezed and input into an MPS block. The MPS block first embeds these vectors into the joint feature space of dimensionality according to Eq. eq:localRef and Eq. eq:jointRef, with . Then, the image patches in the high dimensional feature space are contracted to output a vector with dimension . In our model is set to be the same as the bond dimension . The functionality of the MPS blocks can be interpreted as summarising the patch with a vector of size using a linear model in the high dimensional feature space.
The output vectors from all the MPS blocks at a given layer are reshaped back into the 2-d image space. However, due to the MPS contractions in the first layer, the intermediate image map will be of lower resolution as indicated by the smaller image with fewer patches in Figure 4. This is analogous to obtaining an average pooled version of the intermediate feature maps in CNN operations. The smaller 2-d patches formed after the first layer of MPS blocks are further squeezed and contracted in the next layer of the model. This process is continued for layers and the final MPS block performs the decision contraction of Eq. eq:linModel.
3.2 Model Optimization
The components of the weight matrix (parameters of the model) are approximated using the layers of MPS blocks as described in Section 3.1
. We view the sequence of MPS contractions in successive layers of our model as the forward pass and rely on automatic differentiation in PyTorch(Paszke et al., 2019) to compute the backward computation graph (Efthymiou et al., 2019)
. The Torch MPS module(Miller, 2019) is used to learn the MPS parameters from the training data in an end-to-end fashion.
4 Data and Experiments
We perform experiments on two publicly available datasets with the task formulated as binary classification for: metastasis detection from histopathologic scans; and detection of nodules in thoracic computed tomography (CT) scans.
PCam Dataset: The PatchCamelyon (PCam) dataset is a recently introduced binary image classification dataset in Veeling et al. (2018). Image patches of size px are extracted from the Camelyon16 Challenge dataset (Bejnordi et al., 2017). A positive label indicates the presence of at least one pixel of tumour tissue in the central px region and a negative label indicates the absence of tumour, as shown in Figure 5. Patch extraction is performed to ensure the class balance is close to . We use the modified PCam dataset from the Kaggle challenge222https://www.kaggle.com/c/histopathologic-cancer-detection which excludes duplicate image patches and consists of about patches for training which is further split into for training and validation purposes. An independent test set of about
patches is provided for evaluating the models. All image planes are normalized to have mean and standard deviation of. We use random rotation, random horizontal and vertical flips on the training data for data augmentation.
LIDC Dataset: The LIDC-IDRI datatset comprises of 1018 thoracic CT images with lesions annotated by four radiologists (Armato III et al., 2004). Similar to Kohl et al. (2018); Baumgartner et al. (2019) we extract px image patches centered on a lesion and use the preprocessed data from Knegt (2018) (shown in Figure 6 in Appendix). This yields a total of about patches. To transform this into a classification task we pose it as a task of predicting the presence or absence of lesions based on the four annotations. We indicate a patch to have a lesion if more than two (i.e. 3) radiologists have annotated presence of a lesion and a negative label in the remaining two cases. The binary task then transforms it into capturing the majority vote amongst the radiologists. We split the dataset into splits for training, validation and a hold-out test set.
4.2 Experiments and Results
The proposed model (LoTeNet) is evaluated with layers and a kernel size (for the squeeze operation) and it is implemented based on the efficient MPS implementations in Miller (2019). The only critical hyperparameter inherent to LoTeNet is its bond dimension ; it was set to obtained from the range based on the performance on the PCam validation set (Figure 8 in Appendix). We used the Adam optimizer (Kingma and Ba, 2014) with a learning rate of and a batch size of . Models were assumed to have converged if there was no improvement in validation AUC over
consecutive epochs and the model with the best validation performance was used to predict on the test set. All experiments were run on a single Tesla K80 GPU with 12 GB memory. The same settings were used for the experiments on LIDC dataset.
We compare performance of our model with DenseNet baseline with layers and a growth rate of as described in Huang et al. (2017) and also the single layer MPS model in Efthymiou et al. (2019) reported as Tensor Net-X. Additionally, we compare the PCam dataset performance to the rotation equivariant CNNs method which also introduced the dataset (Veeling et al., 2018). We report area under the ROC curve (AUC) as the metric to compare the different models.
The test set performance for PCam and LIDC datasets with the relevant comparing methods are are reported in Table 1. We notice that LoTeNet attains an AUC of on the PCam dataset which is comparable with the methods of Veeling et al. (2018) and Dinh et al. (2017). There is a clearer improvement when compared to DenseNet on the LIDC dataset. Further, LoTeNet outperforms the Tensor Net-X with a single layer MPS (Efthymiou et al., 2019) on both datasets. LoTeNet takes about minutes per training and validation epoch on the PCam dataset and on the LIDC dataset.
5 Discussion and Conclusions
The most important hyperparameter of any MPS model is its bond dimension as it controls the quality of approximation of the high dimensional tensor. In our proposed model, LoTeNet, which is composed of layers with MPS blocks, we noticed the performance to be robust to the changes in (Figure 8 in Appendix). This is consistent with other findings where bond dimension after a certain number (around ) has shown to have no impact on the performance of the models (Efthymiou et al., 2019). Due to the distributed nature of approximation in LoTeNet across several layers this is all the more pronounced and we find only minor fluctuations in performance of the model and we get away with a much smaller .
The results reported for the LIDC-IDRI dataset in Table 1 are based on the model configuration (including hyperparameters such as learning rate and batch size) obtained for the PCam dataset. This further strengthens the case for Tensor Network based methods as they can be easily transferred to different datasets.
In Table 1, we also report the GPU memory requirement for each of the models. Tensor network models require only a fraction of the memory utilised by the corresponding DenseNet or Rotation Eq-CNN models even when the number of parameters in LoTeNet is higher ( when compared to for the other two models (Veeling et al., 2018)). This drastic reduction in GPU memory utilisation is because tensor networks do not maintain massive intermediate feature maps, unlike CNNs which use a large chunk of GPU memory mainly to store intermediate feature maps (Rhu et al., 2016). As the entire pipeline of LoTeNet is based on contracting input data into smaller tensors it does not grow in memory consumption with successive contracted layers. This can be an important feature in medical imaging applications as larger images and larger batch sizes can be processed.
In conclusion, the proposed model, LoTeNet, overcomes the loss of global structure due to flattening in tensor networks using locally orderless regions that are added to the feature dimension of the input image. By using a hierarchical approach, the model also retains the global structure. We have demonstrated the ability of the model to perform classification on two publicly available datasets, yielding performance comparable to state-of-the-art deep learning models – using fewer model hyperparameters and substantially smaller GPU memory consumption.
The authors would like to thank Silas Ørting and Mathias Perslev for their useful feedback on the manuscript. The authors also thank the four anonymous reviewers and the Area Chair for their insightful feedback which has strengthened the manuscript enormously.
- Armato III et al.  Samuel G Armato III, Geoffrey McLennan, Michael F McNitt-Gray, Charles R Meyer, David Yankelevitz, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, Ella A Kazerooni, Heber MacMahon, et al. Lung image database consortium: developing a resource for the medical imaging research community. Radiology, 232(3):739–748, 2004.
- Baumgartner et al.  Christian F Baumgartner, Kerem C Tezcan, Krishna Chaitanya, Andreas M Hötker, Urs J Muehlematter, Khoschy Schawkat, Anton S Becker, Olivio Donati, and Ender Konukoglu. Phiseg: Capturing uncertainty in medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 119–127. Springer, 2019.
- Bejnordi et al.  Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017.
Boser et al. 
Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik.
A training algorithm for optimal margin classifiers.
Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
- Bridgeman and Chubb  Jacob C Bridgeman and Christopher T Chubb. Hand-waving and interpretive dance: an introductory course on tensor networks. Journal of Physics A: Mathematical and Theoretical, 50(22):223001, 2017.
- Cortes and Vapnik  Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
Dinh et al. 
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio.
Density estimation using Real NVP.In International Conference on Learning Representation, 2017.
- Efthymiou et al.  Stavros Efthymiou, Jack Hidary, and Stefan Leichenauer. Tensornetwork for Machine Learning. arXiv preprint arXiv:1906.06329, 2019.
- Han et al.  Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling using matrix product states. Physical Review X, 8(3):031012, 2018.
- Hofmann et al.  Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 1171–1220, 2008.
Huang et al. 
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Knegt  Stefan Knegt. A Probabilistic U-Net for segmentation of ambiguous images implemented in PyTorch. https://github.com/stefanknegt/Probabilistic-Unet-Pytorch, 2018.
- Koenderink and Van Doorn  Jan J Koenderink and Andrea J Van Doorn. The structure of locally orderless images. International Journal of Computer Vision, 31(2-3):159–168, 1999.
- Kohl et al.  Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus Maier-Hein, SM Ali Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. A probabilistic u-net for segmentation of ambiguous images. In Advances in Neural Information Processing Systems, pages 6965–6975, 2018.
- Miller  Jacob Miller. Torchmps. https://github.com/jemisjoky/torchmps, 2019.
- Novikov et al.  Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets. Exponential machines. arXiv preprint arXiv:1605.03795, 2016.
- Oseledets  Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
- Paszke et al.  Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
- Penrose  Roger Penrose. Applications of negative dimensional tensors. Combinatorial mathematics and its applications, 1:221–244, 1971.
- Perez-Garcia et al.  David Perez-Garcia, Frank Verstraete, Michael M Wolf, and J Ignacio Cirac. Matrix product state representations. arXiv preprint quant-ph/0608197, 2006.
- Rhu et al.  Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13. IEEE, 2016.
- Stoudenmire and Schwab  Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pages 4799–4807, 2016.
- Veeling et al.  Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant CNNs for digital pathology. In International Conference on Medical image computing and computer-assisted intervention, pages 210–218. Springer, 2018.