Deep Tensor Encoding

03/18/2017 ∙ by B Sengupta, et al. ∙ University of Cambridge Cortexica Vision Systems Ltd 0

Learning an encoding of feature vectors in terms of an over-complete dictionary or a information geometric (Fisher vectors) construct is wide-spread in statistical signal processing and computer vision. In content based information retrieval using deep-learning classifiers, such encodings are learnt on the flattened last layer, without adherence to the multi-linear structure of the underlying feature tensor. We illustrate a variety of feature encodings incl. sparse dictionary coding and Fisher vectors along with proposing that a structured tensor factorization scheme enables us to perform retrieval that can be at par, in terms of average precision, with Fisher vector encoded image signatures. In short, we illustrate how structural constraints increase retrieval fidelity.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The success of deep-learning lies in constructing feature spaces where in competing classes of objects, sounds, etc. can be shattered using high-dimensional hyperplanes. The classifier relies on the accumulation of representation in the final convolution layer of a deep neural network. Often times, the classifier performance increases as one incorporates information from earlier layers of the neural network. Such a structural constraint has been imposed on certain deep learning architectures via the

inception module. In addition to decreasing the computational effort, utilization of

convolution filters enables the dimensionality of feature map to be immensely reduced; in tandem with pooling, the dimensionality reduces even further. Thus, information from the previous layer(s) can be accumulated and concatenated in the inception module. By learning the weights feeding to the inception module there is the additional flexibility in vetting the different sources of information that reaches deeper layers.

A big demerit of deep-learning architectures is their inability to perform well in the advent of small amounts of training data. Tricks such as input (rotation, blur, etc.) and feature augmentation (in terms of inception module) have proven useful (Goodfellow et al., 2016)

. Such structural constraints regularize the optimization problem, reducing over-fitting. In this paper, we propose a much simpler structural constraint i.e., to utilize the multi-linear structure of deep feature tensors. We will first emphasize the importance of feature encoding, starting with Fisher encoding (of the last layer) followed by a sparse dictionary based on the last feature layer; this feeds to deep tensor factorization that brings together tensor learning and deep learning – cementing the idea that taking into account the high dimensional multi-linear representation increases the fidelity of our learnt representation. Albeit, these algorithms are evaluated on a content-based-image-retrieval (CBIR) setting for a texture dataset, many of them have been combined in Cortexica’s findSimilar technology (; Figures 2 and 3), that facilitate retailers to recommend items from fashion databases comprising inventory items such as tops, trousers, handbags, etc.

2. Methods

2.1. Dataset and deep-feature generation

In this paper, Describable Textures Dataset (DTD) (Cimpoi et al., 2014) is used to evaluate feature encodings for image retrieval. A wide variety of fashion inventory rely on capturing the differences between varying textures. Hence, our feature encoding comparison leverages the DTD dataset, a widely used dataset for texture discrimination. Rather than recognition and description of object, texture images in DTD are collected from wild images (Google and Flickr) and classified based on human visual perception (Tamura et al., 1978), such as directionality (line-like), regularity (polka-dotted and chequered), etc. DTD contains 5640 wild texture images with 47 describable attributes drawn from the psychology literature, and is publicly available on the web at

Textures can be described via orderless pooling of filter bank response (Gong et al., 2014)

. In deep convolutional neural network (dCNN), the convolutional layers are akin to non-linear filter banks; these have in fact been proved to be better for texture descriptions

(Cimpoi et al., 2015). Here, the local deep features are extracted from last convolutional layer of a pre-trained VGG-M (Chatfield et al., 2014). This is represented by ; the size of last convolutional layer is , where denotes the dimension of filter response at the pixel of last convolution layer; is the total number of local features. Different feature encoding strategies are then utilized for encoding local descriptors. A similarity metric is then applied to rank images. We use the norm (Frobenius norm for matrices and tensors) between vectors as a notion of distance between the query and the database images.

We will now introduce five different encodings of the feature matrix – (a) Fisher encoding, (b) Sparse matrix dictionary learning, (c) t-SVD factorization, (d) Low-rank plus Sparse factorization, and e) Multilinear Principal Component Analysis (MPCA).

Feature encoding

2.1.1. Fisher encoding

We use a Gaussian Mixture Model (GMM) for encoding a probabilistic visual vocabulary on the training dataset. Images are then represented as Fisher Vectors

(Perronnin and Dance, 2006; Uricchio et al., 2015) – derivatives of log-likelihood of the model with respect to its parameters (the score function). Fisher encoding describes how the distribution of features of an individual image differs from the distribution fitted to the feature of all training images.

First, a set of dimension local deep features are extracted from an image and denoted as . As Ref. (Simonyan et al., 2013; Cimpoi et al., 2015), a component GMM with diagonal covariance is generated on the training set with the parameters , only the derivatives with respect to the mean

and variances

are encoded and concatenated to represent an image , where,


For each component , mean and covariance deviation on each vector dimension are


where is the soft assignment weight of feature to the Gaussian and defined as


Just as the image representation, the dimension of Fisher vector is , is the number of components in the GMM, and is the dimension of local feature descriptor. After normalization on Fisher vector, the Euclidean distance is calculated to find similarity between two images.

2.1.2. Sparse coding on deep features

The compressed (sparse) sensing framework allows us to learn a set of over-complete bases and sparse weights such that the feature matrix can be represented by a linear combination of these basis vectors:


The k-SVD algorithm (Jiang et al., 2011) comprises of two stages – first, a sparse coding stage (either using matching pursuit or basis pursuit) and second, a dictionary update stage. In the first stage when the dictionary is fixed, we solve for using orthogonal matching pursuit (OMP). Briefly, OMP recovers the support of the weights using an iterative greedy algorithm that selects at each step the column of that is most correlated with the current residual. Practically, we initialise the residual , subsequently computing the column that reduces the residual the most and finally adding this column to the support


Iterating through these equations for a pre-specified number of iteration, we can update the sparse weight . After obtaining the sparse weights, we use a dictionary update stage where we update only one column of each time. The update for the -th column is,


In order to minimize we decompose (SVD) as . Using this decomposition we utilize a rank-1 approximation to form and . We can then iterate this for every column of . Sparsity is attained by collapsing each only to non-zero entries, and constraining to the corresponding columns.

For image retrieval, each local deep feature can be encoded by sparse weights

. Such an image signature can be represented by max pooling of a set of

, followed by measuring a distance between such sets.

2.1.3. Tensor factorization of deep features

In the earlier section, we relied on an alternate minimization of the dictionary and the loadings (weights) to yield a convex optimization problem. Specifically, the last convolutional layer was used to form a dictionary without reference to the tensorial (multi-linear) representation of feature spaces obtained from the deep convolutional network. Thus, our goal is to approximate these tensors as a sparse conic combinations of atoms that have been learnt from a dictionary comprising the entire image training set. In other words, we would like to obtain an over-complete dictionary such that each feature tensor can be represented as a weighted sum of a small subset of the atoms (loadings) of the dictionary.

There are two decompositions that are used for factorizing tensors, one is based on Tucker decomposition whilst the second is known as Canonical Decomposition/Canonical Polyadic Decomposition (CANDECOMP/CPD), also known as Parallel Factor Analysis (PARAFAC). In the former, tensors are represented by sparse core tensors with block structures. Specifically,

is approximated as a multi-linear transformation of a small “core” tensor

by factor matrices (Kolda and Bader, 2009),


In the latter, a tensor is written as a sum of rank-1 tensors, each of which can be written as the outer product of factor variables i.e.,


It is canonical when the rank is minimal; such a decomposition is unique under certain conditions (Domanov and Lathauwer, 2013). Even then, due to numerical errors, factorization of a feature matrix obtained using a deep neural network results in a non-unique factorization. Thus, CPD proves inadequate for unique feature encoding. We therefore utilize a factorization that is similar to a 2D-PCA albeit lifted for multi-linear objects. Specifically, we use t-SVD (Kilmer et al., 2013) to factorize the feature matrices.

Based on t-SVD

The t-product between two tensors, and ,


where, creates a block circulant matrix and the unfold operator matricizes the tensor on the tube (3rd) axis. is a f-diagonal tensor that contains the eigen-tupules of the covariance tensor on its diagonal whereas, the columns of are the eigenmatrices of the covariance tensor. The images in the training set are inserted as the second index of the tensor. In other words, using t-SVD, we obtain an orthogonal factor dictionary of the entire training set. Ultimately, a projection of the mean-removed input tensor (i.e., a single feature tensor, ) on the orthogonal projector (

) forms the tensor encoding of each image. The Frobenius norm then measures the closeness between a pair of images (individual tensor projections). Computation efficiency is guaranteed since the comparison between the test image and the database is measured in Fourier domain – the t-product utilizes a fast Fourier transform algorithm in its core

(Nvidia, 2007).

Another feature encoding that we consider is the partition of each individual tensor i.e., where, is a low-rank tensor and is a sparse tensor. We have and . denotes the truncation index of the singular components.

Based on mPCA

For high-order tensor analysis, multilinear Principal Component Analysis(mPCA) or High Order Singular Value Decomposition(HOSVD)

(Vasilescu and Terzopoulos, 2003; Lu et al., 2008) compute a set of orthonormal matrices associated with each mode of a tensor – this is analogous to the orthonormal row and column space of a matrix computed by the matrix PCA/SVD.

In a Tucker decomposition (Eqn. 7) if the factor matrices are constrained to be orthogonal the input tensor can be decomposed as a linear combination of rank-1 tensors. Given a set of N-order training tensor , the objective of mPCA is to find N linear projection matrices that maximize the total tensor scatter (variance) in the projection subspace. When factor matrices in Eqn. 7 are orthogonal then . Each query (test) tensor can then be projected using where the bold-faced matrices represent a low-dimensional space. The objective then becomes learning a set of matrices to admit


is the mean tensor. Since the projection to an order tensor subspace consists of N projections to N vector subspaces, in Ref. (Lu et al., 2008), optimization of the N projection matrices can be solved by finding each that maximizes the scatter in the n-mode vector subspace with all other -matrices fixed. The local optimization procedure can be iterated until convergence.


In this section, five deep feature encoding methods are evaluated for image retrieval on the DTD dataset. Images are resized to same size (256 x 256), deep feature is extracted from last convolutional layer of a pre-trained VGG-M. For Fisher vector and sparse coding, deep features are flattened as a set of 1D local features. For t-SVD, deep features are represented as 2D feature maps, and treated as [HxW,1,D] data structures (see Methods). After encoding and normalization, the Euclidean distance is calculated to find similarity between two images.

To evaluate image retrieval, Mean Average Precision (MAP) on top 10 rankings are calculated. Two images per category i.e., a total of 94 images are selected as queries from the test dataset. The dataset retrieved includes 3760 images from DTD training and validation datasets. MAP on DTD is listed in Table 1. An example of the retrieval obtained with each method is shown in Figure 2. On each case 10 images are displayed. Top left image is the query used. The rest of images are arranged by similarity to query image as obtained with each encoding method.

Figure 1. Retrieved results on DTD: a) Fisher vector b) Sparse coding c) t-SVD d) mPCA e) Low rank tensor f) Raw tensor

Image retrieval amounts to searching large databases to return ranked list of images that have similar properties to that of the query image. Here, we reason that in contrast to raw feature tensors (i.e., without any encoding of the feature maps), their encodings either using Fisher vector, sparse dictionaries or multi-linear tensors increases the average precision of retrieval. Table 1 shows that multi-linear encodings based on t-SVD, mPCA or low-rank decomposition of individual images all have similar fidelity whilst performing very close to information geometric encoding of the feature vectors.

Sparse coding supersedes other methods in terms of average precision. This is because the dictionary learnt using k-SVD is matched to the underlying image distribution (i.e., by learning the sparse weights) whereas the tensor dictionaries (t-SVD or mPCA) use orthogonal matrices as dictionaries without the added constraint to finesse the weights, or to add additional structure in terms or sparsity, low-rank or non-negativity.

As shown in Table 1, computing mPCA tensor encodings are computationally efficient in contrast to sparse dictionaries or learning a probabilistic model for Fisher vector encodings. Combined with reasonable retrieval performance, tensor encodings of deep neural features make them a contender for commercial infrastructures.

The code was implemented in Matlab 2015a under linux with Intel Xeon CPU E5-2640 @ 2.00GHz and 125G RAM. Sandia’s Tensor toolbox (Bader et al., 2015), KU Leuven’s Tensorlab (Vervliet et al., 2016) and TFOCS (Becker et al., 2011) were used to implement our algorithms.

Methods top-1 top-5 top-10 time taken
Fisher Vector 0.56 0.52 0.48 12ms
Sparse Coding 0.62 0.49 0.44 35ms
t-SVD dictionary 0.53 0.42 0.38 2188ms
mPCA dictionary 0.53 0.43 0.39 5ms
Low Rank 0.51 0.42 0.38 967ms
Raw Tensor 0.41 0.35 0.31
Table 1. Average precision for the DTD dataset. Raw tensors are feature tensors without any encoding.


Feature encoding is crucial for comparing images (or videos, text, etc.) that are similar to one another. Our experiments show that whilst sparse encoding of a feature tensor proves to be the most efficient encoding for retrieval, having no encoding grossly reduces the average precision. Taking the multi-linear properties of the feature tensor improves retrieval, and the fidelity is at par with Fisher encoding. We further show that computing such multi-linear representation of the feature tensor is computationally much efficient than constructing a sparse dictionary or learning a probabilistic model.

The sparse dictionary encoding has the highest average precision due to the added flexibility of learning the weights as well as imposing the structural constraint of sparsity. Fisher vector encoding has the second highest precision because of its ability to capture the information geometry of the underlying probabilistic model. Specifically, the Fisher tensor encodes the underlying Riemannian metric which augments the encoding with the curvature of the underlying distribution. The multi-linear approaches based on t-SVD, mPCA and low-rank decomposition perform at par with Fisher vectors as they encode the third and higher order interaction in the feature tensors. Comparison of the compute time tells us that amongst all of the methods, encoding deep feature tensors using mPCA is the most time-efficient algorithm.

Our results have not exploited the geometry exhibited by the tensors, for example, one can calculate lagged covariance tensors from the feature tensor – these tensors exhibit a Riemann geometry due to their positive definite structure. Therefore building a dictionary of co-variance tensors using t-SVD, wherein an Augmented Lagrangian alternating direction method can be employed to learn a sparse representation, is the next viable step to our work. We anticipate that such a multi-linear overcomplete dictionary should at the very least have increased precision to that of the Fisher encoding method. In so far, the last convolutional layer was used to form a dictionary without reference to the earlier layers of the deep neural network. In fact a step forward would be to constrain the construction of an image signature with tensorial information from an earlier layer. The tensor methods rely on factorizing large tensors, especially those that emerge from deep neural networks. Yet, GPU implementation of the Fourier transform in the form of cuFFT enables us to build a scalable commercial solution (


  • (1)
  • Bader et al. (2015) Brett W. Bader, Tamara G. Kolda, and others. 2015. MATLAB Tensor Toolbox Version 2.6. Available online. (February 2015).
  • Becker et al. (2011) Stephen R Becker, Emmanuel J Candès, and Michael C Grant. 2011. Templates for convex cone problems with applications to sparse signal recovery. Mathematical programming computation 3, 3 (2011), 165.
  • Chatfield et al. (2014) K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference.
  • Cimpoi et al. (2014) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. 2014. Describing textures in the wild. In

    IEEE Conference on Computer Vision and Pattern Recognition

  • Cimpoi et al. (2015) M Cimpoi, S Maji, and A Vedaldi. 2015. Deep Filter Banks for Texture Recognition and Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Domanov and Lathauwer (2013) Ignat Domanov and Lieven De Lathauwer. 2013. On the Uniqueness of the Canonical Polyadic Decomposition of Third-Order Tensors - Part II: Uniqueness of the Overall Decomposition. SIAM J. Matrix Analysis Applications 34, 3 (2013), 876–903.
  • Gong et al. (2014) Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale Orderless Pooling of Deep Convolutional Activation Features. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII. 392–407.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
  • Jiang et al. (2011) Zhuolin Jiang, Zhe Lin, and Larry S. Davis. 2011. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. 1697 – 1704.
  • Kilmer et al. (2013) Misha E Kilmer, Karen Braman, Ning Hao, and Randy C Hoover. 2013. Third-order tensors as operators on matrices: A theoretical and computational framework with applications in imaging. SIAM J. Matrix Anal. Appl. 34, 1 (2013), 148–172.
  • Kolda and Bader (2009) Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455–500.
  • Lu et al. (2008) H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. 2008. mPCA: Multilinear principal component analysis of tensor objects. IEEE Trans. Neural Network 1, 19 (2008), 18–39.
  • Nvidia (2007) Nvidia. 2007. CUDA CUFFT Library. (2007).
  • Perronnin and Dance (2006) F. Perronnin and C. Dance. 2006. Fisher kernels on visual vocabularies for image categorization. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Simonyan et al. (2013) K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. 2013. Fisher Vector Faces in the Wild. In British Machine Vision Conference.
  • Tamura et al. (1978) H. Tamura, S. Mori, and T. Yamawak. 1978. Textural features corresponding to visual perception. IEEE Transactions on Systems, Man and Cybernetics (1978), 460–473.
  • Uricchio et al. (2015) Tiberio Uricchio, Marco Bertini, Lorenzo Seidenari, and Alberto Del Bimbo. 2015. Fisher Encoded Convolutional Bag-of-Windows for Efficient Image Retrieval and Social Image Tagging.. In ICCV Workshops. 1020–1026.
  • Vasilescu and Terzopoulos (2003) M.A.O. Vasilescu and D. Terzopoulos. 2003. Multilinear Subspace Analysis for Image Ensembles. In IEEE Conference on Computer Vision and Pattern Recognition. 93–99.
  • Vervliet et al. (2016) N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer. 2016. Tensorlab 3.0. (Mar. 2016). Available online.

Appendix A Tensor Norms

Let be the elements of tensor , then the Frobenius norm is . The nuclear (trace) norm is defined as . are the singular values of .

Appendix B Cortexica’s findSimilar Technology

Figure 2. The findSimilar technology: A consumer takes a photograph of a clothing item. Using proprietary versions of feature encodings that leverage deep-learning as well as multi-scale analysis, the retailer presents similar items from the database.
(a) Retrieval of similar bags
(b) Retrieval of similar dresses
Figure 3. Feature encodings: Each image is encoded using a (proprietary) combination of encodings described in this paper, along with other patented descriptors. Shown here are examples wherein the query is the top-left item and a ranked list is returned based on similarity.