# Invariant Tensor Feature Coding

We propose a novel feature coding method that exploits invariance. We consider the setting where the transformations that preserve the image contents compose a finite group of orthogonal matrices. This is the case in many image transformations such as image rotations and image flipping. We prove that the group-invariant feature vector contains sufficient discriminative information when we learn a linear classifier using convex loss minimization. From this result, we propose a novel feature modeling for principal component analysis, and k-means clustering, which are used for most feature coding methods, and global feature functions that explicitly consider the group action. Although the global feature functions are complex nonlinear functions in general, we can calculate the group action on this space easily by constructing the functions as the tensor product representations of basic representations, resulting in the explicit form of invariant feature functions. We demonstrate the effectiveness of our methods on several image datasets.

## Authors

• 13 publications
• 68 publications
• ### Invariant Deep Compressible Covariance Pooling for Aerial Scene Categorization

Learning discriminative and invariant feature representation is the key ...
11/11/2020 ∙ by Shidong Wang, et al. ∙ 4

• ### GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs

Finding local correspondences between images with different viewpoints r...
11/14/2019 ∙ by Yuan Liu, et al. ∙ 16

• ### Learning with Group Invariant Features: A Kernel Perspective

We analyze in this paper a random feature map based on a theory of invar...
06/08/2015 ∙ by Youssef Mroueh, et al. ∙ 0

• ### Non-convex Penalty for Tensor Completion and Robust PCA

In this paper, we propose a novel non-convex tensor rank surrogate funct...
04/23/2019 ∙ by Tao Li, et al. ∙ 0

• ### Deep Transformation-Invariant Clustering

Recent advances in image clustering typically focus on learning better d...
06/19/2020 ∙ by Tom Monnier, et al. ∙ 0

• ### Multilinear Common Component Analysis via Kronecker Product Representation

We consider the problem of extracting a common structure from multiple t...
09/06/2020 ∙ by Kohei Yoshikawa, et al. ∙ 0

• ### Graph Regularized Tensor Sparse Coding for Image Representation

Sparse coding (SC) is an unsupervised learning scheme that has received ...
03/27/2017 ∙ by Fei Jiang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Feature coding is the method that calculates one global feature by summarizing the statistics of local features extracted from one image. To obtain the local features

, we use a nonlinear function and as a global feature. Nowadays, we use activations of convolutional layers of pretrained CNNs such as VGG-Net simonyan2014very as local features to obtain considerable performance improvement sanchez2013image ; wang2016raid . Further, the existing works handle coding methods as differentiable layers and train them end-to-end to obtain high accuracy arandjelovic2016netvlad ; gao2016compact ; lin2015bilinear . Thus, feature coding is a general method for transferring the information of CNNs to a wider domain.

Invariance of images under geometric transformations is essential for image recognition because we can obtain compact and discriminative feature by focusing on the information invariant to the transformations that preserve image contents. For example, some researchers construct CNNs with more complex invariances such as image rotation cohen2016group ; cohen2016steerable ; worrall2017harmonic and obtain the model with high accuracy with reduced model parameters. Therefore, we expect to construct a feature coding method that contains highly discriminative information per dimension, and is robust to the considered transformations by exploiting invariance information into the coding methods.

In this work, we propose a novel feature coding method that exploits invariance. Specifically, we assume that the transformations that preserve the image contents act as a finite group consisting of orthogonal matrices on each local feature . For example, when we use the concatenation of pixel values in the image subregion as the local feature, image flipping acts as the change of pixel values. Hence, it can be represented by a permutation matrix, which is orthogonal. We ignore the change in the feature position because we apply global feature pooling. We need to construct a nonlinear feature coding function that exploits . Our first result is that when we learn the linear classifier using l2-regularized convex loss minimization on the vector space where act as orthogonal matrices, the learned weight exists in the subspace invariant under the action. From this result, we propose a guideline that we first construct a vector space in which act orthogonally on , and calculate the -invariant subspace.

Two problems occur in constructing the global feature. The first problem is that, in general, exhibits complex nonlinearity. The action of

on CNN features can be calculated relatively easily because CNNs consist of linear transformations and point-wise activation functions.This is not the case for feature coding methods. The second is that we must learn

from the training data. When we encode the feature, we first apply principal component analysis (PCA) on

to reduce the dimension. We often learn the clustering using k-means or Gaussian Mixture Model (GMM) and calculate

from the learned model. Therefore, we must consider the effect of on the learned model.

To solve these problems, we exploit two concepts of the group representation theory: reducible decomposition of the representation and tensor product of two representations. The former is the decomposition of the action of on into the direct sum of irreducible representations. This decomposition is important when we construct the dimensionality reduction method that is compatible with group action, and we subsequently calculate the tensor product of the representations. The tensor product of the representations is the method that we construct a vector space where the group acts from the product of input representations. Therefore, it is important when we construct nonlinear feature functions that the group action can be easily calculated.

With these concepts, we propose a novel feature coding method, and model training methods that exploit the group structure. We applied our methods to the D4 group that consists of rotations and flipping, and conducted experiments on image recognition datasets. We observed the performance improvement and robustness to such image transformations.

In summary, our contributions are as follows:

• We prove that the linear classifier becomes group-invariant when trained on the space where the group of content-preserving transformations acts orthogonally.

• We propose a group-invariant extension of feature modeling and feature coding methods when the group acts orthogonally on local features.

• We evaluated the accuracy and invariance of our methods on image recognition datasets.

## 2 Related Work

### 2.1 Feature coding

Two types of feature coding approaches exist: covariance-based and clustering-based approaches.

The covariance-based approach models the distributions of local features using Gaussian distributions, and use statistics as the global feature. For example, GLC

nakayama2010dense uses the mean and covariance of the local descriptors as the feature. The global Gaussian nakayama2010global applies an information geometric metric on the statistical manifold of the Gaussian distribution as the similarity measure. Bilinear pooling (BP) lin2015bilinear uses the mean of self-products instead of the mean and covariance, but the performance is similar. BP is defined as , where denotes the vector that stores the elements of . Because of the simplicity of BP, there are various extensions such as lin2017improved ; wang2017g2denet ; gou2018monet . For example, improved bilinear pooling (iBP) used the matrix square root as a global feature.

The most simple clustering-based approach is the vector of locally aggregated descriptors (VLAD) jegou2010aggregating . The VLAD is a -dimensional vector that uses k-means clustering where is the number of clustering components, and consists of the sum of the difference from each local feature to the cluster centroid to which it is assigned written as , where is the set of local descriptors that are assigned to the -th cluster. The VLAT picard2013efficient is an extension of the VLAD that exploits second-order information. The VLAT uses the sum of tensor products of the difference from each local descriptor to the cluster centroid : , where is the mean of of all the local descriptors assigned to the -th cluster. The VLAT contains information similar to the full covariance of the GMM. The Fisher vector (FV) sanchez2013image also exploits second-order information but use only diagonal covariance.

### 2.2 Feature extraction that considers invariance

One direction to exploit the invariance in model structure is to calculate all the transformations and subsequently apply pooling with respect to the transformation to obtain the invariant feature. TI-Pooling laptev2016ti

first applies various transformations on input images and subsequently applies the CNN with the same weight, and finally applies max-pooling to obtain the invariant feature. RotEqNet

marcos2017rotation calculates the vector field by rotating the convolutional filters and lines up with the activations, and subsequently applies pooling to the vector fields to obtain rotation-invariant features. Group equivariant CNNs cohen2016group construct a network by the average of activation with respect to the transformed convolutional filters and point-wise activations.

Another direction is to exploit the group structure of transformations and construct the feature using group representations. Harmonic networks worrall2017harmonic consider the continuous image rotation, and construct a layer with spherical harmonics that is the basis of the representation of two-dimensional (2-d) rotation groups. Steerable CNNs cohen2016group consider the D4 group and constructs the filter with the direct sum of the irreducible representations of the group to reduce the model parameters while preserving accuracy. Unlike these works, we handle feature modeling methods and feature coding methods whose functions are more complex than the network layers.

## 3 Preliminary about group representation

In this section, we present an overview of the group representation theory necessary for constructing our method. More specific explanations are available in groupbook .

##### Group representation

When set and an operation satisfying the properties:

• is associative: satisfies .

• The identity element exists that satisfies for all .

• All contains the inverse that satisfies .

The pair is called a group. The axioms above are the abstraction of the properties of the set of transformations. For example, the set consists of 2-d image rotations associated with the composition of transformations form a group. When the number of elements in written as is finite, is called a finite group. As mentioned in Section 1, we consider a finite group herein.

We now consider a complex vector space . We use a complex space for the simplicity of the theory; however, in our setting, the proposed global feature is real. The space of bijective on can be identified with the space of regular complex matrices written as , which is also a group with the matrix product as the operator. The homomorphism from to : the mapping that satisfies

• For , .

• .

is called the representation of on , where denotes the

-dimensional identity matrix. When we explicitly denote the space that the matrices act on, we denote

. The representation that maps all to is called a trivial representation. We denote a one-dimensional trivial representation as 1. When all are unitary matrices, the representation is called a unitary representation. Moreover, we call it an orthogonal representation when the s are orthogonal matrices. An orthogonal representation is also a unitary representation. In this work, we assume that all transformations are orthogonal.

##### Intertwining operator

For two representations , and , a linear operator is called an intertwining operator if it satisfies . This implies that also becomes a representation. Thus, it is important when we apply linear dimension reduction that the projection matrix is an intertwining operator. We write the space of the intertwining operator as . When a bijective exists, we write . This implies that the two representations are virtually the same, and that the only difference is the basis of the vector space.

##### Irreducible representation

Given two representation , , the mapping that associates with the matrix that we concatenate in the block-diagonal form is the representation on the space of the direct sum of the input representation spaces. This representation is called the direct sum representation of , written as . When the representation is equivalent with some direct sum representations, is called a completely reducible representation. The direct sum is the composition of space that the group acts on independently. Therefore, a completely reducible representation can be decomposed by independent and simpler representations. When the representation is unitary, the representation that cannot be decomposed is called an irreducible representation, and all representations are equivalent with the direct sum of irreducible representations. The irreducible representation s is decided from the group structure. is decomposed by , where is times direct sum of

. When we denote the characteristic function of

as , we can calculate these coefficients as . Further, the projection operator to is calculated as . Specifically, because , we can calculate the projection to the trivial representation using

 P1=1|G|∑g∈Gπ(g). (1)

This equation reflects the fact that the average of all is invariant to the group action. Schur’s lemma indicates that if and for some matrix .

##### Tensor product representation

Finally, we explain the tensor product representation. The tensor product representation is important when we construct nonlinear feature functions. Given , , the mapping that associates with the matrix tensor product of is the representation on the space of the tensor product of input spaces. We write the tensor product representation as . The tensor product of the unitary representation is also unitary. The important properties of the tensor representation are (i) a distributive property where , and (ii) . Thus, we can calculate the irreducible decomposition of tensor representation from that of irreducible representations.

## 4 Invariant Tensor Feature Coding

In this section, we explain the proposed method. Our goal is to construct an effective feature function when the transformations that preserves the image contents acts as the finite group of orthogonal matrices on . Hence, we prove a theorem that reveals the condition of the invariant feature with sufficient discriminative information in Section 4.1. We subsequently explain the feature modeling method necessary for constructing the coding in Section 4.2. We describe our proposed invariant feature function in Section 4.3. We finally discuss the group and local feature feature used in the experiment in Section 4.4.

### 4.1 Guideline for Invariant Feature

First, to decide what an effective feature is, we prove the following theorem.

###### Theorem 1.

When we assume finite group acts as an orthogonal representation on , and preserves the distribution of the training data , which implies that exhibits the same distribution as for any , the solution of the l2-regularized convex loss minimization

 arg minw∈Rdλ2∥w∥2+1MM∑m=1l(⟨w,xm⟩R,ym) (2)

is -invariant, implying that for any and .

The -invariance of the training data corresponds to the fact that does not change the contents of the image. From another viewpoint, this corresponds to data augmentation that use transformed images as additional training data. The proof is described in the appendix.

This theorem indicates that we can reduce the complexity of the problem by considering the invariance. Because the generalization error increases with the complexity, this theorem explains one reason that invariance contributes to the good test performance. Sokolic et al. analyzed the generalization error in a similar setting using a covering number sokolic2016generalization . While this work calculated the upper bound of the complexity, we focus on the linear classifier, and obtain the explicit complexity and the form of the learned classifier. Hence, our guideline is to construct a global feature as an invariant vector in the vector space where the group acts orthogonally.

### 4.2 Invariant feature modeling

We first construct a novel feature modeling method necessary for calculating the feature coding methods that consider group action.

##### Invariant PCA

The original PCA is the solution of

 maxWtW=I Tr(Wt1NN∑n=1(xn−μ)(xn−μ)tW), (3)

where

. PCA attempts to maximize the sum of variance of the projected vectors. The solution is the matrix where we line up the eigenvectors of

that correspond to the top eigenvectors.

In addition to the original constraint , we assume that is an intertwining operator to the projected space, such that the projected space is also the representation space. From Schur’s lemma described in Section 3, that satisfies these conditions is the matrix that we line up the composition of and dimensionality reduction within . Further, when we denote and -th element of as , the dimensionality reduction within is the form of for because of Schur’s lemma. Hence, our basic strategy is to first calculate s that maximize the variance with the orthonormality preserved, and subsequently choose for larger variances per dimension. In fact, s can be calculated using PCA with the sum of covariance between each dimensional element of s.

Because the projected vector must be real, additional care is required when some elements of are complex. We describe the modification for this case in the appendix. In our application, we use the D4 group in which all irreducible representation can be real; thus, we can use directly. The algorithm is written in Algorithm 1.

##### Invariant k-means

The original k-means is calculated as

 minμN∑n=1Cminc=1∥xn−μc∥2. (4)

To guarantee that acts orthogonally on the learned model, we simply hold the cluster centroids by applying all the transformations to the original centroid. This implies that we learn for such that satisfies for all . The algorithm is shown in Algorithm 2.

### 4.3 Invariant Feature Coding

We subsequently construct a feature coding function as the -invariant vector in the space where acts orthogonally. We first use the space of the function of where we can guarantee that acts orthogonally as the basis space, and construct more complex representation spaces using the tensor product. First basis space is the space of itself. The second is the -dimensional 1-of-k vector that we assign to the nearest . When we apply on , the nearest corresponds to the vector where we apply on the nearest . Therefore, acts as a permutation matrix on the space that is orthogonal. We denote this representation as .

##### Invariant Bilinear Pooling

Since BP is written as the tensor product of , we can directly use

 F=vec(1NN∑n=1P1(xn⊗xn)), (5)

as the invariant global feature. Though is a projection to the trivial representation defined by Eq. (1), in actual we can calculate this feature from the irreducible decomposition of and the irreducible decomposition of the tensor products. We can also apply normalization on the invariant covariance with respect to each -th elements to get the variant of BP such as iBP. Because we discard elements that are not invariant, the feature dimensions of these methods are smaller than the original ones.

We subsequently propose an invariant version of the VLAD as follows:

 F=vec(1NN∑n=1P1(1μ(xn)⊗(xn−μc))), (6)

where is the nearest centroid to . Because is the vector where only the elements corresponding to the nearest is , the tensor product becomes the vector where the elements corresponding to the nearest is , which is the same as the original VLAD. Because both and are orthogonal representation spaces, this space is also an orthogonal representation space. Although the size of the codebook becomes -times larger, the dimensions of the global feature is not as large because we only use the invariant vector. When we use the D4 group, as will be described later, we can prove that the dimension of the Invariant VLAD is .

##### Invariant VLAT

We finally propose the invariant VLAT that can incorporate local second-order statistics. We can calculate the feature by combining the two features above.

 F=vec(1NN∑n=1P1(1μ(xn)⊗((xn−μc)⊗(xn−μc)−Tc))). (7)

The dimension of the Invariant VLAT with components is the same as that of the VLAT with components.

### 4.4 D4 group

In the experiment, we used the D4 group in cohen2016steerable that contains rich information, and is easy to calculate. The D4 group is a group consisting of rotation , and an image flipping with . We summarize the property of the D4 group in the appendix.

Because D4 does not act orthogonally to the output of general CNNs, we pretrain the group equivariant CNNs with respect to the D4 group, and used the last convolutional activation as the local feature extractor. The group equivariant CNN is the model where we preserve the action using -times number of filters where we applied s on the original filters, and used the average with respect to when we applied convolution. When the feature is a dimension, it can be regarded as times the direct sum of the eight-dimensional orthogonal representation space because D4 acts as a permutation. The representation is decomposed as follows: .

## 5 Experiment

### 5.1 Experiment with fixed local features

In this subsection, we evaluated our methods on image recognition datasets using pretrained CNN local features. Note that we fixed the local features to compare only the performance of coding methods, and thus the overall scores are lower than the existing end-to-end methods.

We evaluated the methods on the Flickr Material Dataset (FMD) sharan2013recognizing , describable texture datasets (DTD) cimpoi2014describing , UIUC material dataset (UIUC) liao2013non , Caltech-UCSD Birds (CUB) welinder2010caltech ) and Stanford Cars (Cars) krause20133d . FMD contains 10 material categories with 1,000 images. DTD contains 47 texture categories with 5,640 images. UIUC contains 18 categories with 216 images. CUB contains 200 bird categories with 11,788 images. Cars consists of 196 car categories with 16,185 images. We used the given train-test splits for DTD, CUB and Cars, and randomly split 10 times such that the sizes of the train and test data are the same for each category for the FMD and UIUC.

We pretrained the group equivariant CNNs with the VGG19 architecture with convolutional filter sizes of instead of

as the local feature extractor. Further, we added batch normalization layers for each convolution layer to accelerate the training speed. We trained the model with the ILSVRC2012 dataset

imagenet_cvpr09 . We applied the standard data augmentation strategy and used the same learning setting as the original VGG-Network.

We extracted the last convolutional activation of this pretrained group equivariant VGG19 after rescaling the input images by , where . For efficiency, we discarded the scales that increased the image size to more than pixels. Subsequently, we applied the nonlinear embedding proposed in vedaldi2012efficient such that the feature dimension is three times as large. Because this embedding is a point-wise function, we can regard the output as three times the direct sum of the original representations when we consider the group action.

We subsequently reduce the local feature dimension using PCA for the existing method, and the proposed Invariant PCA for the proposed method. We applied BP and iBP with the dimension 1,024, VLAD with 512 dimension and 1,024 components, FV with 512 dimension and 512 components, and VLAT with 256 dimension and 8 components. We also applied the proposed BP and iBP with the same setting, and VLAD and VLAT with eight times the number of components.

We used the linear SVM implemented in liblinear REF08a , and evaluated the average of the test accuracy. Further, to validate that the proposed feature is D4 invariant, we used the same training data and evaluated the accuracy when we augmented the test data eight times using the D4 group.

Table 1 shows the accuracy for the original test datasets and for the augmented test datasets. Our method demonstrates better accuracy than the non-invariant methods. Further, the dimensions of Invariant BP and iBP are approximately the original dimensions. Thus, we obtained features with much smaller dimensions with higher performance. This table also shows that the existing methods shows poor performance on the augmented test data, but the proposed methods demonstrates the similar performance to the original score. These results suggest that the existing methods use information that does not relate image contents, but some dataset biases such as the angle of contents in the image. Our method can discard such bias and focus more on the contents of the image.

### 5.2 Experiment with end-to-end model

We then applied our Invariant BP to end-to-end learning framework and evaluated the accuracy on ILSVRC2012 imagenet_cvpr09 dataset. We constructed the model based on the iSQRT-COV li2018towards which is the variant of BP that demonstrated good performance.

We used the group equivariant CNNs with Resnet50 he2016deep architecture as a local feature extractor, where we reduced the filter sizes like the case of VGG19. We then substituted global average pooling with iSQRT-COV and proposed Invariant iSQRT-COV.

We learn the whole models including feature extractor using a momentum grad with an initial learning rate 0.1, momentum 0.9, and weight decay rate to 1e-4 for 65 epochs. We multiplied the learning rate by 0.1 at 30, 45 and 60 epochs. We set the batch size to 160 for Inv iSQRT-COV and 80 for iSQRT-COV due to the GPU memory restriction. We then evaluated the top-1 and top-5 test error. We used the average score for the original image and the flipped image for the evaluation.

Table 2 shows that iSQRT-COV does not gain accuracy by changing the feature extractor to the equivariant one. On the other hand, our Inv iSQRT-COV demonstrates good accuracy with small feature dimension. Therefore, considering invariance is also effective for the end-to-end training.

## 6 Conclusion

In this research, we proposed a feature coding method that considered the transformations that preserved the image information. Based on the group representation theory, we proposed a guideline that we first constructed a feature space in which a group acted orthogonally, and subsequently calculated the invariant vector. We subsequently constructed a novel model learning method and coding methods that explicitly considered group action. We applied our methods on image classification datasets and demonstrated that our feature can yield high accuracy while preserving invariance. Our work becomes a novel framework when we construct an invariant feature.

#### Acknowledgements

This work was partially supported by JST CREST Grant Number JPMJCR1403, Japan, and partially supported by JSPS KAKENHI Grant Number JP19176033. We would like to thank Atsushi Kanehira, Takuhiro Kaneko, Toshihiko Matsuura, Takuya Nanri and Masatoshi Hidaka for the helpful discussion.

## References

• (1) R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, 2016.
• (2) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
• (3) T. Cohen and M. Welling. Group equivariant convolutional networks. In ICML, 2016.
• (4) T. S. Cohen and M. Welling. Steerable cnns. In ICLR, 2017.
• (5) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
• (6) R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
• (7) W. Fulton and J. Harris. Representation Theory: A First Course. Springer, 2014.
• (8) Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, 2016.
• (9) M. Gou, F. Xiong, O. Camps, and M. Sznaier.

Monet: Moments embedding network.

In CVPR, 2018.
• (10) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
• (11) E. Hoogeboom, J. W. Peters, T. S. Cohen, and M. Welling. Hexaconv. In ICLR, 2018.
• (12) H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
• (13) J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In ICCV, 2013.
• (14) D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys.

Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks.

In CVPR, 2016.
• (15) P. Li, J. Xie, Q. Wang, and Z. Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, 2018.
• (16) Z. Liao, J. Rock, Y. Wang, and D. Forsyth. Non-parametric filtering for geometric detail extraction and material representation. In CVPR, 2013.
• (17) T.-Y. Lin and S. Maji. Improved bilinear pooling with cnns. In BMVC, 2017.
• (18) T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn models for fine-grained visual recognition. In ICCV, 2015.
• (19) D. Marcos, M. Volpi, N. Komodakis, and D. Tuia. Rotation equivariant vector field networks. In ICCV, 2017.
• (20) H. Nakayama, T. Harada, and Y. Kuniyoshi. Dense sampling low-level statistics of local features. IEICE TRANSACTIONS on Information and Systems, 93:1727–1736, 2010.
• (21) H. Nakayama, T. Harada, and Y. Kuniyoshi. Global gaussian approach for scene categorization using information geometry. In CVPR, 2010.
• (22) D. Picard and P.-H. Gosselin. Efficient image signatures and similarities using tensor products of local descriptors. CVIU, 117:680–687, 2013.
• (23) J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 105:222–245, 2013.
• (24) L. Sharan, C. Liu, R. Rosenholtz, and E. H. Adelson. Recognizing materials using perceptually inspired features. IJCV, 103(3):348–371, 2013.
• (25) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
• (26) J. Sokolic, R. Giryes, G. Sapiro, and M. R. Rodrigues. Generalization error of invariant classifiers. In AISTATS, 2017.
• (27) A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34(3):480–492, 2012.
• (28) Q. Wang, P. Li, and L. Zhang. G2denet: Global gaussian distribution embedding network and its application to visual recognition. In CVPR, 2017.
• (29) Q. Wang, P. Li, W. Zuo, and L. Zhang.

Raid-g: Robust estimation of approximate infinite dimensional gaussian with application to material recognition.

In CVPR, 2016.
• (30) P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-ucsd birds 200. Technical Report CNS-TR-201, Caltech, 2010.
• (31) D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In CVPR, 2017.

## Appendix A Ilustrative Example of the proposed method.

We visualize our methods in the simple setting.

##### Group consists of identity mapping and image flipping.

We consider the group consists of identity mapping and horizontal image flipping . Since we get original image by applying image flipping twice, it follows that , , , . This definition satisfies the three properties of group by setting :

• is associative: satisfies .

• The identity element exists that satisfies for all .

• All contains the inverse that satisfies .

Also, we can see that the irreducible representations are and . This is proved as follows: The s defined above satisfies the conditinos of representation

• For , .

• .

1-dimensional representations are trivially irreducible. When the dimension of the representation space is larger than 1, we denote and . We apply eigendecomposition of to obtain where is diagonal and we denote the -th diagonal eelements as . Because and , this implies is decomposed as direct sum representatin of , for each -th dimension. Therefore, this representation is not irreducible. Thus the irreducible representation needs to be 1-dimensional. Also, . Thus, is 1 or -1.

##### Image feature and its irreducible decomposition.

As a image feature, we use the concatenation of luminosity of two horizontally adjacent pixels as a local feature and apply average pooling to get global feature. Thus, the feature dimension is 2. Since image flipping changes the order of the pixels, it acts as the permutation of first and second elements in the feature space. Therefore, the group acts as , , as plotted in the left figure of Figure 2. By applying orthogonal matrices calculated by Eq. (7) in the original paper, we get the feature space written in the central of Figure 2. In this space, the group acts as , . Thus, this is the irreducible decomposition of into defined by Irreducible representation paragraph in Section 3 of the original paper. Note that in general case each representation matrices are block-diagonalized instead of diagonalized and each diagonal blocks become more complex like those written in Table 1 in the original paper.

##### Invariant classifier.

In this decomposed space, when we train the classifier by considering both original images and flipped images, the classification boundary learned with l2-regularized convex loss minimization becomes the form of as plotted in the right figure of Figure 2. Thus, we can discard -elements of features and get compact feature vector. This result is validated as follows: when we denote the feature in the decomposed space corresponds to the original image as , the feature corresponds to the flipped image becomes because . When we write the linear classifier as , the loss for these two images are written as . From Jensen’s inequality, this is lower-bounded by . This means that the loss is minimized when , resulting in the classification boundary . This calculation is generalized to Eq. (10) in the original paper.

##### Tensor product representation.

Subsequently, we consider the ’product’ of feature space to get nonlinear feature. We consider the two feature space and with acts as , on both spaces. We can use any input spaces whenever these spaces and satisfy the above condition. For example, we use the same feature space for and to obtain bilinear pooling. The tensor product of these spaces becomes 4-dimenisonal vector space consisting of . Since acts as permutations of and , and at the same time, acts as permutations of and and and at the same time. Thus, is written as . Thus, tensor product space is also the group representation space. This space can also be decomposed into irreducible representations as and . We only need first 2 elements as an invariant feature.

As plotted in Figure 4, when the learned codebook is for , and are both assigned to . When these features are flipped, flipped is assigned to while flipped is assigned to . Thus image flipping does not act consistently on the assignment vector . As plotted in Figure 4, when we learn the codebook so that there exists with -element flipped for each codeword, the group acts consistently on . It is because whenever is assigned to , flipped is assigned to . Also flipped is assigned to when is assigned to . Thus, the group acts orthogonally on .

## Appendix B Proof of Theorem 1.

In this section, we provide the proof for Theorem 1.

###### Proof.

Non-trivial unitary representation satisfies . This is because if we assume , there exists that satisfies and is a one-dimensional -invariant subspace. It violates the irreducibility of . Because is a unitary representation of , it is completely reducible. We write the elements of and as , It follows that , . Further, s and s are orthogonal for different s. It follows that

 1MM∑m=1l(⟨w,vm⟩R,ym) = 1MM∑m=1l(Re(T∑t=1⟨w(t),v(t)m⟩C),ym) = 1M|G|M∑m=1∑g∈Gl(Re(T∑t=1⟨w(t),τt(g−1)v(t)m⟩C),ym) = 1M|G|M∑m=1∑g∈Gl(Re(T∑t=1⟨τt(g)w(t),v(t)m⟩C),ym) ≥ = 1MM∑m=1l(⟨w(1),v(1)m⟩R,ym), (8)

where implies the real part of a complex number; the first equation comes from the orthogonality of s and s; the second equation comes from the -invariance of the training data; the third comes from the unitarity of ; the inequation comes from the convexity of , additivity of , and the inner products; the final equality comes from the fact that the average of equals for non-trivial ; and are real vectors.

Combined with the fact that , the loss value of is larger than the loss value of . Therefore, the solution is -invariant. ∎

## Appendix C Invariant PCA when the irreducible representations are not real.

In the case where some have complex elements, we cannot apply Algorithm 1 directly because intertwining operator and covariance matrices become complex. In this case, we couple with into . Because , the multiplicity of and in the decomposition are equal because of Eq. (6), and the projected vectors are complex conjugate of each other because of (7). Thus, times the concatenation of the real and imaginary parts becomes the -th elements. We replace with defined above in Algorithm 1 to obtain Invariant PCA in general case.

## Appendix D Property of D4 group

We summarize the irreducible representations and decomposition of the tensor products in Tables 3 and 4 respectively.

## Appendix E Proof that the dimension of Invariant VLAD and VLAT is the same as that of original feature

When we use components, we can decompose into . When first or second order statistics are decomposed as , the multiplicity of of is , which is the same as the dimension of original feature.

## Appendix F Results on SIFT features

In this section, we report the results using SIFT feature that D4 acts orthogonally. We describe the irreducible decomposition of SIFT feature in Section F.1. We evaluate the accuracy on image recognition datasets in Section F.2.

### f.1 Irreducible decomposition of SIFT

We plot the overview of SIFT feature in Figure 5, where D4 group acts as permutation on both and . We can further decompose these into permutation on , , , , and . These permutations can be decomposed as , , , , and respectively, which can be calculated using characteristic function. Thus, permutation on SIFT can be decomposed as .

### f.2 Experimental Results

We evaluated the methods on (FMD) sharan2013recognizing , (DTD) cimpoi2014describing , (UIUC) liao2013non and CUB welinder2010caltech ). The evaluation protocol is the same as that of VGG-feature.

We extracted the dense SIFT feature from multi-scale images like the case of VGG, then applied nonlinear homogeneous mapping vedaldi2012efficient to make the feature dimension three times as large.

We reduced the dimension to 256 and then we applied BP, VLAD with 1,024 components, FV with 512 components, and VLAT with 16 components. We also applied the proposed Invariant BP with the same setting, and VLAD and VLAT with eight times the number of components.

Table 5 shows the similar tendency to the results for the VGG-feature. Our methods demonstrate better performance than existing methods for both original test data and augmented test data.