Deep Convolutional Neural Networks (CNNs) have shown great success in many vision tasks. There are several successful networks, e.g., AlexNet [Krizhevsky2012ImageNet], VGG [Simonyan2014Very], GoogleNet [Szegedy2014Going], Network In Network [Lin2013Network] and ResNet [He2015Deep]. Driven by the emergence of large-scale data sets and fast development of computation power, features based on CNNs have proven to perform remarkably well on a wide range of visual recognition tasks [Zeiler2013Visualizing, Donahue2013DeCAF]. Two contemporaneous works introduced by Liu et al. [Lin2016Bilinear] and Babenko and Lempitsky [Yandex2016Aggregating]
demonstrate that convolutional features could be seen as a set of local features which can capture the visual representation related to objects. To make better use of deep convolutional features, many efforts have been devoted to aggregating them, such as max pooling[Tolias2015Particular], cross-dimensional pooling [Kalantidis2015Cross], sum pooling [Yandex2016Aggregating], and bilinear pooling [Lin2016Bilinear, Gao2015Compact]. However, modeling these convolutional features to boost the feature learning ability of a CNN is still a challenging task. This work investigates a more effective scheme to aggregate convolutional features to produce a robust representation using an end-to-end deep network for visual tasks.
The second-order statistic information of convolutional features, e.g.,
the covariance matrix and Gaussian distribution, are the widely used SPD matrix representation endowed with CNNs[Li2017Is, Ionescu2015Matrix, Yu2017Second]
. The dimensionality of convolutional features extracted from CNNs may be much larger than that of hand-craft features. As a result, modeling convolutional features from CNNs by using the covariance matrix or Gaussian distribution is insufficient to precisely model the real feature distribution. When the dimension of features is larger than the number of features, the covariance matrix and Gaussian distribution is a symmetric Positive SemiDefinite (PSD) matrix,i.e., the singular matrix. Singular matrix makes the data have an unreasonable manifold structure. In this case, the Riemannain metrics, e.g., the affine-invariant distance and Log-Euclidean distance, are unsuitable to measure the manifold structure of SPD matrices. Moreover, most SPD matrix embedding on deep networks only contains the linear correlation of features. Owning the ability of capturing nonlinear relationship among features is indispensable for a generic representation.
It is thus desirable that a more discriminative and suitable SPD representation aggregated from deep features should be established in an end-to-end framework for visual analysis. To this end, we design a series of new layers to overcome existing issues aforementioned based on the following two observations.
Kernel functions possess an ability of modeling nonlinear relationships of data, and they are fiexible and easy to be computed. Beyond covariance [Wang2015Beyond]
have witnessed significant advances of positive definite kernel functions whose kernel matrices are real SPD matrices, no matter what the number of the feature dimension and the number of features are. Since many kernel functions are differentiable, such as Radial Basis Function (RBF) kernel function, Polynomial kernel function and Laplacian kernel function[Bo2010Kernel], they can be readily embeded into a network to implement an end-to-end training, which is well aligned with the design requirements of a deep network.
Several deep SPD networks [Dong2017Deep, Huang2016A, Zhang2017Deep] transform the SPD matrix to a new compact and discriminative matrix. The network input is an SPD matrix. The transformed matrix is still an SPD matrix which can capture desirable data properties. We find that the transformed SPD matrix after the learnable layers leads to better performance than the original SPD matrix. The output SPD matrix not only have characteristics of a general SPD matrix that captures the desirable properties of visual features but also is more suitable and discriminative to the specific visual task.
Motivated by empirical observations mentioned above, we introduce a convolutional feature aggregation operation which consists of the SPD generation and the SPD transformation. Three new layers including a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer, are designed to replace the traditional global pooling layers and fully connected (FC) layers. Concretely, we deem each feature map as a sample and present a kernel aggregation layer using a nonlinear kernel function to generate an SPD matrix. The proposed kernel matrix models a nonlinear relationship among feature maps and ensures that the SPD matrix is nonsingular. More importantly, our kernel matrix is differentiable, which entirely meets requirements of a deep network. The SPD matrix transformation layer is employed to map the SPD matrix to a more discriminative and compact one. Thanks to the symmetry property, the vectorization layer carries out the upper triangle vectorization and normalization operations to the SPD matrix followed by the classifier. The architecure of our network is illustrated in Fig.1. The proposed method first generates an SPD matrix based on convolutional features and then transforms the initial SPD matrix to a more discriminative one. It can not only capture the real spatial information but also encode high-level variation information among convolutional features. Actually, the obtained descriptor acts as a mid-level representation bridging convolutional features and high-level semantics features. The resulting vector can contribute to visual classification tasks, as validated in experiments.
In summary, our contributions are three-fold.
(1) We apply the SPD matrix non-linear aggregation to the convolutional feature aggregation field by the generation and the transformation two processes. In this way, it can learn an compactness and robustness SPD matrix representation to characterize the underlying structure of convolutional features.
(2) We carry out the nonlinear aggregation of convolutional features under a Riemannain deep network architecture, where three novel layers are introduced, i.e., a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer. The state-of-the-art performance of our SPD aggregation network is consistently achieved over the visual classification tasks.
(3) We exploit the faster matrix operations to avoid the cyclic calculation in forward and backward backpropagations of the kernel aggregation layer. In addition, we present the component decomposition and retraction of the Orthogonal Stiefle manifold to carry out the backpropagation on the SPD matrix transformation layer.
The remaining sections are organized as follows. We review the recent works about feature aggregation methods in both Euclidean Space and Riemannain Space in Section II. Section III presents the details of our SPD aggregation method. We report and discuss the experimental results in Section IV, and conclude the paper in Section V.
Ii Related Work
Feature aggregation is an important problem in computer vision tasks. Recent works have witnessed significant advances of CNNs. It is still a challenging work to find a suitable way to aggregate convolutional features. In this section, we review typical techniques of feature aggregation in both the Euclidean space and Riemannain space.
Ii-a Convolutional Feature Aggregation in the Euclidean Space
An effective image representation is an essential element for visual recognition due to the object appearance variations caused by pose, view angle, and illumination changes. Traditional methods typically obtain the image representation by aggregating hand-crafted local features (e.g., SIFT) into a global image descriptor. Popular aggregation schemes include Bag-of-words (BOW) [Sivic2003Video], Fisher Vector (FV) [Liu2014Encoding], and Vector of Locally Aggregated Descriptor (VLAD) [Ng2015Exploiting]. Gong et al. [Gong2014Multi] introduced a multi-scale orderless pooling scheme to aggregates FC6 features of local patches into a global feature using VLAD. The VLAD ignores different effects of each cluster center. Cimpoi et al. [Cimpoi2015Deep] treated the convolutional layer of CNNs as a filter bank and built an orderless representation using FV. In addition, Liu et al. [Liu2015Cross] proposed a cross convolutional layer pooling scheme which regards feature maps as a weighting filter to the local features. Tolias et al. [Tolias2015Particular] max pooled convolutional features of the last convolutional layer to represent each patch and achieved compelling performance for object retrieval. Babenko et al. [Yandex2016Aggregating] compared different kinds of aggregation methods (i.e., max pooling, sum pooling and fisher vector) for last convolutional layer features and demonstrated the sum-pooled convolutional descriptor is really competitive with other aggregation schemes.
Works mentioned above only treat the CNN as a black-box feature extractor rather than studying on properties of CNN features in an end-to-end framework. Several researchers [Lin2016Bilinear, Zhang2016Deep, Arandjelovic2016NetVLAD] suggested that the end-to-end network can achieve better performance because it is sufficient by itself to discover good features for visual tasks. Arandjelovic et al. [Arandjelovic2016NetVLAD] proposed a NetVLAD which adopts an the end-to-end framework for weakly supervised place recognition. Based on the ResNet, Zhang et al. [Zhang2016Deep] introduced an extended version of the VLAD, i.e., Deep-TEN, for texture classification. Lin et al. [Lin2016Bilinear] presented a general orderless pooling model named Bilinear to compute the outer product of local features. He et al. [He2015Spatial] introduced a spatial pyramid pooling method eliminating the constrain of the fixed-size input image.
Recent research shows that exploiting the manifold structure representation is more effective than the hypothetical Euclidean distribution in some visual tasks. The difference between our method and the traditional aggregation methods in the Euclidean space is that we use the powerful SPD manifold structure to aggregate the desirable data distributions of features. We design an SPD aggregation scheme to generate the SPD matrix as the resulting representation, and transform the SPD representation to more discriminative one by learnable layers.
Ii-B Convolutional Feature Aggregation in the Riemannain Space
The aggregation methods in the non-Euclidean space have been successful applied. It can capture more appropriate feature distributions information. The second-order statistic information has better performance than the first-order statistic [Li2017Is], such as average pooling. Some works directly regard the second-order statistic information as the SPD matrix. Ionescu et al. [Ionescu2015Matrix]
proposed a DeepO2P network that uses a covariance matrix as the image representation. They mapped points on the manifold to the logarithm tangent space and derived a new chain rule for derivatives. Liet al. [Li2017Is] presented a matrix normalized covariance method exploring the second-order statistic. This work can tackle the singular issue of the covariance matrix by the normalization operation. Yu and Salzmann [Yu2017Second] introduced a covariance descriptor unit to integrates second-order statistic information. The covariance matrix of convolutional features is generated and then transformed to a vector for the softmax classifier. Compared with our network, these three works are confined to the drawbacks of covariance matrices. Engin et al. [Engin2017DeepKSPD] designed a deep kernel matrix based SPD representation, but didn’t contains the transformation process.
Other SPD Riemannain networks mainly project an SPD matrix to a more discriminative one. Dong et al. [Dong2017Deep] and Huang and Gool [Huang2016A] proposed Riemannain networks contemporaneously, in which the inputs of their networks are SPD matrices. The networks projects high dimensional SPD matrices to a low dimensional discriminative SPD manifold by a nonlinear mapping. Zhang et al. [Zhang2017Deep]
introduced new layers to transform and vectorize the SPD matrix for action recognition, where the input is a nonlinear kernel matrix modeling correlation of frames in a video. However, these three works only focused on how to transform the SPD matrix without utilizing the powerful convolutional features. The generation of the input SPD matrix can not be guided by the loss function. In contrast, our method focuses on not only the SPD matrix transformation but also the generation from convolutional features.
Our work is closely related with [Li2017Is, Yu2017Second, Engin2017DeepKSPD]. We make it clear that the proposed convolutional feature aggregation method is composed of generation and transformation processes. Compared with [Li2017Is], our method utilizes the kernel matrix as the representation instead of the second-order statistic covariance matrix, characterizing complex nonlinear variation information of features. In addition, our aggregation method contains a learnable transformation process than [Li2017Is], making SPD representation more compact and robust. The generated SPD matrix in our method is more powerful than the covariance matrix in [Yu2017Second], avoiding some drawbacks of PSD matrix. In addition, instead of a transformation from a matrix to a vector, the vectorization operation in our work is taking the upper triangle of a matrix since there are already transformation operations between the SPD matrices. Compared to [Engin2017DeepKSPD], our SPD representation can be more compact and robust through the transformation process.
Iii SPD Aggregation Method
Our model aims to aggregate convolutional features into a powerful SPD representation in an end-to-end fashion. To this end, we design three novel layers including a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer. Our SPD aggregation can be applied to the visual classification. Specifically, the convolutional features pass through the proposed three layers followed by an FC layer and a loss function. The intermediate generated SPD matrix can be treated as a mid-level representation which is a connection between convolutional features and high-level features. The architecture of our network is illustrated in Fig. 1(c).
Iii-a Preprocessing of Convolutional Features
A CNN model trained on a large dataset such as ImageNet can have a better general representation ability. We would like to fuse the convolutional features of the last convolutional layer and adjust the dimension of convolutional features for different tasks. We introduce a convolutional layer whose filter’s size is
between the last convolutional layer of the off-the-shelf model and the kernel aggregation layer to make the processed convolutional features more adaptive to the SPD matrix representation. A Relu layer follows theconvolutional layer to increase the nonlinear ability.
Iii-B Kernel Aggregation Layer
We present the kernel aggregation layer to aggregate convolutional features into an initial SPD matrix. Let be -dimensional convolutional features. is the number of channels, i.e., the number of feature maps, and are the height and width of each feature map, respectively. Let denote the -th local feature, and there are local features in total, where . is the -th feature map.
Although several approaches have applied a covariance matrix to be a generic feature representation and obtained promising results, two issues remain to be addressed. First, the rank of covariance matrix should hold , otherwise covariance matrix is prone to be singular when the dimension of local features is larger than the number of local features extracted from an image region. Second, for a generic representation, the capability of modeling nonlinear feature relationship is essential. However, covariance matrix only evaluates the linear correlation of features.
To address these issues, we adopt the nonlinear kernel matrix as a generic feature representation to aggregate deep convolutional features. In particular, we take advantage of the Riemannain structure of SPD matrices to describe the second-order statistic and nonlinear correlations among deep convolutional features. The nonlinear kernel matrix is capable of modeling nonlinear feature relationship and is guaranteed to be nonsingular. Different from the traditional kernel-based methods whose entries evaluates the similarity between a pair of samples, we apply the kernel mapping to each feature rather than each sample . Mercer kernels are usually employed to carry out the mapping implicitly. The Mercer kernel is a function which can generate a kernel matrix using pairwise inner products between mapped convolutional features for all the input data points. The in our nonlinear kernel matrix can be defined as
where is an implicit mapping. In this paper, we exploit the Radial Basis Function (RBF) kernel function expressed as
where is a positive constant and set to the mean Euclidean distances of all feature maps. What Eq. (2) reveals is the nonlinear relationship between convolutional features.
We show an important theorem for kernel aggregation operation. Based on the Theorem 1, the kernel matrix of the RBF kernel function is guaranteed to be positive definite no matter what and are.
Let denotes a set of different points and . Then the kernel matrix of the RBF kernel function on is guaranteed to be a positive definite matrix, whose -th element is and .
The Fourier transform conventionof the RBF kernel function is
Then we calculate the quadratic form of the kernel matrix . Let denote an arbitrary nonzero vector. The quadratic form is
where is the transpose operation. Because is a positive and continuous function, the quadratic form on the condition that
However, the complex exponentials is linear independence. Accordingly, and kernel matrix is a positive definite matrix. ∎
In this work, is the generated SPD matrix as the mid-level image representation. Any SPD manifold optimization can be applied directly, without structure being destroyed. The toy example of the kernel aggregation is illustrated in Fig. 2
. As we all known, the kernel aggregation layer should be differentiable to meet the requirement of an end-to-end deep learning framework. Clearly, Eq. (2) is differentiable with respect to the input . Denoting by the loss function, the gradient with respect to the kernel matrix is . is an element in . We compute the partial derivatives of with respect to and , which are
In this process, the gradient of the SPD matrix can flow back to convolutional features.
During forward propagation Eq. (2) and backward propagation Eq. (6), we have to do cycles to compute the kernel matrix and cycles to gain the gradient with respect to convolutional features , where is the number of channels. Obviously, both the forward and backward propagations are computationally demanding. It is well known that the computation using matrix operations is preferable due to the parallel computing in computers. Accordingly, our kernel aggregation layer is able to be calculated in a faster way via matrix operations. Let’s reshape the convolutional features to a matrix . Each row of is a reshaped feature map obtained from and each column of is the convolutional local feature . Note that, in Eq. (2) can be expanded to . For each of inner products , and , it needs to be calculated times in cycles of Eq. (2). Now, we can convert times inner products operation to a matrix multiplication operation which only needs to be computed once,
where is the Hadamard product and is a matrix whose elements are all “1”s. , and are all real matirces. The element is the 2-norm of -th row vector of , and is equal to the calculation output of . The element is the 2-norm of -th column vector of , and is equal to the calculation output of . The element is equal to . , and can be calculated in advance.
Therefore, we compute in Eq. (2) by the matrix addition and multiplication, and implement the to the matrix in a parallel computing way instead of calculating each element in the cycle. Then the kernel matrix can be calculated by matrix operations as follows.
where means the exponential operation to each element in the matrix . Although calculating directly the function is time-consuming, it can be computed efficiently in a matrix form through Eq. (8), which is faster than through Eq. (2). Similarly, back propagation process in Eq. (6) can also be carried out in the matrix operation which is given by
Remark: The covariance matrix descriptor, as a special case of SPD matrices, captures feature correlations compactly in an object region, and therefore has been proven to be effective for many applications. Given the local features , the covariance descriptor is defined as
where is the mean vector. The covariance matrix can also be seen as a kernel matrix where the -th element of the covariance matrix can be expressed as
where denotes the inner product, and is the mean value of . Therefore, the covariance matrix corresponds to a special case of the nonlinear kernel matrix defined in Eq. (1), where . Through this way, we can find that covariance matrices contain the simple linear correlation features. Whether the covariance matrix is a positive definite matrix depends on the and , i.e., .
Iii-C SPD Matrix Transformation Layer
As discussed in [Dong2017Deep, Huang2016A, Zhang2017Deep]
, SPD matrix transformation networks are capable of achieving the better performance than the original SPD matrix. Inspired by[Yu2016Weakly] and [Yu2017Second], we add a learnable layer to make the network more flexible and more adaptive to the specific task. Based on the SPD matrix generated by the kernel aggregation layer, we expect to transform the existing SPD representation to be a more discriminative, suitable and desirable matrix. To preserve the powerful ability of the SPD matrix, the transformed matrix should also be an SPD matrix. Moreover, we attempt to adjust the dimension to make the SPD matrix more flexible and compact. Here, we design the SPD matrix transformation layer in our network.
Let’s define the Riemannain manifold of SPD matrices as . The output SPD matrix of the kernel aggregation layer lies on the manifold . We use a matrix mapping to complete the transformation operation. As depicted in Fig. 3, we map the input SPD matrix which lies on the original manifold to a new discriminative and compact SPD matrix in another manifold , where is the dimension of the SPD matrix transformation layer. In this way, the desired transformed matrix can be obtained by a learnable mapping. Given a SPD matrix as an input, the output SPD matrix can be calculated as
where is the output of the transformation layer, and are learnable parameters which are randomly initialized during training. controls the size of . Based on the Theorem 2, the learnable parameters should be a column full rank matrix to make be an SPD matrix as well.
Let denote an SPD matrix, and , where . is an SPD matrix if and only if is a column full rank matrix, i.e., .
If is an SPD matrix, is a column full rank matrix and . For homogeneous equations and , only has a zero solution, where is the zero vector. For arbitrary nonzero vector , . We calculate the quadric form ,
Because and is an SPD matrix, . This proves that is an SPD matrix.
On the other hand, if is an SPD matrix, for arbitrary nonzero vector , . Beccause is an SPD matrix, . Only if can lead to . Accordingly, and is a column full rank matrix. ∎
Since there are learnable parameters in the SPD matrix transformation layer, we should not only compute the gradient of loss function with respect to the input , but also calculate the gradient with respect to parameters . The gradient with respect to the input is
where is the gradient with respect to the output .
Since is a column full rank matrix, it is on a non-compact Stiefel manifold [Absil2009Optimization]. However, directly optimizing on the non-compact Stiefel manifold is infeasible. To overcome this issue, we relax to be semi-orthogonal, i.e., . In this case, is on the orthogonal Stiefel manifold . The optimization space of parameters is changed from the non-compact Stiefel manifold to the orthogonal Stiefel manifold . Considering the manifold structure of , the optimization process is quite different from the gradient descent method in the Euclidean space. We first compute the partial derivative with respect to . Then we convert the partial derivative to the manifold gradient that lies on the tangent space. Along the tangent gradient, we find a new point on the tangent space. Finally, the retraction operation is applied to map the new point on the tangent space back to the orthogonal Stiefel manifold. Thus, an iteration of the optimization process on the manifold is completed. This process is illustrate in Fig. 4. Next we will elaborate each step.
First the partial derivative with respect to is computed by
The partial derivative doesn’t contain any manifold constraints. Considering is a point on the orthogonal Stiefel manifold, the partial derivative needs to be converted to the manifold gradient, which is on the tangent space. As shown in Fig. 5, on the orthogonal Stiefel manifold, the partial derivative is a Euclidean gradient at the point , not tangent to the manifold. The tangential component of is what we need for optimization, which lies on the tangent space. The normal component is perpendicular to the tangent space. We decompose into two vectors that are perpendicular to each other, i.e., one is tangent to the manifold and the other is the normal component based on the Theorem 3.
Let denote an orthogonal Stiefel manifold and is a point on . denotes a function defined on the orthogonal Stiefel manifold. If the partial derivatives of with respect to is , the manifold gradient at which is tangent to is .
Because is a point on the orthogonal Stiefel manifold, , where
is an identity matrix. Differentiatingyields , where is a tangent vector. Thus,
is a skew-symmetric matrix. Note that, the canonical metric for the orthogonal Stiefel manifold at the pointis . For all tangent vectors at , we can get that
Because is a skew-symmetric matrix, Eq. (16) can be solved, i.e., . ∎
Then the tangential component at can be expressed by the partial derivative ,
is the manifold gradient of the orthogonal Stiefel manifold. Searching along the giadient gets a new point on the tangent space. Finally, we use the retracting operation to map the point on the tangent space back to the Stiefel manifold space,
where is the retraction operation mapping the data back to the manifold. Specificly, denotes the
matrix of QR decomposition to. , , where
is a semi-orthogonal matrix andis a upper triangular matrix. is the learning rate.
Note that, we can make a Relu activation function layer follow the SPD matrix transformation layer. The output of the Relu layer is still an SPD matrix based on the Theorem4.
The relu activation function on a matrix is . Let ,
If is an SPD matrix, is an SPD matrix.
The detailed proof of this theorem is shown in the appendix section of [Dong2017Deep]. ∎
Iii-D Vectorization layer
Since inputs of the common classifier is all vectors, we should vectorize the SPD matrix to a vector. Because of the symmetry of the robust SPD matrix achieved by the transformation layer, is determined by elements, i.e., the upper triangular matrix or the lower triangular matrix of . Here, we take the upper triangular matrix of and reshape it into a vector as the input of the loss function. Let’s denote the vector by ,
Due to the symmetry of the matrix , the gradient is also a symmetric matrix. For the diagonal elements of , its gradient of the loss function is equal to the gradient of its corresponding element in the vector , while the gradient of non-diagonal elements of is times of the element in the vector . The gradient with respect to is given by
The normalization operation is important as well. We use the power normalization and normalization operation following the vector . The gradient formulation Eq. (9), Eq. (14) and Eq. (18) calculate the gradient with respect to the input of the corresponding layer, respectively. Once these gradients are obtained, the standard SGD backpropagation can be easily employed to update the parameters directly with the learning rate. Fig. 6 shows the data flow in our network with the proposed three layers including forward and backward propagations. denotes the output of the last fully-connected layer. In Algorithm 1, we summarize the training process of our model. We can use more than one SPD transformation layers in the network, where each one can be followed by a Relu layer as the activation layer.
To demonstrate the benefits of our method, we conduct extensive experiments on visual classification tasks. We conduct experiments on visual classification tasks to show the performace of the SPD aggregation framework including the generation and transformation processes. We present the visual classification tasks on five datasets. We choose the challenging texture and fine-grained classification tasks. The texture classification tasks need a powerful global representation, because of the features of texture should be invariant to translation, scaling and rotation. Differences among fine-grained images are very small. It is challenging to represent these differences in the aggregation process.
|conv + B-CNN [Lin2016Bilinear]|
|512 conv + VGG-16 [Simonyan2014Very]|
|+ Kernel Aggregation Layer|
|+ Kernel Aggregation Layer|
|+ Kernel Aggregation Layer|
Iv-a Datasets and Evaluation Protocols
We choose three texture datasets in the experiments. They are Describable Textures Dataset (DTD) [Cimpoi2014Describing], Flickr Material Database (FMD) [Sharan2013Recognizing] and KTH-TIPS-2b (KTH-2b) [Caputo2005Class]. DTD and FMD are both collected in the wild conditions while KTH-2b is under the laboratory condition. DTD has classes, and each class contains images. There are totally images in DTD. FMD contains images of classes, each class has images. KTH-2b contains images of classes. Fig. 7 illustrates the texture datasets for our experiments. For these texture datasets, we follow the standard train-test protocol. We divide DTD and FMD into three subsets randomly, and use two subsets for training and the rest one subset for testing. Images of KTH-2b are splited into four samples. we train the framework using one sample and test on the rest three samples. Inspired by [Krizhevsky2012ImageNet], the texture images are augmented. We do times augmentation to the training data, including randomly cropping times and picking from center and four corners. The test images are only picked from the center and four corners. The size of cropped images are resized to .
We report results on birds and aircrafts fine-grained recognition datasets. The birds dataset [Wah2011The] is CUB-200-2011 which contains classes and images totally. The FGVC-aircraft dataset [Maji2013Fine] contains aircraft images of classes. Fig. 8 illustrates some fine-grained images. We train and test the birds and aircrafts fine-grained datasets through the inherent training document. The data augmentation is not applied to the fine-grained images. We resize them to the size of . All the texture and fine-grained images are normalized by subtracting means for RGB channels.
Iv-B Implementation Details
The basic convolutional layers and pooling layers before our SPD aggregation are from the VGG-16 model which is pre-trained on the ImageNet dataset. We remove layers after the conv5-3 layer of VGG-16 model. Then we insert our SPD aggregation method into the network following the conv5-3
layer. Finally, a FC layer and a softmax layer follow the vectorization layer where the output dimension of the FC layer is equal to the number of classes. All our networks run under the caffe framework. We use SGD with a mini-batch size of. The training process is divided into two stages. At the first training stage, we fix the parameters before the SPD aggregation method and train the rest new layers. The learning rate is started from and reduced by when error plateaus. At the second training stage, we train the whole network. The learning rate is started from and divided by when error plateaus.
Iv-C Experiments for the SPD Aggregation Framework
In this section, we compare the SPD aggregation framework with some state-of-the-art convolutional feature aggregation methods. First a convolutional layer whose number of channels is follows the conv5-3 convolutional layers. Then our SPD aggregation method including a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer is inserted after the convolutional layer. The output size of the SPD matrix transformation layer is . Considering these datasets are not big enough, we only use one SPD matrix transformation layer to avoid the overfitting. Table I shows the comparison on texture datasets and fine-grained datasets respectively.
The following methods are evaluated in our experiments, FV-CNN [Cimpoi2015Deep], FV-FC-CNN [Cimpoi2015Deep], B-CNN [Lin2016Bilinear], Deep-TEN [Zhang2016Deep] and the pure VGG-16 model [Simonyan2014Very] is used as the baseline. FV-CNN aggregates convolutional features from VGG-16 conv5-3 layer. The dimension of it is and is compressed to by PCA for classification. FV-FC-CNN incorporates the FC features and FV vector. B-CNN uses the Bilinear pooling method on the conv5-3 layer of VGG-16 model. The Deep-TEN uses -layers ResNet and larger number of training samples and image size, while the other methods use VGG-16 model. It is not scientific to compare Deep-TEN with the other feature aggregation methods.
We can see that, our method gets a better performance, especially on KTH-2b and FGVC-aircraft datasets. The average precision of our method on KTH-2b and FGVC-aircraft datasets are and . In contrast, B-CNN achieves and . On DTD and CUB-200-2011 datasets, our method is slightly worse than the B-CNN. The reason may be that the linear relationships among features are dominant on some datasests and the nonlinear relationships are important on the others.
Iv-D Experiments for the Components of the Proposed Aggregation Method
convolutional layer. As mentioned above, we employ a convolutional layer to accomplish the preprocessing of convolutional features. In this section, we provide experiments for the necessity of the preprocessing of convolutional features in our method. We design experiments in Table II. We combine different numbers of channels of layer with the kernel aggregation layer. We also add the layer to the B-CNN and pure VGG network. , and in the Table indicate that whether there is a layer before the kernel aggregation layer and the number of channels of the layer. The kernel aggregation layer in the Table means that there is only the kernel aggregation layer without the SPD matrix transformation layer and the vectorization layer in the network. Table II shows that, the layer is beneficial to our nonlinear kernel aggregation method. However, it is useless or even harmful to the B-CNN and pure VGG-16 model. Our benefits are brought about by the powerful SPD matrix instead of the preprocessing of convolutional features. But the preprocessing of convolutional features can actually lead to better performance to the SPD aggregation. The reason may be that the convolutional features are totally different from the kernel matrices but have some similarities to the Bilinear matrices or FC features. We can also observe that the number of channels of layer has small influence for the texture datasets. But when it is reduced to , the performance on fine-grained datasets is declined. So we argue that the convolutional features are redundant for the texture datasets but not redundant for the fine-grained datasets.
SPD Matrix Generation Process. To evaluate the effectiveness of the nonlinear SPD matrix generation process, we establish a network that only contains the kernel aggregation layer without the SPD matrix transformation layer and vectorization layer. The outcome is shown in Table II. Without the layer, i.e., + kernel aggregation layer in the Table, our network is comparable with the B-CNN and pure VGG-16 model. When the layer is added to the network, i.e.,, + Kernel Aggregation Layer in the Table, it has a obvious better performance than the other methods on FMD, KTH-2b and FGVC-aircraft datasets.
SPD Matrix Transformation Process. We design experiments to evaluate the effectiveness of the proposed transformation process in this subsection. Compared with ours, kernel aggregation layer in Table II only lacks the SPD matrix transformation layer and the vectorization layer, the rest is the same. Through Table II, we find that the SPD matrix transformation process transforms the SPD matrix to a more suitable and discriminative representation. Especially on DTD dataset and FGVC-aircraft datasets, the performance is improved by and respectively.
In this paper, we have proposed a new powerful SPD aggregation method which models the convolutional feature aggregation as an SPD matrix non-linear learning problem on the Riemannain manifold. To achieve this goal, we have designed three new layers to aggregate the convolutional features into an SPD matrix and transform the SPD matrix to be more discriminative and suitable. The three layers include a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer under an end-to-end framework. We investigated the component decomposition and retraction of the Orthogonal Stiefle manifold to carry out the backpropagation of our model. Meanwhile, the faster matrix operation was adopted to speed up forward and backward backpropagations. Compared with alternative aggregation strategies such as FV, VLAD and bilinear pooling, our SPD aggregation achieves appealing performance on visual classification tasks. Extensive experiments on challenging datasets have demonstrated that our approach outperforms the state-of-the-art methods.