Learning rich and compact representations is an open topic in many fields such as object recognition [Szegedy_2015_CVPR] or image retrieval [Opitz_2017_ICCV, Carvalho_2018_SIGIR]. Recently, representations that compute first order statistics over input data have been outperformed by improved models that compute higher order statistics [Perronnin_2010_ECCV, Picard_2011_ICIP, Picard_2016_ICIP, Jacob_2018_ICIP]. This strategy generates richer representations and are the state-of-the-art methods on fine grained visual classification tasks [Lin_2015_ICCV].
However, even if the increase in performances is unquestionable, second order models suffer from a collection of drawbacks: quadratically increasing dimensionality, costly dimensionality reduction, difficulty to be trained, lack a proper adapted pooling.
The two main downsides, namely the high dimensional output representations and the sub-efficient pooling scheme, have been widely studied over the last decade. On the one hand, the dimensionality issue has been studied through factorization scheme, either representation oriented [Gao_2016_CVPR, Kim_2017_ICLR] or task oriented [Kong_2017_CVPR]. While these factorization schemes are efficient in term of computation cost and number of parameters, the intermediate representation is still very large (typically 10k dimensions) and hinders the training process, while using lower dimension greatly deteriorates performances.
On the other hand, it is well-known that global average pooling schemes aggregate unrelated features. This problem has been tackled by the use of codebooks (e.g., VLAD [Arandjelovic_2013_CVPR]
and Fisher Vectors[Perronnin_2010_ECCV]) and extended to be end-to-end trainable [Arandjelovic_2016_CVPR, Tang_2016_Arxiv]. However, using a codebook on second-order features leads to an unreasonably large model, since the already large feature has to be duplicated for each entry of the codebook. This is the case for example in MFAFVNet [Li_2017_ICCV_MFAFVNet] for which the second order layer alone (i.e., without the CNN part) costs as much as an entire ResNet50.
In this paper, we tackle the intermediate representation cost and the lack of proper pooling shortcomings by exploring joint factorization and codebook strategies. Our main results are the following:
We first show that state-of-the-art factorization schemes can be improved by the use of a codebook pooling, albeit at a prohibitive cost.
We then propose our main contribution, a joint codebook and factorization scheme that achieves similar results at a much reduced cost.
Since our approach focuses on representation learning and is task agnostic, we validate it in a retrieval context on several image datasets to show the relevance of the learned representations. We show our model achieves competitive results on these datasets at a very reasonable cost.
The remaining of this paper is organized as follows: in the next section, we present the related work on second order pooling, factorization schemes and codebook strategies. In section 3, we present our factorization with the codebook strategy and how we improve its integration. In section 4, we show an ablation study on the Stanford Online Products dataset [Song_2016_CVPR]. Finally, we compare our approach to the state-of-the-art methods on three image retrieval datasets (Stanford Online Products, CUB-200-2001, Cars-196).
2 Related work
2.1 Second-Order Pooling
In this section, we briefly review end-to-end trainable Bilinear pooling (BP) [Lin_2015_ICCV]. This method extracts representations from the same image with two CNNs and computes the cross-covariance as representation. This representation outperforms its first-order version and other second-order representations such as Fisher Vectors [Perronnin_2010_ECCV] once the global architecture is fine-tuned. Most of recent works on bilinear pooling only focus on computing covariance of the extracted features with a single CNN, that is :
where is the matrix of the extracted -dimensional CNN features. Another formulation is the vectorized version of obtained by computing the Kronecker product () of with itself:
Due to the very high dimension of the above representation that is quadratic in the feature dimension, factorization schemes are mandatory.
2.2 Factorization schemes
Recent works on bilinear pooling proposed factorization schemes with two objectives: avoiding the direct computation of second order features and reducing the high dimensionality output representation. One of the main end-to-end trainable factorization is based on Tensor Sketch (CBP-TS)[Gao_2016_CVPR] which tackles the high dimensionality of second-order features using sketching functions. Their formulation allows to keep less than 4% of the components with nearly no loss in performances compared to the uncompressed model.
This rank-one factorization has been generalized to multi-rank by taking advantage of the SVM formulation to jointly train the network and the classifier[Kong_2017_CVPR]. Even if the second-order features are never directly computed, their factorization is limited to the SVM formulation and cannot be used for other tasks. Another task agnostic extensions are e.g., FBN [Li_2017_ICCV_FBN] which also integrates the first order into the representation and HPBP [Kim_2017_ICLR]
which improves the factorization with attention model and non-linearity and applies it to visual question answering. Grassmann BP[Wei_2018_ECCV]
also improves second-order pooling by dealing with the ”burstiness” of features which may be predominant in high order representations by using Grassmann manifolds and providing an indirect computation of the representation. However, their method relies on Singular Value Decomposition (SVD) and they need to greatly reduce the input feature dimension due to the SVD computation complexity which is cubic in the feature dimension.
For image retrieval tasks, producing very compact representation is mandatory to tackle the indexing of very large datasets. E.g., current state-of-the-art method on the CUB dataset [CUB_200_2011] named HTL [Wei_2018_ECCV] uses only 512 dimensions for the representation. Thus, all of the aforementioned methods have representations that are still too large to compete in this category. In this work, we start from a rank-one factorization detailed in section 3.1 which is extended by the introduction of a codebook strategy that allows smaller representation dimension, improves performances and makes them competitive to state-of-the-art methods in image retrieval.
2.3 Codebook strategies
An acknowledged drawback of pooling methods is that they pool unrelated features that may decrease performances. To cope with this observation, codebook strategies (e.g., Bag of Words) have been proposed and greatly improved performances by pooling only features that belong to the same codeword.
In the case of second order information, the first representations that take advantage of codebook strategies are VLAT [Picard_2011_ICIP, Picard_2013_CVIU] and Fisher Vectors [Perronnin_2010_ECCV]
. While in VLAT the high-dimensionality is handled by PCA on local features and intra-projections, Fisher Vectors (FVs) replace the hard assignment by a Gaussian Mixture Model (GMM) and supposes that covariance matrices to be diagonal which leads to smaller representations. However, FV ignores cross-dimension correlations. Strategies like STA[Picard_2016_ICIP, Jacob_2018_ICIP] extends the VLAT representation by computing cross correlation matrices of nearby features to integrate spatial information and takes advantage of a codebook strategy to avoid the computation of unrelated features. However, as the dimensionality is both quadratic in the codebook size and the feature dimension, factorization scheme is mandatory. In the case of ISTA [Jacob_2018_ICIP], the proposed dimensionality reduction only allows to reduce the dimensionality to around 20k dimensions, which is twice higher than standard second-order pooling factorization.
In end-to-end trainable architectures, FisherNet [Tang_2016_Arxiv] extends the FVs and outperforms non-trainable FV approaches but nonetheless has the high output dimension of the original FV. MFA-FV network [Li_2017_ICCV_MFAFVNet], which extends MFA-FV of [Dixit_2016_NIPS], generates an efficient representation of non-linear manifolds with a small latent space and is trainable in an end-to-end way. The main drawbacks of their method is the direct computation of second-order features for each codeword (computation cost), the raw projection of this covariance matrix into the latent space for each codeword (computation cost and number of parameters), and finally the representation dimension. In the original paper, the proposed representation reaches 500k dimensions, which is prohibitive for image retrieval as it may require more memory than whole images.
To our knowledge, no efficient factorization combined with codebook strategy has been proposed to exploit the richer representation of second order features combined with the codebook strategy. Our propositions combine the best of both worlds by providing a joint codebook and factorization optimization scheme with a similar number of parameters and computation cost to that of methods without codebook strategies.
3 Method overview
After a presentation of the initial factorization (section 3.1), we first propose an extension to a codebook strategy (section 3.2) and show the limitations of this architecture in terms of computation cost, low-rank approximation, number of parameters, etc. Finally, we present our shared projectors strategy (section 3.3) which leads to a joint codebook and factorization optimization.
3.1 Initial factorization scheme
In this section, we present the factorization of the projection matrix and highlight the advantages and limitations of this scheme. Using the same notation as in section 2.1, we want to find the optimal linear projection matrix to build the output feature . These output features are then pooled to build the output representation :
In the rest of the paper, we use the notation that refers to the -th dimension of the output representation and the -th dimension of the output feature , that is:
where is a column of . Due to the large number of parameters induced by this projection matrix, we enforce the rank one decomposition where for all projectors of . from Eq. (4) becomes:
This factorization is efficient in term of parameters as it needs only parameters instead of for the full projection matrix. However, even if this rank one decomposition allows efficient dimension reduction, it is not enough to keep all the richness of the second-order statistics due to the pooling of unrelated features. Consequently, we extend the second-order feature to a codebook strategy.
3.2 Codebook strategy
To avoid destructive averaging, we want to pool only similar features which belong to the same codeword. This codebook pooling is interesting because each projection to a sub-space should have only similar features. Thus, they lie on a simpler manifold and they could be encoded with fewer dimensions. For a codebook size of , we compute an assignment function . This function could be a hard assignment (e.g. , the over distance to each cluster) or a soft assignment (e.g. , the softmax). Thus, output feature becomes:
Remark that now and . Here, we duplicate to keep the generalization of bilinear pooling (two codebooks can be learned, one per modality) or for STA based strategies (two nearby features may belong to different codewords). As in equation 5, we enforce the rank one decomposition of where to split the modalities. This first factorization leads to the following output feature :
However, this representation is still too large to be computed directly. Then, we enforce two supplementary factorizations:
where is the -th vector from the natural basis of and . The decompositions of and play the same roles as intra-projection in VLAD [Delhumeau_2013_ACM]. Indeed, if we consider as a hard assignment function, the only computed projection is the one assigned to the corresponding codewords. Thus, this model learns a projection matrix for each codebook entry.
Furthermore, by exploiting the same property used in Eq. (7), the following equation can be compacted such as:
where and are the matrices concatenating the projections of all entries of the codebook for the -th output dimension. We call it Joint Codebook and Factorization, JCF -N.
This representation has multiple advantages: First, it computes second order features that leads to better performances compared to its first order counterpart. Second, our first factorization provides an efficient alternative in terms of number of parameters and computation despite the decreasing performances when it reaches small representation dimensions. This downside is addressed by the codebook strategy. It allows the pooling of only related features while their projections to a sub-space is more compressible. However, even if this codebook strategy improves the performances, the number of parameters is in As such, using large codebook may become intractable. In the next section, we extend this scheme by sharing a set of projectors and enhance the decompositions of and .
3.3 Sharing projectors
In the previous section, one projector is learned to map all features that belong to a given codebook entry for each entry of the codebook. The proposed idea is, instead of using a one-to-one correspondence, we learn a set of projectors that is shared across the codebook. The reasoning behind is that projectors from different codebook entries are unlikely to be all orthogonal. By doing such hypothesis (i.e., the vector space spaned by the combination of all the projection matrices has a lower dimension than the codebook itself), we can have smaller models with nearly no loss in performances. To check this hypothesis, we extend the proposed factorization from section 3.2. We want to generate from and from where is the number of projections in the set. Then the two new enforced factorization of and are:
where are two functions from that transform the codebook assignment into a set of coefficient which generate their respective projection matrices. Similarly to Eq. (9), we have:
In this paper, we only study the case of a linear projection:
where . Eq. (12) is more efficient in terms of parameters than Eq. (9) as it requires times lesser parameters and computation. We call this approach JCF -N-R. In section 4, we provide an ablation study of the proposed method, comparing Eq. (9) and Eq. (12), demonstrating that learning recombination is both efficient and performing.
3.4 Implementation details
We build our model over pre-trained backbone network such as VGG16 [Simonyan_2014_ILSVRC] (on CUB and CARS datasets) and ResNet50 [He_2016_CVPR] (on Stanford Online Products). In both case, the features are reduced to 256d and -normalized. The assignment function
is the softmax over cosine similarity between the features and the codebook. In metric, we use Recall@K which takes the value 1 if there is at least one element from the same instance in the top-K results else 0 and averages these scores over the test set. Images are resized to 224x224 and we do not use data augmentation. We use SGD with a learning rate of, a batch of 64 images, the N-pair triplet loss [Sohn_2016_NIPS] with the margin set to and . We also use semi-hard mining for the final comparison to the state-of-the-art.
4 Ablation studies
4.1 Bilinear pooling and codebook strategy
In this section, we demonstrate both the relevance of second-order information for retrieval tasks and the influence of the codebook on our method. We report recall@1 on Stanford Online Products in Table 1 for the different configuration detailed below.
First, as a reference, we train a Baseline network, i.e., which consists in the average of the features reduced to 512 dimensions (first order model). Then we re-implement BP and extend it naively to a codebook strategy. The objective is to demonstrate that such strategy performs well, but at an intractable cost. Results are reported in the left part of Table 1. This experiment confirms the interest of second-order information in image retrieval with a improvement of 2% over the baseline, while using a 512 dimension representation. Furthermore, using a codebook strategy with few codewords enhances bilinear pooling by 1% more. However, the number of parameters becomes intractable for codebook of size greater than 4: this naive strategy requires 270M parameters to extend this model to a codebook with a size of 8.
Using the factorization from Eq. (9) greatly reduces the required number of parameters and allows the exploration of larger codebook. In the case of the factorization alone, the small representation dimension leads to poor performances and are only slightly retrieved using a codebook. On the opposite, our factorization which exploits both the larger codebook and the low-rank approximation is able to reach higher performances (+4% between BP and JCF -32-32) with nearly four times less parameters.
4.2 Sharing projections
In this part, we study the impact of the sharing projection. We use the same training procedure as in the previous section. For each codebook size, we train architecture with a different number of projections, allowing to compare architectures without the sharing process to architectures with greater codebook size but with the same number of parameters by sharing projectors. Results are reported in the right part of Table 1. Sharing projectors leads to smaller models with few loss in performances, and using richer codebooks allows more compression with superior results. In the next section, we compare our best model JCF -32-32 and its shared version with four times less parameters JCF -32-8 against state-of-the-art methods.
5 Comparison to the state-of-the-art
In this section, we compare our method to the state-of-the-art on 3 retrieval datasets: Stanford Online Products [Song_2016_CVPR], CUB-200-2011 [CUB_200_2011] and Cars-196 [CARS_196]. For Stanford Online Products and CUB-200-2011, we use the same train/test split as [Song_2016_CVPR]. For Cars-196, we use the same as [Opitz_2017_ICCV]. We report the standard recall@K with for Stanford Online Products and with for the other two. We implement the codebook factorization from Eq. (9) with a codebook size of 32 (denoted JCF -32). While JCF -32 outperforms state-of-the-art methods on the three dataset, our low-rank approximation JCF -32-8, which cost 4 times less also leads to state-of-the-art performances on two of them with a loss between 1-2% consistent with our ablation studies. In the case of Cars-196 however, the performances are much more lower than the full model. We argue that the variety introduced by the colors, the shapes, etc.
in cars requires more projections to be estimated, as it is observed for the full model.
|Binomial deviance [Ustinova_2016_NIPS]||65.5||82.3||92.3||97.6|
|N-pair loss [Sohn_2016_NIPS]||67.7||83.8||93.0||97.8|
In this paper, we explore codebook based second order representations that are intractable in practice. We propose a two-step factorization and a low-rank approximation designed to keep the richness of the second-order representation but with the compactness of the first-order. We provide ablation studies to confirm the necessity of a codebook pooling strategy, the impact of the different factorizations and the benefit of the low-rank approximation to control the computation cost. This representation named JCF outperforms state-of-the-art methods on three image retrieval benchmarks and its low-rank approximation is state-of-the-art on two of them.