Implementation of CVPR'19 paper (oral) - ContextDesc: Local Descriptor Augmentation with Cross-Modality Context
Most existing studies on learning local features focus on the patch-based descriptions of individual keypoints, whereas neglecting the spatial relations established from their keypoint locations. In this paper, we go beyond the local detail representation by introducing context awareness to augment off-the-shelf local feature descriptors. Specifically, we propose a unified learning framework that leverages and aggregates the cross-modality contextual information, including (i) visual context from high-level image representation, and (ii) geometric context from 2D keypoint distribution. Moreover, we propose an effective N-pair loss that eschews the empirical hyper-parameter search and improves the convergence. The proposed augmentation scheme is lightweight compared with the raw local feature description, meanwhile improves remarkably on several large-scale benchmarks with diversified scenes, which demonstrates both strong practicality and generalization ability in geometric matching applications.READ FULL TEXT VIEW PDF
In this paper, we present a novel approach that exploits the information...
Interest point detection and local feature description are fundamental s...
Keypoint detection and description is fundamental yet important in many
Numerous computer vision applications rely on local feature descriptors,...
We present PPFNet - Point Pair Feature NETwork for deeply learning a glo...
This paper presents a simple yet very effective data-driven approach to ...
This paper investigates how to step up local image descriptor matching b...
Implementation of CVPR'19 paper (oral) - ContextDesc: Local Descriptor Augmentation with Cross-Modality Context
As a common practice, a multi-scale-like architecture can help to capture visual context of different levels, which is referred to as multi-scale domain aggregation by DSP-SIFT  and adopted by recent learned descriptors [50, 19, 43]. Beside of the challenge on selecting proper domain sizes, a naïve multi-scale implementation may cost excessive computation such as doubled inference time and doubled feature dimensionality [50, 19, 43]. Seeking for more reasonable accuracy-efficiency trade-offs, we instead resort to well-studied high-level image representation, e.g., the regional representation used by image retrieval studies [33, 38] which essentially incorporates rich image context. Thereby, we strive to effectively combine the local feature description and off-the-shelf visual understandings so as to go beyond the local detail representation.
In addition, it would be interesting to exploit context in other modality. In particular, as shown in Fig. 0(b), since keypoint is principally designed to be repeatable in the same underlying scene, its distribution thus reveals comprehensive scene structure that allows we human beings to establish coarse matches even without color information, which further enables us to explore geometric context formed by the spatial relations of keypoints to help to alleviate the visual ambiguity of local descriptions.
Thus far, we have discussed two context candidates, referred to as visual context and geometric context that incorporate high-level visual representation over the image and geometric cues from 2D keypoint distribution, respectively. Instead of learning a completely new descriptor, in the present work, we target to flexibly leverage the above context awareness to augment off-the-shelf local descriptors without altering their dimensionality, in which process we consider the key challenges threefold:
A proper integration of geometric local feature and semantic high-level representation. As keypoint description requires sub-pixel accuracy, the integration is not supposed to obscure the raw representation of local details.
The instability of 2D keypoint distribution. Due to image appearance changes, keypoint distribution often suffers from substantial variations of sparsity, non-uniformity or perspective, which raises difficulties on acquiring strong invariance property of the feature encoder.
An effective learning scheme. Input signals and features in different modalities are supposed to be efficiently processed and aggregated in a unified framework.
Finally, regarding practicability, the augmentation is not supposed to introduce excessive computational cost, as the local feature description is often regarded as part of preprocessing in practical pipelines.
Although contextual information has been widely explored in semantic-based tasks, the challenges faced by local feature learning are substantially different, posing many non-trivial technical and systematic issues to overcome. In this paper, we propose a unified augmentation scheme that effectively leverages and aggregates cross-modality context, of which the contributions are summarized threefold: 1) a novel visual context encoder that integrates high-level visual understandings from regional image representation, a technique often used by image retrieval [33, 38]. 2) A novel geometric context encoder that consumes unordered points and exploits geometric cues from 2D keypoint distribution, while being robust to complex variations. 3) A novel N-pair loss that requires no manual hyper-parameter search and has better convergence properties. To our best knowledge, it is the first work that emphasizes the importance of context awareness, and in particular addresses the usability of spatial relations of keypoints in local feature learning.
The proposed augmentation is extensively evaluated and achieves state-of-the-art results on several large-scale benchmarks, including patch-level homography dataset, image-level wild outdoor/indoor scenes and application-level 3D reconstruction image sets, while being lightweight compared with raw local description, demonstrating both strong generalization ability and practicability.
Initially, local descriptors are jointly learned with a new comparison metric [9, 50], which is later simplified as direct comparison in Euclidean space [40, 48, 3, 19, 1]. More recently, efforts are spent on efficient training data sampling [43, 25, 11], effective regularizations [43, 53]
, and geometric shape estimation of input patches[26, 7]. However, most of above methods take individual image patches as input, whereas in the present work, we aim to take advantage of contextual cues beyond the local detail and incorporate features in multiple modalities.
Although widely introduced in computer vision tasks, context awareness has received little attention in learning 2D local descriptors. In terms of visual context, the central-surround (CS) structure[50, 19, 43] leverages multi-scale information by additionally feeding the central part of patches to boost the performance, whereas sacrificing computational efficiency due to the doubled extraction time and feature dimensionality. To incorporate semantics, one previous practice  designs a new comparison metric and describes features from histogram of semantic labels. In contrast to geometric matching, a family of studies has focused on finding semantic correspondences [45, 34] across different objects of the same category. Beside of visual information, a recent study 
explores to encode motion context for identifying outliers from keypoint matches, i.e., 4-d coordinate pairs, while we aim to exploit geometric context from single image without any reference. Overall, encoding proper context is non-trivial and still unclear in 2D local feature learning.
Point feature learning. In the present work, one of our goals is to explore geometric features from keypoint distribution, we thus resort to PointNet  and its variants [32, 5, 49] to consume unordered points. Although great success has been witnessed in learning tasks on 3D points, there are only few studies exploiting the potential outcome of 2D keypoint sets. In essence, keypoint structure is not intuitively meaningful and robust, as being highly dependent on the performance of interest point detectors and strongly affected by image variations. However, in descriptor learning, we consider the keypoint location as an important cue that bridges each individual local feature that has potentials to alleviate the local visual ambiguity.
Loss formulation. Recent local descriptors are often evolved with advanced variants of N-pair losses. Initially, L2-Net  adopts a log-likelihood formulation, which is later extended by HardNet  with a subtractive hinge loss. Furthermore, GeoDesc  applies an adaptive margin to improve the convergence in terms of different hard negative mining strategies, while AffNet  approaches the same issue by fixing the distance to hardest negative sample during training. Meanwhile, on the other hand, DOAP  extends the N-pair loss to a list-wise ranking loss, while  points out and studies the scale effects in N-pair losses while introducing additional manual tuning of hyper-parameters. Principally, a good loss is supposed to encourage similar patches to be close while dissimilar ones to be distant in the descriptor space. In this spirit, we aim to further resolve the scale effects in 
in an self-adaptive manner, without the need of complex heuristics or manual tuning.
Overview. As illustrated in Fig. 2, the proposed framework consists of two main modules: preparation (left) and augmentation (right). The preparation module provides input signals in different modalities (raw local feature, high-level visual feature and keypoint location), which are then fed to the augmentation module and aggregated into compact feature descriptions. At test time, the augmentation needs to be performed once per image, resulting in
feature vectors forcorresponding keypoints.
Patch sampler. This module takes images and their keypoints as input, producing gray-scale patches. Akin to [48, 23], image patches are sampled by a spatial transformer , whose parameters are derived from keypoint attributes (coordinates, orientation and scale) from the SIFT detector. As a result, the sampled patch has the same support region size with the SIFT descriptor.
Local feature extractor. This module takes image patches as input, producing 128-d feature descriptions as output. We borrow the lightweight 7-layer convolutional networks as used in several recent works [43, 25, 23].
Regional feature extractor. In contrast to aggregating features of different domain sizes [50, 19, 43], in the present work, we fix the sampling scale of patches, and exploit contextual cues by inspiration of well-studied regional representation in image retrieval tasks [44, 33, 28]. Without the loss of generality, we reuse features from an off-the-shelf deep image retrieval model of ResNet-50 . As in , feature maps are extracted from the last bottleneck block, across which each response is regarded as a regional feature vector effectively corresponding to a particular region in the image. As a result, we derive regional features of , where and denote the original image height and width. The aggregation of regional and local features will be later discussed in Sec. 3.3.
This module takes unordered points as input, and outputs 128-d corresponding feature vectors. Each input point is represented as 2D keypoint coordinate, and can be associated with other attributes.
2D point processing. At first glance, 2D keypoints are inappropriate to serve as robust contextual cues, as its presence is heavily dependent on image appearance and thus affected by various image variations. As a result, keypoint distribution depicting the same scene may suffer from significant density or structure variations, as examples shown in Fig. 0(b). Hence, acquiring strong invariance property is the key challenge when designing the context encoder.
Initially, we attempt to approach the goal by PointNet  and its variants [32, 5]. Although having shown great success on processing 3D point clouds, those prevalent PointNet methods fails to achieve consistent improvement in terms of 2D points processing (Sec. 4.4.1). Instead, we resort to , in which context normalization (CN) is equipped in PointNet and consumes putative matches (-d coordinate pairs) for outlier rejection in image matching. In this work, we aim to further explore the usability of CN for modeling 2D point distribution in single image.
Formally, CN is a non-parametric operation that simply normalizes feature maps according to their distribution, written as , where is the output of -th point in layer , and
are mean and standard deviation of the output in layer. To equip the operation, we borrow the residual architecture in 3a.
However, the above design leads to a non-negative output from the residual branch that may impact the representational ability as investigated in  and witnessed in our experiments (Sec. 4.4.1). Following the teachings of , we re-arrange the operations in each residual unit with pre-activation, which is compatible with CN as presented in Fig. 3b. We then construct four such units for the encoder, as shown in Fig. 2. We will show that this simple revision plays an important role to ease the optimization.
Intuitively, the non-parametric CN suffices to model the keypoint distribution in our task, while high-level abstractions (e.g., in PointNet++ ) may not be necessary.
Matchability predictor. In 3D point cloud processing, low-level color and normal  information or complex geometric attributes  are often incorporated to enhance the representation. Similarly, associating 2D coordinate input with other meaningful attributes would be promising to boost the performance. However, due to the substantial variations, e.g. perspective change, it is non-trivial to define appropriate intermediate attributes on 2D points.
Although this issue has been merely discussed, we draw inspiration from , which poses a problem named matchability prediction that targets to decide whether a keypoint descriptor is matchable before the matching stage. In practice, the matchability serves as learned attenuation to diversify the keypoints, so that the feature encoder can implicitly focus on the points that are more robust, i.e., matchable, in order to improve the invariance property.
, we resort to an unsupervised learning scheme that aims to appropriately rank points by their matchability. Formally, givencorrespondences , from an image pair, we first extract their local features , then construct feature quadruples as , satisfying and holding that:
where absorbs the raw local feature into a single real-valued matchability, implemented as standard multi-layer perceptrons (MLPs). Here, Cond. 1 aims to preserve a ranking of each keypoint, hence improves the repeatability of prediction. The condition can be re-written as:
the final objective can be obtained with a hinge loss:
In the proposed framework, the matchability is learned as an auxiliary task, which is then activated by and associated with keypoint coordinates as the network input, as in Fig. 2. Beside of Eq. 3, the gradient from final augmented features will flow through the matchability predictor, allowing a joint optimization of the entire encoder. The visualization of predicted matchability is shown in Fig. 4.
This module consumes regional features of in Sec. 3.1, local features and their location, and produces augmented features. To integrate visual information in different levels, a valid option as in  is to concatenate the global representation of entire image on raw local features. In our framework, the global feature can be derived by applying Maximum Activations of Convolutions (MAC) aggregation 
, which simply max-pools over all dimensions of regional features. However, such compact representation is shown to obscure the raw local description, due to the lack of spatial distinctions (Sec.4.4.1). Hence, we stick to the regional representation, where the key issue is to handle the regional features and keypoints of different numbers ( and ).
To achieve the goal, we associate regional features to a regular sampling grid on the image, then interpolategrid points at coordinates of the keypoints. For interpolation, we use inverse distance weighted average based on nearest neighbors (in default we use ), formulated as:
where is the regional feature located at a certain grid point. , and , indicate the interpolated and original grid point. Next, the dimensionality is reduced by applying point-wise MLPs, where we also insert CN after each perceptron in order to capture global context. Finally, raw local features are concatenated and further mapped by MLPs, forming the final 128-d features. The above process is illustrated in Fig. 2.
To aggregate the above two types of contextual features, similar to the CS structure, one option is to concatenate them together and forms features of, in our case, -d (). However, the increased dimensionality will introduce excessive computational cost in the matching stage of complexity. Instead, as shown in Tab. 2, we propose to combine different feature streams into a single vector by element-wise summation and L2-normalization, i.e., without altering the feature dimensionality. Beside of the simplicity, such strategy allows flexible use of the proposed augmentation. For example, in situations where regional features are not available, one may aggregate with only geometric context without the need of retraining the model.
N-pair losses have been primarily used by recent works. Empirically, the subtractive hinge loss [25, 23, 7] has reported better performance, of which the main idea is to push similar samples away from dissimilar ones to a certain margin in the descriptor space. However, setting the appropriate margin is tricky, which does not always assure convergence as observed in [23, 7]. More generally, the criteria of making a good loss is studied in , from which guidelines are provided on tuning loss parameters on a particular dataset. In this spirit, we aim to further ease the pain of parameter search in , and obtain an adaptive loss that allows fast convergence regardless of the learning difficulty.
We use the log-likelihood form of N-pair loss  as a base, which originally does not involve any tunable parameter. Formally, given L2-normalized feature descriptors , the distance matrix can be obtained by . By applying both row-wise () and column-wise () softmax, we derive the final loss as:
Noted that since input features are L2-normalized, the resulting is bounded by , which causes convergence issues due to the scale sensitivity of softmax function . Similarly, we introduce a single trainable parameter , referred to as softmax temperature, to amend the inability of re-scaling the input. The loss now becomes:
where is initialized to and regularized with the same weight decay in the network, hence does not require any manual tuning or complex heuristics. In the experiments in Sec. 4.4.2, we show this simple technique improves drastically than its original form , whose performance we suspect is hindered due to the above-mentioned scale sensitivity. In the proposed framework, we compute the N-pair loss on augmented features, and obtain the total loss:
where we choose in the experiment.
Training details. Although the framework is end-to-end trainable, we fix the local and regional feature extractors in Sec. 3.1 during the training, in order to clearly demonstrate the efficacy of the proposed augmentation scheme. We train the networks using SGD with a base learning rate of 0.05, weight decay of 0.0001 and momentum at 0.9. The learning rate exponentially decays by 0.1 for every 100k steps. The batch size is set to 2, and each time 1024 keypoints are randomly sampled including random numbers of matchable and noisy keypoints (see Appendix A.1). Input patches are standardized to have zero mean and unit norm, while input keypoint coordinates are normalized to regarding the image size.
Training dataset. Although UBC Phototour  is used as a common practice, this dataset consists of only three scenes with limited diversity of keypoint distribution. In order to achieve better generalization ability, we resort to large-scale photo-tourism [46, 33] and aerial datasets (GL3D)  as in [48, 23], and generate groundtruth matches from SfM. We manually exclude the data that is used in the evaluation.
Data augmentation. We randomly perturb input patches by affine transformations including rotation (90), anisotropic scaling and translation w.r.t. the detection scale. For keypoint augmentation, we perturb the coordinate with random homography transformation as in  (see Appendix A.1).
Homography dataset. HPatches  is a large-scale patch dataset for evaluating local features regarding illumination and viewpoint changes. As groundtruth homographies and raw images are provided, HPatches can also be used to evaluate image matching performance, which we accordingly refer to as HPSequences as in , consisting of 116 sequences and 580 image pairs.
Wild dataset. Similar to settings in , we also evaluate on outdoor YFCC100M  (1000 pairs) and indoor SUN3D  (539 pairs) datasets. Compared with HPSequences, the two datasets additionally introduce variations such as self-occlusions, and in particular, repetitive or feature-poor patterns in indoor scenes, which is generally considered challenging for sparse matching.
SfM dataset. Following , we evaluate on SfM dataset such as well-known Fountain and Herzjesu , or landmark collections . We integrate the proposed framework into SfM pipeline, i.e., COLMAP , and use the keypoints provided in  to compute the local features.
|Visual context encoder||Geometric context encoder||Comparison with other methods|
|Strategy||Recall i/v||Network architecture||Recall i/v||Method||Recall i/v|
|baseline (GeoDesc )||59.46||71.24||baseline (GeoDesc )||59.46||71.24||SIFT ||47.36||53.06|
|CS (256-d) [50, 19, 43]||59.83||71.27||PointNet ||59.61||70.96||L2-Net ||47.58||53.96|
|w/ global feature ||59.11||71.02||w/ CN (pre.) + xy||61.67||72.63||HardNet ||57.63||63.36|
|w/ regional feature||63.64||73.37||w/ CN (pre.) + xy + raw local feature||60.91||72.99||GeoDesc ||59.46||71.24|
|w/ regional feature + CN||63.98||73.63||w/ CN (orig.) + xy + matchability||59.94||71.25||ContextDesc||66.55||75.52|
|w/ CN (pre.) + xy + matchability||62.82||73.40||ContextDesc+||67.14||76.42|
Patch level. For HPatches , we follow its evaluation protocols and use mean average precision (mAP) for three subtasks, including patch verification, matching, and retrieval.
Image level. For HPSequences, we use Recall = # Correct Matches / # Correspondences defined in , to quantify the image matching performance, where # Correct matches are matches found by nearest neighbor searching and verified by groundtruth geometry, e.g., homography, while # Correspondences are matches that should have been identified by the given keypoint locations. Following , a match point is determined to be correct if it is within 2.5 pixels from the wrapped keypoint in the reference image. We use a standard SIFT detector to localize the keypoints, of which the number is randomly sampled to 2048. For YFCC100M  and SUN3D , we follow the same setting in  and report the median number of inlier matches after RANSAC for each dataset.
Reconstruction level. For clarity, we report metrics in  that quantify the completeness of SfM, including the number of registered images (# Registered), sparse points (# Sparse Points) and image observations (# Observations).
In this section, we evaluate two splits of HPSequences : illumination (i) and viewpoint (v), regarding different image transformations. We report Recall as defined in Sec. 4.3. If not specified, we use GeoDesc  as a baseline model (baseline (GeoDesc)) to extract raw local features, whose parameters are fixed during the training of augmentation.
Visual context. We compare four designs, including i) CS (256-d): the central-surround (CS) structure [50, 19, 43] as described in Sec. 2, which concatenates local features from different domain sizes. ii) w/ global feature: the integration with global features , which is originally designed for improving 3D local descriptors. iii) w/ regional feature: the proposed integration with interpolated regional features, and its variant iv) w/ regional feature + CN: with context normalization to incorporate global visual information.
As shown in Tab. 1 (left columns), the CS structure [50, 19, 43] delivers only marginal improvements despite the doubled dimensionality. Meanwhile, though being effective in 3D descriptor learning, the integration with global features  instead harms the performance, which we ascribe to the limited representation ability of a single global feature. Finally, the proposed integration with interpolated regional features shows clear improvements, as it better handles both spatial and visual distinctiveness. Moreover, to strengthen global context awareness, we show that the performance can be further boosted by equipping context normalization when encoding regional features.
Geometric context. We study five options: i) PointNet-like architecture, i.e., segmentation networks in 
without the final classifier. ii) Pre-activated context normalization (CN) networks in Sec.3.2 with 2D xy input, and its variants iii) with additional raw local feature input or iv) with matchability. We also compare the use of pre-activation of the residual unit in context normalization networks.
As presented in Tab. 1 (middle columns), though widely used in processing 3D points, PointNet  does not perform well in our task, while the similar phenomenon is also observed in  when processing 2D correspondences. Besides, it is noticed that input with raw local feature does not help to boost the performance, which we attribute to the weak relevance between local features as extracted from different orientations and levels of scale space pyramid. Instead, the incorporation with matchability is notably beneficial, as matchability is more comprehensive as a high-level abstraction of local feature. Finally, the pre-activation is clearly a preferable alternative than its original design.
Integration with cross-modality context. Finally, we evaluate the full augmentation with both visual and geometric context (ContextDesc). As shown in Tab. 1 (right columns), the simple summation aggregation in Sec. 3.4 effectively takes advantage of both context, delivering remarkable improvements over the state-of the-art.
To demonstrate the validity of proposed loss in Sec. 3.5, we train only the local base model without any context awareness, and compare different losses including: i) the plain N-pair loss in  without scale temperature, and ii) the scale-aware loss in  with its original parameters.
|GeoDesc ||w/ loss in ||w/ loss in ||Ours|
|HPatches, mAP [%]|
As shown in Tab. 2, the proposed loss improves the overall performance over the previous best-performing GeoDesc  under similar training settings, while GeoDesc requires additional geometric supervision. Besides, the proposed loss clearly shows better convergence compared with losses in  and . Although we suspect that the loss in  may perform better with careful parameter searching, the proposed loss is advantageous due to its self-adaptivity without the need of complex heuristics or manual tuning.
Moreover, once replace GeoDesc with the above model as a base in the augmentation scheme, the final performance can be further improved by a significant margin, denoted as ContextDesc+ in Tab. 1 (right columns), which again addresses the advance of improved base model. We will use this model to complete the following experiments.
Wild dataset. The evaluation results on two challenging datasets (outdoor YFCC100M  and indoor SUN3D ) are presented in Tab. 3. The proposed cross-modality context augmentation delivers and improvements over the previous state of the art, which effectively demonstrates the strong generalization ability of the learned augmented features in practical scenes.
|SIFT ||L2-Net ||HardNet ||GeoDesc ||Ours|
|median number of inlier matches|
SfM dataset. We further demonstrate the improvement in complex SfM pipeline. As shown in Tab. 4, the integration of augmented feature generalizes well among different scenes even in large-scale SfM tasks, meanwhile consistently boosts the completeness of sparse reconstruction. Some matching results are presented in Fig. 5, and more visualizations can be found in the appendix.
|# Images||# Registered||# Sparse Points||# Observations|
Invariance property. We again use Recall and evaluate on Heinly benchmark  to quantify the invariance property. As shown in Tab 5, the proposed method improves remarkably over the previous best-performing descriptor, except for some minor underperformance regarding Rotation change when images are rotated up to 180°, which may be caused by the inability of being fully rotation-invariant especially for the regional feature extractor.
|SIFT ||GeoDesc ||Ours|
Computational cost. Towards practicability, we only use shallow MLPs or non-parametric context normalization in the augmentation framework, which thus introduces only insignificant computation overhead. As reported in Tab. 6, suppose that regional features are readily extracted, e.g., from a retrieval model deployed in SfM pipeline for accelerating image matching, the full augmentation then requires only time cost compared with the raw local feature description. Virtually, the proposed framework allows flexible integration and reuse of other visual components to achieve system-level efficiency, such as saliency or segmentation masks, and thus has large rooms for future improvements.
|local feat.||regional feat.||geo. context||vis. context||multi-context|
End-to-end training. For ablation purposes, the parameters of base local and regional models are previously fixed in the training, and we here provide further studies about the efficacy of an end-to-end training scheme.
In the first setting, we freeze only the regional model and train from scratch with Eq. 7 on the augmented feature. As a result, the performance is notably improved from 67.14 to 67.53, and 76.42 to 77.20 for i/v sequences of HPSequences, compared with ContextDesc+ in Tab. 1.
In the second setting, we further end-to-end train with the regional model, which is additionally optimized by a standard cross-entropy classification loss as in  for simplicity (see Appendix A.1 for details). Although several loss balancing strategies have been experimented, we did not observe a consistent improvement for final matching performance, which we ascribe to the substantial challenge posing by multi-task learning. Thus, we currently recommend a separate training for the regional model, and look forward to an improved solution in the future.
In contrast to current trends, we have addressed the importance of introducing context awareness to augment local feature descriptors. The proposed framework takes keypoint location, raw local and high-level regional feature as input, from which two types of context are encoded: geometric and visual context, while the training adopts a novel N-pair loss that is self-adaptive and parameter-tuning free. We have conducted extensive evaluations on diversified and large-scale datasets, and demonstrate remarkable improvements over the state of the art, meanwhile showing the strong generalization and practicability in real applications.
This work is supported by Hong Kong RGC GRF 16203518, T22-603/15N, ITC PSKL12EG02. We thank the support of Google Cloud Platform.
Learning local feature descriptors with triplets and shallow convolutional neural networks.In BMVC, 2016.
Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions.In CVPR, 2016.
In terms of the matchability predictor, we construct 4-layer MLPs whose output node numbers are 128, 32, 32, 1, respectively. The visual context encoder is composed of two 2-layer MLPs, located before/after the concatenation with raw local features. We insert context normalization only into the former MLPs, while insertion in the latter one is observed to harm the performance.
The retrieval model is trained on Google-Landmarks Dataset , which contains more than 1M landmark images. In our experiments, we have compared different networks for the retrieval performance. In brief, ResNet-101 is slightly better than ResNet-50, while VGG and AlexNet are notably underperforming. We choose ResNet-50 for better tradeoffs in memory and speed.
Instead of adopting the training scheme in , we find that the model pretrained on landmark classification task (containing 15K classes) as in  suffices to produce satisfactory results in practice, and avoids difficulties on preparing training data for Siamese networks or hard negative mining with complex heuristics. We have evaluated the retrieval model with MAC aggregation on standard Oxford buildings dataset , where we obtain mAP of 0.83, on par with  of 0.80.
Similar to , we choose to use 4-point parameterization to represent the homography as follows:
where are four corner points at , and
are random variables between. One can easily convert to a standard
homography by, e.g., normalized Direct Linear Transform (DLT) algorithm. In our implementation, we apply the random homography on each keypoint coordinate set before feeding it into the geometric context encoder.
The training of proposed framework, apparently, needs to be conducted between image pairs instead of isolated patches, since we also take keypoint coordinates as input. Such data organization is referred to as simulating image matching in . However, the simulation in  is not complete, as it considers only keypoints that have successfully established correspondences, whereas in real situation, only a subset of keypoints is repeatable in other images. In practice, as illustrated in Fig. 6, we divide keypoints obtained from SfM as in [48, 23] into three categories: i) Matchable: repeatable and verified by SfM; ii) Undiscovered: repeatable but did not survive the SfM. iii) Unrepeatable: unable to be re-detected in other images.
In the training, we randomly sample a number of matchable keypoints as well as some undiscovered and unrepeatable keypoints (denoted as noisy keypoints), in order to have a complete simulation that is necessary to acquire strong generalization ability. Otherwise, the training will consider only ideal setting with all matchable keypoints, which is inconsistent with real applications.
To incorporate with the above training strategy, we need to make some adaptation on the loss formulation. Formally, given index sets and , where and are numbers of matchable and noisy keypoints for an image pair, the losses of Eq. 3 and Eq. 5 are now rewritten as:
Subsequently, adding noisy keypoints will first influence the encoding of geometric context, posing a harder training settings and playing a key role in order to acquire the invariance properties. Second, it will influence the computation of , as those keypoints will be all cross-paired as negative samples. It also enables us to increase the pair number in each batch, i.e., 1024 in our implementation compared with 64 in GeoDesc , which boost the effectiveness of N-pair loss as observed in .
In this work, as in Sec. 3.4, we simply sum and normalize the cross-modality features for aggregation. Meanwhile, we have also attempted to make this module learnable by concatenating and feeding the features to several fully-connected layers. However, the experimental results showed a considerable performance decrease from such choice, i.e., 2 points decrease on HPatches even compared with the base model, GeoDesc. Our observation is that the raw local features are supposed to be preserved as much as possible, and a learnable aggregation would result in over-parameterization and inability to represent the local detail.
We plot the growth of softmax temperature and its respective loss decrease in Fig. 7. As can be seen, the softmax temperature fast grows at the beginning and gradually converges to a constant value, . As mentioned in Sec. 3.5, the softmax temperature is regularized with the same network weight decay, whereas we have observed that eschewing the regularization does not harm the performance, but results in a larger temperature value, i.e., .
In previous experiments on image matching, we did not apply any outlier rejection (e.g., mutual check, ratio test ) for all methods for fair comparison, whereas the early outlier rejection is critical and necessary to later geometry computation, e.g., recovering camera pose. In particular, ratio test  has demonstrated remarkable success, we thus follow the practice in  to determine the ratio criteria of the proposed augmented descriptor. Specifically, given # Correct Matches defined in Sec. 4.3, we test on HPSequence  and aim to find a proper ratio that achieves Precision = # Putative Matches / # Correct Matches similar to SIFT. As a result, we choose for the proposed descriptor.
|mAP of pose (error threshold 20°)|
To demonstrate the efficacy of the obtained ratio, we evaluate on the wild indoor/outdoor data [47, 42] with an error metric of relative camera pose accuracy. Following the protocols defined in , we use mean average precision (mAP) of a certain threshold (e.g., 20°) to quantify the error of rotation and translation. For comparison, we use ratio criteria of 0.80 for SIFT  and 0.89 for GeoDesc , and present evaluation results in Tab. 7, which demonstrates consistent improvements with proper outlier rejection.
Somewhat counter-intuitively, the CS structure improves marginally on image matching tasks as reported in Tab. 1. To further study this phenomenon, we compare the patch sampling from different domain sizes, including the original SIFT’s () as used in previous experiments, half () or double () sizes. We also compare the aggregation of multiple sizes, i.e., the original and halved or the original and doubled . Instead of concatenating features as used by CS structure, we apply simple summing-and-normalizing aggregation in Sec. 3.4 to avoid increasing the dimensionality.
We experiments with our ContextDesc+ model in Tab. 1, and present the comparison results with different domain sizes in Tab. 8. As can be seen, when only single size is adopted, the original ‘’ performs best as being consistent with the training. In addition, when combining a larger size (), we can further boost the proposed method by a considerable margin, yet leading to excessive computational cost and doubling the inference time. In practice, the aggregation with different domain sizes is compatible with the proposed framework, and can be applicable when high accuracy is in demand.
|domain size||Recall i/v|
We further demonstrate the robustness regarding density change on HPSequences , of which images are feature-rich and have keypoints up to 15k. Beside of sampling keypoints of different numbers, we consider a more challenging case where all detected keypoints are used. As presented in Fig. 9, the proposed method delivers consistent improvements in terms of all cases, which demonstrates the reliable invariance property acquired by context encoders.
To better interpret the functionality of the proposed matchability predictor in Sec. 3.2, we quantitatively evaluate its performance being used as a keypoint detector. Following , we apply the matchability predictor onto the entire image, then select top responses after NMS as keypoints. whose performance is measured by Repeatability. Compared with SIFT detector, the results are improved from 32.81 to 37.93 and 25.53 to 26.34 on i/v sequences of HPatches. While the detector performance is not the focus of this paper, we believe that by adopting more advanced techniques, this module will potentially benefit to the joint training of keypoint detector and descriptor, and have large rooms for future improvements.
We use an open-source implementation of VocabTree111https://github.com/hlzz/libvot  for evaluation image retrieval performance, and compare SIFT , GeoDesc  and the proposed ContextDesc. The mAPs on Paris dataset  from three competitors are 49.89, 53.84 and 61.29, while on Oxford buildings  are 47.27, 53.29 and 61.64. By re-ranking the top-100 by spatial verification , the mAPs on Paris are improved to 52.23, 55.02 and 64.53, while on Oxford are 51.64 , 54.98 and 65.03. The experimental results effectively demonstrate the superiority of the proposed method.