PhotoSynth Dataset for improving local patch Descriptors
We propose a new dataset for learning local image descriptors which can be used for significantly improved patch matching. Our proposed dataset consists of an order of magnitude more number of scenes, images, and positive and negative correspondences compared to the currently available Multi-View Stereo (MVS) dataset from Brown et al. The new dataset also has better coverage of the overall viewpoint, scale, and lighting changes in comparison to the MVS dataset. Our dataset also provides supplementary information like RGB patches with scale and rotations values, and intrinsic and extrinsic camera parameters which as shown later can be used to customize training data as per application. We train an existing state-of-the-art model on our dataset and evaluate on publicly available benchmarks such as HPatches dataset and Strecha et al.strecha to quantify the image descriptor performance. Experimental evaluations show that the descriptors trained using our proposed dataset outperform the current state-of-the-art descriptors trained on MVS by 8 and 10 HPatches dataset. Similarly on the Strecha dataset, we see an improvement of 3-5READ FULL TEXT VIEW PDF
PhotoSynth Dataset for improving local patch Descriptors
Finding correspondences between images using descriptors is important in many computer vision tasks such as 3D reconstruction, structure from motion (SFM)
Finding correspondences between images using descriptors is important in many computer vision tasks such as 3D reconstruction, structure from motion (SFM), wide-baseline matching , stitching image panoramas , and tracking [6, 7]. However, due to changes in viewpoints, scale variations, occlusion, variations in illumination, and shading in the real world scenarios, finding correspondences in-the-wild is challenging and it is an active area of research.
Traditionally, handcrafted descriptors such as SIFT , SURF , LIOP  were used. These type of descriptors encode pixel, super-pixel or sub-pixel level statistics. However, handcrafted features do not have ability to capture higher structural level information. On the other hand, learning based descriptors using Convolutional Neural Networks (CNNs) have the potential to capture higher level structural information and also to generalize well. Hence, CNN based descriptors are gaining more importance in recent years
were used. These type of descriptors encode pixel, super-pixel or sub-pixel level statistics. However, handcrafted features do not have ability to capture higher structural level information. On the other hand, learning based descriptors using Convolutional Neural Networks (CNNs) have the potential to capture higher level structural information and also to generalize well. Hence, CNN based descriptors are gaining more importance in recent years[10, 11, 12, 13, 14, 15, 16].
Many research works using CNN based descriptors, focus on the architecture  , defining better loss function
, defining better loss function[13, 16], and improving training strategies [14, 15] to enhance the quality and achieve state-of-the-art results. As noted in , it is unclear that these descriptors can be used for applications where data is not representative of the dataset they are trained with. This is because few datasets are small [17, 18], few lack diversity [1, 19], and in few datasets scenes are obtained through controlled laboratory experiments using small toys . As a result, despite a wide variety of datasets being available in the literature [1, 19, 18, 17, 20], they cannot be employed to design descriptors for applications in-the-wild.
Recently, Hpatches dataset  has been proposed as a benchmark for evaluation of local features. This dataset is large and diverse with clear protocols for evaluation metrics and reproducibility . Hpatches dataset has overcome the shortcomings of older smaller datasets such as Oxford-Affine
has been proposed as a benchmark for evaluation of local features. This dataset is large and diverse with clear protocols for evaluation metrics and reproducibility . Hpatches dataset has overcome the shortcomings of older smaller datasets such as Oxford-Affine that were used as evaluation benchmarks. Although Hpatches dataset is an excellent benchmark for evaluation, this dataset is seldom used for training as the images in its scenes are related only by 2D homography and such assumptions cannot be made for real-world applications.
Frequently used dataset for training and learning local descriptors is the Multi-View Stereo (MVS) dataset from Brown et al. . The MVS dataset comprises of matching and non-matching pairs for training obtained from scenes of real world objects captured at different viewpoints. However, MVS dataset consists of only 3 scenes and cannot be considered as diverse enough. Data augmentation is one of the traditional method employed to increase the size of dataset. Mishchuk et al.  highlighted the importance of data augmentation and achieved state-of-the-art results. Regardless, data augmentation cannot substitute the advantages of training with a larger and diverse dataset. These drawbacks of the current datasets limit the potential of powerful CNN based approaches and highlight the necessity for an improved, next generation dataset as concluded in .
In this paper, we introduce a novel dataset for training CNN based descriptors that overcomes many drawbacks of current datasets such as MVS. It has sufficiently large number of scenes, is diverse, and has better coverage of the overall viewpoint, scale, and illumination. Moreover, this dataset contains RGB patches including information such as location, scale, and rotation to reverse map them onto the scene. Additionally, this dataset also has intrinsic and extrinsic camera parameters for all the images in a scene which enables one to incorporate the functionality of setting scale and viewpoint variations for matching correspondences. With all the ingredients, this dataset is conducive and ideal for learning descriptors which can also be customized to various diverse tasks of learning including narrow base line matching and wide baseline matching.
A sampling technique for generating matching correspondences is also introduced. This type of sampling ensures that the training dataset has sufficient variations in viewpoint and scale while generating patch-pairs and avoids the generation of redundant patch-pairs having similar contextual information.
The success of CNNs in various computer vision tasks can be partly attributed to availability of large datasets for training. An ideal dataset for learning a particular task should capture the all the real world scenario involved with the task. An example being the ImageNet . Each scene consists of a large collection of images. Dense 3D point cloud and visibility maps are estimated from the set of images. The 3D points are projected in different reference images accounting visibility to extract patches. Each scene contains more than 400,000 patches. Patches belonging to same 3D point form matching pairs. However, the dataset suffers from two major drawbacks. Firstly, it lacks data diversity as it contains only 3 scenes. Secondly, inconsistencies in the predicted visibility maps produce noisy matching pairs. In Fig.
The success of CNNs in various computer vision tasks can be partly attributed to availability of large datasets for training. An ideal dataset for learning a particular task should capture the all the real world scenario involved with the task. An example being the ImageNet dataset for image classification. In the context of learning patch descriptors the dataset provided by Brown et al.  is the most widely used for training. The dataset contains 3 scenes viz., liberty, notredame and yosemite
. Each scene consists of a large collection of images. Dense 3D point cloud and visibility maps are estimated from the set of images. The 3D points are projected in different reference images accounting visibility to extract patches. Each scene contains more than 400,000 patches. Patches belonging to same 3D point form matching pairs. However, the dataset suffers from two major drawbacks. Firstly, it lacks data diversity as it contains only 3 scenes. Secondly, inconsistencies in the predicted visibility maps produce noisy matching pairs. In Fig.1, few noisy matching pairs from liberty and notredame scenes are shown. These limitations severely restrict the performance of the descriptors trained with the dataset as shown in Sec. 5.
The DTU dataset  contains images and 3D point clouds of small objects obtained using a robotic arm in a controlled laboratory environment. Images are taken from different view points with varying illumination. Although the size of the dataset is big in number of images, it does not capture intricacies of images in the wild.
The CDVS dataset  is another large patch based dataset offering more number of scenes than the MVS dataset. However, as shown in Fig. 2 the matching pairs in the dataset does not have severe deformations. A quantitative analysis depicting the weakness of this dataset is presented in .
The Oxford-Affine dataset  is a small dataset containing 8 scenes with sequence of 6 images per scene. The images in a sequence are related by homographies. Although the dataset is suitable for benchmarking evaluations, it is too small for training CNN models. Similar to Oxford-Affine, another dataset exists where matching pairs are created synthetically . In this dataset, every scene contains a reference image and a collection of images which are transformations of the reference image. The dataset has good variations in scene content and deformations. However, the deformations are only limited to homographies. Table 1 gives a comparison of the various publicly available datasets.
Mishchuk et al.  used the MVS dataset for training their network and noted that the state-of-the-art results can be achieved by using better CNN architectures and training procedures. However, Schoenberger et al. , through extensive experiments, highlighted the importance and the necessity of a better training dataset for learning patch descriptors.
Based on all these considerations, the contributions of the paper are:
(a). A large and novel PS dataset for learning patch descriptors, created from real-world photo-collections, having a good coverage of viewpoint, scale and illumination is proposed.
(b). A sampling technique to generate high quality matching correspondences without resulting in redundant patch matches is proposed.
(c). By training the current state-of-the-art model on the proposed dataset and outperforming the model, we show that alongside having better models and training procedures, the quality of the training dataset is also important in realizing the potential of the CNN.
The dataset proposed in this paper is called PhotoSynth (PS) dataset as images were collected by crawling through Microsoft PhotoSynth. This section focuses on various aspects of the dataset. The description about the scenes and images collected to form the dataset is detailed in Sec. 3.1 followed by the methodology adopted to create data for learning local descriptors out of the vast collection of images and the format of dataset in Sec. 3.2 and Sec. 3.3
The PS dataset111The dataset along with trained models is publicly available at https://github.com/rmitra/PS-Dataset consists of a total of 30 scenes with 25 scenes for training and 5 scenes for validation. Sample image pairs from the dataset are shown in Fig. 3. It can be observed from Fig. 3 that the diversity of the proposed PS dataset in terms of scene content, illumination, and geometric variations is large.
Each scene in the dataset contains 200 RGB images on an average. The resolution of the images varies from to . The number of patches extracted per scene on an average is 250,000. The number of correspondences depend on the threshold imposed on scale and viewpoint variations. For the training data used in Sec. 4.1, matching correspondences were obtained by setting scale and viewpoint threshold to and respectively. The higher viewpoint threshold is used for scenes which have planar structures. With these thresholds, on an average, 300,000 matching correspondences per scene are generated. Detailed statistics about each scene is provided in the supplementary material.
Structure From Motion (SFM) is adapted to create ground truth pairs of correspondence. To generate the 3D reconstructions, Colmap [24, 25] SFM software is used. The SFM process outputs a 3D point cloud with each point having a list of feature points from different images, with which it is triangulated, and predicted intrinsic and extrinsic camera parameters of each image in the scene. Difference of Gaussian (DOG)  feature points are used in our reconstructions.
Patches are extracted by traversing through the list of feature points associated with each 3D point. An extracted patch is scale and rotation normalized by cropping the patch around the feature point with size , and then rotating the patch by degree , where and are the scale and rotation values of the feature point respectively. The value of has been limited in the range , so that minimum and maximum crop sizes are of and respectively. The resultant patch is then scaled to . All of the experiments reported in this paper are based on patch size of which is cropped around the center pixel. This facilitates in avoiding border artifacts when applying data augmentation techniques.
As the PS dataset is constructed from photo collections, there are many instances where a particular scene has images that are captured from almost similar viewpoint and scale. Therefore a sampling technique has been adopted to ensure that the sampled correspondence pairs belonging to a particular 3D point have good coverage of viewpoint and scale.
Let be a 3D point and be the set of patches associated with . Let be the estimated focal lengths and be viewing directions of cameras of . Let be the camera centers. We calculate to be the distance of from camera centers in the direction of i.e. . The scale between two patches can be estimated by comparing their ratio. Let SC_TH, MIN_V_TH, MAX_V_TH be user defined thresholds for scale, minimum viewpoint difference and maximum viewpoint difference between the pairs. To form matching correspondences with varied viewpoint changes, we initially compute the angle between all possible pairs from . Next, given a patch , its matching set is initialized by . Algorithm 1 has been used to fill the matching set .
The algorithm works in an iterative approach. In each iteration, a patch in and within MAX_V_TH from , is assigned a minimum viewpoint difference (MVD) value. The value for is computed as follows. The pairwise viewpoint differences (or angles) between and all patches in are computed and the minimum of these differences is assigned as the MVD for in that iteration. This is repeated for all remaining patches in and within MAX_V_TH from . The patch in having the highest MVD in that iteration is considered. The patch is added to the set if angle between and is more than MIN_V_TH or the scale between the two patches differs by at least 1.5. The iterations stop when the algorithm fails to add a patch to the set in an iteration. The sampling technique avoids adding redundant pairs to which are very similar to already existing pairs. Hence we can obtain the required coverage in viewpoint and scale without creating all possible pairs. Once is computed, patches in the set is paired with forming valid matching correspondences.
Details of experimental setup used for evaluating various models are discussed in this section. Sec. 4.1 gives the detail about procedure followed to train the model using proposed PS dataset. Description about evaluation is given in Sec. 4.2.
For training purpose, the CNN architecture is adapted from Hardnet  (also L2-net  has similar architecture). Since, the CNN is trained on proposed PS dataset, we call it as HardNet-PS. Schematic diagram of the CNN architecture is shown in Fig. 6. It should be noted that the original HardNet and its better variant HardNet+ are trained on MVS dataset .
For comparison with HardNet+, the same loss function as described in  is adapted. In each iteration, unique 3D points were randomly sampled, where is the batch size. For a 3D point if there are patches then the hardest from all the ’s (see Sec. 3.2) are chosen based on descriptor distance. Selecting matching pairs from 3D points gives a list of matching pairs . Next, a pairwise distance matrix is formed of size , where and function is the L2 distance between the descriptors of and . The selection of the nearest non-matching pair of and of are modified as follows:
where contains a set of valid ’s. Given , a patch is valid w.r.t it, when 3D point and corresponding to and have at-least one image in common and their projections in that common image differ by of the un-normalized patch size, i.e before scaling to pixels as done in Sec. 3.2. The average loss over the batch is given in Eq. 1,
To reduce generalization error, augmentation of data is carried out by randomly rotating the patches between and scaling within .
Two evaluation benchmark were used for fair performance comparison, namely, Hpatches for planar objects and Strecha for non-planar objects. The procedure followed to evaluate them are given in Sec. 4.2.1 and 4.2.2 respectively. As in the case with all other descriptors, HardNet-PS is also not trained using any of these two evaluation datasets.
The HPatches benchmark dataset contains image sequences which vary either in viewpoint or in illumination. It has 59 scenes with geometric deformations (viewpoint) and 57 scenes with photometric changes (illumination). Three type of detectors namely DOG, Hessian, and Harris affine are used to extract key points. While extracting key points, additional geometric noise in 3 levels were introduced, namely easy, medium, and hard. Brief overview of the three evaluation procedures or protocols in HPatches are listed below ,
Patch verification : Verification is to classify a list of pair of patches as matching or non-matching. Each pair is also assigned a similarity score based on the L2 distance of the descriptors of the two patches. Classification is done on the basis of similarity score. Mean Average Precision (mAP) is calculated based on the list of similarity scores.
: Verification is to classify a list of pair of patches as matching or non-matching. Each pair is also assigned a similarity score based on the L2 distance of the descriptors of the two patches. Classification is done on the basis of similarity score. Mean Average Precision (mAP) is calculated based on the list of similarity scores.
Image matching: It is a task of matching key points from reference image to target image. This is done using nearest neighbor on descriptors of the key points. Each predicted match is also associated with a similarity score like patch verification and mAP is calculated over the list of predictions.
Patch retrieval: In this protocol, a patch is queried in a large collection of patches majorly consisting of distractors. A similarity score coherent with the previous evaluations is computed between the query patch and collection of patches. The evaluation is carried out by varying the number of distractors and taking mean.
The HPatches benchmark evaluatoin provides a comprehensive evaluation for image sequences related by 2D homography. However, it does not capture image pairs in-the-wild which are non-planar, having self and external occlusions. Hence, the Herzjesu and Fountain scenes from  which have wide-baseline image pairs on non-planar objects has been adapted to evaluate critically. The dataset provides images with projection matrices and a dense point cloud of the scene. The Herzjesu-P8 scene contains 8 images indexed from 0 to 7 with gradual shift in viewpoint when iterated in order. In other words, the image pair has the highest viewpoint difference. Similarly, the Fountain-P11 scene has a sequence with 11 images.
To ensure high repeatability we assume one of the image in the sequence as the reference image and extract key-points from it and transfer them to the other images. The following steps are used to transfer a point from the reference image to a target image:
Project all 3D points in the reference image.
Find the 3D point whose projection onto the reference image is nearest to and within 3 pixels distance.
if exists, project it to the other image.
The reference images used in Fountain-P11 and Herzjesu-P8 are index ‘‘5’’ and index ‘‘4’’ respectively.
DOG key-points with 4 octaves and 3 scales per octave were used. The peak and edge threshold are set to and respectively. Points with scales larger than 1.6 are retained for stability with at-most 2 orientations per point. vl_covdet  is used to extract patches from the images with default parameters values. This makes the smallest patch extracted of size which is similar to the support window used by SIFT. In both scenes, we pair all other images with image indexed ‘‘0’’ to form the list of image pairs. We divide the image pairs into 3 categories on the basis of viewpoint difference. Range of viewpoint change for ‘‘Narrow’’, ‘‘Wide’’ and ‘‘Very-Wide’’ has been categorized as , and respectively. Table. 2 lists the categorized image pairs of both scenes. Since, Herzjesu sequence does not have any image pair differing more than in viewpoint, the category ‘‘Very-Wide’’ is not applicable to it.
|Fountain-P11||‘‘0’’-‘‘1’’, ‘‘0’’-‘‘2’’, ‘‘0’’-‘‘3’’||‘‘0’’-‘‘4’’, ‘‘0’’-‘‘5’’, ‘‘0’’-‘‘6’’||‘‘0’’-‘‘7’’, ‘‘0’’-‘‘8’’, ‘‘0’’-‘‘9’’|
|Herzjesu-P8||‘‘0’’-‘‘1’’, ‘‘0’’-‘‘2’’, ‘‘0’’-‘‘3’’||‘‘0’’-‘‘4’’, ‘‘0’’-‘‘5’’, ‘‘0’’-‘‘6’’, ‘‘0’’-‘‘7’’||NA|
Key-point matching is used as metric and followed the same protocol used in HPatches to calculate mAP values. Given a pair of images, we compute the mAP values on 2000 random points visible to both images.
Quantitative comparisons between models trained on MVS dataset and HardNet+ trained on our dataset are described in this section. As described in Sec. 4, Hardnet-PS indicates Hardnet+ trained on proposed PS dataset. Results on Hpatches benchmark evaluation and the Strecha benchmark are discussed in Sec. 5.1 and Sec. 5.2 respectively.
Results for matching task are shown in Table 3. The results are categorized into illumination and viewpoint sequences. As can be observed, in overall score, HardNet-PS outperforms HardNet+ by a margin of 8%. It is noteworthy that HardNet-PS outperforms all the viewpoint sequences especially on the ’Hard’ and ’Tough’ sequences by a large margin of 15.5% and 19.2%, respectively, over the state-of-the-art.
The performance comparison on the verification task is shown in Table 4. As in the matching task, the sequences can be categorized into same-sequence (intra) and different sequence (inter). Overall, Hardnet-PS is better than Hardnet+ by 4.4%. The improvement over Hardnet+ increases as the difficulty level of the scenes increase. As it can be seen from Table 4, Hardnet-PS performs notably better by nearly 10% over Hardnet+ on the ’Tough’ scenes.
The results of the retrieval task in the Hpatches evaluation are reported in Table 5. The Hardnet-PS outperforms the current state-of-the-art Hardnet+ around 10% on an average. Again, as in the previous tasks, the margin of improvement for Hardnet-PS is higher for the ’Hard’ and ’Tough’ scenes by 9.3% and 16.5% respectively.
The mAP values of different models for the matching task on the two datasets of Strecha et al.  is shown in Table. 6 and 7, respectively. Hardnet-PS performs better than the state-of-the-art by nearly 5% and 3.5% on the Fountain-P11 and HerzJesu-P8 scenes respectively. The margin of improvement over Hardnet+ is higher in the ’Very-Wide’ category for the Fountain-P11 and the ’Wide’ category for the HerzJesu-P8 scene.
Qualitative comparison for the matching task on the Fountain-P11 from the Strecha benchmark is shown in Figure 7. It can be seen that for wide baseline and very wide baseline, the matches from the proposed Hardnet-PS model are better than the matches from Hardnet+ model.
The results on the HPatches and the Strecha benchmarks indicate a common pattern. The Hardnet+ and the Hardnet-PS models yield comparably close mAP scores for the ’Easy’ scenes (HPatches) and ’Narrow’ category (Strecha). But, when the difficulty in the scenes increase (’Hard’ and ’Tough’ or ’Wide’ and ’Very-Wide’), the Hardnet-PS model trained on the PS dataset outperforms the state-of-the-art Hardnet+ model by larger margin.
In this paper, we have introduced a novel dataset for training CNN based descriptors that overcomes many drawbacks of current datasets such as MVS. It has sufficiently large number of scenes, better coverage of viewpoint, scale, and illumination. We trained the state-of-the-art CNN model available in the literature with the proposed dataset and evaluated on the Hpatches and Strecha benchmark evaluation datasets. On these benchmarks, it has been observed that the model trained with the proposed dataset outperforms the current state-of-the-art significantly, and the margin of improvement is higher for the difficult scenes (’Hard’ and ’Tough’ in Hpatches and ’Wide’ and ’Very-Wide’ scenes in Strecha). With these new state-of-the-art results, we conclude that alongside improving the CNN architecture and the training procedure, a good dataset, such as the proposed PS dataset, conforming to the real-world is also necessary to learn high-quality widely-applicable descriptor.
Y. Tian, B. Fan, and F. Wu, ‘‘L2-net: Deep learning of discriminative patch descriptor in euclidean space,’’CVPR, 2017.