Log In Sign Up

ViewSynth: Learning Local Features from Depth using View Synthesis

We address the problem of jointly detecting keypoints and learning descriptors in depth data with challenging viewpoint changes. Despite great improvements in recent RGB based local feature learning methods, we show that these methods cannot be directly transferred to the depth image modality. These methods also do not utilize the 2.5D information present in depth images. We propose a framework ViewSynth, designed to jointly learn 3D structure aware depth image representation, and local features from that representation. ViewSynth consists of `View Synthesis Network' (VSN), trained to synthesize depth image views given a depth image representation and query viewpoints. ViewSynth framework includes joint learning of keypoints and feature descriptor, paired with our view synthesis loss, which guides the model to propose keypoints robust to viewpoint changes. We demonstrate the effectiveness of our formulation on several depth image datasets, where learned local features using our proposed ViewSynth framework outperforms the state-of-the-art methods in keypoint matching and camera localization tasks.


page 1

page 3

page 5

page 6

page 7


Novel View Synthesis from a Single Image via Unsupervised learning

View synthesis aims to generate novel views from one or more given sourc...

StyLandGAN: A StyleGAN based Landscape Image Synthesis using Depth-map

Despite recent success in conditional image synthesis, prevalent input c...

Extreme View Synthesis

We present Extreme View Synthesis, a solution for novel view extrapolati...

Matching with AffNet based rectifications

We consider the problem of two-view matching under significant viewpoint...

Compact and adaptive multiplane images for view synthesis

Recently, learning methods have been designed to create Multiplane Image...

Integral Curvature Representation and Matching Algorithms for Identification of Dolphins and Whales

We address the problem of identifying individual cetaceans from images s...

UR2KiD: Unifying Retrieval, Keypoint Detection, and Keypoint Description without Local Correspondence Supervision

In this paper, we explore how three related tasks, namely keypoint detec...

1 Introduction

Figure 1: Overview of ViewSynth. We learn robust local feature representation from depth images using keypoint matching loss and by learning to synthesize depth images from different views to encourage structure aware depth image representation learning.

Due to the rapid development of inexpensive commodity depth sensors in the past decade, learning representations of depth images have been ubiquitous in many applications such as robotics, human pose estimation, etc

[12, 45, 25]. Depth images have an unique advantage over RGB images in being invariant to color and illumination changes [12]. This property makes them suitable for many tasks including depth image to 3D correspondence matching and camera localization task, especially when high illumination and color variation are expected [24]

. Learning local features from RGB images either requires real world annotated dataset (in a supervised learning setting)

[11, 27], or synthetic scenes along with designed realistic textures like [15]. On the other hand, one can utilize [12] large repositories of easily available 3D CAD models of objects and scenes, for example the ModelNet dataset [48], or the Stanford 3D scanning repository [1] to render depth images from different viewpoints without requiring any costly data annotation to learn pose invariant local features from depth images. [12].

A popular approach for local feature learning has been generating sparse local features which aims to describe only relevant parts of the image [22, 27, 12, 26, 23]. This requires first acquiring keypoints in the image [22, 40], and then using them to generate descriptors for their surrounding patches [8, 42]. However, generating keypoints which are repeatable [30], along with their descriptors, which allow correct matching of those keypoints across different images have proven to be difficult under varying imaging conditions when only low-level image features are utilized. Drastic changes in color or illumination trigger inferior results [51, 32] in keypoint localization and matching, leading to approaches towards learning deep local representations [27, 42, 11].

Recent advances in learning local features, for example the work in [11] from RGB images have seen good performance improvement over the previous works on such challenging cases. They extract dense features from a RGB image, and jointly extract keypoints and descriptors from those dense features.

Although D2Net [11] achieves state-of-the-art performance in keypoint matching task for RGB images, as we demonstrate in Sec. 4, it’s not directly applicable for learning viewpoint invariant local features in the challenging depth image modality. We found D2Net [11] to have a problem of triplet collapse [47], where descriptors of all keypoints are collapsed onto a single representation during very early phase of the training process on several depth image datasets. Another major shortcoming of direct adaptation of RGB based local feature learning methods [22, 11, 27] for depth modality is that they are not explicitly designed to utilize the 2.5D geometric information in the depth data. Sitzmann et al. [37]

learn 3D-structure-aware scene representation that encodes both geometry and appearance, and show its efficacy for various tasks like few-shot reconstruction, shape and appearance interpolation, novel view synthesis, etc.

Inspired by [37], we propose to learn 3D-structure-aware depth image representation from depth images, hypothesizing that it will help generate local features more optimized for keypoint matching and camera localization. To this end, we propose View Synthesis Network (VSN): a network that is designed to generate depth image views given a depth image features and relative pose; and View Synthesis Loss

: a loss function to train

VSN. Given a depth image, a dense feature representation is extracted from it, from which keypoints and descriptors can be jointly estimated. We propose to use this dense feature representation with a given relative pose, and synthesize the view from that pose using VSN, which consists of two sub-networks, Grid Transformation Encoding Network, which encodes the relative transformation related parameters as a high-dimensional representation; and Depthmap Synthesis Network, which synthesizes the view from the relative viewpoint using the depth image features and encoded transformation representation.

Additionally, to adapt D2Net [11] for learning local features on the depth modality, we propose to use a contrastive loss [14] for descriptor learning with hardest negative sampling. Unlike the triplet learning formulation, the contrastive loss optimizes towards completely viewpoint invariant local feature learning by penalizing any descriptor difference between a pair of correct keypoint correspondence which is another desirable quality. Synthesizing views from unseen viewpoints involves reconstructing surfaces invisible in the given image. While our contrastive loss formulation optimizes the dense feature extractor to learn a viewpoint invariant representation of the scene, we hypothesize that training VSN will encourage the dense feature extractor to learn to encode 3D structure aware depth image representation. Using the D2Net with contrastive formulation along with our VSN, we demonstrate that our framework - ViewSynth, can perform well for local feature learning in depth modality.

We make the following contributions: (1) To learn a 3D structure aware depth image representation, we propose View Synthesis Network (VSN) composed of the Grid Transformation Encoding Network (GTN) and Depthmap Synthesis Network (DSN), to synthesize depth image views using a depth image and query pose. (2) We propose View Synthesis Loss (VSL) to train VSN  for learning 3D structure aware depth image representation.

We validate the effectiveness of our proposed formulation for the depth image to 3D keypoint matching task, and camera localization task on the the real-world datasets MSR-7 [34], TUM [39], and CoRBS [46]. Our method outperforms the state-of-the-art depth image local feature learning method [12] and D2Net [11] on all datasets with margins between 21.89% and 52.32% under various viewpoint changes and thresholds.

2 Related work

Sparse and dense local feature learning: The introduction of hand-crafted local feature algorithms like SIFT [22], SURF [4] and ORB [31]

signified the importance of local features in computer vision, as they dramatically outperformed other existing techniques for describing objects and images, providing robustness to variance in scale, rotation and pose. While these methods extract local features from images by looking at the pixel neighborhoods alone, recent deep-learning based local feature extraction methods

[27, 42, 11] demonstrated more robust local feature detection by utilizing contextual information from the image.

Recent improvements in local feature learning include improving keypoint localization [44], orientation estimation [27], and easy to match description generation [35, 42, 43], where the keypoint detection and keypoint description stages are either learned independently, or learned jointly [12, 27, 11, 9]. Most existing local feature methods take a detect-then-describe approach, where keypoints are detected in the first step, and the keypoints are described in the second step [22, 4, 31, 6, 50, 27, 9]. In contrast, D2Net [11] proposes a network which shares all parameters between detection and description, and uses a joint methodology to concurrently solve both tasks. While we use a detect-and-describe approach similar to [11], their method does not directly operate on depth modality (discussed more in 3.2), and it does not utilize the 3D information available in depth images. [40]

shows that learning 3D keypoints via geometric reasoning leads to keypoint learning optimized for pose estimation. Unlike their method of learning keypoints separately, we jointly learn keypoints and descriptors, while encouraging 3D geometric structure aware depth representation learning using our proposed

VSN. We adopt [11] to operate in the depth image modality, and improve keypoint generation and description by explicitly learning 3D structure aware scene representation of scenes in the feature extractor network, using the VSN trained with VSL. Our intuition is that, the better the feature extractor is able to encode structure aware scene information, the better the local feature extraction will be able to detect and describe keypoints that are more accurate to match.

Learning from depth data: The reliance on depth data in robotics and autonomous driving has seen a surge in recent years. This has led to various works proposing methods geared towards depth data utilization [12, 13, 2, 21, 41, 3, 17] in domains such as object detection [29, 10], crowd counting [5], activity recognition [19] etc. Georgakis et al. [12] learn keypoints and descriptors for pose invariant 3D matching by using depth images. However, their method relies on separately trained detector and descriptor, utilizing the detect-then-describe pipeline. This detect-then-describe pipeline is shown to under-perform compared to the detect-and-describe formulation, as studied by D2Net [11]. In contrast to [12], our method uses the detect-and-describe pipeline.

Synthesizing novel views: The use of view synthesis has been largely focused on generating missing information from given state with known applications in point cloud reconstruction [20], resolution enhancement [38]

, image inpainting

[49, 28]

, and image-to-image translation

[18, 7]. Many of novel view synthesis methods operate on the RGB domain and do not utilize depth image geometry, to generate realistic looking fake structures to generate more training data [49, 28, 18]. Generation of such fake structures is unwanted in our framework, and hence we do not use such adversarial loss formulations. In contrast to previous works, VSN employs a lightweight view synthesis sub-network, which takes in the dense features of a depth image and a query pose, and synthesizes the normalized view of the depth image from that query pose. Our main goal is not to synthesize perfect views from arbitrary viewpoints, but rather to optimize the feature extraction network to learn structure aware depth image representation. [52] predicts an appearance flow to synthesize novel views from an image, geared towards the RGB modality only. They learn to copy pixel colors from input images to synthesize novel views. This copying mechanism is impractical for depth images since intensity of the same point can vary drastically over different viewpoints. Instead of using MLPs [37]

for encoding scene, we use one convolutional neural network on the regular 2.5D depth image to encode the depth image representation. Our

VSN is also different in contrast to the neural rendering process in [37].

Figure 2: The architecture ViewSynth. Given depth image pair , dense features, keypoints, and descriptors are extracted. Keypoint matching loss supervises keypoint and descriptor learning. View Synthesis Network trained with View Synthesis Loss synthesizes depth image from ’s view from ’s features.

3 Methodology

Recent works [27, 12, 11] show that joint keypoint, descriptor learning reaches SOTA performance for keypoint matching task. We propose a joint keypoint-descriptor learning framework from depth images: ViewSynth, using the detect-and-describe formulation, which learns structure aware depth image representation using view synthesis. The architecture of ViewSynth is illustrated in Fig. 2, and the details of its formulation are discussed below.

3.1 Feature descriptor and keypoint detection

We first use a dense feature extractor, VGG-16 [36] up to the conv_4_3 layer to extracts depth image features, . Here , and refer to the height, width and channels of the feature representation respectively. represents the unnormalized representation of the feature map location , where and . Applying L2 normalization to this representation gives us the keypoint descriptor, .

To detect keypoints from , we follow the strategy proposed by D2-Net [11], where a keypoint score is obtained at each potential keypoint location based on the relative magnitude of feature representation along a spatial neighborhood, and along the channel dimension.

A hard binary scoring mechanism determines the keypoints during testing. During the training phase, a soft scoring mechanism is used for gradient propagation [11]. For each spatial position in the dense feature map, a soft keypoint-ness score, is computed where indicates a relative score signifying how confident the feature extraction network is about the correct match-ability of the pixel at feature location .

3.2 Keypoint and descriptor learning

We learn the local features by training the network with pairs of images at each iteration and by penalizing it to learn correct image to image keypoint correspondences. Given a pair of normalized depth images , we first pass them through the dense feature extraction network to obtain and respectively. Then we extract from the dense features: the keypoints with scores , and descriptors as described above. A set of ground truth correspondence is created based on the known 3D distance between the keypoints. For each ground truth correspondence between the images, where is a pixel in and is a pixel in , we minimize the positive descriptor distance, to ensure descriptor similarity between correct correspondence. We also maximize the descriptor distance between incorrect correspondences, or negative descriptor distance. For this, we choose the most confounding incorrect correspondence distance from to : , where is a spatial position in , and . defines a boundary around each correctly matched keypoint, within which we do not consider any point as a negative match. Similar to [11], we use . In similar fashion, we compute the most confounding incorrect correspondence distance from to : .

D2-Net uses a triplet loss formulation to minimize positive descriptor distances, and maximize negative descriptor distances. Interestingly, we observe that this loss often led to all descriptors collapsing onto a singular representation in earlier phases of training this method on depth image datasets. We presume that inherent difficulty associated with learning pose invariant representation from often noisy depth data, coupled with high learning rate, and hard-negative sampling suggested by [11] led to this phenomenon when a triplet loss is used. We propose to use a contrastive loss to avoid this problem. Contrastive loss also encourages the network to learn the same descriptors for some keypoint across all depth images. This is a desirable effect in ViewSynth, since we want the densely extracted features to encode 3D structure aware representation for each keypoint in a completely viewpoint invariant fashion. For each keypoint in , the descriptor learning loss becomes:


The margin is set to for all of our experiments. Similarly, we calculate descriptor loss for each keypoint in : . The overall descriptor loss is the summation of these two terms. For jointly learning keypoints and descriptors, we use the joint learning formulation of [11], which we refer to as . During training phase, encourages the network to detect keypoints it can correctly match with higher confidence, and keypoints the network fails to match correctly with a lower confidence.

Figure 3: VSN takes in the dense representation of the depth image and the parameters related to pixel-wise transformation from to , and synthesizes the normalized representation from the view of . See subsection 3.3 for details.

3.3 Learning view synthesis

Inspired by [37, 40], we hypothesize that learning 3D structure aware depth image representation can assist in learning local features more suitable for correct matching. Intuitively, the better the feature extractor is at representing the depth image in a structure aware fashion, the better it can identify and describe keypoints more optimized for correct matching. To this end, we propose the View Synthesis Network (VSN) (Fig. 3): given the dense features of a depth image, and a pose relative to the image, VSN synthesizes the view of the depth image from the relative pose.

Depth image from different viewpoints of a scene can visually capture different surfaces. Imagine two depth cameras are looking at a sofa, one positioned in-front of it, and one positioned on its right - illustrated as and in Figure 3. Each camera will essentially capture different surfaces of the sofa. We hypothesize that VSN can correctly synthesize the view from ’s viewpoint by observing the features of and the relative pose parameters, only if the depth image features of is capable of encoding 3D structure aware information. In each iteration of the training, we utilize a depth image pair, with camera parameters and respectively to learn view synthesis. here embodies the intrinsic and extrinsic parameters of camera . Assume, represents the dense features for depth image ; and represents the ground truth mapping function defining where each pixel of maps to in . is mathematically defined on camera parameters and , and the unnormalized representation of depth image . Note, VSN utilizes the unnormalized representation of only to compute the mapping function . VSN never uses directly, and instead uses , and the camera parameters and , to synthesize the normalized depth image view from ’s perspective.

First step of VSN is to warp the dense feature representation onto the image space of using mapping function to obtain warped representation, . Grid Transformation Encoding Network (GTN) then encodes to transformation related parameters to a high-dimensional space. Finally, using the encoded transformation parameters and , we synthesize the normalized depth image view from the view of using Depthmap Synthesis Network (DSN). VSN is optimized using the View Synthesis Loss (VSN) as discussed later.

Grid Transformation Encoding Network (GTN): Since using the keypoint-descriptor learning loss guides the dense feature extractor to learn a viewpoint invariant representation of the depth images, to synthesize the depth-view from an arbitrary viewpoint, it’s essential to use the transformation related parameters in the view synthesis process as well. The GTN sub-network (Figure 3) is designed to encode these parameters related to the transformation between image space of and . GTN is applied on each spatial position of the feature representation . Input to this block is the physical transformation related parameters is , where is the spatial size of , and is the number of features along each spatial position. Features along some spatial position : encodes the location of in the image space of . It also encodes , which refers to the spatial position in the pixel gets mapped to, the unnormalized depth image 1, and the camera parameters and . These are the physical parameters that define the pose transformation between and . Similar to the viewpoint transformation technique used in [52], we employ a small fully connected network called GTN to represent the transformation related physical parameter into a high dimensional space (in ) for each spatial position of . We use which we obtained empirically. GTN is composed of two fully connected residual blocks [16], which are shared among all spatial positions. The output of GTN is , an encoded representation of the relative transformation from to for each spatial position of the feature map .

Figure 4: Qualitative demonstration of VSN’s contribution on matching examples on the pairs of images in the TUM test dataset. ViewSynth shows that learning structure aware depth representation allows robust-to-match keypoint-descriptor learning. Best viewed in color.

Depthmap Synthesis Network (DSN): Depthmap Synthesis Network is outlined in Figure 3. DSN is designed to take as input to warped feature representation ; and the transformation features, , and synthesize the normalized depth image view from the second camera’s viewpoint, . First, global contextual information of the feature representation is extracted by applying global average pooling (GAP). The GAP features are then concatenated with every spatial position and passed through a convolutional layer to obtain . Each spatial position of encodes that spatial position specific information, which is aware of the global construct of the scene provided by the GAP features. is then concatenated with high dimensional transformation representation along the channel dimension to obtain . Three residual convolutional blocks then follow to outputs the synthesized normalized depth image view from the second camera’s viewpoint, . is 8 times downsampled compared to . According to our hypothesis, will closely resemble the ground truth normalized depth image , only if the input to DSN implicitly encodes the information about the surfaces invisible in the scene. This is possible only if is capable of representing depth image in a 3D structure aware fashion. This in turn, allows the feature extractor network to generate pose invariant local features more optimized towards correct correspondence matching.

View Synthesis Loss (VSL): We train depth image synthesis using the View Synthesis Loss (VSL) loss function. The formulation trains the synthesized depth image with the ground truth normalized depth image . We apply a L1 loss function along each pixel in the synthesized depth image and ground truth depth image to obtain the view synthesis loss . If refers to the set of pixels in that correspond to 3D points contained within the camera view frustum of , but not necessarily visible in , then the VSL loss is:


is the 8 times downsampled representation of , and used as a supervisory signal to train VSN. VSN is used only during the training time to encourage the initial feature extractor to learn structure aware scene representation. It is not required for actual keypoint-descriptor generation during the test time.

4 Experimental Evaluation

To validate local feature learning by our proposed ViewSynth framework, we compare the matching accuracy of keypoints against state-of-the-art (SOTA) local feature extractor [12] from depth images and the SOTA local feature extractor [11] for RGB images, adapted for depth modality.

We experiment on the three datasets: RGB-D Dataset 7-Scenes [34], TUM RGBD-SLAM [39], and the CoRBS dataset [46], each of which is a compilation of tracked sequences of real RGB-D camera frames of naturally occurring indoor scenes captured by a RGBD sensors like Kinect.

4.1 Experimental protocol

We follow the same experimental setup as [12] to evaluate matching accuracy of keypoints. The training process takes pairs of depth images and their camera parameters as input. Training pairs are created by pairing depth images that are 10 or 30 frames apart, as denoted in the experimentation tables. After the model is trained, the training images used to create a repository of 3D keypoints with their descriptors. Each depth image is passed through the local feature extractor network to extract 50 highest scoring keypoints and their corresponding descriptors. We then put the keypoints, the descriptors and their 3D world coordinate in the reference 3d keypoints repository R. In the second step, we apply the model to each depth image of the testing set, extract 50 highest scoring keypoints, and match them against the keypoints in R. A match is assumed to be correct if the 3D world coordinates of the matched keypoints are within a certain distance to each other.

To evaluate camera localization performance using the local features, we use an experimental protocol described in [32]. For this task, we create R from the depth images in the training dataset as described before. For each image in the testing set, we extract the keypoints from the local feature extractor method, match them against the 3D keypoints in R, and estimate the camera pose using RANSAC based PnP solver. In accordance to the experimental setup, we use 50 keypoints for each image during the keypoint repository creation, and when we pass through each image in the testing data. This camera localization accuracy is measured in different position error and orientation error thresholds. Here, we use the (0.5m, 2°), (1.0m, 5°) and (5.0m, 10°) thresholds for evaluating accuracy. We show the efficacy of our method against other baselines in depth image to 3D keypoint matching accuracy task, and camera localization accuracy task for different datasets.

Figure 5: View synthesis output on TUM dataset. The rows represent input depth image, the ground truth depth image, and the synthesized output for 10/30/50 frames apart validation image pairs.

4.2 Baseline

We use the current SOTA local feature learning from RGB images, D2Net [11] with triplet loss formulation as our baseline local feature learning in depth images. The original D2Net formulation led to descriptor collapse in every experimental setup we established. Hence, we modify D2Net by removing their hardest negative sampling, and switching to all negative sampling. We call this modified D2Net as mD2Net, and use this as our baseline.

property dataset MSR-7 [34] TUM [39] CoRBS [46]
# of scenes 7 11 15
# of sequences 18 55 75
sensor type Kinect Kinect Kinect v2
# training/testing images 26K/17K 18K/4K 26K/6K
Table 1: Experimentation datasets and their properties.
10 30 10 30 10 30
D2Net Collapsed Collapsed Collapsed
mD2Net 8.72 3.62 20.48 12.60 30.89 21.33
D2Net 33.38 23.93 53.19 45.82 68.93 61.25
ViewSynth (ours) 34.75 35.63 59.45 57.39 77.02 73.65
D2Net Collapsed Collapsed Collapsed
mD2Net 17.10 13.93 29.83 28.13 44.61 42.10
D2Net 56.73 51.53 71.24 66.65 80.35 75.47
ViewSynth (ours) 67.30 52.69 72.43 69.25 81.76 79.16
D2Net Collapsed Collapsed Collapsed
mD2Net 45.69 45.02 61.31 59.55 71.48 69.25
D2Net 79.87 80.35 89.84 90.30 93.30 93.41
ViewSynth (ours) 80.10 80.56 89.70 90.72 93.37 94.19
Table 2: Comparison of MMA on TUM, CoRBS, and MSR7 datasets, trained on 10/30-frames-apart setting. Acronyms: mD2Net: modified D2Net; D2Net: D2Net with contrastive loss formulation; ViewSynth: D2Net+ , proposed method.

4.3 Results

Our proposed method ViewSynth is compared against existing methods on the depthmap to 3D keypoint matching task, and camera localization task. On each dataset, we report the mean matching accuracy (MMA) obtained by different methods using the experimental protocol discussed for each method. The original D2Net [11] method, which uses the triplet loss formulation with hardest negative sampling is denoted D2Net in the tables. mD2Net refers to the D2Net method using all negative triplet sampling. The ViewSynth method indicates our final proposed method that learns local features using contrastive loss (D2Net), and learns view synthesis using view synthesis loss . Note that for keypoint detection during evaluation, we used the multi-scale keypoint detection setting [11] for mD2Net and our ViewSynth method. We use the MSR-7, TUM RGBD-SLAM, and the CoRBS datasets for evaluation. Tab 1 contains the dataset summaries.

We show the efficacy of our method on all three datasets in Tab. 2, where we compare different methods of learning local features in the depth image to 3D keypoint matching task, using MMA as the metric. This table details the comparison of different subset of our method against the SOTA methods, for , and 3D matching threshold values. We notice that D2Net consistently faces descriptor collapse for all datasets and failed to produce any keypoint. Our method ViewSynth outperforms the baseline of mD2Net by a significant margin in all cases. This is also apparent in the qualitative results in Fig. 4, where we notice that local features learned with ViewSynth generates significantly higher number of correct matches for pairs of depth images. ViewSynth also comfortably beats [12] in the MSR-7 dataset for 10-frames-apart training setting. Their MMA performance for this dataset is taken from [12]. Since the code for [12] is not publicly available, and ViewSynth demonstrates far superior accuracy on the reported metric, we do not evaluate [12] on the other metrics. Some view synthesis examples can be found at 5.

We see superior local feature learning performance of ViewSynth on camera localization task as well (Tab. 3). We notice that local features learned using ViewSynth leads to most accurate localization compared to the baselines for CorBS and TUM datasets.

0.5m, 2° 1.0m, 5° 5.0m, 10°
10 30 10 30 10 30
D2Net Collapsed Collapsed Collapsed
mD2Net 1.18 1.77 4.81 6.51 9.67 12.49
ViewSynth (ours) 7.70 7.58 23.02 16.60 35.49 27.18
D2Net Collapsed Collapsed Collapsed
mD2Net 1.90 4.18 6.85 11.56 13.40 18.51
ViewSynth (ours) 8.19 8.57 23.36 30.29 47.78 50.52
Table 3: Camera localization accuracy (%) on TUM and CoRBS datasets, with 10/30-frames-apart training setting. For all localization correctness thresholds, our proposed method outperforms the SOTA

4.4 Ablation

For ablation studies, we study the effectiveness of structure aware representation learning utilizing VSN by comparing ViewSynth against D2Net. D2Net refers to D2Net learned only with contrastive loss, and without view synthesis. Tab. 2 reports the comparison between these methods on all three datasets. In every case we notice that ViewSynth achieves superior or on-par MMA compared to D2Net, leading to a state-of-the-art result. This result strongly supports our hypothesis that learning structure away depth image representation leads to robust local feature learning. The effectiveness is especially apparent in the 30-frames-apart training setting, where VSN is more effective since it learns to synthesize views from larger viewpoint variation. Quantitative measures also supports this, as our method consistently performs better than D2Net in this setting. In an ablation study on MSR-7 (Tab. 4) for evaluating camera localization we again see that ViewSynth leads to more accurate or on par result, asserting our hypothesis.

0.5m, 2° 1.0m, 5° 5.0m, 10°
10 30 10 30 10 30
D2Net Collapsed Collapsed Collapsed
mD2Net 15.46 14.11 37.68 34.66 53.98 50.24
D2Net 31.52 21.92 66.33 58.25 85.24 82.61
ViewSynth (ours) 34.60 23.83 70.09 57.04 86.67 80.34
Table 4: Camera localization accuracy (%) on MSR7 dataset, with 10/30-frames-apart training setting. For all localization correctness thresholds, our proposed method outperforms the SOTA.

4.5 Discussion

Quantitative and qualitative evaluations of keypoint matching and camera localization tasks on different datasets indicate the superiority of our method. In all cases, we noticed that original D2Net faced feature collapse during training, and was unable to generate any keypoint during the test time. mD2Net bypasses the problem of feature collapse, but performs poorly in the keypoint matching task since the all negative sampling strategy for learning descriptor does not provide challenging negative descriptor, leading to very slow learning. Another alternative to all negative sampling can be semi-hard negative sampling [33]. We did not explore that strategy in this work. We notice good improvement using the D2Net setting, where descriptors are learning using a contrastive loss form formulation and hardest negative sampling. . Our overall proposed method achieves superior result to methods compared to in the TUM and CoRBS datasets in all training scenarios and for all thresholds. Our method achieves superior performance in the 30-frames-apart training setting on MSR7 compared to all other methods, and achieves competitive performance on the 10-frames-apart training setting. Especially when trained on 30-frames-apart setting, VSN can utilize higher viewpoint variance in training image pairs to learn view synthesis more effectively. This is apparent in across all three datasets, where we see significant improvement over all baselines in the MMA metric. We see the usefulness of VSN on the camera localization metric as well in MSR7 and dataset. All of these results assert the effectiveness of 3D structure aware depth representation learning using VSN.

5 Conclusion

We show that the state-of-the-art detect-and-describe approach for keypoint detection and description do not transfer directly for the depth modality. We propose modifications to the method which make the training stable, along with an architecture (VSN) and a loss (VSL) which encourages the network to focus on the local features which are integral and important to synthesize a novel view from a different viewpoint, allowing the network to learn keypoints which not only focus on the local information but which are also important to describe the global scene. We show performance improvements of multiple percentage points over baselines methods on two datasets, MSR-7 Scenes Dataset and TUM RGB-D Benchmark Dataset.


  • [1] The stanford 3d scanning repository. Accessed on: 2019-10-17.
  • [2] Sari Awwad, Fairouz Hussein, and Massimo Piccardi. Local depth patterns for tracking in depth videos. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1115–1118. ACM, 2015.
  • [3] Sari Awwad and Massimo Piccardi. Local depth patterns for fine-grained activity recognition in depth videos. In 2016 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6. IEEE, 2016.
  • [4] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110(3):346–359, June 2008.
  • [5] Enrico Bondi, Lorenzo Seidenari, Andrew D Bagdanov, and Alberto Del Bimbo. Real-time people counting from depth imagery of crowded environments. In 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 337–342. IEEE, 2014.
  • [6] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010.
  • [7] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, and Jan Kautz. Extreme view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pages 7781–7790, 2019.
  • [8] Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal correspondence network. In Advances in Neural Information Processing Systems, pages 2414–2422, 2016.
  • [9] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 224–236, 2018.
  • [10] Bertram Drost and Slobodan Ilic. 3d object detection and localization using multimodal point pair features. In 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, pages 9–16. IEEE, 2012.
  • [11] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561, 2019.
  • [12] Georgios Georgakis, Srikrishna Karanam, Ziyan Wu, Jan Ernst, and Jana Košecká. End-to-end learning of keypoint detector and descriptor for pose invariant 3d matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1965–1973, 2018.
  • [13] Georgios Georgakis, Srikrishna Karanam, Ziyan Wu, and Jana Kosecka. Learning local rgb-to-cad correspondences for object pose estimation. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [14] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings - 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006, volume 2, pages 1735–1742, 2006.
  • [15] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [17] Rabah Iguernaissi, Djamal Merad, and Pierre Drap. People counting based on kinect depth data. In ICPRAM, pages 364–370, 2018.
  • [18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [19] Ahmad Jalal, Yeon-Ho Kim, Yong-Joong Kim, Shaharyar Kamal, and Daijin Kim. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern recognition, 61:295–308, 2017.
  • [20] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning efficient point cloud generation for dense 3d object reconstruction. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [21] Mengyuan Liu and Hong Liu. Depth context: A new descriptor for human activity recognition by using sole depth sequences. Neurocomputing, 175:747–758, 2016.
  • [22] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [23] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [24] Karol Matusiak, Piotr Skulimowski, and Pawel Strumillo. Depth-based descriptor for matching keypoints in 3d scenes. International Journal of Electronics and Telecommunications, 64(3):299–306, 2018.
  • [25] Derek McColl, Zhe Zhang, and Goldie Nejat. Human body pose interpretation and classification for social human-robot interaction. International Journal of Social Robotics, 3(3):313, 2011.
  • [26] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant interest point detectors. International journal of computer vision, 60(1):63–86, 2004.
  • [27] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: Learning local features from images, 2018.
  • [28] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  • [29] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018.
  • [30] Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. R2d2: Repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195, 2019.
  • [31] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R Bradski. Orb: An efficient alternative to sift or surf. In ICCV, volume 11, page 2. Citeseer, 2011.
  • [32] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8601–8610, 2018.
  • [33] Florian Schroff, Dmitry Kalenichenko, and James Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [34] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, June 2013.
  • [35] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pages 118–126, 2015.
  • [36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [37] Vincent Sitzmann, Michael Zollhoefer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 1119–1130. Curran Associates, Inc., 2019.
  • [38] Xibin Song, Yuchao Dai, and Xueying Qin.

    Deeply supervised depth map super-resolution as novel view synthesis.

    IEEE Transactions on circuits and systems for video technology, 2018.
  • [39] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.
  • [40] Supasorn Suwajanakorn, Noah Snavely, Jonathan Tompson, and Mohammad Norouzi. Discovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in Neural Information Processing Systems, 2018.
  • [41] Mariusz Szwoch and Pawel Pieniazek. Facial emotion recognition using depth data. 2015 8th International Conference on Human System Interaction (HSI), pages 271–277, 2015.
  • [42] Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 661–669, 2017.
  • [43] Engin Tola, Vincent Lepetit, and Pascal Fua. Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE transactions on pattern analysis and machine intelligence, 32(5):815–830, 2009.
  • [44] Alessio Tonioni, Samuele Salti, Federico Tombari, Riccardo Spezialetti, and Luigi Di Stefano. Learning to detect good 3d keypoints. International Journal of Computer Vision, 126(1):1–20, 2018.
  • [45] Keze Wang, Liang Lin, Chuangjie Ren, Wei Zhang, and Wenxiu Sun. Convolutional memory blocks for depth data representation learning. In IJCAI, pages 2790–2797, 2018.
  • [46] Oliver Wasenmüller, Marcel Meyer, and Didier Stricker. Corbs: Comprehensive rgb-d benchmark for slam using kinect v2. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–7. IEEE, 2016.
  • [47] Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017.
  • [48] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [49] Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6721–6729, 2017.
  • [50] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision, pages 467–483. Springer, 2016.
  • [51] Hao Zhou, Torsten Sattler, and David W Jacobs. Evaluating local features for day-night matching. In European Conference on Computer Vision, pages 724–736. Springer, 2016.
  • [52] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis by appearance flow. In European conference on computer vision, pages 286–301. Springer, 2016.