NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning

by   Tony Ng, et al.

In the light of recent analyses on privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance.



page 1

page 5

page 7

page 8

page 13


Deep Parallel MRI Reconstruction Network Without Coil Sensitivities

We propose a novel deep neural network architecture by mapping the robus...

Image Reconstruction from Bag-of-Visual-Words

The objective of this work is to reconstruct an original image from Bag-...

Analysis and Mitigations of Reverse Engineering Attacks on Local Feature Descriptors

As autonomous driving and augmented reality evolve, a practical concern ...

Using Deep Learning for Visual Decoding and Reconstruction from Brain Activity: A Review

This literature review will discuss the use of deep learning methods for...

Learning Foveated Reconstruction to Preserve Perceived Image Statistics

Foveated image reconstruction recovers full image from a sparse set of s...

A Theory of Local Matching: SIFT and Beyond

Why has SIFT been so successful? Why its extension, DSP-SIFT, can furthe...

Neural Head Reenactment with Latent Pose Descriptors

We propose a neural head reenactment system, which is driven by a latent...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Local visual descriptors are fundamental to a wide range of computer vision applications such as SLAM

[mur-artal2016orb-slam, newcombe2011dtam, mei2011rslam, dong2015distributed], SfM [schonberger2016colmap, sweeney2015theia, agarwal2011building], wide-baseline stereo [jin2021image, mur2017orb], camera calibration [Oth_2013_CVPR], tracking [hare2012efficient, nebehay2014consensus, pernici2013object]

, image retrieval 

[tolias2020learning, simeoni2019local, arandjelovic2016netvlad, noh2017delf, arandjelovic2014dislocation, tolias2013smk]

, and camera pose estimation 

[toft2020long, sattler2017active, porav2018adversarial, detone2018superpoint, sarlin20superglue, dusmanu2019d2net, revaud2019r2d2, toft2020single]. These descriptors represent local regions of images and are used to establish local correspondences between and across images and 3D models.

The descriptors take the form of vectors in high-dimensional space, and thus are not directly interpretable by humans. However, researchers have shown that it is possible to reveal the input images from local visual descriptors 

[weinzaepfel2011reconstructing, dosovitskiy2016inverting, d2013bits]

. With the recent advances in deep learning, the quality of the reconstructed image content has been significantly improved 

[invsfm, dangwal21]. This pose potential privacy concerns for visual descriptors if they are used for sensitive data without proper encryption [dangwal21, linecloud, weinzaepfel2011reconstructing].

To prevent the reconstruction of the image content from visual descriptors, several methods have been proposed. These methods include obfuscating keypoint locations by lifting them to lines that pass through the original points [speciale2019privacy, linecloud, geppert2021privacy, shibuya2020privacy], or to affine subspaces with augmented adversarial feature samples [dusmanu2020privacy] to increase the difficulty of recovering the original images. However, recent work [chelani2021howprivacypreserving] has demonstrated that the closest points between lines can yield a good approximation to the original points locations, allowing descriptor inversion.

Figure 1: Our proposed content-concealing visual descriptor. a) We train NinjaNet, the content-concealing network via adversarial learning to give NinjaDesc. b) On the two examples shown, we compare inversions on SOSNet [tian2019sosnet] descriptors vs. NinjaDesc (encoding SOSNet with NinjaNet). c) NinjaDesc is able to conceal facial features and landmark structures, while retaining correspondences. Image credits: laylamoran4battersea & sgerner (Flickr)222CC BY 2.0 & CC BY-SA 2.0 licenses..

In this work, we explore whether such local feature inversion could be mitigated at the descriptor level. Ideally, we want a descriptor that does not reveal the image content without a compromise in its performance. This may seem counter-intuitive due to the trade-off between utility and privacy discussed in the recent analysis on visual descriptors [dangwal21], where the utility is defined as matching accuracy, and the privacy is defined as non-invertibility of the descriptors. The analysis showed that the more useful the descriptors are for correspondence matching, the easier it is to invert them. To minimize this trade-off, we propose an adversarial approach to train visual descriptors.

Specifically, we optimize our descriptor encoding network with an adversarial loss for descriptor invertibility, in addition to the traditional metric learning loss for feature correspondence matching. For the adversarial loss, we jointly train an image reconstruction network to compete with the feature descriptor network in revealing the original image content from the descriptors. In this way, the feature descriptor network learns to hinder the image reconstruction network by generating visual descriptors that conceal the original image content, while being optimized for correspondence matching.

In particular, we introduce an auxiliary encoder network NinjaNet that can be trained with any existing visual descriptors and transform them to our content-concealing NinjaDesc, as illustrated in Fig. 1. In the experiments, we show that visual descriptors trained with our adversarial learning framework lead to only marginal drop in performance for feature matching and visual localization tasks, while significantly reducing the visual similarity of the reconstruction to the original input image.

One of the main benefits of our method is that we can control the trade-off between utility and privacy by changing a single parameter in the loss function. In addition, our method generalizes to different types of visual descriptors, and different image reconstruction network architectures.

In summary, our main innovations are as follows: a) We propose a novel adversarial learning framework for visual descriptors to prevent reconstructing original input image content from the descriptors. We experimentally validate that the obtained descriptors significantly deteriorate the image quality from descriptor inversion with only marginal drop in matching accuracy using standard benchmarks for matching (HPatches [balntas2017hpatches]) and visual localization (Aachen Day-Night [sattler2018benchmarking, zhang2020aachenv_1_1]). b) We empirically demonstrate that we can effectively control the trade-off between utility (matching accuracy) and privacy (non-invertibility) by changing a single training parameter. c) We provide ablation studies by using different types of visual descriptors, image reconstruction network architectures and scene categories to demonstrate the generalizability of our method.

2 Related work

This section discusses prior work on visual descriptor inversion and the state-of-the-art descriptor designs that attempt to prevent such inversion.

Inversion of visual descriptors. Early results of reconstructing images from local descriptors was shown by Weinzaepfel et al[weinzaepfel2011reconstructing] by stitching the image patches from a known database with the closest distance to the input SIFT [lowe2004sift] descriptors in the feature space. d’Angelo et al[d2013bits] used a deconvolution approach on local binary descriptors such as BRIEF [calonder2010brief] and FREAK [alahi2012freak]. Vondrick et al[hoggles] used paired dictionary learning to invert HoG [zhu2006fast] features to reveal its limitations for object detection. On the global descriptor side, Kato and Harada [kato2014image] demonstrated image reconstruction from bag-of-words (BoW) descriptors [sivic2003video]. However, the quality of reconstructions by these early works were not sufficient to raise concerns about privacy or security.

Subsequent work introduced methods that steadily improved the quality of the reconstructions. Mahendran and Vedaldi [mahendran2015understanding] used a back-propagation technique with a natural image prior to invert CNN features as well as SIFT [liu2010sift] and HOG [zhu2006fast]. Dosovitskiy and Brox [dosovitskiy2016inverting] trained up-convolutional networks that estimate the input image from features in a regression fashion, and demonstrated superior results on both classical [lowe2004sift, zhu2006fast, ojala2002multiresolution] and CNN [krizhevsky2017imagenet] features. In the recent work, descriptor inversion methods have started to leverage larger and more advanced CNN models as well as employ advanced optimization techniques. Pittaluga et al[invsfm] and Dangwal et al[dangwal21] demonstrated sufficiently high reconstruction qualities, revealing not only semantic information but also details in the original images.

Preventing descriptor inversion for privacy. Descriptor inversion techniques may raise privacy concerns [invsfm, dangwal21, linecloud, weinzaepfel2011reconstructing]. For example, in computer vision systems where the visual descriptors are transferred between the device and the server, an honest-but-curious server may try to exploit the descriptors sent by the client device. In particular, many large-scale camera localization systems adopt cloud computing and storage, due to limited processing power, memory, and storage on mobile devices. Cryptographic methods like homomorphic encryption [erkin2009privacy, sadeghi2009efficient, yonetani2017privacy] can be used to protect descriptors, but they are too computationally expensive for large-scale applications.

Proposed by Speciale et al. [linecloud], the line-cloud representation obfuscate 2D / 3D point locations in the map building process [geppert2020privacy, shibuya2020privacy, geppert2021privacy] without compromising the accuracy in localization. However, since the descriptors are unchanged, Chelani et al[chelani2021howprivacypreserving] showed that line-clouds are vulnerable to inversion attacks if the underlying point-cloud is recovered.

Recently, Dusmanu et al[dusmanu2020privacy] proposed a privacy-preserving visual descriptor via lifting descriptors to affine subspaces, which conceals the visual content from inversion attacks. However, this comes with a significant cost on the descriptor’s utility in downstream tasks. Our work differs from [dusmanu2020privacy] in that we propose a learned content-concealing descriptor and explicitly train it for utility retention to achieve a better trade-off between the two.

3 Method

Figure 2: Top: Architecture of our content-concealing NinjaNet encoder . Bottom: A base descriptor with dimensionality is transformed to NinjaDesc of the same size e.g. .

We propose an adversarial learning framework for obtaining content-concealing visual descriptors, by introducing a descriptor inversion model as an adversary. In this section, we detail our content-concealing encoder NinjaNet (Sec. 3.1) and the descriptor inversion model (Sec. 3.2), as well as the joint adversarial training procedure (Sec. 3.3).

3.1 NinjaNet: the content-concealing encoder

In order to conceal the visual content of a local descriptor while maintaining its utility, we need a trainable encoder which transforms the original descriptor space to a different one, where visual information essential for reconstruction is reduced. Our NinjaNet encoder is implemented by an MLP shown in Fig. 2. It takes a base descriptor , and transforms it into a content-concealing NinjaDesc, :


The design of NinjaNet is light-weight and plug-and-play, to make it flexible in accepting different types of existing local descriptors. The encoded NinjaDesc descriptor maintains the matching performance of the original descriptor, but prevents from high-quality reconstruction of images. In many of our experiments, we adopt SOSNet [tian2019sosnet] as our base descriptor since it is one of the top-performing descriptors for correspondence matching and visual localization [jin2021image].

Utility initialization. To maintain the utility (i.e. accuracy for downstream tasks) of our encoded descriptor, we use a patch-based descriptor training approach [tian2017l2net, mishchuk2017hardnet, tian2019sosnet]. The initialization step trains NinjaNet via a triplet-based ranking loss. We use the UBC dataset [goesele2007ubc] which contains three subsets of patches labelled as positive and negative pairs, allowing for easy implementation of triplet-loss training.

Figure 3: The pipeline for training our content-concealing NinjaDesc. Top: The two networks at play and their corresponding objectives are: 1. NinjaNet , which is for utility retention in A; and 2. the descriptor inversion model, which reconstructs RGB images from input sparse features in B. Bottom: During joint adversarial training, we alternate between steps 1. and 2., which is presented by Algorithm 1.

Utility loss. We extract the base descriptors from image patches and train NinjaNet () with the descriptor learning loss from  [tian2019sosnet] to optimize NinjaDesc ().


where is the second-order similarity regularization term [tian2019sosnet]. We always freeze the weights of the base descriptor network, including the joint training process in Sec. 3.3.

3.2 Descriptor inversion model

For our proposed adversarial learning framework, we utilize a descriptor inversion network as the adversary to reconstruct the input images from our NinjaDesc. We adopt the UNet-based [unet] inversion network from prior work [invsfm, dangwal21]. Following Dangwal et al[dangwal21], the inversion model takes as input the sparse feature map composed from the descriptors and their keypoints, and predicts the RGB image , i.e. . We denote as the resolutions of the sparse feature image and the reconstructed RGB image, respectively. is the dimensionality of the descriptor. The detailed architecture is provided in the supplementary.

Reconstruction loss. The descriptor inversion model is optimized under a reconstruction loss which is composed of two parts. The first loss is the mean absolute error (MAE) between the predicted and input I images,


The second loss is the perceptual loss, which is the L2 distance between intermediate features of a VGG16 [simonyan2015VGG]

network pretrained on ImageNet 



where are the feature maps extracted at layers , and is the corresponding resolution.

The reconstruction loss is the sum of the two terms


where denote the image data term that includes both the descriptor feature map and the RGB image I.

Reconstruction initialization. For the joint adversarial training described in Sec. 3.3, we initialize the the inversion model using the initialized NinjaDesc in Sec. 3.1, This part is done using the MegaDepth [li2018megadepth] dataset, which contains images of landmarks across the world. For the keypoint detection we use the Harris corners [harris_corner] in our experiments.

3.3 Joint adversarial training

The central component of engineering our content-concealing NinjaDesc is the joint adversarial training step, which is illustrated in Fig. 3 and elaborated as pseudo-code in Algorithm 1. We aim to minimize trade-off between utility and privacy, which are the two competing objectives. Inspired by methods using adversarial learning [goodfellow2014generative, xie2017controllable, roy2019mitigating], we formulate the optimization of utility and privacy trade-off as an adversarial learning process.

1:NinjaNet: initialize with Eqn. 2
2:Desc. inversion model: initialize with Eqn. 5
3: set privacy parameter
4:for  do
5:     if  then
7:     end if
8:     Compute from and .
9:     Extract sparse features on with , reconstruct image with and compute .
10:     Update weights of :
11:     Extract sparse features on with , reconstruct image with and compute .
12:     Update weights of :
14:end for
Algorithm 1 Pseudo-code for the joint adversarial training process of NinjaDesc

The objective of the descriptor inversion model is to minimize the reconstruction error over image data . On the other hand, NinajaNet aims to conceal the visual content by maximizing this error. Thus, the resulting objective function for content concealment is a minimax game between the two:


At the same time, we wish to maintain the descriptor utility:

Figure 4: Qualitative results on landmark images. First column: original images overlaid with the 1000 (red) Harris corners [harris_corner]. Second column: reconstructions by the inversion model from raw SOSNet [tian2019sosnet] descriptors extracted on those points. The last five columns show reconstruction from NinjaDesc with increasing privacy parameter . The SSIM and PSNR w.r.t. the original images are shown on top of each reconstruction. Best viewed digitally. Image credits: first 3 — Holidays dataset [jegou2008hamming]; last — laylamoran4battersea (Flickr).

This brings us to the two separate optimization objectives for and that we will describe in the following. For the inversion model , the objective remains the same as in Eqn. 6:


However, for maintaining utility, NinjaNet with weights is also optimized with the utility loss from Eqn. 2. In conjunction with the maximization by from Eqn. 6, the loss for NinjaNet becomes


where controls the balance of how much prioritizes content concealment over utility retention, i.e. the privacy parameter. In practice, we optimize and in an alternating manner, such that is not optimized in Eqn. 8 and is not optimized in Eqn. 9. The overall objective is then


3.4 Implementation details

The code is implemented using PyTorch 

[paszke2019PyTorch]. We use Kornia [riba2020kornia]’s implementation of SIFT for GPU acceleration. For all training, we use the Adam [adam_optimizer] optimizer with and .

Utility initialization. We use the liberty set of the UBC patches [goesele2007ubc]

to train NinjaNet for 200 epochs and select the model with the lowest average FPR@95 in the other two sets (

notredame and yosemite). The number of submodules in NinjaNet ( in Fig. 2) is , since we observed no improvement in FPR@95 by increasing . Dropout rate is 0.1. We use a batch-size of 1024 and learning rate of 0.01.

Reconstruction initialization. We randomly split MegaDepth [li2018megadepth] into train / validation / test split of ratio 0.6 / 0.1 / 0.3. The process of forming a feature map is the same as in [dangwal21] and we use up to 1000 Harris corners [harris_corner] for all experiments. We train the inversion model with a batch-size of 64, learning rate of 1e-4 for a maximum of 200 epochs and select the best model with the lowest structural similarity (SSIM) on the validation split. We also do not use the discriminator as in [dangwal21], since convergence of the discriminator takes substantially longer, and it improves the inversion model only very slightly.

Joint adversarial training. The dataset configurations for and are the same as in the above two steps, except the batch size, that is 968 for UBC patches. We use equal learning rate for and . This is 5e-5 for SOSNet [tian2019sosnet] and HardNet [mishchuk2017hardnet], and 1e-5 for SIFT [lowe2004sift]. NinjaDesc with the best FPR@95 in 20 epochs on the validation set is selected for testing.

4 Experimental results

In this section, we evaluate NinjaDesc on the two criteria that guide its design — the ability to simultaneously achieve: (1) content concealment (privacy) and (2) utility (matching accuracy and camera localization performance).

4.1 Content concealment (Privacy)

We assess the content-concealing ability of NinjaDesc by measuring the reconstruction quality of descriptor inversion attacks. Here we assume the inversion model has access to the NinjaDesc and the input RGB images for training, i.e. in Sec. 3.2. We train the inversion model from scratch for NinjaDesc (Eqn. 5) on the train split of MegaDepth [li2018megadepth], and the best model with the highest SSIM on the validation split is used for the evaluation.

Figure 5: HPatches evaluation results. We compare the baseline SOSNet [tian2019sosnet] vs. NinjaDesc, with 5 different levels of privacy parameter (indicated by the number in parenthesis). All results are from models trained on the liberty subset of the UBC patches [goesele2007ubc] dataset.

Recall in Eqn. 9, is the privacy parameter controlling how much NinjaDesc prioritizes privacy over utility. The intuition is that, the higher is, the more aggressive NinjaDesc tries to prevent reconstruction quality by the inversion model. We perform descriptor inversion on NinjaDesc that are trained with a range of values to demonstrate its effect on reconstruction quality.

SOSNet (Raw) NinjaDesc ()
Metric 0.001 0.01 0.1 0.25 1.0 2.5
MAE () 0.104 0.117 0.125 0.129 0.162 0.183 0.212
SSIM () 0.596 0.566 0.569 0.527 0.484 0.385 0.349
PSNR () 17.904 18.037 16.826 17.821 17.671 13.367 12.010
Table 1: Quantitative results of the descriptor inversion on SOSNet vs. NinjaDesc, evaluated on the MegaDepth [li2018megadepth] test split444Note that in [dangwal21], only SSIM is reported, and we do not share the same train / validation / test split. Also, [dangwal21] uses the discriminator loss for training which we omit, and it leads to slight difference in SSIM.. The arrows indicate higher / lower value is better for privacy.

Fig. 4 shows qualitative results of descriptor inversion attacks when changing . We observe that indeed fulfills the role of controlling how much NinjaDesc conceals the original image content. When is small, e.g. , the reconstruction is only slightly worse than that from the baseline SOSNet. As increases to , there is a visible deterioration in quality. Once equal / stronger weighting is given to privacy (), little texture / structure is revealed, achieving high privacy.

Such observation is also validated quantitatively by Table 4, where we see a drop in performance of the inversion model as

increases across the three metrics: maximum average error (MAE), structural similarity (SSIM), and peak signal-to-noise ratio (PSNR) which are computed from the reconstructed image and the original input image.

4.2 Utility retention

We measure the utility of NinjaDesc via two tasks: image matching and visual localization.

Image matching. We evaluate NinjaDesc based on SOSNet [tian2019sosnet] with a set of different privacy parameter on the HPatches [balntas2017hpatches] benchmarks, which is shown in Fig. 5. NinjaDesc is comparable with SOSNet in mAP across all three tasks, especially for the verification and retrieval tasks. Also, higher privacy parameter generally corresponds to lower mAP, as becomes less dominant in Eqn. 9.

Figure 6: Illustration of our proposed adversarial descriptor learning framework’s generalization across three different base descriptors. Top. We show two matching images. Two rows of small images to the right of each of them are the reconstructions. The top & bottom rows are, respectively, the reconstructions from the raw descriptor and from NinjaDesc () associated with the base descriptor above. Bottom. We visualize the matches between the two images on raw descriptors vs. NinjaDesc () for each of the three base descriptors. Image credits: left — Tatyana Gladskih /; right — Urse Ovidiu (Wikimedia Commons, Public Domain).
Method Accuracy @ Thresholds (%)
Query NNs m, m, m,
Base Desc  SOS / Hard / SIFT  SOS / Hard / SIFT  SOS / Hard / SIFT
Day () 20 Raw  85.1 / 85.4 / 84.3  92.7 / 93.1 / 92.7  97.3 / 98.2 / 97.6
 85.4 / 84.7 / 82.0  92.5 / 91.9 / 91.1  97.5 / 96.8 / 96.4
 84.7 / 84.3 / 82.9  92.4 / 91.9 / 91.0  97.2 / 96.7 / 96.1
 84.6 / 83.7 / 82.5  92.4 / 92.0 / 91.0  97.1 / 96.8 / 96.0
50 Raw  85.9 / 86.8 / 86.0  92.5 / 93.7 / 94.1  97.3 / 98.1 / 98.2
 85.2 / 85.2 / 84.2  92.2 / 92.4 / 91.4  97.1 / 97.1 / 96.6
 84.7 / 85.7 / 83.4  92.2 / 92.6 / 91.6  97.2 / 96.7 / 96.7
 85.6 / 85.3 / 83.6  92.7 / 91.7 / 91.1  97.3 / 96.8 / 96.2
Night () 20 Raw  49.2 / 52.4 / 50.8  60.2 / 62.3 / 62.3  68.1 / 72.3 / 72.8
 47.6 / 43.5 / 44.0  57.1 / 54.5 / 51.3  63.4 / 61.8 / 61.3
 45.5 / 44.5 / 41.4  56.0 / 51.8 / 52.9  61.8 / 60.2 / 62.3
 45.0 / 44.5 / 43.5  55.0 / 54.5 / 49.7  61.8 / 61.3 / 61.3
50 Raw  44.5 / 47.6 / 51.3  52.4 / 59.7 / 62.3  60.2 / 64.9 / 74.3
 39.8 / 39.8 / 41.9  47.6 / 48.7 / 50.3  57.6 / 56.0 / 59.7
 42.9 / 39.8 / 39.8  52.4 / 49.2 / 48.2  57.1 / 54.5 / 56.5
 41.9 / 38.2 / 40.3  49.2 / 47.1 / 49.2  56.6 / 55.0 / 57.1
Table 2: Visual localization results on Aachen-Day-Night v1.1 [zhang2020aachenv_1_1]. ‘Raw’ corresponds to the base descriptor in each column, followed by three vales (0.1, 1.0, 2.5) for NinjaDesc.

Visual localization. We evaluate NinjaDesc with three base descriptors - SOSNet [tian2019sosnet], HardNet [mishchuk2017hardnet] and SIFT [lowe2004sift] on the Aachen-Day-Night v1.1 [sattler2018benchmarking, zhang2020aachenv_1_1] dataset using the Kapture [kapture2020] pipeline. We use AP-Gem [revaud2019aploss] for retrieval and localize with the shortlist size of 20 and 50. The keypoint detector used is DoG [lowe2004sift]. Table 2 shows localization results. Again, we observe little drop in accuracy for NinjaDesc overall compared to the original base descriptors, ranging from low () to high () privacies. Comparing our results on HardNet and SIFT with Table 3 in Dusmanu et al[dusmanu2020privacy], NinjaDesc is noticeably better in retaining the visual localization accuracy of the base descriptors than the subspace descriptors in [dusmanu2020privacy]555 [dusmanu2020privacy] is evaluated on Aachen-Day-Night v1.0, resulting in higher accuracy in Night due to poor ground-truths, and the code of [dusmanu2020privacy] is not released yet. We also report our results on v1.0 in the supplementary. , e.g. drop in night is up to 30% for HardNet in [dusmanu2020privacy] but for NinjaDesc.

Hence, the results on both image matching and visual localization tasks demonstrate that NinjaDesc is able to retain the majority of its utility w.r.t. to the base descriptors.

5 Ablation studies

Table 2 already hints that our proposed adversarial descriptor learning framework generalizes to several base descriptors in terms of retaining utility. In this section, we further investigate the generalizability of our method through additional experiments on different types of descriptors, inversion network architectures, and scene categories.

5.1 Generalization to different descriptors

Base Descriptor Raw (w/o NinjaDesc) NinjaDesc ()
0.01 0.1 0.25 1.0 2.5
SOSNet 0.596 0.569 0.527 0.484 0.385 0.349
HardNet 0.582 0.545 0.516 0.399 0.349 0.312
SIFT 0.553 0.490 0.459 0.395 0.362 0.296
Table 3: Qualitative performance of the descriptor inversion model on the MegaDepth [li2018megadepth] test split with three base descriptors and the corresponding NinjaDescs, varying in privacy parameter.

We extend the same experiments from SOSNet [tian2019sosnet] in Table 4 to include HardNet [mishchuk2017hardnet] and SIFT [lowe2004sift] as well. We report SSIM in Table 3. Similar to the observation for SOSNet, increasing privacy parameter reduces reconstruction quality for both HardNet and SIFT as well. In Fig. 6, we qualitatively show the descriptor inversion and correspondence matching result across all three base descriptors. We observe that NinjaDesc derived from all three base descriptors are effective in concealing important contents such as person or landmark compared with the raw base descriptors. The visualization of keypoint correspondences between the images also demonstrates the utility retention of our proposed learning framework across different base descriptors.

5.2 Generalization to different architectures

Arch. UNet UResNet
MAE () 0.104 0.183 0.212 0.121 0.190 0.202
SSIM () 0.596 0.385 0.349 0.595 0.427 0.380
PSNR () 17.904 13.367 12.010 16.533 12.753 12.299
Table 4: Reconstruction results on MegaDepth [li2018megadepth]. We compare the UNet used in this work vs. a different architecture — UResNet.

So far, all experiments are evaluated with the same architecture for the inversion model - the UNet [unet]-based network [dangwal21, invsfm]. To verify that NinjaDesc does not overfit to this specific architecture, we conduct a descriptor inversion attack using an inversion model with drastically different architecture, called UResNet, which has a ResNet50 [he2016ResNet] as the encoder backbone and residual decoder blocks. (See the supplementary material.) The results are shown in Table 4, which depicts only SSIM is slightly improved compared to UNet whereas MAE and PSNR remain relatively unaffected. This result illustrate that our proposed method is not limited by the architectures of the inversion model.

5.3 Content concealment on faces

We further show qualitative results on human faces using the Deepfake Detection Challenge (DFDC) [DFDC] dataset. Fig. 7 presents the descriptor inversion result using the base descriptors (SOSNet [tian2019sosnet]) as well as our NinjaDesc varying in privacy parameter . Similar to what we observed in Fig. 4, we see progressing concealment of facial features as we increase compared to the reconstruction on SOSNet.

Figure 7: Qualitative reconstruction results on faces. Images are cropped frames sampled from videos in the DFDC [DFDC] dataset.

6 Utility and privacy trade-off

We now describe two experiments we perform to further investigate the utility and privacy trade-off of NinjaDesc.

First, in Fig. 8 we evaluate the mean matching accuracy (MMA) of NinjaDesc at the highest privacy parameter , for both HardNet [mishchuk2017hardnet] and SIFT [lowe2004sift], on the HPatches sequences [balntas2017hpatches] and compare that with the sub-hybrid lifting method by Dusmanu et al[dusmanu2020privacy] with low privacy level (dim. 2). Even at a higher privacy level, NinjaDesc significantly outperforms sub-hybrid lifting for both types of descriptors. For NinjaDesc, the drop in MMA w.r.t. to HardNet is also minimal, and even increases w.r.t. SIFT.

Mean matching accuracy on HPatches [balntas2017hpatches] sequences. We compare NinjaDesc () to sub-hybrid lifting (dim. 2) in Dusmanu et al[dusmanu2020privacy].
For each descriptor we select NinjaDesc with varying privacy parameter values (annotated next to data points), and compare their utility relative to the raw descriptor vs. content concealment.
Figure 8: Utility vs. privacy trade-off analyses.

Second, in Fig. 8 we perform a detailed utility vs. privacy trade-off analysis on NinjaDesc for all three base descriptors. The -axis is the average difference in NinjaDesc’s mAP across the three tasks in HPatches in Fig. 5, and the -axis is the privacy measured by 1 - SSIM [dangwal21]. We plot the results varying the privacy parameter. For SOSNet and HardNet, the drop in utility () is a magnitude less than the gain in privacy (), indicating an optimal trade-off. Interestingly, for SIFT we see a net gain in utility for all (positive values in the -axis). This is due to the SOSNet-like utility training, improving the verification and retrieval of NinjaDesc beyond the handcrafted SIFT. Full HPatches results for HardNet and SIFT are in the supplementary.

7 Limitations

NinjaDesc only affects the descriptors, and not the keypoint locations. Therefore, it does not prevent inferring scene structures from the patterns of keypoint locations themselves [linecloud, Luo_2019_CVPR]. Also, some level of structure can still be revealed where keypoints are very dense, e.g. the venetian blinds in the second example of Fig. 7.

8 Conclusions

We introduced a novel adversarial learning framework for visual descriptors to prevent reconstructing original input image content from the descriptors. We experimentally validated that the obtained descriptors deteriorate the descriptor inversion quality with only marginal drop in utility using the standard benchmarks for descriptor matching. We also empirically demonstrated that we can control the trade-offs between utility and non-invertibility using our framework, by changing a single parameter that weighs the adversarial loss. The ablation study using different types of visual descriptors and image reconstruction network architecture demonstrates the generalizability of our method. Our proposed pipeline can enhance the security of computer vision systems that use visual descriptors, and has great potential to be extended for other applications beyond local descriptor encoding. Our observation suggests that the visual descriptors contain more information than what is needed for matching, which is removed by the adversarial learning process. It opens up a new opportunity in general representation learning for obtaining representations with only necessary information for the target task to preserve privacy.


Supplementary material

We first provide a comparison of our NinjaDesc and the base descriptor on the 3D reconstruction task using SfM (Sec. A). Next, we report the full HPatches results using HardNet [mishchuk2017hardnet] and SIFT [lowe2004sift] as the base descriptors (Sec. B). In addition to our results on Aachen-Day-Night v1.1 in the main paper, we also provide our results on Aachen-Day-Night v1.0 (Sec. C). Finally, we illustrate the detailed architecture for the inverse models (Sec. D).

Appendix A 3D Reconstruction

Table 5 shows a quantitative comparison of our content-concealing NinjaDesc and the base descriptor SOSNet [tian2019sosnet] on the SfM reconstruction task using the landmarks dataset for local feature benchmarking [schonberger2017comparative]. As can be seen, decrease in the performance for our content-concealing NinjaDesc is only marginal for all metrics.

Dataset Method Reg. images Sparse points Obser- vations Track length Reproj. error
South- Building images SOSNet 128 101,568 638,731 6.29 0.56
NinjaDesc (1.0) 128 105,780 652,869 6.17 0.56
NinjaDesc (2.5) 128 105,961 653,449 6.17 0.56
Madrid Metropolis images SOSNet 572 95,733 672,836 7.03 0.62
NinjaDesc (1.0) 566 94,374 668,148 7.08 0.64
NinjaDesc (2.5) 564 94,104 667,387 7.09 0.63
Gendarmen- markt images SOSNet 1076 246,503 1,660,694 6.74 0.74
NinjaDesc (1.0) 1087 312,469 1,901,060 6.08 0.75
NinjaDesc (2.5) 1030 340,144 1,871,726 5.50 0.77
Tower of London images SOSNet 825 200,447 1,733,994 8.65 0.62
NinjaDesc (1.0) 797 198,767 1,727,785 8.69 0.62
NinjaDesc (2.5) 837 218,888 1,792,908 8.19 0.64
Table 5: 3D reconstruction statistics on the local feature evaluation benchmark [schonberger2017comparative]. Number in parenthesis is the privacy parameter .

Appendix B Full HPatches results for HardNet and SIFT

Figure 9 illustrates our full evaluation results on HPatches using HardNet [mishchuk2017hardnet] and SIFT [lowe2004sift] as the base descriptors for NinjaDesc, in addition to the results using SOSNet [tian2019sosnet] provided in the main paper. Similar to the results for SOSNet [tian2019sosnet], we observe little drop in accuracy for NinjaDesc overall compared to the original base descriptors, ranging from low () to high () privacy parameters.

(a) HardNet Base Descriptor
(b) SIFT Base Descriptor
Figure 9: HPatches evaluation results. For each base descriptor (HardNet [mishchuk2017hardnet] and SIFT [lowe2004sift]), we compare with NinjaDesc, with 5 different levels of privacy parameter (indicated by the number in parenthesis). All results are from models trained on the liberty subset of the UBC patches [goesele2007ubc] dataset, apart from SIFT which is handcrafted, and we use the Kornia [riba2020kornia] GPU implementation evaluated on patches.

Appendix C Evaluation on Aachen-Day-Night v1.0

In Table 2 of the main paper, we report the result of NinjaDesc on Aachen-Day-Night v1.1 dataset. The v1.1 is updated with more accurate ground-truths compared to the older v1.0. Because Dusmanu et al. [dusmanu2020privacy] performed evaluation on the v1.0, we also provide our results on v1.0 in Table 6 for better comparison.

Method Accuracy @ Thresholds (%)
Query NNs m, m, m,
Base Desc  SOS / Hard / SIFT  SOS / Hard / SIFT  SOS / Hard / SIFT
Day () 20 Raw  85.1 / 85.4 / 84.3  92.7 / 93.1 / 92.7  97.3 / 98.2 / 97.6
 85.4 / 84.7 / 82.0  92.5 / 91.9 / 91.1  97.5 / 96.8 / 96.4
 84.7 / 84.3 / 82.9  92.4 / 91.9 / 91.0  97.2 / 96.7 / 96.1
 84.6 / 83.7 / 82.5  92.4 / 92.0 / 91.0  97.1 / 96.8 / 96.0
50 Raw  85.9 / 86.8 / 86.0  92.5 / 93.7 / 94.1  97.3 / 98.1 / 98.2
 85.2 / 85.2 / 84.2  92.2 / 92.4 / 91.4  97.1 / 97.1 / 96.6
 84.7 / 85.7 / 83.4  92.2 / 92.6 / 91.6  97.2 / 96.7 / 96.7
 85.6 / 85.3 / 83.6  92.7 / 91.7 / 91.1  97.3 / 96.8 / 96.2
Night () 20 Raw  51.0 / 57.2 / 55.1  65.3 / 68.4 / 67.3  70.4 / 76.5 / 74.5
 51.0 / 45.9 / 45.9  62.2 / 56.1 / 54.1  68.4 / 62.2 / 63.3
 50.0 / 43.9 / 44.9  62.2 / 54.1 / 56.1  66.3 / 62.2 / 64.3
 48.0 / 44.9 / 44.9  58.2 / 59.2 / 52.0  65.3 / 65.3 / 62.2
50 Raw  48.0 / 51.0 / 54.1  59.2 / 64.3 / 65.3  65.3 / 68.4 / 74.5
 41.8 / 39.8 / 41.8  52.0 / 51.0 / 52.0  60.2 / 56.1 / 60.2
 43.9 / 39.8 / 43.9  54.1 / 50.0 / 54.1  63.3 / 58.2 / 63.3
 42.9 / 40.8 / 42.9  52.0 / 50.0 / 52.0  61.2 / 56.1 / 58.2
Table 6: Visual localization results on Aachen-Day-Night v1.0 [sattler2018benchmarking]. ‘Raw’ corresponds to the base descriptor in each column, followed by three vales (0.1, 1.0, 2.5) for NinjaDesc.

Appendix D Detailed architectures of the descriptor inversion models

UNet. The architecture of the UNet-based descriptor inversion model, which is also used in [dangwal21, invsfm], is shown in Figure 10.

UResNet. Figure 11 illustrates the architecture of the descriptor inversion model based on UResNet used for the ablation study in the Section 5.2 of the main paper. The overall “U” shape of UResNet is similar to UNet, but each convolution block is drastically different. We use the 5 stages of ResNet50 [he2016ResNet]

(pretrained on ImageNet 

[deng2009ImageNet]) {conv1, conv2_x, conv3_x, conv4_x, conv4_x} as the 5 encoding / down-sampling blocks, except for conv2_x we remove the MaxPool2d so that each encoding block corresponds to a 1/2 down-sampling in resolution. Since ResNet50 takes in RGB image as input (which has shape of , whereas the sparse feature maps are of shape ), we pre-process the input with 4 additional basic redisual blocks denoted by res_conv_block in Figure 11. The up-sampling decoder blocks (denoted by up_conv

) are also residual blocks with an addition input up-sampling layer using bilinear interpolation. In contrast to UNet, the skip connections in our UResNet are performed by additions, rather than concatenations.

Figure 10: UNet Architecture.
Figure 11: UResNet Architecture.