Localization with Sampling-Argmax

10/17/2021
by   Jiefeng Li, et al.
Shanghai Jiao Tong University
0

Soft-argmax operation is commonly adopted in detection-based methods to localize the target position in a differentiable manner. However, training the neural network with soft-argmax makes the shape of the probability map unconstrained. Consequently, the model lacks pixel-wise supervision through the map during training, leading to performance degradation. In this work, we propose sampling-argmax, a differentiable training method that imposes implicit constraints to the shape of the probability map by minimizing the expectation of the localization error. To approximate the expectation, we introduce a continuous formulation of the output distribution and develop a differentiable sampling process. The expectation can be approximated by calculating the average error of all samples drawn from the output distribution. We show that sampling-argmax can seamlessly replace the conventional soft-argmax operation on various localization tasks. Comprehensive experiments demonstrate the effectiveness and flexibility of the proposed method. Code is available at https://github.com/Jeff-sjtu/sampling-argmax

READ FULL TEXT VIEW PDF

Authors

page 15

page 16

page 17

07/12/2019

PC-DARTS: Partial Channel Connections for Memory-Efficient Differentiable Architecture Search

Differentiable architecture search (DARTS) provided a fast solution in f...
12/08/2019

SampleNet: Differentiable Point Cloud Sampling

There is a growing number of tasks that work directly on point clouds. A...
08/21/2021

SSR: Semi-supervised Soft Rasterizer for single-view 2D to 3D Reconstruction

Recent work has made significant progress in learning object meshes with...
12/01/2019

End to End Trainable Active Contours via Differentiable Rendering

We present an image segmentation method that iteratively evolves a polyg...
11/04/2020

Pixel-wise Dense Detector for Image Inpainting

Recent GAN-based image inpainting approaches adopt an average strategy t...
03/03/2022

Learning Selection Bias and Group Importance: Differentiable Reparameterization for the Hypergeometric Distribution

Partitioning a set of elements into a given number of groups of a priori...
06/12/2020

Reinforced Data Sampling for Model Diversification

With the rising number of machine learning competitions, the world has w...

Code Repositories

sampling-argmax

Code for "Localization with Sampling-Argmax", NeurIPS 2021


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Localizing the target position from the input is a fundamental task in the field of computer vision. Common approaches to localization can be divided into two categories: regression-based and detection-based. Detection-based methods show superiority over regression-based methods and demonstrate impressive performance on a wide variety of tasks 

Zhou et al. (2017); Sun et al. (2018); Zhang et al. (2018); He et al. (2019a); Lee et al. (2019); Honari et al. (2018); Li et al. (2019); Joung et al. (2020); Spezialetti et al. (2020); Li et al. (2021b); Shi et al. (2021). Probability maps (also referred to as heat maps) are predicted in detection-based methods to indicate the likelihood of the target position. The position with the highest probability is retrieved from the probability map with the argmax operation. However, the argmax operation is not differentiable and suffers from quantization error. For accurate localization and end-to-end learning, soft-argmax Goroshin et al. (2015); Finn et al. (2016)

is proposed as an approximation of argmax. It has found a wide range of applications in human pose estimation 

Sun et al. (2018); Luvizon et al. (2018, 2019); Wang et al. (2020), facial landmark localization Honari et al. (2018); Liu et al. (2019); Chandran et al. (2020), stereo matching Zhou et al. (2017); Kendall et al. (2017); Duggal et al. (2019) and object keypoint estimation Shi et al. (2021).

Nevertheless, the mechanism of training networks with soft-argmax is rarely studied. The conventional training strategy is to minimize the error between the output coordinate from soft-argmax and the ground truth position. However, this strategy is deficient since it only provides constraints to the expectation of the probability map, not to its shape. As shown in Figure 1, these two maps have the same mean values, but the bottom one is more concentrated. In well-calibrated probability maps, positions that locate closer to the ground truth have higher probabilities. Reliable confidence scores of localization results could be provided, which is essential in unconstrained real-world applications and downstream tasks. Besides, imposing constraints on the probability map can provide supervised pixel-wise gradients and facilitate the learning process.

Prior work Nibali et al. (2018)

attempts to shape the probability map by introducing hand-crafted regularizations. The variance regularization encourages the variance of the probability map to get close to the pre-defined variance. The Gaussian regularization forces the probability map to resemble a Gaussian distribution. We argue that these variants are overconstrained. The hand-crafted constraints are not always correct in different cases. For example, the underlying shape of the probability map is not necessarily Gaussian, and the underlying variance might change as the input changes. Imposing the model to learn a fixed-variance Gaussian distribution might degrade the model performance.

Figure 1: Top: an unconstrained probability map. Bottom: a well-calibrated probability map. These two maps have different shapes but a same mean value.

In this work, we present sampling-argmax, a novel training method to obtain well-calibrated probability maps and improve the localization accuracy. To constrain the shape of the map, we replace the objective function of minimizing “the error of the expectation” with minimizing “the expectation of the error”. In this way, the network is encouraged to generate higher probabilities around the ground truth position.

A natural way to estimate the expectation is by calculating the probability-weighted sum of the errors at all grid positions. However, we find that the gradient has high variance, and the model is hard to train. To address this issue, we choose to approximate the expectation by sampling. The expectation of the error is calculated as the mean error of all samples. Therefore, the sampling process should be differentiable for end-to-end learning.

In our work, we show that the likelihood of the target position can be modelled in the continuous space with a mixture distribution. Samples can be drawn from the mixture distribution by three steps: i) generate categorical weights from the probability map; ii) draw samples from sub-distributions; iii) obtain a sample by the category-weighted sum. The benefit of using mixture distribution is that differentiable sampling from arbitrary continuous distributions can be resolved by differentiable sampling from categorical distributions, which is less challenging and can be addressed by off-the-shelf discrete sampling methods.

Sampling-argmax is simple and effective. With out-of-the-box settings, it can be integrated into methods that using soft-argmax operation. To study its effectiveness, we conduct experiments on a variety of localization tasks. Quantitative results demonstrate the superiority of sampling-argmax against soft-argmax and its variants. In summary, the contributions of this work are threefold:

  • We propose sampling-argmax for improving detection-based localization methods. By minimizing “the expectation of the error”, the network generates well-calibrated probability maps and obtains higher localization accuracy.

  • We show the output likelihood can be formulated as a mixture distribution and develop a differentiable sampling pipeline.

  • Comprehensive experiments show that sampling-argmax is effective and can be flexibly generalized to different localization tasks.

2 Preliminary

Given a learned discrete probability map , the value indicates the probability of the predicted target appearing at . A direct way to localize the target is taking the position with the maximum likelihood. However, this approach is non-differentiable, and the output is discrete, which impedes end-to-end training and brings quantization errors. Soft-argmax is an elegant approximation to address these issues:

(1)

Notice that

is a normalized distribution and the soft-argmax operation calculates the probability-weighted sum, which is equivalent to taking the expectation of the probability map

. A conventional way to train the model with the soft-argmax operation is minimizing the distance between the expectation and the ground truth:

(2)

where denotes the ground truth position and denotes the distance function, e.g. distance. We refer to this objective function as “the error of the expectation”.

3 Method

The conventional detection-based method with soft-argmax only supervises the expectation of the probability map. The shape of the distribution remains unconstrained. In well-calibrated probability maps, the positions closer to the ground truth should have higher probabilities. To this end, we proposed a new objective function that optimizes “the expectation of the error” instead of “the error of the expectation”. In particular, the objective function is formulated as:

(3)

The learned distribution tends to allocate high probabilities around the ground truth to minimize the entire loss. In this way, the shape of the probability map is implicitly constrained.

Discrete Distribution.

The probability map predicted by the neural network is discrete. Similar to the soft-argmax operation, the expectation of error can be approximated by calculating the probability-weighted sum of the errors at all grid positions:

(4)

This approximation treats the distribution of the target position as a discrete distribution. The target only appears at the grid positions, i.e. at position with the probability .

However, because the underlying target lies in a continuous space, modelling the distribution as a discrete distribution is not accurate. The probability map has limited resolution due to the computation complexity. Besides, we find the model is slow to converge by training with Equation 4. When training with Equation 4

, the model only obtains 30.9 mAP on COCO Keypoint, while conventional soft-argmax obtains 64.5 mAP. For analysis, we derive the gradient from the loss function to the model parameters

under the discrete approximation:

(5)

Notice that the form of the gradient is similar to the score function estimator (SF), which is alternatively called the REINFORCE estimator Williams (1992). SF estimator is known to have very high variance and is slow to converge. Therefore, using the discrete approximation for training is not a good solution. This challenge prompts us to explore a better approximation to calculate the expectation of the error.

In the following parts, we present sampling-argmax to estimate the expectation of the error by sampling. We first develop a continuous approximation to the distribution of the target position (Section 3.1). Then we propose a differentiable sampling method (Section 3.2).

3.1 Continuous Mixture Distribution

A differentiable process is necessary to estimate the expectation by sampling. However, since the underlying probability density functions can vary among different input images, it is challenging to draw samples from arbitrary distributions differentiably. In this work, we present a unified method by formulating the target distribution as a mixture distribution.

Let denotes the underlying density function of the target position, which is defined within the boundary of the input image, i.e. . As illustrated in Figure 2(a), the interval can be divided into subintervals. The density function can be partitioned into shapes in the subintervals. We could use regular shape (rectangles, triangles, Gaussian functions) in subintervals to form the entire function (as illustrated in Figure 2(b-c)).

Figure 2: Representing the continuous distribution as a mixture distribution. (a) The original probability density function can be viewed as the sum of

sub-functions. Each sub-function can be replaced by standard density functions with proper weights to approximate the original function. (b) Approximate the original function by replacing the sub-functions with uniform distribution. (c) Approximate the original function by replacing the sub-function with the triangular distribution, which is equivalent to the linear interpolation of the discrete weights.

Formally, given a finite set of probability density functions and weights such that and , the mixture density function is formulated as a sum:

(6)

Here, we can leverage the discrete probability map to represent the mixture weights, i.e. . In the context of signal processing, the original function can be perfectly reconstructed if the sample rate (the distance between two adjacent grid points) satisfies the Nyquist-Shannon sampling theorem. However, in our case, the sub-function must be a probability density function, i.e. it has the non-negative values, and its integral over the entire space is equal to . Therefore, with these restrictions, the original function cannot be perfectly reconstructed. For approximation, we study three different types of standard density functions below.

Uniform Basis.

For the uniform basis, the sub-function is a uniform distribution centred at the position :

(7)

where is the distance between two adjacent grid points.

Triangular Basis.

For the triangular basis, the sub-function is a triangular distribution:

(8)

For all , there exist grid points and that satisfy . Therefore, we have , which is the linear interpolation of and . In other words, using triangular bases is equivalent to the linear interpolation of the discrete probability map.

Gaussian Basis.

For the Gaussian basis, is the Gaussian function:

(9)

where

denotes the standard deviation. We set

by default in the experiments.

3.2 Differentiable Sampling

In this part, we present how to draw a sample from the mixture distribution. We first study the non-differentiable process and then present the differentiable approximation.

Non-differentiable Process.

As illustrated in Figure 3

(a), the non-differentiable sampling process can be divided into two steps: i) determine which sub-distribution the sample comes from; ii) draw a sample from the selected sub-distribution. In the first step, the sub-distribution can be selected by drawing a random variable from a categorical distribution. The categorical distribution is indicated by the predicted probability map

. The sub-distribution is chosen with the probability . There are a number of methods to draw samples from the categorical distribution. Here, we introduce the Gumbel-Max trick Gumbel (1954); Maddison et al. (2014):

(10)

where are i.i.d samples drawn from Gumbel(0, 1), and the sample

is a one-hot vector with the value

in the maximum categorical column.

In the second step, sampling from the standard basis function is easy to implement. This step is independent of the predicted probability map . Therefore, the key to differentiable sampling from the mixture distribution is to make the first step differentiable.

Figure 3: Illustration of the sampling process. (a) The non-differentiable process: i) select a sub-distribution by categorical sampling; ii) draw samples from the selected sub-distribution. (b) The differentiable process: i) approximate the categorical sampled weights by Gumbel-softmax; ii) draw samples from all sub-distribution; iii) add all samples together with the sampled weights. Reparameterization allows gradients to flow from the sample to the probability map.

Differentiable Process.

The differentiable sampling process consists of three steps. In the first step, we adopt the Gumbel-softmax Jang et al. (2017) operation to sample the categorical weight from the probability map. Gumbel-softmax is a continuous and differentiable approximation of the Gumbel-Max trick. We can obtain an -dimensional simplex :

(11)

where and denotes the sampled weight of the sub-distribution . As the softmax temperature approaches , the simplex becomes one-hot, and its distribution becomes identical to the categorical distribution .

In the second step, we draw a sample from every sub-distribution . Note that the sampled weight is not completely one-hot. Therefore, we obtain the final sample in the third step by adding all samples together with the sampled weight :

(12)

This process is illustrated in Figure 3(b). With the reparameterization trick, the sample is computed as a deterministic function of the probability map and the independent random variables. The randomness of the sampling process is transferred to the variable . We denote the sampling process as , where follows the multivariate Gumbel(0, 1) distribution. The gradient from the expected error to the model parameters is derived as:

(13)

As we see, the gradient of the continuous sampling process is easy to compute via backpropagation. Therefore, we can relax the objective function by calculating the average error of the samples drawn from the mixture distribution. The objective function is written as:

(14)

where denotes the number of samples. In the testing phase, no randomness is introduced, and sampling-argmax degrades to soft-argmax.

While the sampling process is differentiable, the sample does not follow the original mixture distribution for non-zero temperature. For small temperatures, the distribution of is close to , but the variance of the gradients is large. There is a tradeoff between small temperatures and large temperatures. In our experiments, we start at a high temperature and anneal to a small temperature.

4 Related Work

Variants of Soft-Argmax.

Nibali et al. Nibali et al. (2018) introduced hand-crafted regularization to constrain the shape of the probability map.

Variance Regularization. Variance regularization is to control the variance of the probability map. It pushes the variance of the probability map close to the target variance :

(15)

where the target variance

is a hyperparameter and the variance of the probability map

is approximated in a discrete manner, i.e. .

Distribution Regularization. Distribution regularization is to impose strict regularization on the appearance of the heatmap to directly encourage a certain shape. Specifically, Nibali et al. (2018) forces the probability map to resemble a Gaussian distribution by minimizing the Jensen-Shannon divergence between and target discrete Gaussian distribution:

(16)

Unlike them, our objective function does not set pre-defined hyperparameters for the shape of the map, which makes it general and flexible in applying to various applications.

Other works Joung et al. (2020); Lee et al. (2019) study how to localize target with soft-argmax in different situations. Joung et al. Joung et al. (2020) proposed sinusoidal soft-argmax for cylindrical probabilities map. Lee et al. Lee et al. (2019) proposed kernel soft-argmax to make the results less susceptible to multi-modal probability map. Our work is compatible with these methods by applying the sinusoidal function to the grid positions or multiplying the Gaussian kernel before obtaining the probability map.

Differentiable Sampling.

Differentiable sampling for a discrete random variable has been studied for a long time. Maddison et al. 

Maddison et al. (2017) and Jang et al. Jang et al. (2017) concurrently proposed the idea of using a softmax of Gumbel as relaxation for differentiable sampling from discrete distributions. Kočiskỳ et al. Kočiskỳ et al. (2016) relaxed the discrete sampling by drawing symbols from a logistic-normal distribution rather than drawing from softmax. In this work, unlike previous methods that study discrete distributions, we focus on continuous distributions. We propose a relaxation of continuous sampling by formulating the target distribution as a mixture distribution.

5 Experiments

We validate the benefits of the proposed sampling-argmax with experiments on a variety of localization tasks, including human pose estimation, retina segmentation and object keypoint estimation. Additional experiments on facial landmark localization are provided in appendix. Sampling-argmax is compared with the conventional soft-argmax and the variants that using additional auxiliary loss 

Nibali et al. (2018). Training details of all tasks are provided in the supplemental material.

5.1 2D Human Pose Estimation from RGB

We first evaluate the proposed sampling-argmax in 2D human pose estimation. In 2D human pose estimation, the probability map is a typical representation to localize body keypoints. The experiments are conducted on the large-scale in-the-wild 2D human pose benchmark – COCO Keypoint Lin et al. (2014). Significant progress has been achieved in this field Xiao et al. (2018); Sun et al. (2019); Moon et al. (2019b); McNally et al. (2020). We adopt the standard model SimplePose Xiao et al. (2018) for experiments. We follow the standard metric of COCO Keypoint and use mAP over 10 OKS (object keypoint similarity) thresholds for evaluation.

As shown in Table 1, the proposed sampling-argmax significantly outperforms the soft-argmax operation and its variants. Soft, Soft w/V.R. and Soft w/D.R correspond to conventional soft-argmax, soft-argmax with variance regularization and distribution regularization, respectively. Samp. Uni., Tri. and Gau. correspond to sampling-argmax with uniform, triangular and Gaussian basis, respectively. The triangular basis brings 5.3 mAP improvement (relative 8.2%) to the original soft-argmax operation. Besides, we find the auxiliary losses degrade the model performance in COCO Keypoint.

Soft Soft w/ V.R. Soft w/ D.R. Samp. Uni. Samp. Tri. Samp. Gau.
mAP  64.5 60.6 55.6 68.2 69.8 68.3
mAP@0.5  84.7 81.5 77.8 87.2 87.9 87.3
mAP@0.75  70.9 65.7 60.8 75.0 76.2 75.2
Table 1: Quantitative results on COCO Keypoint.

Number of Samples.

In our method, the differentiable sampling process is utilized to approximate the expectation of the error. As the number of samples increases, the approximation will be closer to the underlying expectation. To study how the number of samples affects the final results, we compare the performance of the models that trained with different numbers of samples. In Table 2, we report the results with . It shows that a large number of samples might improve the performance but not necessary. Training the model with only one sample can still obtain high performance while saving computation resources.

1 5 10 30 50
Samp. Uni. 67.8 67.8 67.9 68.2 68.1
Samp. Tri. 69.7 69.7 69.6 69.8 69.8
Samp. Gau. 68.1 68.1 68.2 68.3 68.3
Table 2: Comparison of different sample numbers.

Correlation with Prediction Correctness.

Method Corr. 
Soft 0.233
Soft w/ V.R. 0.158
Soft w/ D.R. 0.082
Samp. Uni. 0.394
Samp. Tri. 0.432
Samp. Gau. 0.423
Table 3: Correlation testing.

For a well-calibrated probability map, the shape of the map could reflect the uncertainty of the regression output. When encountering challenging cases, the probability map would have a large variance, resulting in a lower peak value. In other words, the peak value establishes the correlation with the prediction correctness. To demonstrate the probability map trained with sampling-argmax is better-calibrated, we calculate the Pearson correlation coefficient between the peak value and the prediction correctness. The correctness is represented by the OKS between the predicted pose and the ground-truth pose. Table 3 compares the correlation with prediction correctness among different methods. It shows that sampling-argmax has a much stronger correlation to the correctness than other methods. Compared to the soft-max operation, sampling-argmax with the triangular bases brings 85.4% relative improvement. It demonstrates that training with sampling-argmax can obtain a more reliable probability map, which is essential to real-world applications and downstream tasks.

5.2 3D Human Pose Estimation from RGB

We further evaluate the proposed sampling-argmax on Human3.6M Ionescu et al. (2014), an indoor benchmark for 3D human pose estimation. The 3D probability map is adopted to represent the likelihoods for joints in the discrete 3D space. We adopt the model architecture of prior work Sun et al. (2018). Following previous methods Pavlakos et al. (2017); Sun et al. (2018); Moon et al. (2019a); Li et al. (2021a), MPJPE and PA-MPJPE Gower (1975)

are used as the evaluation metrics. Comparisons with baselines are shown in Table 

4. The proposed sampling-argmax provides consistent performance improvements. Different from the experiments on COCO Keypoint, the variance regularization provides performance improvements in Human3.6M.

Soft Soft w/ V.R. Soft w/ D.R. Samp. Uni. Samp. Tri. Samp. Gau.
MPJPE  50.4 49.7 51.9 49.6 49.5 50.9
PA-MPJPE  39.5 39.2 41.4 39.1 39.1 39.0
Table 4: Quantitative results on Human3.6M.

5.3 Retina Segmentation from OCT

Using optical coherence tomography (OCT) to obtain 3D retina images is widely used in the clinic. A major goal of analyzing retinal OCT images is retinal layer segmentation. Previous work He et al. (2019a) proposes a regression method to regress the boundary and obtain the sub-pixel surface positions. One-dimensional probability maps are leveraged to model the position distribution of the surface in each column. In the testing phase, the soft-argmax method is used to infer the final surface positions. The entire surface can be reconstructed by connecting the surface positions in all columns.

The experiments are conducted on the multiple sclerosis and healthy controls dataset (MSHC) He et al. (2019b). Mean absolute distance (MAD) and standard deviation (Std. Dev.) are used as evaluation metrics. Quantitative results are reported in Table 5. It shows that sampling-argmax achieve superior performance to other methods, while the auxiliary losses also provide performance improvements.

Soft Soft w/ V.R. Soft w/ D.R. Samp. Uni. Samp. Tri. Samp. Gau.
MAD  3.08 0.743 0.746 0.735 0.744 0.740
Std. Dev.  0.281 0.114 0.108 0.101 0.100 0.104
Table 5: Quantitative results on MSHC dataset.

5.4 Supervised Object Keypoint Estimation from Point Clouds

Detecting aligned 3D object keypoints from point clouds has a wide range of applications on object tracking, shape retrieval and robotics. Probability maps are adopted to localize the semantic keypoints. Different from the RGB input, the probability map indicates the pointwise score of the input point cloud, not the grid position of an image. The distances between the adjacent point-pairs are different. Besides, point clouds are unordered, and each point has a different number of neighbours. Therefore, it is hard to directly apply the uniform bases or linear interpolation, which requires a constant adjacent distance. Fortunately, the Gaussian basis can be adopted. In the experiment, we set the standard deviation of the Gaussian bases to , which is the average adjacent point distance in the input point clouds. PointNet++ Qi et al. (2017) is adopted as the backbone network. The experiments are conducted on the large-scale object keypoint dataset – KeypointNet You et al. (2020). The percentage of correct keypoints (PCK) Yi et al. (2017) is adopted for evaluation. The error distance threshold is set to .

Table 6 shows the quantitative results on 16 categories. It shows that the proposed sampling-argmax is also effective on the non-grid input data. Table 6 also compare the results of sampling-argmax with different numbers of samples. It is seen that leads to the best average performance.

Air. Bat. Bed Bot. Cap Car Cha. Gui. Hel. Kni. Lap. Mot. Mug Ska. Tab. Ves. Avg
Soft 64.9 43.6 44.0 53.9 8.3 40.2 37.2 45.5 4.9 43.8 46.6 40.8 23.9 27.7 53.9 32.6 38.2
Soft w/ V.R. 64.1 41.6 39.2 53.2 12.5 38.3 37.7 44.5 3.7 39.8 52.8 44.0 24.9 25.6 54.4 30.7 37.9
Soft w/ D.R. 63.2 42.7 43.9 55.8 16.7 42.2 38.6 43.2 4.9 42.4 48.9 41.9 26.8 28.2 54.0 30.3 39.0
Samp. Gau. () 65.0 43.0 41.2 53.6 6.2 43.4 38.7 42.5 6.2 45.4 50.6 43.5 26.3 37.5 51.6 33.3 39.3
Samp. Gau. () 65.1 42.4 43.8 54.7 12.5 43.2 37.1 44.6 1.9 45.4 46.6 44.7 29.7 26.7 54.6 31.4 39.0
Samp. Gau. () 64.0 45.5 41.7 58.6 20.8 40.9 37.0 43.4 3.7 45.7 48.3 46.4 18.2 34.4 53.5 32.3 39.7
Samp. Gau. () 64.3 45.1 47.5 58.4 6.2 44.6 39.2 45.4 6.2 45.8 48.7 43.4 29.9 30.4 54.1 28.8 39.9
Table 6:

Quantitative results of supervised learning on KeypointNet dataset, reported as PCK (higher is better).

5.5 Unsupervised Object Keypoint Estimation from Point Clouds

We then evaluate the proposed method on object keypoint estimation in the context of unsupervised learning. The autoencoder framework is adopted to estimate the keypoint in an unsupervised manner. The encoder first estimates the 3D keypoints, and the decoder reconstructs the object point clouds from the estimated keypoints. We follow the state-of-the-art method 

Shi et al. (2021) that generates 3D keypoints with the soft-argmax operation for differentiable and end-to-end learning. The soft-argmax is replaced with sampling-argmax, where the Gaussian bases with the standard deviation are used.

The experiments are conducted on KeypointNet You et al. (2020). Unlike supervised learning, the semantic of each predicted keypoint is unknown in unsupervised methods. Therefore, the PCK metric is not applicable. For evaluation, we adopt the dual alignment score (DAS) following the previous method Shi et al. (2021). Table 7 reports the performance comparison with other methods.

Air. Bat. Bed Bot. Cap Car Cha. Gui. Hel. Kni. Lap. Mot. Mug Ska. Tab. Ves. Avg
Soft 69.1 56.2 58.0 45.4 59.1 70.2 76.8 34.1 55.7 50.0 91.5 53.4 52.2 65.7 72.5 35.8 59.1
Soft w/ V.R. 72.0 55.4 57.4 52.8 54.7 63.4 70.9 56.1 61.6 50.3 82.4 59.8 71.7 65.3 85.1 38.1 62.3
Soft w/ D.R. 47.9 35.5 47.3 46.1 58.3 65.5 60.9 35.3 47.6 69.3 64.1 55.0 45.9 44.2 57.6 28.8 50.6
Samp. Gau. () 73.9 53.8 63.5 43.9 67.0 69.3 77.7 46.6 59.1 55.9 87.8 59.0 67.0 66.2 80.3 36.4 62.9
Samp. Gau. () 73.1 54.0 61.9 48.4 64.4 67.0 81.1 50.7 55.2 50.1 87.5 58.2 58.9 65.9 77.9 41.2 62.2
Samp. Gau. () 73.9 58.8 61.7 46.2 60.9 68.6 72.0 53.6 56.5 48.1 91.6 59.8 68.8 65.8 83.5 34.9 62.8
Samp. Gau. () 71.2 56.7 60.0 51.0 58.4 64.1 83.8 47.6 61.8 47.8 91.3 55.5 68.5 70.6 81.7 37.5 63.0
Table 7: Quantitative results of unsupervised learning on KeypointNet dataset, reported as DAS (higher is better).

5.6 Discussion

Although the variants of soft-argmax can bring improvements in some cases, they need laborious tuning of parameters, such as the weight of the regularization term and the variance of the target distribution. The best parameters for different tasks are different. Besides, the best parameters for variance regularization and distribution regularization is also different, which increases the effort needed for the process of parameters tunning. In our experiment, we tune the loss weight ranging from 0.1 to 10 and the variance ranging from 1 to 5 for each task. After laborious tuning, the performances of these variants are still not consistent across different tasks and they are inferior to the performance of our method, while our method is out-of-the-box and free from parameters tuning. Therefore, we think our method is effective and general to different cases.

In addition to a more accurate localization performance, sampling-argmax can predict well-calibrated probability maps and provide more reliable confidence scores. COCO Keypoint uses the mAP metric to evaluate multi-person pose estimation. Thus reliable confidence scores could also improve the performance. In other datasets, the metric only reflects the localization performance and ignore the importance of confidence scores. In many real-world applications and downstream tasks, a reliable confidence score is very important and necessary.

6 Conclusion

In this paper, we propose sampling-argmax, an operation for improving the detection-based localization. Sampling-argmax implicitly imposes shape constraints to the predicted probability map by optimizing “the expectation of error”. With the continuous formulation and differentiable sampling, sampling-argmax can seamlessly replace the conventional soft-argmax operation. We show that sampling-argmax is effective and flexible by conducting comprehensive experiments on various localization tasks.

References

  • [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. In CVPR, Cited by: Appendix A.
  • [2] P. Chandran, D. Bradley, M. Gross, and T. Beeler (2020) Attention-driven cropping for very high resolution facial landmark detection. In CVPR, Cited by: §1.
  • [3] COCO - common objects in context. Note: https://cocodataset.org/#home Cited by: Appendix D.
  • [4] COCO license agreement. Note: https://creativecommons.org/licenses/by/4.0/legalcode Cited by: Appendix D.
  • [5] S. Duggal, S. Wang, W. Ma, R. Hu, and R. Urtasun (2019) Deeppruner: learning efficient stereo matching via differentiable patchmatch. In ICCV, Cited by: §1.
  • [6] Facial landmark detection by deep multi-task learning. Note: http://mmlab.ie.cuhk.edu.hk/projects/TCDCN.html Cited by: Appendix D.
  • [7] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel (2016) Deep spatial autoencoders for visuomotor learning. In ICRA, Cited by: §1.
  • [8] R. Goroshin, M. Mathieu, and Y. LeCun (2015) Learning to linearize under uncertainty. In NeurIPS, Cited by: §1.
  • [9] J. C. Gower (1975) Generalized procrustes analysis. Psychometrika. Cited by: §5.2.
  • [10] E. J. Gumbel (1954) Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §3.2.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Appendix A, Appendix A, Appendix A.
  • [12] Y. He, A. Carass, Y. Liu, B. M. Jedynak, S. D. Solomon, S. Saidha, P. A. Calabresi, and J. L. Prince (2019) Fully convolutional boundary regression for retina oct segmentation. In MICCAI, Cited by: Appendix A, §1, §5.3.
  • [13] Y. He, A. Carass, S. D. Solomon, S. Saidha, P. A. Calabresi, and J. L. Prince (2019) Retinal layer parcellation of optical coherence tomography images: data resource for multiple sclerosis and healthy controls. Data in brief. Cited by: Appendix D, §5.3.
  • [14] S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz (2018)

    Improving landmark localization with semi-supervised learning

    .
    In CVPR, Cited by: §1.
  • [15] Human3.6m dataset. Note: http://vision.imar.ro/human3.6m/description.php Cited by: Appendix D.
  • [16] Human3.6m license agreement. Note: http://vision.imar.ro/human3.6m/eula.php Cited by: Appendix D.
  • [17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI. Cited by: Appendix D, §5.2.
  • [18] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §3.2, §4.
  • [19] S. Joung, S. Kim, H. Kim, M. Kim, I. Kim, J. Cho, and K. Sohn (2020) Cylindrical convolutional networks for joint object detection and viewpoint estimation. In CVPR, Cited by: §1, §4.
  • [20] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In ICCV, Cited by: §1.
  • [21] KeypointNet. Note: https://github.com/qq456cvb/KeypointNet Cited by: Appendix D.
  • [22] T. Kočiskỳ, G. Melis, E. Grefenstette, C. Dyer, W. Ling, P. Blunsom, and K. M. Hermann (2016) Semantic parsing with semi-supervised sequential autoencoders. In EMNLP, Cited by: §4.
  • [23] J. Lee, D. Kim, J. Ponce, and B. Ham (2019) Sfnet: learning object-aware semantic correspondence. In CVPR, Cited by: §1, §4.
  • [24] J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, and C. Lu (2021) Human pose regression with residual log-likelihood estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11025–11034. Cited by: §5.2.
  • [25] J. Li, C. Wang, H. Zhu, Y. Mao, H. Fang, and C. Lu (2019) Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 10863–10872. Cited by: §1.
  • [26] J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, and C. Lu (2021) HybrIK: a hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3383–3393. Cited by: §1.
  • [27] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: Appendix A, Appendix D, §5.1.
  • [28] Y. Liu, H. Shen, Y. Si, X. Wang, X. Zhu, H. Shi, Z. Hong, H. Guo, Z. Guo, Y. Chen, et al. (2019) Grand challenge of 106-point facial landmark localization. In ICMEW, Cited by: §1.
  • [29] D. C. Luvizon, D. Picard, and H. Tabia (2018)

    2d/3d pose estimation and action recognition using multitask deep learning

    .
    In CVPR, Cited by: §1.
  • [30] D. C. Luvizon, H. Tabia, and D. Picard (2019) Human pose regression by combining indirect part detection and contextual information. Computers & Graphics. Cited by: §1.
  • [31] C. J. Maddison, A. Mnih, and Y. W. Teh (2017) The concrete distribution: a continuous relaxation of discrete random variables. In ICLR, Cited by: §4.
  • [32] C. J. Maddison, D. Tarlow, and T. Minka (2014) A* sampling. In NeurIPS, Cited by: §3.2.
  • [33] W. McNally, K. Vats, A. Wong, and J. McPhee (2020) Evopose2d: pushing the boundaries of 2d human pose estimation using neuroevolution. arXiv preprint arXiv:2011.08446. Cited by: §5.1.
  • [34] G. Moon, J. Y. Chang, and K. M. Lee (2019) Camera distance-aware top-down approach for 3D multi-person pose estimation from a single rgb image. In ICCV, Cited by: Appendix A, §5.2.
  • [35] G. Moon, J. Y. Chang, and K. M. Lee (2019) Posefix: model-agnostic general human pose refinement network. In CVPR, Cited by: §5.1.
  • [36] A. Nibali, Z. He, S. Morgan, and L. Prendergast (2018)

    Numerical coordinate regression with convolutional neural networks

    .
    arXiv preprint arXiv:1801.07372. Cited by: §1, §4, §4, §5.
  • [37] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR, Cited by: §5.2.
  • [38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: Appendix A, §5.4.
  • [39] Resources - iacl. Note: http://iacl.ece.jhu.edu/index.php?title=Resources Cited by: Appendix D.
  • [40] R. Shi, Z. Xue, Y. You, and C. Lu (2021) Skeleton merger: an unsupervised aligned keypoint detector. In CVPR, Cited by: §1, §5.5, §5.5.
  • [41] R. Spezialetti, F. Stella, M. Marcon, L. Silva, S. Salti, and L. Di Stefano (2020) Learning to orient surfaces by self-supervised spherical cnns. In NeurIPS, Cited by: §1.
  • [42] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §5.1.
  • [43] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In ECCV, Cited by: Appendix A, §1, §5.2.
  • [44] C. Wang, J. Li, W. Liu, C. Qian, and C. Lu (2020) Hmor: hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In European Conference on Computer Vision, pp. 242–259. Cited by: §1.
  • [45] R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    .
    Machine learning. Cited by: §3.
  • [46] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In ECCV, Cited by: Appendix A, §5.1.
  • [47] L. Yi, H. Su, X. Guo, and L. J. Guibas (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In CVPR, Cited by: §5.4.
  • [48] Y. You, Y. Lou, C. Li, Z. Cheng, L. Li, L. Ma, C. Lu, and W. Wang (2020) Keypointnet: a large-scale 3d keypoint dataset aggregated from numerous human annotations. In CVPR, Cited by: Appendix D, §5.4, §5.5.
  • [49] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel (2018)

    Deep imitation learning for complex manipulation tasks from virtual reality teleoperation

    .
    In ICRA, Cited by: §1.
  • [50] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2014) Facial landmark detection by deep multi-task learning. In ECCV, Cited by: Appendix D, Appendix E.
  • [51] C. Zhou, H. Zhang, X. Shen, and J. Jia (2017) Unsupervised learning of stereo matching. In ICCV, Cited by: §1.

Appendix

In the supplemental document, we elaborate on the training settings (Appendix A), the broader impact of our work (Appendix B), limitation and future work (Appendix C), descriptions of the utilized datasets (Appendix D), experiments on facial landmark localization (Appendix E), comparison between the learned distribution of soft-argmax and sampling-argmax(Appendix F), and qualitative results (Appendix G).

Appendix A Training Details

2D Human Pose Estimation from RGB

We adopt SimplePose [46] for experiments. The model is trained and evaluated on COCO Keypoint [27]. ResNet-50 [11] is adopted as the backbone network. The input image is resized to . The learning rate is set to at first and reduced by a factor of at the

th epoch and the

th epoch. We use the Adam solver and train for epochs, with a mini-batch size of per GPU and 1080Ti GPUs in total. For comparison with the auxiliary losses, we set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.

3D Human Pose Estimation from RGB

We follow the model architecture of Integral Pose [43]. ResNet-50 [11] is adopted as the backbone network. The input image is resized to . The learning rate is set to at first and reduced by a factor of at the th and th epoch. We use the Adam solver and train for epochs, with a mini-batch size of per GPU and 1080Ti GPUs in total. Following the settings of previous works [43, 34], we mix Human3.6M and MPII [1] data for training. Each mini-batch consists of half 2D and half 3D samples. Five subjects (S1, S5, S6, S7, S8) are used for training and two subjects (S9, S11) for evaluation. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.

Retina Segmentation from OCT

We follow the model architecture of [12]. The input image is resized to . The learning rate is set to at first and reduced by a factor of at the th and the th epoch. We use the Adam solver and train for epochs, with a mini-batch size of and GPU. The split of training, validation and test sets follows the settings of the previous method [12]. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.

Supervised Object Keypoint Estimation from Point Clouds

We adopt PointNet++ [38] as the backbone network. The output of the last layer is a per-point probability map for each keypoint. The input point cloud consists of 2048 points represented by their Euclidean coordinates sampled from a normalized object, and the indexes of keypoints are given. The learning rate is set to and halved every epochs. We use Adam solver and train for 100 epochs with a mini-batch size of on one GPU for each category. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.

Unsupervised Object Keypoint Estimation from Point Clouds

The learning rate is set to and halved every epochs. We use the Adam solver and train for epochs, with a mini-batch size of and one GPU for each category. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.

Facial Landmark Localization from RGB

ResNet-18 [11] is adopted as the backbone network. The head network consists of deconvolution layers and a convolution layer. The input image is resized to . The learning rate is set to at first and reduced by a factor of at the th and th epoch. We use the Adam solver and train for epochs, which a mini-batch size of and GPUs in total. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.

Appendix B Broader Impact

In this work, we propose sampling-argmax to improve the ability of machines to understand target positions in input data. Current methods usually adopt computationally expensive models to improve the localization accuracy, which could cost many financial and environmental resources. We partly alleviate this issue by presenting a simple yet effective method.

Furthermore, our method is an improvement of existing capabilities but does not introduce a radically new capability in machine learning. Thus our contribution is unlikely to facilitate misuse of technology that is already available to anyone.

Appendix C Limitation and Future Work

In our method, the underlying density function of the target position is approximated by a mixture of sub-distributions. By comparing the performance of the three proposed bases, we see that a more accurate reconstruction of the underlying function leads to better results. Theoretically, the underlying density function cannot be perfectly reconstructed since the proposed basis distributions are fixed. To address this limitation, learnable sub-distributions could be adopted in future works. For example, normalizing flow models can be leveraged to predict sub-distribution at each position according to the corresponding features. In this way, the sub-distributions are no longer fixed, and the mixture distribution has the potential to precisely reconstruct the underlying distribution and further improve the model performance.

Appendix D Data Acquisition

In our experiments, we use five different datasets, including COCO Keypoint [27], Human3.6M [17], MSHC [13], KeypointNet [48] and MTFL [50]. These public datasets do not contain personally identifiable information or offensive content.

COCO Keypoint

COCO Keypoint dataset is licensed under the Creative Commons Attribution 4.0 License [4]. The images and annotations are publicly available. We download the images and annotations from its official website [3].

Human3.6M

Human3.6M dataset is licensed under [16]. To obtain the data, we register and download it from its official website [15].

Mshc

MSHC dataset is publicly available, and no license is specified. We download the data from its official website [39].

KeypointNet

KeypointNet dataset is publicly available, and no license is specified. We download the data from its official website [21].

Mtfl

MTFL dataset is publicly available, and no license is specified. We download the data from its official website [6].

Appendix E Facial Landmark Localization from RGB

We further evaluate the proposed sampling-argmax on the facial landmark localization dataset MTFL [50]. Absolute error and relative error (normalized by the two-eye distance) are adopted as evaluation metrics. Quantitative results are reported in Table 8. Consistent with the experiments on other tasks, sampling-argmax provides performance improvement to facial landmark localization.

Soft Soft w/ V.R. Soft w/ D.R. Samp. Uni. Samp. Tri. Samp. Gau.
Abs. Err  3.18 3.16 3.15 3.00 2.98 2.94
Rel. Err  7.25 7.22 7.20 6.86 6.82 6.96
Table 8: Quantitative results on MTFL dataset.

Appendix F Visualization of learned probability maps

We show the predicted probability maps of soft-argmax and sampling-argmax in Figure 4. It shows that soft-argmax is prone to predict multi-modal distribution, while the proposed sampling-argmax predicts better-calibrated probability maps.

Figure 4: Visualization of the learned distribution. Left: Soft-Argmax. Right: Sampling-Argmax.

Appendix G Qualitative Results

Qualitative results on six tasks are shown in Figure 5, 6, 7, 8, 9 and 10.

Figure 5: Qualitative results of 2D human pose estimation on COCO Keypoint.
Figure 6: Qualitative results of 3D human pose estimation on Human3.6M.
Figure 7: Qualitative results of retina segmentation on MSHC.
Figure 8: Qualitative results of supervised model on KeypointNet.
Figure 9: Qualitative results of unsupervised model on KeypointNet.
Figure 10: Qualitative results of facial landmark localization on MTFL.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? We claim that the proposed sampling-argmax can help the model obtain well-calibrated probability maps and improve the localization accuracy. In our experiments, we validate the localization accuracy of sampling-argmax across six tasks. We also demonstrate that the probability maps are well-calibrated by conducting correlation testing.

    2. Did you describe the limitations of your work? Please see Section C.

    3. Did you discuss any potential negative societal impacts of your work? Please see the supplemental material.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Our code is attached in the supplemental material.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Training details are elaborated in the supplemental material.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error bars are not reported because it would be too computationally expensive.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? The required training resources (number of GPUs) are elaborated in the supplemental material.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? Please see Section 5.

    2. Did you mention the license of the assets? Please see the supplemental material.

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? All data we used is publicly available. Please see the supplemental material.

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? Please see the supplemental material.

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?