Code for "Localization with Sampling-Argmax", NeurIPS 2021
Soft-argmax operation is commonly adopted in detection-based methods to localize the target position in a differentiable manner. However, training the neural network with soft-argmax makes the shape of the probability map unconstrained. Consequently, the model lacks pixel-wise supervision through the map during training, leading to performance degradation. In this work, we propose sampling-argmax, a differentiable training method that imposes implicit constraints to the shape of the probability map by minimizing the expectation of the localization error. To approximate the expectation, we introduce a continuous formulation of the output distribution and develop a differentiable sampling process. The expectation can be approximated by calculating the average error of all samples drawn from the output distribution. We show that sampling-argmax can seamlessly replace the conventional soft-argmax operation on various localization tasks. Comprehensive experiments demonstrate the effectiveness and flexibility of the proposed method. Code is available at https://github.com/Jeff-sjtu/sampling-argmaxREAD FULL TEXT VIEW PDF
Code for "Localization with Sampling-Argmax", NeurIPS 2021
Localizing the target position from the input is a fundamental task in the field of computer vision. Common approaches to localization can be divided into two categories: regression-based and detection-based. Detection-based methods show superiority over regression-based methods and demonstrate impressive performance on a wide variety of tasksZhou et al. (2017); Sun et al. (2018); Zhang et al. (2018); He et al. (2019a); Lee et al. (2019); Honari et al. (2018); Li et al. (2019); Joung et al. (2020); Spezialetti et al. (2020); Li et al. (2021b); Shi et al. (2021). Probability maps (also referred to as heat maps) are predicted in detection-based methods to indicate the likelihood of the target position. The position with the highest probability is retrieved from the probability map with the argmax operation. However, the argmax operation is not differentiable and suffers from quantization error. For accurate localization and end-to-end learning, soft-argmax Goroshin et al. (2015); Finn et al. (2016)
is proposed as an approximation of argmax. It has found a wide range of applications in human pose estimationSun et al. (2018); Luvizon et al. (2018, 2019); Wang et al. (2020), facial landmark localization Honari et al. (2018); Liu et al. (2019); Chandran et al. (2020), stereo matching Zhou et al. (2017); Kendall et al. (2017); Duggal et al. (2019) and object keypoint estimation Shi et al. (2021).
Nevertheless, the mechanism of training networks with soft-argmax is rarely studied. The conventional training strategy is to minimize the error between the output coordinate from soft-argmax and the ground truth position. However, this strategy is deficient since it only provides constraints to the expectation of the probability map, not to its shape. As shown in Figure 1, these two maps have the same mean values, but the bottom one is more concentrated. In well-calibrated probability maps, positions that locate closer to the ground truth have higher probabilities. Reliable confidence scores of localization results could be provided, which is essential in unconstrained real-world applications and downstream tasks. Besides, imposing constraints on the probability map can provide supervised pixel-wise gradients and facilitate the learning process.
Prior work Nibali et al. (2018)
attempts to shape the probability map by introducing hand-crafted regularizations. The variance regularization encourages the variance of the probability map to get close to the pre-defined variance. The Gaussian regularization forces the probability map to resemble a Gaussian distribution. We argue that these variants are overconstrained. The hand-crafted constraints are not always correct in different cases. For example, the underlying shape of the probability map is not necessarily Gaussian, and the underlying variance might change as the input changes. Imposing the model to learn a fixed-variance Gaussian distribution might degrade the model performance.
In this work, we present sampling-argmax, a novel training method to obtain well-calibrated probability maps and improve the localization accuracy. To constrain the shape of the map, we replace the objective function of minimizing “the error of the expectation” with minimizing “the expectation of the error”. In this way, the network is encouraged to generate higher probabilities around the ground truth position.
A natural way to estimate the expectation is by calculating the probability-weighted sum of the errors at all grid positions. However, we find that the gradient has high variance, and the model is hard to train. To address this issue, we choose to approximate the expectation by sampling. The expectation of the error is calculated as the mean error of all samples. Therefore, the sampling process should be differentiable for end-to-end learning.
In our work, we show that the likelihood of the target position can be modelled in the continuous space with a mixture distribution. Samples can be drawn from the mixture distribution by three steps: i) generate categorical weights from the probability map; ii) draw samples from sub-distributions; iii) obtain a sample by the category-weighted sum. The benefit of using mixture distribution is that differentiable sampling from arbitrary continuous distributions can be resolved by differentiable sampling from categorical distributions, which is less challenging and can be addressed by off-the-shelf discrete sampling methods.
Sampling-argmax is simple and effective. With out-of-the-box settings, it can be integrated into methods that using soft-argmax operation. To study its effectiveness, we conduct experiments on a variety of localization tasks. Quantitative results demonstrate the superiority of sampling-argmax against soft-argmax and its variants. In summary, the contributions of this work are threefold:
We propose sampling-argmax for improving detection-based localization methods. By minimizing “the expectation of the error”, the network generates well-calibrated probability maps and obtains higher localization accuracy.
We show the output likelihood can be formulated as a mixture distribution and develop a differentiable sampling pipeline.
Comprehensive experiments show that sampling-argmax is effective and can be flexibly generalized to different localization tasks.
Given a learned discrete probability map , the value indicates the probability of the predicted target appearing at . A direct way to localize the target is taking the position with the maximum likelihood. However, this approach is non-differentiable, and the output is discrete, which impedes end-to-end training and brings quantization errors. Soft-argmax is an elegant approximation to address these issues:
is a normalized distribution and the soft-argmax operation calculates the probability-weighted sum, which is equivalent to taking the expectation of the probability map. A conventional way to train the model with the soft-argmax operation is minimizing the distance between the expectation and the ground truth:
where denotes the ground truth position and denotes the distance function, e.g. distance. We refer to this objective function as “the error of the expectation”.
The conventional detection-based method with soft-argmax only supervises the expectation of the probability map. The shape of the distribution remains unconstrained. In well-calibrated probability maps, the positions closer to the ground truth should have higher probabilities. To this end, we proposed a new objective function that optimizes “the expectation of the error” instead of “the error of the expectation”. In particular, the objective function is formulated as:
The learned distribution tends to allocate high probabilities around the ground truth to minimize the entire loss. In this way, the shape of the probability map is implicitly constrained.
The probability map predicted by the neural network is discrete. Similar to the soft-argmax operation, the expectation of error can be approximated by calculating the probability-weighted sum of the errors at all grid positions:
This approximation treats the distribution of the target position as a discrete distribution. The target only appears at the grid positions, i.e. at position with the probability .
However, because the underlying target lies in a continuous space, modelling the distribution as a discrete distribution is not accurate. The probability map has limited resolution due to the computation complexity. Besides, we find the model is slow to converge by training with Equation 4. When training with Equation 4under the discrete approximation:
Notice that the form of the gradient is similar to the score function estimator (SF), which is alternatively called the REINFORCE estimator Williams (1992). SF estimator is known to have very high variance and is slow to converge. Therefore, using the discrete approximation for training is not a good solution. This challenge prompts us to explore a better approximation to calculate the expectation of the error.
A differentiable process is necessary to estimate the expectation by sampling. However, since the underlying probability density functions can vary among different input images, it is challenging to draw samples from arbitrary distributions differentiably. In this work, we present a unified method by formulating the target distribution as a mixture distribution.
Let denotes the underlying density function of the target position, which is defined within the boundary of the input image, i.e. . As illustrated in Figure 2(a), the interval can be divided into subintervals. The density function can be partitioned into shapes in the subintervals. We could use regular shape (rectangles, triangles, Gaussian functions) in subintervals to form the entire function (as illustrated in Figure 2(b-c)).
Formally, given a finite set of probability density functions and weights such that and , the mixture density function is formulated as a sum:
Here, we can leverage the discrete probability map to represent the mixture weights, i.e. . In the context of signal processing, the original function can be perfectly reconstructed if the sample rate (the distance between two adjacent grid points) satisfies the Nyquist-Shannon sampling theorem. However, in our case, the sub-function must be a probability density function, i.e. it has the non-negative values, and its integral over the entire space is equal to . Therefore, with these restrictions, the original function cannot be perfectly reconstructed. For approximation, we study three different types of standard density functions below.
For the uniform basis, the sub-function is a uniform distribution centred at the position :
where is the distance between two adjacent grid points.
For the triangular basis, the sub-function is a triangular distribution:
For all , there exist grid points and that satisfy . Therefore, we have , which is the linear interpolation of and . In other words, using triangular bases is equivalent to the linear interpolation of the discrete probability map.
For the Gaussian basis, is the Gaussian function:
denotes the standard deviation. We setby default in the experiments.
In this part, we present how to draw a sample from the mixture distribution. We first study the non-differentiable process and then present the differentiable approximation.
As illustrated in Figure 3
(a), the non-differentiable sampling process can be divided into two steps: i) determine which sub-distribution the sample comes from; ii) draw a sample from the selected sub-distribution. In the first step, the sub-distribution can be selected by drawing a random variable from a categorical distribution. The categorical distribution is indicated by the predicted probability map. The sub-distribution is chosen with the probability . There are a number of methods to draw samples from the categorical distribution. Here, we introduce the Gumbel-Max trick Gumbel (1954); Maddison et al. (2014):
where are i.i.d samples drawn from Gumbel(0, 1), and the sample
is a one-hot vector with the valuein the maximum categorical column.
In the second step, sampling from the standard basis function is easy to implement. This step is independent of the predicted probability map . Therefore, the key to differentiable sampling from the mixture distribution is to make the first step differentiable.
The differentiable sampling process consists of three steps. In the first step, we adopt the Gumbel-softmax Jang et al. (2017) operation to sample the categorical weight from the probability map. Gumbel-softmax is a continuous and differentiable approximation of the Gumbel-Max trick. We can obtain an -dimensional simplex :
where and denotes the sampled weight of the sub-distribution . As the softmax temperature approaches , the simplex becomes one-hot, and its distribution becomes identical to the categorical distribution .
In the second step, we draw a sample from every sub-distribution . Note that the sampled weight is not completely one-hot. Therefore, we obtain the final sample in the third step by adding all samples together with the sampled weight :
This process is illustrated in Figure 3(b). With the reparameterization trick, the sample is computed as a deterministic function of the probability map and the independent random variables. The randomness of the sampling process is transferred to the variable . We denote the sampling process as , where follows the multivariate Gumbel(0, 1) distribution. The gradient from the expected error to the model parameters is derived as:
As we see, the gradient of the continuous sampling process is easy to compute via backpropagation. Therefore, we can relax the objective function by calculating the average error of the samples drawn from the mixture distribution. The objective function is written as:
where denotes the number of samples. In the testing phase, no randomness is introduced, and sampling-argmax degrades to soft-argmax.
While the sampling process is differentiable, the sample does not follow the original mixture distribution for non-zero temperature. For small temperatures, the distribution of is close to , but the variance of the gradients is large. There is a tradeoff between small temperatures and large temperatures. In our experiments, we start at a high temperature and anneal to a small temperature.
Nibali et al. Nibali et al. (2018) introduced hand-crafted regularization to constrain the shape of the probability map.
Variance Regularization. Variance regularization is to control the variance of the probability map. It pushes the variance of the probability map close to the target variance :
where the target variance
is a hyperparameter and the variance of the probability mapis approximated in a discrete manner, i.e. .
Distribution Regularization. Distribution regularization is to impose strict regularization on the appearance of the heatmap to directly encourage a certain shape. Specifically, Nibali et al. (2018) forces the probability map to resemble a Gaussian distribution by minimizing the Jensen-Shannon divergence between and target discrete Gaussian distribution:
Unlike them, our objective function does not set pre-defined hyperparameters for the shape of the map, which makes it general and flexible in applying to various applications.
Other works Joung et al. (2020); Lee et al. (2019) study how to localize target with soft-argmax in different situations. Joung et al. Joung et al. (2020) proposed sinusoidal soft-argmax for cylindrical probabilities map. Lee et al. Lee et al. (2019) proposed kernel soft-argmax to make the results less susceptible to multi-modal probability map. Our work is compatible with these methods by applying the sinusoidal function to the grid positions or multiplying the Gaussian kernel before obtaining the probability map.
Differentiable sampling for a discrete random variable has been studied for a long time. Maddison et al.Maddison et al. (2017) and Jang et al. Jang et al. (2017) concurrently proposed the idea of using a softmax of Gumbel as relaxation for differentiable sampling from discrete distributions. Kočiskỳ et al. Kočiskỳ et al. (2016) relaxed the discrete sampling by drawing symbols from a logistic-normal distribution rather than drawing from softmax. In this work, unlike previous methods that study discrete distributions, we focus on continuous distributions. We propose a relaxation of continuous sampling by formulating the target distribution as a mixture distribution.
We validate the benefits of the proposed sampling-argmax with experiments on a variety of localization tasks, including human pose estimation, retina segmentation and object keypoint estimation. Additional experiments on facial landmark localization are provided in appendix. Sampling-argmax is compared with the conventional soft-argmax and the variants that using additional auxiliary lossNibali et al. (2018). Training details of all tasks are provided in the supplemental material.
We first evaluate the proposed sampling-argmax in 2D human pose estimation. In 2D human pose estimation, the probability map is a typical representation to localize body keypoints. The experiments are conducted on the large-scale in-the-wild 2D human pose benchmark – COCO Keypoint Lin et al. (2014). Significant progress has been achieved in this field Xiao et al. (2018); Sun et al. (2019); Moon et al. (2019b); McNally et al. (2020). We adopt the standard model SimplePose Xiao et al. (2018) for experiments. We follow the standard metric of COCO Keypoint and use mAP over 10 OKS (object keypoint similarity) thresholds for evaluation.
As shown in Table 1, the proposed sampling-argmax significantly outperforms the soft-argmax operation and its variants. Soft, Soft w/V.R. and Soft w/D.R correspond to conventional soft-argmax, soft-argmax with variance regularization and distribution regularization, respectively. Samp. Uni., Tri. and Gau. correspond to sampling-argmax with uniform, triangular and Gaussian basis, respectively. The triangular basis brings 5.3 mAP improvement (relative 8.2%) to the original soft-argmax operation. Besides, we find the auxiliary losses degrade the model performance in COCO Keypoint.
|Soft||Soft w/ V.R.||Soft w/ D.R.||Samp. Uni.||Samp. Tri.||Samp. Gau.|
In our method, the differentiable sampling process is utilized to approximate the expectation of the error. As the number of samples increases, the approximation will be closer to the underlying expectation. To study how the number of samples affects the final results, we compare the performance of the models that trained with different numbers of samples. In Table 2, we report the results with . It shows that a large number of samples might improve the performance but not necessary. Training the model with only one sample can still obtain high performance while saving computation resources.
|Soft w/ V.R.||0.158|
|Soft w/ D.R.||0.082|
For a well-calibrated probability map, the shape of the map could reflect the uncertainty of the regression output. When encountering challenging cases, the probability map would have a large variance, resulting in a lower peak value. In other words, the peak value establishes the correlation with the prediction correctness. To demonstrate the probability map trained with sampling-argmax is better-calibrated, we calculate the Pearson correlation coefficient between the peak value and the prediction correctness. The correctness is represented by the OKS between the predicted pose and the ground-truth pose. Table 3 compares the correlation with prediction correctness among different methods. It shows that sampling-argmax has a much stronger correlation to the correctness than other methods. Compared to the soft-max operation, sampling-argmax with the triangular bases brings 85.4% relative improvement. It demonstrates that training with sampling-argmax can obtain a more reliable probability map, which is essential to real-world applications and downstream tasks.
We further evaluate the proposed sampling-argmax on Human3.6M Ionescu et al. (2014), an indoor benchmark for 3D human pose estimation. The 3D probability map is adopted to represent the likelihoods for joints in the discrete 3D space. We adopt the model architecture of prior work Sun et al. (2018). Following previous methods Pavlakos et al. (2017); Sun et al. (2018); Moon et al. (2019a); Li et al. (2021a), MPJPE and PA-MPJPE Gower (1975)
are used as the evaluation metrics. Comparisons with baselines are shown in Table4. The proposed sampling-argmax provides consistent performance improvements. Different from the experiments on COCO Keypoint, the variance regularization provides performance improvements in Human3.6M.
|Soft||Soft w/ V.R.||Soft w/ D.R.||Samp. Uni.||Samp. Tri.||Samp. Gau.|
Using optical coherence tomography (OCT) to obtain 3D retina images is widely used in the clinic. A major goal of analyzing retinal OCT images is retinal layer segmentation. Previous work He et al. (2019a) proposes a regression method to regress the boundary and obtain the sub-pixel surface positions. One-dimensional probability maps are leveraged to model the position distribution of the surface in each column. In the testing phase, the soft-argmax method is used to infer the final surface positions. The entire surface can be reconstructed by connecting the surface positions in all columns.
The experiments are conducted on the multiple sclerosis and healthy controls dataset (MSHC) He et al. (2019b). Mean absolute distance (MAD) and standard deviation (Std. Dev.) are used as evaluation metrics. Quantitative results are reported in Table 5. It shows that sampling-argmax achieve superior performance to other methods, while the auxiliary losses also provide performance improvements.
|Soft||Soft w/ V.R.||Soft w/ D.R.||Samp. Uni.||Samp. Tri.||Samp. Gau.|
Detecting aligned 3D object keypoints from point clouds has a wide range of applications on object tracking, shape retrieval and robotics. Probability maps are adopted to localize the semantic keypoints. Different from the RGB input, the probability map indicates the pointwise score of the input point cloud, not the grid position of an image. The distances between the adjacent point-pairs are different. Besides, point clouds are unordered, and each point has a different number of neighbours. Therefore, it is hard to directly apply the uniform bases or linear interpolation, which requires a constant adjacent distance. Fortunately, the Gaussian basis can be adopted. In the experiment, we set the standard deviation of the Gaussian bases to , which is the average adjacent point distance in the input point clouds. PointNet++ Qi et al. (2017) is adopted as the backbone network. The experiments are conducted on the large-scale object keypoint dataset – KeypointNet You et al. (2020). The percentage of correct keypoints (PCK) Yi et al. (2017) is adopted for evaluation. The error distance threshold is set to .
Table 6 shows the quantitative results on 16 categories. It shows that the proposed sampling-argmax is also effective on the non-grid input data. Table 6 also compare the results of sampling-argmax with different numbers of samples. It is seen that leads to the best average performance.
|Soft w/ V.R.||64.1||41.6||39.2||53.2||12.5||38.3||37.7||44.5||3.7||39.8||52.8||44.0||24.9||25.6||54.4||30.7||37.9|
|Soft w/ D.R.||63.2||42.7||43.9||55.8||16.7||42.2||38.6||43.2||4.9||42.4||48.9||41.9||26.8||28.2||54.0||30.3||39.0|
|Samp. Gau. ()||65.0||43.0||41.2||53.6||6.2||43.4||38.7||42.5||6.2||45.4||50.6||43.5||26.3||37.5||51.6||33.3||39.3|
|Samp. Gau. ()||65.1||42.4||43.8||54.7||12.5||43.2||37.1||44.6||1.9||45.4||46.6||44.7||29.7||26.7||54.6||31.4||39.0|
|Samp. Gau. ()||64.0||45.5||41.7||58.6||20.8||40.9||37.0||43.4||3.7||45.7||48.3||46.4||18.2||34.4||53.5||32.3||39.7|
|Samp. Gau. ()||64.3||45.1||47.5||58.4||6.2||44.6||39.2||45.4||6.2||45.8||48.7||43.4||29.9||30.4||54.1||28.8||39.9|
Quantitative results of supervised learning on KeypointNet dataset, reported as PCK (higher is better).
We then evaluate the proposed method on object keypoint estimation in the context of unsupervised learning. The autoencoder framework is adopted to estimate the keypoint in an unsupervised manner. The encoder first estimates the 3D keypoints, and the decoder reconstructs the object point clouds from the estimated keypoints. We follow the state-of-the-art methodShi et al. (2021) that generates 3D keypoints with the soft-argmax operation for differentiable and end-to-end learning. The soft-argmax is replaced with sampling-argmax, where the Gaussian bases with the standard deviation are used.
The experiments are conducted on KeypointNet You et al. (2020). Unlike supervised learning, the semantic of each predicted keypoint is unknown in unsupervised methods. Therefore, the PCK metric is not applicable. For evaluation, we adopt the dual alignment score (DAS) following the previous method Shi et al. (2021). Table 7 reports the performance comparison with other methods.
|Soft w/ V.R.||72.0||55.4||57.4||52.8||54.7||63.4||70.9||56.1||61.6||50.3||82.4||59.8||71.7||65.3||85.1||38.1||62.3|
|Soft w/ D.R.||47.9||35.5||47.3||46.1||58.3||65.5||60.9||35.3||47.6||69.3||64.1||55.0||45.9||44.2||57.6||28.8||50.6|
|Samp. Gau. ()||73.9||53.8||63.5||43.9||67.0||69.3||77.7||46.6||59.1||55.9||87.8||59.0||67.0||66.2||80.3||36.4||62.9|
|Samp. Gau. ()||73.1||54.0||61.9||48.4||64.4||67.0||81.1||50.7||55.2||50.1||87.5||58.2||58.9||65.9||77.9||41.2||62.2|
|Samp. Gau. ()||73.9||58.8||61.7||46.2||60.9||68.6||72.0||53.6||56.5||48.1||91.6||59.8||68.8||65.8||83.5||34.9||62.8|
|Samp. Gau. ()||71.2||56.7||60.0||51.0||58.4||64.1||83.8||47.6||61.8||47.8||91.3||55.5||68.5||70.6||81.7||37.5||63.0|
Although the variants of soft-argmax can bring improvements in some cases, they need laborious tuning of parameters, such as the weight of the regularization term and the variance of the target distribution. The best parameters for different tasks are different. Besides, the best parameters for variance regularization and distribution regularization is also different, which increases the effort needed for the process of parameters tunning. In our experiment, we tune the loss weight ranging from 0.1 to 10 and the variance ranging from 1 to 5 for each task. After laborious tuning, the performances of these variants are still not consistent across different tasks and they are inferior to the performance of our method, while our method is out-of-the-box and free from parameters tuning. Therefore, we think our method is effective and general to different cases.
In addition to a more accurate localization performance, sampling-argmax can predict well-calibrated probability maps and provide more reliable confidence scores. COCO Keypoint uses the mAP metric to evaluate multi-person pose estimation. Thus reliable confidence scores could also improve the performance. In other datasets, the metric only reflects the localization performance and ignore the importance of confidence scores. In many real-world applications and downstream tasks, a reliable confidence score is very important and necessary.
In this paper, we propose sampling-argmax, an operation for improving the detection-based localization. Sampling-argmax implicitly imposes shape constraints to the predicted probability map by optimizing “the expectation of error”. With the continuous formulation and differentiable sampling, sampling-argmax can seamlessly replace the conventional soft-argmax operation. We show that sampling-argmax is effective and flexible by conducting comprehensive experiments on various localization tasks.
Improving landmark localization with semi-supervised learning. In CVPR, Cited by: §1.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10863–10872. Cited by: §1.
2d/3d pose estimation and action recognition using multitask deep learning. In CVPR, Cited by: §1.
Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:1801.07372. Cited by: §1, §4, §4, §5.
Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning. Cited by: §3.
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In ICRA, Cited by: §1.
In the supplemental document, we elaborate on the training settings (Appendix A), the broader impact of our work (Appendix B), limitation and future work (Appendix C), descriptions of the utilized datasets (Appendix D), experiments on facial landmark localization (Appendix E), comparison between the learned distribution of soft-argmax and sampling-argmax(Appendix F), and qualitative results (Appendix G).
We adopt SimplePose  for experiments. The model is trained and evaluated on COCO Keypoint . ResNet-50  is adopted as the backbone network. The input image is resized to . The learning rate is set to at first and reduced by a factor of at the
th epoch and theth epoch. We use the Adam solver and train for epochs, with a mini-batch size of per GPU and 1080Ti GPUs in total. For comparison with the auxiliary losses, we set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.
We follow the model architecture of Integral Pose . ResNet-50  is adopted as the backbone network. The input image is resized to . The learning rate is set to at first and reduced by a factor of at the th and th epoch. We use the Adam solver and train for epochs, with a mini-batch size of per GPU and 1080Ti GPUs in total. Following the settings of previous works [43, 34], we mix Human3.6M and MPII  data for training. Each mini-batch consists of half 2D and half 3D samples. Five subjects (S1, S5, S6, S7, S8) are used for training and two subjects (S9, S11) for evaluation. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.
We follow the model architecture of . The input image is resized to . The learning rate is set to at first and reduced by a factor of at the th and the th epoch. We use the Adam solver and train for epochs, with a mini-batch size of and GPU. The split of training, validation and test sets follows the settings of the previous method . We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.
We adopt PointNet++  as the backbone network. The output of the last layer is a per-point probability map for each keypoint. The input point cloud consists of 2048 points represented by their Euclidean coordinates sampled from a normalized object, and the indexes of keypoints are given. The learning rate is set to and halved every epochs. We use Adam solver and train for 100 epochs with a mini-batch size of on one GPU for each category. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.
The learning rate is set to and halved every epochs. We use the Adam solver and train for epochs, with a mini-batch size of and one GPU for each category. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.
ResNet-18  is adopted as the backbone network. The head network consists of deconvolution layers and a convolution layer. The input image is resized to . The learning rate is set to at first and reduced by a factor of at the th and th epoch. We use the Adam solver and train for epochs, which a mini-batch size of and GPUs in total. We set the target variance to , the loss weight of variance regularization to , and the loss weight of distributions regularization to to achieve the best results after tuning.
In this work, we propose sampling-argmax to improve the ability of machines to understand target positions in input data. Current methods usually adopt computationally expensive models to improve the localization accuracy, which could cost many financial and environmental resources. We partly alleviate this issue by presenting a simple yet effective method.
Furthermore, our method is an improvement of existing capabilities but does not introduce a radically new capability in machine learning. Thus our contribution is unlikely to facilitate misuse of technology that is already available to anyone.
In our method, the underlying density function of the target position is approximated by a mixture of sub-distributions. By comparing the performance of the three proposed bases, we see that a more accurate reconstruction of the underlying function leads to better results. Theoretically, the underlying density function cannot be perfectly reconstructed since the proposed basis distributions are fixed. To address this limitation, learnable sub-distributions could be adopted in future works. For example, normalizing flow models can be leveraged to predict sub-distribution at each position according to the corresponding features. In this way, the sub-distributions are no longer fixed, and the mixture distribution has the potential to precisely reconstruct the underlying distribution and further improve the model performance.
In our experiments, we use five different datasets, including COCO Keypoint , Human3.6M , MSHC , KeypointNet  and MTFL . These public datasets do not contain personally identifiable information or offensive content.
MSHC dataset is publicly available, and no license is specified. We download the data from its official website .
KeypointNet dataset is publicly available, and no license is specified. We download the data from its official website .
MTFL dataset is publicly available, and no license is specified. We download the data from its official website .
We further evaluate the proposed sampling-argmax on the facial landmark localization dataset MTFL . Absolute error and relative error (normalized by the two-eye distance) are adopted as evaluation metrics. Quantitative results are reported in Table 8. Consistent with the experiments on other tasks, sampling-argmax provides performance improvement to facial landmark localization.
|Soft||Soft w/ V.R.||Soft w/ D.R.||Samp. Uni.||Samp. Tri.||Samp. Gau.|
We show the predicted probability maps of soft-argmax and sampling-argmax in Figure 4. It shows that soft-argmax is prone to predict multi-modal distribution, while the proposed sampling-argmax predicts better-calibrated probability maps.
For all authors…
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? We claim that the proposed sampling-argmax can help the model obtain well-calibrated probability maps and improve the localization accuracy. In our experiments, we validate the localization accuracy of sampling-argmax across six tasks. We also demonstrate that the probability maps are well-calibrated by conducting correlation testing.
Did you describe the limitations of your work? Please see Section C.
Did you discuss any potential negative societal impacts of your work? Please see the supplemental material.
Have you read the ethics review guidelines and ensured that your paper conforms to them?
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results?
Did you include complete proofs of all theoretical results?
If you ran experiments…
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Our code is attached in the supplemental material.
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Training details are elaborated in the supplemental material.
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error bars are not reported because it would be too computationally expensive.
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? The required training resources (number of GPUs) are elaborated in the supplemental material.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
If your work uses existing assets, did you cite the creators? Please see Section 5.
Did you mention the license of the assets? Please see the supplemental material.
Did you include any new assets either in the supplemental material or as a URL?
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? All data we used is publicly available. Please see the supplemental material.
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? Please see the supplemental material.
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable?
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?