1 Introduction
Supervised regression entails learning a model capable of predicting a continuous target value from an input
, given a set of paired training examples. It is a fundamental machine learning problem with many important applications within computer vision and other domains. Common regression tasks within computer vision include object detection
(Jiang et al., 2018; Zhou et al., 2019), head and bodypose estimation
(Xiao et al., 2018; Yang et al., 2019), age estimation (Rothe et al., 2016; Pan et al., 2018), visual tracking (Danelljan et al., 2019) and medical image registration (Niethammer et al., 2011; Chou et al., 2013), just to mention a few. While all of these tasks benefit from accurate regression of the target values, high accuracy can even be safetycritical in e.g. automotive and medical applications. Today, such regression problems are commonly tackled using Deep Neural Networks (DNNs), due to their ability to learn powerful feature representations from data.While classification is generally addressed using standardized losses and output representations, a wide variety of different techniques are employed for regression. The most conventional strategy is to train a DNN to directly predict a target given an input (Lathuilière et al., 2019). In such direct regression
approaches, the model parameters of the DNN are learned by minimizing a loss function, for example the
orloss, penalizing the discrepancy between the predicted and ground truth target values. From a probabilistic perspective, this approach corresponds to creating a simple parametric model of the conditional target density
, and minimizing the associated negative loglikelihood. Theloss, for example, corresponds to a fixedvariance Gaussian model. More recent work
(Kendall & Gal, 2017; Lakshminarayanan et al., 2017) has also explored learning more expressive models of, by letting a DNN instead output the full set of parameters of a certain family of probability distributions. These
probabilistic regression approaches however still restrict the parametric model to fairly simple distributions in most cases, such as Gaussian (Chua et al., 2018) or Laplace (Gast & Roth, 2018; Ilg et al., 2018), limiting the expressiveness of the learned conditional target density. While these methods benefit from a clear probabilistic interpretation, they may thus not fully exploit the predictive power of the DNN.The quest for improved regression accuracy has also led to the development of more specialized methods, designed for a specific set of tasks. In computer vision, one particularly popular approach is that of confidencebased regression. Here, a DNN instead predicts a scalar confidence value for inputtarget pairs . The confidence can then be maximized w.r.t. to obtain a target prediction for a given input
. The approach is commonly employed for image coordinate regression tasks within e.g. human pose estimation
(Cao et al., 2017; Xiao et al., 2018) and object detection (Law & Deng, 2018; Zhou et al., 2019), where a 2D heatmap over image pixel coordinates is predicted. Recently, the approach was also applied to the problem of bounding box regression by Jiang et al. (2018). Their proposed method, IoUNet, obtained stateoftheart accuracy on object detection, and was later also successfully applied to the task of visual tracking (Danelljan et al., 2019). The training of such confidencebased regression methods does however entail generating additional pseudo ground truth labels, by for example employing a Gaussian kernel (Xiao et al., 2018), and selecting an appropriate loss function. This both requires numerous design choices to be made, and limits the general applicability of the methods. Moreover, confidencebased regression methods do not allow for a natural probabilistic interpretation in terms of the conditional target density . In this work, we therefore set out to develop a method combining the general applicability and clear interpretation of probabilistic regression with the predictive power of confidencebased approaches.Contributions We propose Deep Conditional Target Densities (DCTD), a novel and general regression method with a clear probabilistic interpretation. DCTD predicts the unnormalized conditional target density from the inputtarget pair . It is trained by directly minimizing the associated negative loglikelihood by exploiting Monte Carlo approximations. At test time, targets are predicted by maximizing the conditional target density through gradientbased refinement. Compared to confidencebased approaches, our DCTD requires no pseudolabels and benefits from a direct probabilistic interpretation. Unlike existing probabilistic models, our approach can learn highly flexible target densities directly from data, as visualized in Figure 1.
We evaluate the proposed method on four diverse computer vision regression tasks: object detection, age estimation, headpose estimation and visual tracking. Our DCTD method is found to outperform both the direct regression baselines, and popular probabilistic and confidencebased alternatives. Notably, our method achieves a AP improvement over the FPN FasterRCNN (Lin et al., 2017) baseline on the COCO dataset (Lin et al., 2014) when applied for object detection. It also sets a new stateoftheart on standard benchmarks (Müller et al., 2018, 2016) when applied for bounding box regression in the ATOM visual tracking algorithm (Danelljan et al., 2019).
2 Background & related work
In supervised regression, the task is to learn to predict a target value from a corresponding input , given a training set of i.i.d. inputtarget examples, , . As opposed to classification, the target space is a continuous set, e.g. . In computer vision, the input space often corresponds to the space of images, whereas the output space depends on the task at hand. Common examples include in age estimation (Rothe et al., 2016), in image keypoint estimation (Xiao et al., 2018), and in object bounding box regression (Jiang et al., 2018).
Direct regression Over the last decade, DNNs have been shown to excel at a variety of regression problems. Here, a DNN is viewed as a function , parameterized by a set of learnable weights . The most conventional regression approach is to train a DNN to directly predict the targets, , called direct regression. The model parameters are learned by minimizing a loss that penalizes the discrepancy between the prediction and the ground truth target value on training samples . The most common choices are the loss, , the loss, , and their close relatives (Huber, 1964; Lathuilière et al., 2019). From a probabilistic perspective, the choice of loss corresponds to minimizing the negative loglikelihood of a specific model of the conditional target density. For example, the loss is derived from a fixedvariance Gaussian model, .
Probabilistic regression More recent work (Kendall & Gal, 2017; Lakshminarayanan et al., 2017; Chua et al., 2018) has explicitly taken advantage of this probabilistic perspective to achieve more flexible parametric models , by letting the DNN output the parameters of a family of probability distributions . For example, a general 1D Gaussian model can be realized as , where the DNN outputs the mean and logvariance as . The model parameters are learned by minimizing the negative loglikelihood over the training set . At test time, a target estimate is obtained by first predicting the density parameter values and then, for instance, taking the expected value of . Previous work has applied Gaussian and Laplace models on computer vision tasks such as object detection (Feng et al., 2019; He et al., 2019) and optical flow estimation (Gast & Roth, 2018; Ilg et al., 2018). The aim of such probabilistic approaches is often not only to achieve accurate predictions, but also to provide an estimate of the aleatoric uncertainty (Kendall & Gal, 2017), which models noise and ambiguities inherent in the data itself. Our method also entails predicting a conditional target density and minimizing the associated negative loglikelihood. However, our model is not restricted to the functional form of any specific probability density (e.g. Gaussian or Laplace), but is instead directly defined by the DNN architecture itself, allowing for more expressive target densities.
Confidencebased regression Another category of approaches reformulate the regression problem as , where is a scalar confidence value predicted by the DNN. The idea is thus to predict a quantity , depending on both input and target , that can be maximized over to obtain the final prediction . This maximizationbased formulation is inherent in Structural SVMs (Tsochantaridis et al., 2005), but has also been adopted for DNNs. We term this family of approaches confidencebased regression. Different from direct regression, the predicted confidence can encapsulate multiple hypotheses and other ambiguities. Confidencebased regression has been shown particularly suitable for imagecoordinate regression tasks, such as hand keypoint localization (Simon et al., 2017) and bodypart detection (Wei et al., 2016; Pishchulin et al., 2016; Xiao et al., 2018). In these cases, a CNN is trained to output a 2D heatmap over the image pixel coordinates , thus taking full advantage of the translational invariance of the problem. A similar approach has also been employed to locate the two defining corners (Law & Deng, 2018) or four extreme points (Zhou et al., 2019) of bounding boxes in object detection. In computer vision, confidence prediction has also been successfully employed for tasks other than pure imagecoordinate regression. Jiang et al. (2018) proposed the IoUNet for bounding box regression in object detection, where a boundingbox and image are both inputs to the DNN to predict a confidence . It employs a poolingbased architecture that is differentiable w.r.t. the bounding box , allowing gradientbased maximization to obtain the final estimate . IoUNet was later also applied to visual tracking (Danelljan et al., 2019). In general, confidencebased approaches are trained using a set of generated pseudo label confidences and by employing a loss . One strategy (Pishchulin et al., 2016; Law & Deng, 2018) is to treat the confidence prediction as a binary classification problem, where represents either the class, , or its probability, , and employ crossentropy based losses . The other approach is to treat the confidence prediction as a direct regression problem itself by applying standard regression losses, such as (Simon et al., 2017; Danelljan et al., 2019; Wei et al., 2016) or the Huber loss (Jiang et al., 2018). In these cases, the pseudo label confidences can be constructed using a similarity measure in the target value space, , for example defined as the Intersection over Union (IoU) between two bounding boxes (Jiang et al., 2018) or simply by a Gaussian kernel (Wei et al., 2016; Xiao et al., 2018). While these approaches have demonstrated impressive results, existing confidencebased approaches require important design choices. In particular, the strategy for constructing the pseudo labels and the choice of loss are often crucial for performance and highly taskdependent. Moreover, the predicted confidence can be difficult to interpret, since it has no natural connection to the conditional target density . In contrast, our approach is directly trained to predict itself, and does not require generation of pseudo label confidences or choosing a specific loss.
Regressionbyclassification A regression problem can also be treated as a classification problem by first discretizing the target space into a finite set of classes. Standard techniques from classification, such as softmax and the crossentropy loss, can then be employed. Rothe et al. (2016) additionally computed the softmax expected value to obtain a more finegrained prediction, and applied their method to the task of age estimation. Ruiz et al. (2018) applied the same method to headpose estimation, but also added an loss term for the softmax expected value during training. Again for age estimation, Pan et al. (2018) then added an additional loss term penalizing the softmax variance. A hierarchical classification approach has also been proposed for both age estimation (Yang et al., 2018) and headpose estimation (Yang et al., 2019). The discretization of the target space often complicates exploiting its inherent neighborhood structure. This has been addressed by exploring ordinal regression methods for 1D problems (Cao et al., 2019; Diaz & Marathe, 2019). Finally, classification into coarse discrete bins can be combined with direct regression, a technique often utilized in 2D (Redmon et al., 2016; Liu et al., 2016) and 3D (Shi et al., 2019; Qi et al., 2018) object detection. While our approach can be seen as a generalization of the softmax model for classification to the continuous target space , it does not suffer from the aforementioned drawbacks of regressionbyclassification. On the contrary, our model naturally allows the network to exploit the full structure of the continuous target space .
3 Regression using deep conditional target densities
In this work, we take the probabilistic view of regression by creating a model of the conditional target density . Instead of defining by letting a DNN predict the parameters of a certain family of probability distributions (e.g. Gaussian or Laplace), we construct a versatile model that can better leverage the predictive power of DNNs. To that end, we take inspiration from confidencebased regression approaches and let a DNN predict a scalar value for any inputtarget pair . Unlike confidencebased methods however, this prediction has a clear probabilistic interpretation. Specifically, we view a DNN as a function , parameterized by , that maps an inputtarget pair to a scalar value . Then, we define the Deep Conditional Target Density (DCTD) according to,
(1) 
where is the normalizing constant. We train our DCTD model by minimizing the negative loglikelihood , where each term is given by,
(2) 
The training thus requires the evaluation of the normalizing constant , involving the integral in equation 2. This can be achieved using effective finite approximations. In some tasks, such as imagecoordinate regression, this is naturally performed by a grid approximation, utilizing the dense prediction already employed in many such methods. In this work, we however investigate a more generally applicable technique, namely Monte Carlo approximations. This procedure, when employed for training the network, is detailed in Section 3.1.
At test time, given an input , our model in equation 1 allows evaluating the conditional target density for any by first approximating the constant and then predicting the scalar using the DNN. This enables the computation of, for instance, means and variances of the target value . In this work, we focus on finding the most likely prediction, , which does not require the evaluation of during inference. Thanks to the autodifferentiation capabilities of modern deep learning frameworks, we can apply gradientbased techniques to find the final prediction by simply maximizing the network output w.r.t. . We elaborate on this procedure for prediction in Section 3.2.
3.1 Training
Our model of the conditional target density is trained by minimizing the negative loglikelihood . To evaluate the integral in equation 2, we employ a Monte Carlo approximation. Specifically, each term is approximated by sampling values from a proposal distribution that depends on the ground truth target ,
(3) 
The final loss is then obtained by averaging over all training samples in the minibatch,
(4) 
where are samples drawn from . Qualitatively, minimizing encourages the DNN to output large values for the ground truth target , while minimizing the predicted value at all other targets . In ambiguous or uncertain cases, the DNN can output small values everywhere or large values at multiple hypotheses, but at a cost of a higher loss.
As seen in equation 4, the DNN is applied both to the inputtarget pair , and all inputsample pairs
during training. While this can seem inefficient, most applications in computer vision employ network architectures that first extract a deep feature representation for the input
. The DNN can thus be designed to combine this input feature with the targetat a late stage, meaning that the input feature extraction process, which becomes the main computational bottleneck, needs to be performed only once for each
. In practice, we found our training strategy to not add any significant computational overhead compared to the baselines.Compared to confidencebased regression, a significant advantage of our approach is that there is no need for generating taskdependent pseudo label confidences or choosing between different losses. The only design choice of our method is the proposal distribution . Note however that since the loss (equation 4) explicitly adapts to , this choice has no effect on the overall behaviour of the loss, only on the quality of the sampled approximation. We found a simple mixture of a few equally weighted Gaussian components, all centered at the target label , to consistently perform well in our experiments. Specifically we set,
(5) 
where the variances
are hyperparameters selected based on a validation set for each experiment.
3.2 Prediction
Given an input at test time, the trained DNN can be used to evaluate the full target density , by employing the aforementioned techniques to approximate . In many applications, the most likely prediction is however the single desired output. For DCTD, this is obtained by directly maximizing the DNN output, , thus not requiring to be evaluated. By designing the DNN to be differentiable w.r.t. the target , the gradient can be efficiently evaluated using the autodifferentiation tools implemented in modern deep learning frameworks. We can therefore perform gradient ascent to find a local maximum of . The gradient ascent refinement is performed either on a single initial estimate , or on a set of random initializations to obtain a final accurate prediction . As noted in Section 3.1, this prediction procedure can be made highly efficient in practice by extracting the deep feature representation for only once. Backpropagation is then performed only through a few final layers of the DNN in order to evaluate the gradient . Moreover, the gradient computation for a set of target candidates can be parallelized on the GPU by simple batching, requiring no significant overhead. Please refer to Appendix B for a detailed algorithm of this prediction procedure.
4 Experiments
We perform comprehensive experiments on four different computer vision tasks. Our DCTD method is compared both to baseline regression methods and to stateoftheart models. All experiments are implemented in PyTorch
(Paszke et al., 2017).4.1 Object detection
We first perform experiments for visual object detection, the problem of estimating a bounding box for each object in the image from a set of given classes. Specifically, we compare our regression method to other techniques for the task of boundingbox regression, by integrating them into an existing object detection pipeline. To this end, we use the FasterRCNN (Ren et al., 2015) framework, which serves as a popular baseline in the object detection field due to its strong stateoftheart performance. It uses one network head for classification and the second for regressing the bounding box using the direct method. We also compare our approach to the confidencebased IoUNet (Jiang et al., 2018). It extends FasterRCNN with an additional branch that predicts the IoU overlap between a target box and the ground truth. The IoU prediction branch uses differentiable region pooling (Jiang et al., 2018), allowing the initial bounding box predicted by the FasterRCNN to be refined using gradientbased maximization of the predicted IoU confidence.
For our approach, we employ an identical architecture as used in IoUNet for a fair comparison. Instead of training the network to output the IoU, we predict the exponent in equation 1, trained by minimizing the negative loglikelihood (NLL) in equation 4. We parametrize the bounding box as , where and denote the center coordinate and size respectively. The reference size is set to that of the ground truth during training and the initial box during inference. For the proposal distribution (equation 5) we employ
isotropic Gaussians with standard deviation
. In addition to the standard IoUNet, we compare with a version (denoted IoUNet) employing the same proposal distribution and inference settings as in our approach. For both our method and IoUNet, we set the refinement steplength using grid search on a separate validation set. We also compare with a Gaussian and a Laplace probabilistic model for bounding box regression by modifying the FasterRCNN regression head to predict both the mean and logvariance of the distribution, and adopting the NLL loss.Formulation  Direct  Gaussian  Laplace  Confidence  Confidence  DCTD 

Approach  FasterRCNN  IoUNet  IoUNet  Ours  
AP (%)  37.2  36.7  37.1  38.3  38.2  39.1 
AP  59.2  58.7  59.1  58.3  58.4  58.5 
AP  40.3  39.6  40.2  41.4  41.4  41.8 
Our experiments are performed on the largescale COCO benchmark (Lin et al., 2014). As per the official guideline, we use the 2017 train split ( 118 000 images) for training and the 2017 val split ( 5 000 images) as the validation set for setting the hyperparameters. The results are reported on the 2017 testdev split ( 20 000 images), in terms of the standard COCO metrics AP (mean Average Precision AP over 10 IoU thresholds ), AP, and AP. We initialize all networks in our comparison with the pretrained FasterRCNN weights, using the ResNet50FPN (Lin et al., 2017) backbone and retrain only the newly added layers for a fair comparison. Further details are provided in Appendix C. The results are shown in Table 1. Our DCTD approach obtains the best results, outperforming both FasterRCNN and IoUNet by and in AP, respectively.
4.2 Age estimation
In age estimation, we are given a cropped image of a person’s face, and the task is to predict his/her age . We utilize the UTKFace (Zhang et al., 2017) dataset, specifically the subset of images used by Cao et al. (2019). In this subset, ground truth age labels . We also utilize the dataset split employed by Cao et al. (2019), with test images and images for training. Additionally, we use of the training images for validation. Methods are evaluated in terms of the Mean Absolute Error (MAE). The DNN architecture of our DCTD first extracts ResNet50 (He et al., 2016) features from the input image . The age is processed by four fullyconnected layers, generating
. The two feature vectors are then concatenated and processed by two fullyconnected layers, outputting
. We apply our DCTD to refine the age predicted by baseline models, using the gradient ascent maximization of (Section 3.2). All baseline DNN models employ a similar architecture, including an identical ResNet50 for feature extraction and the same number of fullyconnected layers to output either the age (Direct), mean and variance parameters for Gaussian and Laplace distributions, or to output logits for
discretized classes (Softmax). The results are found in Table 2. We observe that age refinement provided by our DCTD method consistently improves the accuracy of the predictions generated by the baseline methods. Further details are provided in Appendix D.+DCTD  Cao et al. (2019)  Direct  Gaussian  Laplace  Softmax (CE, )  Softmax (CE, , Var) 

5.47 0.01  4.81 0.02  4.79 0.06  4.85 0.04  4.78 0.05  4.81 0.03  
✓    4.65 0.02  4.66 0.04  4.81 0.04  4.65 0.04  4.69 0.03 
4.3 Headpose estimation
In headpose estimation, we are given an image of a person, and are tasked with predicting the orientation of his/her head, where is the Yaw, Pitch and Roll angles. We utilize the BIWI (Fanelli et al., 2013) dataset, specifically the processed dataset provided by Yang et al. (2019), in which the images have been cropped to faces detected using MTCNN (Zhang et al., 2016). We also employ protocol 2 as defined by Yang et al. (2019), with images for training and images for testing. Additionally, we use training images for validation. The methods are evaluated in terms of the average MAE for Yaw, Pitch and Roll. The network architecture of the DNN defining our DCTD takes the image and orientation as inputs, but is otherwise identical to the age estimation case (Section 4.2). Our DCTD model is again evaluated by applying the optimizationbased refinement to the predicted orientation outputted by a number of baseline models. We use the same baselines as for age estimation, and apart from minor changes required to increase the output dimension from to , identical network architectures are also used. The results are found in Table 3, and also in this case we observe that refinement using DCTD consistently improves upon the baselines. Further details are provided in Appendix E.
+DCTD  Yang et al. (2019)  Direct  Gaussian  Laplace  Softmax (CE, )  Softmax (CE, , Var) 

3.60  3.09 0.07  3.12 0.08  3.21 0.06  3.04 0.08  3.15 0.07  
✓    3.07 0.07  3.11 0.07  3.19 0.06  3.01 0.07  3.11 0.06 
4.4 Visual tracking
Lastly, we evaluate our approach on the problem of generic visual object tracking. The task is to estimate the bounding box of a target object in every frame of a video. The target object is defined by a given box in the first video frame. We employ the recently introduced ATOM (Danelljan et al., 2019)
tracker as our baseline. Given the firstframe annotation, ATOM trains a classifier to first roughly localize the target in a new frame. The target bounding box is then determined using an IoUNet based module, which is also conditioned on the firstframe target appearance using a modulationbased architecture. We train our network to predict the conditional target density through
in equation 1, using a network architecture identical to the baseline ATOM tracker. In particular, we employ the same bounding box parameterization as for object detection (Section 4.1) and sample boxes during training from a proposal distribution (equation 5) generated by Gaussians with standard deviations of and . During tracking, we follow the same procedure as in ATOM, sampling boxes in each frame followed by gradient ascent to refine the estimate generated by the classification module.We demonstrate results on two standard tracking benchmarks: TrackingNet (Müller et al., 2018) and UAV123 (Müller et al., 2016). TrackingNet contains challenging videos sampled from YouTube, with a test set of 511 videos. The main metric is the Success, defined as the average IoU overlap with the ground truth. UAV123 contains 123 videos captured from a UAV, and includes small and fastmoving objects. We report the overlap precision metric (), defined as the percentage of frames having bounding box IoU overlap larger than a threshold . The final AUC score is computed as the average OP over all thresholds . Hyperparameters are set on the OTB (Wu et al., 2015) and NFS (Galoogahi et al., 2017) datasets, containing 100 videos each. Due to the significant challenges imposed by the limited supervision and generic nature of the tracking problem, there are not any competitive baselines employing direct bounding box regression. Current stateoftheart employ either confidencebased regression, as in ATOM, or anchorbased bounding box regression techniques (Zhu et al., 2018; Li et al., 2019). We therefore only compare with the ATOM baseline and include other recent stateoftheart methods in the comparison. As in section 4.1, we compare with a version of the IoUNet based ATOM (denoted ATOM) employing the same training and inference settings as our final approach. The results are shown in Table 4. Our approach achieves a significant and absolute improvement over ATOM on the overall metric on TrackingNet and UAV123 respectively. Note that the improvements are most prominent for highaccuracy boxes, indicated by the OP score. Moreover, our approach outperforms the recent SiamRPN++ (Li et al., 2019), which employs anchorbased bounding box regression (Ren et al., 2015; Redmon & Farhadi, 2016) and a much deeper backbone network (ResNet50) compared to ours (ResNet18). Figure 2 visualizes the conditional target density generated by our approach for tracking.
Dataset  Metric  SiamFC  MDNet  DaSiamRPN  SiamRPN++  ATOM  ATOM  Ours 

Bertinetto et al. (2016)  Nam & Han (2016)  Zhu et al. (2018)  Li et al. (2019)  Danelljan et al. (2019)  
TrackingNet  Precision (%)  53.3  56.5  59.1  69.4  64.8  66.7  68.9 
Norm. Prec. (%)  66.6  70.5  73.3  80.0  77.1  78.3  79.5  
Success (%)  57.1  60.6  63.8  73.3  70.3  72.1  73.7  
UAV123  OP (%)      73.6  75  78.9  79.6  80.1 
OP (%)      41.1  56  55.7  56.0  59.8  
AUC (%)    52.8  58.4  61.3  65.0  65.0  66.5 
5 Conclusion
We proposed Deep Conditional Target Densities (DCTD), a novel and generally applicable regression method with a clear probabilistic interpretation. It directly models the conditional target density by predicting the unnormalized density through a DNN , taking the inputtarget pair as input. The model is trained by minimizing the associated negative loglikelihood, employing a Monte Carlo approximation of the normalizing constant. At test time, targets are predicted by maximizing the DNN output w.r.t. via gradientbased refinement. Experiments performed on four diverse computer vision applications demonstrate the high accuracy and wide applicability of our method. However, this work constitutes an initial investigation of DCTD. Future directions include exploring better architectural designs, studying other regression applications, and investigating DCTD’s potential for aleatoric uncertainty estimation.
Acknowledgments
This research was financially supported by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT150012) and by the project Learning flexible models for nonlinear dynamics (contract number: 201703807), funded by the Swedish Research Council.
References
 Bertinetto et al. (2016) Luca Bertinetto, Jack Valmadre, João F Henriques, Andrea Vedaldi, and Philip HS Torr. Fullyconvolutional siamese networks for object tracking. In ECCV workshop, 2016.
 Cao et al. (2019) Wenzhi Cao, Vahid Mirjalili, and Sebastian Raschka. Rankconsistent ordinal regression for neural networks. arXiv preprint arXiv:1901.07884, 2019.

Cao et al. (2017)
Zhe Cao, Tomas Simon, ShihEn Wei, and Yaser Sheikh.
Realtime multiperson 2d pose estimation using part affinity fields.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 7291–7299, 2017.  Chou et al. (2013) ChenRui Chou, Brandon Frederick, Gig Mageras, Sha Chang, and Stephen Pizer. 2D/3D image registration using regression learning. Computer Vision and Image Understanding, 117(9):1095–1106, 2013.

Chua et al. (2018)
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine.
Deep reinforcement learning in a handful of trials using probabilistic dynamics models.
In Advances in Neural Information Processing Systems (NeurIPS), pp. 4759–4770, 2018.  Danelljan & Bhat (2019) Martin Danelljan and Goutam Bhat. PyTracking: Visual tracking library based on PyTorch. https://github.com/visionml/pytracking, 2019. Accessed: 12/08/2019.
 Danelljan et al. (2019) Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. ATOM: Accurate tracking by overlap maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4660–4669, 2019.
 Diaz & Marathe (2019) Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 Fan et al. (2019) Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A highquality benchmark for largescale single object tracking. In CVPR, 2019.
 Fanelli et al. (2013) Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3d face analysis. International Journal of Computer Vision (IJCV), 101(3):437–458, 2013.

Feng et al. (2019)
Di Feng, Lars Rosenbaum, Fabian Timm, and Klaus Dietmayer.
Leveraging heteroscedastic aleatoric uncertainties for robust realtime lidar 3D object detection.
In 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 1280–1287. IEEE, 2019.  Galoogahi et al. (2017) Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. In ICCV, 2017.
 Gast & Roth (2018) Jochen Gast and Stefan Roth. Lightweight probabilistic deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3369–3378, 2018.
 Girshick (2015) Ross B. Girshick. Fast rcnn. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448, 2015.

Gu et al. (2017)
Jinwei Gu, Xiaodong Yang, Shalini De Mello, and Jan Kautz.
Dynamic facial analysis: From bayesian filtering to recurrent neural network.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1548–1557, 2017.  He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
 He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask rcnn. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
 He et al. (2019) Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, and Xiangyu Zhang. Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2888–2897, 2019.
 Huang et al. (2018) Lianghua Huang, Xin Zhao, and Kaiqi Huang. GOT10k: A large highdiversity benchmark for generic object tracking in the wild. arXiv preprint arXiv:1810.11981, 2018.
 Huber (1964) Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, pp. 73–101, 1964.
 Ilg et al. (2018) Eddy Ilg, Ozgun Cicek, Silvio Galesso, Aaron Klein, Osama Makansi, Frank Hutter, and Thomas Bro. Uncertainty estimates and multihypotheses networks for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 652–667, 2018.
 Jiang et al. (2018) Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–799, 2018.
 Kendall & Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems (NeurIPS), pp. 5574–5584, 2017.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6402–6413, 2017.
 Lathuilière et al. (2019) Stéphane Lathuilière, Pablo Mesejo, Xavier AlamedaPineda, and Radu Horaud. A comprehensive analysis of deep regression. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
 Law & Deng (2018) Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750, 2018.
 Li et al. (2019) Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019.
 Lin et al. (2014) TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755, 2014.
 Lin et al. (2017) TsungYi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125, 2017.
 Liu et al. (2016) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 21–37. Springer, 2016.
 Massa & Girshick (2018) Francisco Massa and Ross Girshick. maskrcnnbenchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnnbenchmark, 2018. Accessed: 04/09/2019.
 Müller et al. (2016) Matthias Müller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav tracking. In ECCV, 2016.
 Müller et al. (2018) Matthias Müller, Adel Bibi, Silvio Giancola, Salman AlSubaihi, and Bernard Ghanem. Trackingnet: A largescale dataset and benchmark for object tracking in the wild. In ECCV, 2018.

Nam & Han (2016)
Hyeonseob Nam and Bohyung Han.
Learning multidomain convolutional neural networks for visual tracking.
In CVPR, 2016.  Niethammer et al. (2011) Marc Niethammer, Yang Huang, and FrançoisXavier Vialard. Geodesic regression for image timeseries. In International conference on medical image computing and computerassisted intervention, pp. 655–662. Springer, 2011.
 Niu et al. (2016) Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Ordinal regression with multiple output cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4920–4928, 2016.
 Pan et al. (2018) Hongyu Pan, Hu Han, Shiguang Shan, and Xilin Chen. Meanvariance loss for deep age estimation from a face. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5285–5294, 2018.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS  Autodiff Workshop, 2017.
 Pishchulin et al. (2016) Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 4929–4937, 2016.
 Qi et al. (2018) Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum PointNets for 3D object detection from RGBD data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 918–927, 2018.
 Redmon & Farhadi (2016) Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525, 2016.
 Redmon et al. (2016) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, realtime object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2016.
 Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015.
 Rothe et al. (2016) Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision, 126(24):144–157, 2016.
 Ruiz et al. (2018) Nataniel Ruiz, Eunji Chong, and James M Rehg. Finegrained head pose estimation without keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2074–2083, 2018.
 Shi et al. (2019) Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–779, 2019.
 Simon et al. (2017) Tomas Simon, Hanbyul Joo, Iain A. Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 4645–4653, 2017.
 Tsochantaridis et al. (2005) Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453–1484, 2005.
 Wei et al. (2016) ShihEn Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 4724–4732, 2016.
 Wu et al. (2015) Yi Wu, Jongwoo Lim, and MingHsuan Yang. Object tracking benchmark. TPAMI, 37(9):1834–1848, 2015.
 Xiao et al. (2018) Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481, 2018.

Yang et al. (2018)
TsunYi Yang, YiHsuan Huang, YenYu Lin, PiCheng Hsiu, and YungYu Chuang.
SSRNet: A compact soft stagewise regression network for age
estimation.
In
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)
, 2018.  Yang et al. (2019) TsunYi Yang, YiTing Chen, YenYu Lin, and YungYu Chuang. FSANet: Learning finegrained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1087–1096, 2019. URL https://github.com/shamangary/FSANet.

Zhang et al. (2016)
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
Joint face detection and alignment using multitask cascaded convolutional networks.
IEEE Signal Processing Letters, 23(10):1499–1503, 2016. 
Zhang et al. (2017)
Zhifei Zhang, Yang Song, and Hairong Qi.
Age progression/regression by conditional adversarial autoencoder.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5810–5818, 2017. URL https://susanqq.github.io/UTKFace/.  Zhou et al. (2019) Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottomup object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 850–859, 2019.
 Zhu et al. (2018) Zheng Zhu, Qiang Wang, Li Bo, Wei Wu, Junjie Yan, and Weiming Hu. Distractoraware siamese networks for visual object tracking. In ECCV, 2018.
Appendix A Illustrative example
The ground truth conditional target density in Figure 1 is defined by a mixture of two Gaussian components (with weights and ) for
, and a lognormal distribution (with
, ) for . The training data was generated by uniform random sampling of , . Both models were trained for epochs with a batch size of using the ADAM (Kingma & Ba, 2014) optimizer.The Gaussian model is defined using a DNN according to,
It is trained by minimizing the negative loglikelihood, corresponding to the loss,
The DNN
is a simple feedforward neural network, containing two shared fullyconnected layers (dimensions:
, ) and two identical heads for and of three fullyconnected layers (, , ).The DCTD model is defined using a feedforward neural network containing two fullyconnected layers (, ) for both and , and three fullyconnected layers (, , ) processing the concatenated feature vector. It is trained using samples from a proposal distribution (equation 5) with and variances , .
Appendix B Prediction using deep conditional target densities
Appendix C Object Detection
Here, we provide further details about the network architectures, training procedure, and hyperparameters used for our experiments on object detection (Section 4.1)
c.1 Network architecture
We use the FasterRCNN (Ren et al., 2015) detector with ResNet50FPN (Lin et al., 2017) as our baseline. FasterRCNN generates object proposals using a region proposal network (RPN). The features from the proposal regions are then pooled to a fixedsized feature map using the RoiPool layer (Girshick, 2015). The pooled features are then passed through a feature extractor (denoted FeatBox) consisting of two fullyconnected (FC) layers. The output feature vector is then passed through two parallel FC layers, one which predicts the class label (denoted FCCls), and another which regresses the offsets between the proposal and the ground truth box (denoted FCBB). We use the PyTorch implementation for FasterRCNN from Massa & Girshick (2018). Note that we use the RoiAlign (He et al., 2017) layer instead of RoiPool in our experiments as it has been shown to achieve better performance (He et al., 2017).
For the Gaussian and Laplace probabilistic models (Gaussian and Laplace in Table 1), we replace the FCBBReg layer in FasterRCNN with parallel two FC layers, denoted FCBBMean and FCBBVar, which predict the mean and the logvariance of the distribution modeling the offset between the proposal and the ground truth box for each coordinate.
For our confidencebased IoUNet (Jiang et al., 2018) models (IoUNet and IoUNet in Table 1), we use the same network architecture as employed in the original paper. That is, we add an additional branch to predict the IoU overlap between the proposal box and the ground truth. This branch uses the PrRoiPool (Jiang et al., 2018) layer to pool the features from the proposal regions. The pooled features are passed through a feature extractor (denoted FeatConf) consisting of two FC layers. The output feature vector is passed through another FC layer, FCConf, which predicts the IoU. We use an identical architecture for our approach, but train it to output in equation 1 instead. Illustrations of the architectures are found in Figure 3.
c.2 Training
We use the pretrained weights for FasterRCNN from Massa & Girshick (2018). Note that the bounding box regression in FasterRCNN is trained using a direct method, with an Huber loss (Huber, 1964). We trained the other networks in Table 1 (Gaussian, Laplace, IoUNet, IoUNet and DCTD) on the MSCOCO (Lin et al., 2014) training split (2017 train
) using stochastic gradient descent (SGD) with a batch size of 16 for 60k iterations. The base learning rate
is reduced by a factor of after 40k and 50k iterations, for all the networks. We also warm up the training by linearly increasing the learning rate from to during the first 500 iterations. We use a weight decay of and a momentum of . For all the networks, we only trained the newly added layers, while keeping the backbone and the region proposal network fixed.For the Gaussian and Laplace models, we only train the final predictors (FCBBMean and FCBBVar), while keeping the class predictor (FCCls) and the box feature extractor (FeatBox) fixed. We also tried finetuning the FCCls and FeatBox weights, with different learning rate settings, but obtained worse performance on the validation set. The weights for both FCBBMean and FCBBVar were initialized with zero mean Gaussian with standard deviation of . Both Gaussian and Laplace models were trained with a base learning rate by minimizing the negative loglikelihood.
For the IoUNet, IoUNet and our DCTD model, we only trained the newly added confidence branch. We found it beneficial to initialize the feature extractor block (FeatConf) with the corresponding weights from FasterRCNN, i.e. the FeatBox block. The weights for the predictor FCConf were initialized with zero mean Gaussian with standard deviation of . As mentioned in the original paper, we used a base learning rate for the IoUNet and IoUNet networks. For our DCTD network, we used due to the different scaling of the loss. Note that we did not perform any parameter tuning for setting the learning rates. We generate proposals for each ground truth box during training. For the IoUNet, we use the proposal generation strategy mentioned in the original paper. That is, for each ground truth box, we generate a large set of candidate boxes which have an IoU overlap of at least with the ground truth, and uniformly sample proposals from this candidate set w.r.t. the IoU. For IoUNet and DCTD, we sample boxes from a proposal distribution (equation 5) generated by Gaussians with standard deviations of , , and . The IoUNet and IoUNet are trained by minimizing the Huber loss between the predicted IoU and the ground truth, while DCTD is training by minimizing the negative log likelihood of the training data.
c.3 Inference
The inference in both Gaussian and Laplace models is identical to the one employed by FasterRCNN. Thus, we do not utilize the predicted variances for inference. For IoUNet and IoUNet, we perform IoUGuided NMS as described in (Jiang et al., 2018), followed by optimizationbased refinement (Algorithm 1). For our approach we adopt the same NMS technique, but guide it with the values predicted by or network instead. We use a steplength and steplength decay for IoUNet. For IoUNet and our approach we perform the gradientbased refinement in the relative bounding box parametrization (see Section 4.1). Here, we employ different steplengths for position and size. For IoUNet, we use and respectively, with a decay of . For our DCTD approach, we use and with . For all methods, these hyperparameters ( and ) were set using a grid search on the MSCOCO validation split (2017 val). We used refinement iterations for each of the three models.
Appendix D Age estimation
In this appendix, further details on the age estimation experiments (Section 4.2) are provided.
d.1 DCTD network architecture
The DNN architecture of the DCTD model first extracts ResNet50 features from the input image . The age is processed by four fullyconnected layers (dimensions: , , , ), generating . The two feature vectors , are then concatenated to form , which is processed by two fullyconnected layers (, ), outputting .
d.2 DCTD training
The DCTD model is trained using samples from a proposal distribution (equation 5) with and variances , . It is trained for epochs with a batch size of , using the ADAM optimizer with weight decay of . The images are of size . For data augmentation, we use random flipping along the vertical axis and random scaling in the range . After random flipping and scaling, a random image crop of size is also selected. The ResNet50 is imported from torchvision.models in PyTorch with the pretrained option set to true, all other network parameters are randomly initialized using the default initializer in PyTorch.
d.3 DCTD prediction
For this experiment, we use a slight variation of Algorithm 1, which is found in Algorithm 2. There, is the number of gradient ascent iterations, is the stepsize, is an earlystopping threshold and is a degeneration tolerance. Following IoUNet, we set , and . Based on the validation set, we select . We refine a single estimate , predicted by each baseline model.
d.4 Baselines
All baselines are trained for epochs with a batch size of , using the ADAM optimizer with weight decay of . Identical data augmentation and parameter initialization as for DCTD is used.
Direct
The DNN architecture of Direct first extracts ResNet50 features from the input image . The feature vector is then processed by two fullyconnected layers (, ), outputting the prediction . It is trained by minimizing either the Huber or loss.
Gaussian
The Gaussian model is defined using a DNN according to,
It is trained by minimizing the negative loglikelihood, corresponding to the loss,
The DNN architecture of first extracts ResNet50 features from the input image . The feature vector is then processed by two heads of two fullyconnected layers (, ) to output and . The mean is taken as the prediction .
Laplace
The Laplace model is defined using a DNN according to,
It is trained by minimizing the negative loglikelihood, corresponding to the loss,
The DNN architecture of first extracts ResNet50 features from the input image . The feature vector is then processed by two heads of two fullyconnected layers (, ) to output and . The mean is taken as the prediction .
Softmax
The DNN architecture of Softmax first extracts ResNet50 features from the input image . The feature vector is then processed by two fullyconnected layers (, ), outputting logits for discretized classes It is trained by minimizing either the crossentropy (CE) and losses, , or the CE, and variance (Pan et al., 2018) losses, . The prediction is computed as the softmax expected value.
d.5 Full results
Full experiment results, expanding the results found in Table 2 (Section 4.2), are provided in Table 5.
Method  MAE 

ORCNN (Niu et al., 2016)  5.74 0.05 
CORALCNN (Cao et al., 2019)  5.47 0.01 
Direct  Huber 
4.80 0.06 
Direct  Huber + DCTD  4.74 0.06 
Direct  L2 
4.81 0.02 
Direct  L2 + DCTD  4.65 0.02 
Gaussian 
4.79 0.06 
Gaussian + DCTD  4.66 0.04 
Laplace 
4.85 0.04 
Laplace + DCTD  4.81 0.04 
Softmax  CE & L2 
4.78 0.05 
Softmax  CE & L2 + DCTD  4.65 0.04 
Softmax  CE, L2 & Var 
4.81 0.03 
Softmax  CE, L2 & Var + DCTD  4.69 0.03 

Appendix E Headpose estimation
In this appendix, further details on the headpose estimation experiments (Section 4.3) are provided.
e.1 DCTD network architecture
The DNN architecture of the DCTD model first extracts ResNet50 features from the input image . The pose is processed by four fullyconnected layers (dimensions: , , , ), generating . The two feature vectors , are then concatenated to form , which is processed by two fullyconnected layers (, ), outputting .
e.2 DCTD training
The DCTD model is trained using samples from a proposal distribution (equation 5) with and variances , for Yaw, Pitch and Roll. It is trained for epochs with a batch size of , using the ADAM optimizer with weight decay of . The images are of size . For data augmentation, we use random flipping along the vertical axis and random scaling in the range . After random flipping and scaling, a random image crop of size is also selected. The ResNet50 is imported from torchvision.models in PyTorch with the pretrained option set to true, all other network parameters are randomly initialized using the default initializer in PyTorch.
e.3 DCTD prediction
For this experiment, we also use the prediction procedure detailed in Algorithm 2. Again following IoUNet, we set , and . Based on the validation set, we select . We refine a single estimate , predicted by each baseline model.
e.4 Baselines
All baselines are trained for epochs with a batch size of , using the ADAM optimizer with weight decay of . Identical data augmentation and parameter initialization as for DCTD is used.
Direct
The DNN architecture of Direct first extracts ResNet50 features from the input image . The feature vector is then processed by two fullyconnected layers (, ), outputting the prediction . It is trained by minimizing either the Huber or loss.
Gaussian
The Gaussian model is defined using a DNN according to,
It is trained by minimizing the negative loglikelihood, corresponding to the loss,
The DNN architecture of first extracts ResNet50 features from the input image . The feature vector is then processed by two heads of two fullyconnected layers (, ) to output and . The mean is taken as the prediction .
Laplace
Following Gast & Roth (2018), the Laplace model is defined using a DNN according to,
It is trained by minimizing the negative loglikelihood, corresponding to the loss,
The DNN architecture of first extracts ResNet50 features from the input image . The feature vector is then processed by two heads of two fullyconnected layers (, ) to output and . The mean is taken as the prediction .
Softmax
The DNN architecture of Softmax first extracts ResNet50 features from the input image . The feature vector is then processed by three heads of two fullyconnected layers (, ), outputting logits for discretized classes for the Yaw, Pitch and Roll angles (in degrees). It is trained by minimizing either the crossentropy (CE) and losses, , or the CE, and variance (Pan et al., 2018) losses, . The prediction is obtained by computing the softmax expected value for Yaw, Pitch and Roll.
e.5 Full results
Full experiment results, expanding the results found in Table 3 (Section 4.3), are provided in Table 6.
Method  Yaw MAE  Pitch MAE  Roll MAE  Av. MAE 

SSRNetMD (Yang et al., 2018) 
4.24  4.35  4.19  4.26 
VGG16 (Gu et al., 2017)  3.91  4.03  3.03  3.66 
FSACapsF (Yang et al., 2019)  2.89  4.29  3.60  3.60 
Direct  Huber 
2.78 0.09  3.73 0.13  2.90 0.09  3.14 0.07 
Direct  Huber + DCTD  2.75 0.08  3.70 0.11  2.87 0.09  3.11 0.06 
Direct  L2 
2.81 0.08  3.60 0.14  2.85 0.08  3.09 0.07 
Direct  L2 + DCTD  2.78 0.08  3.62 0.13  2.81 0.08  3.07 0.07 
Gaussian 
2.89 0.09  3.64 0.13  2.83 0.09  3.12 0.08 
Gaussian + DCTD  2.84 0.08  3.67 0.12  2.81 0.08  3.11 0.07 
Laplace 
2.93 0.08  3.80 0.15  2.90 0.07  3.21 0.06 
Laplace + DCTD  2.89 0.07  3.81 0.13  2.88 0.06  3.19 0.06 
Softmax  CE & L2 
2.73 0.09  3.63 0.13  2.77 0.11  3.04 0.08 
Softmax  CE & L2 + DCTD  2.67 0.08  3.61 0.12  2.75 0.10  3.01 0.07 
Softmax  CE, L2 & Var 
2.83 0.12  3.79 0.10  2.84 0.11  3.15 0.07 
Softmax  CE, L2 & Var + DCTD  2.76 0.10  3.74 0.09  2.83 0.10  3.11 0.06 

Appendix F Visual Tracking
Here, we provide further details about the training procedure and hyperparameters used for our experiments on visual object tracking (Section 4.4).
f.1 Training
We adopt the ATOM (Danelljan et al., 2019) tracker as our baseline, and use the PyTorch implementation and pretrained weights from Danelljan & Bhat (2019). ATOM trains an IoUNet based module to predict the IoU overlap between a candidate box and the ground truth, conditioned on the firstframe target appearance. The IoU predictor is trained by generating candidates for each ground truth box. The candidates are generated by adding a Gaussian noise for each ground truth box coordinate, while ensuring a minimum IoU overlap of between the candidate box and the ground truth. The network is trained by minimizing the squared error ( loss) between the predicted and ground truth IoU.
Our DCTD model is instead trained by sampling candidate boxes from a proposal distribution (equation 5) generated by Gaussians with standard deviations of and , and minimizing the negative log likelihood of the training data. We use the training splits of TrackingNet (Müller et al., 2018), LaSOT (Fan et al., 2019), GOT10k (Huang et al., 2018), and MSCOCO datasets for our training. Our network is trained for epochs, using the ADAM optimizer with a base learning rate of which is reduced by a factor of after every epochs. The rest of the training parameters are exactly the same is in ATOM. The ATOM model is trained by using the exact same proposal distribution, datasets and settings. It only differs by the loss, which is the same squared error between the predicted and ground truth IoU as in the original ATOM.
f.2 Inference
During tracking, the ATOM tracker first apply the classification head network, which is trained online, to coarsely localize the target object. 10 random boxes are then sampled around this prediction, to be refined by the IoU prediction network. We only alter the final bounding box refinement step of the 10 given random initial boxes, and preserve all other settings as in the original ATOM tracker. The original version performs gradient ascent iterations with a step length of . For our DCTDbased and the ATOM version, we use iterations, employing the bounding box parameterization described in Section 4.1. For our approach we set the step length to for position and for size dimensions. For ATOM we use for position and for size dimensions. These parameters were set on the separate validation set. For simplicity, we adopt the vanilla gradient ascent strategy employed in ATOM, for the two other methods as well. That is, we have no decay () and do not perform checks whether the confidence score is increasing in each iteration.
Comments
There are no comments yet.