weighted-hausdorff-loss
A loss function (Weighted Hausdorff Distance) for object localization in PyTorch
view repo
Recent advances in Convolutional Neural Networks (CNN) have achieved remarkable results in localizing objects in images. In these networks, the training procedure usually requires providing bounding boxes or the maximum number of expected objects. In this paper, we address the task of estimating object locations without annotated bounding boxes, which are typically hand-drawn and time consuming to label. We propose a loss function that can be used in any Fully Convolutional Network (FCN) to estimate object locations. This loss function is a modification of the Average Hausdorff Distance between two unordered sets of points. The proposed method does not require one to "guess" the maximum number of objects in the image, and has no notion of bounding boxes, region proposals, or sliding windows. We evaluate our method with three datasets designed to locate people's heads, pupil centers and plant centers. We report an average precision and recall of 94 datasets, and an average location error of 6 pixels in 256x256 images.
READ FULL TEXT VIEW PDFA loss function (Weighted Hausdorff Distance) for object localization in PyTorch
Weighted Hausdorff Distance Loss: use it as point cloud similarity metric based loss for keras and tf. Useful in keypoint detection.
Custom Loss Functions in Keras
Tensorflow implementation for Weighted Hausdorff Distance: A Loss Function For Object Localization https://arxiv.org/abs/1806.07564
plaque detection using CNN
Locating objects in images is an important task in computer vision. A common approach in object detection is to obtain bounding boxes around the objects of interest. In this paper, we are not interested in obtaining bounding boxes. Instead, we define the object localization task as obtaining a single 2D coordinate corresponding to the location of each object. The location of an object can be any key point we are interested in, such as its center. Figure
1 shows an example of localized objects in images. Differently from other keypoint detection problems, we do not know in advance the number of keypoints in the image. To also make the method as generic as possible we do not assume any physical constraint between the points, unlike in cases such as pose estimation. This definition of object localization is more appropriate for applications where objects are very small, or substantially overlap (see the overlapping plants in Figure 1). In these cases, bounding boxes may not be provided by the dataset or they may be infeasible to groundtruth.Bounding-box annotation is tedious, time-consuming and expensive [37]
. For example, annotating ImageNet
[43] required 42 seconds per bounding box when crowdsourcing on Amazon’s Mechanical Turk using a technique specifically developed for efficient bounding box annotation [50]. In [6], Bell et al. introduce a new dataset for material recognition and segmentation. By collecting click location labels in this dataset instead of a full per-pixel segmentation, they reduce the annotation costs an order of magnitude.In this paper, we propose a modification of the average Hausdorff distance as a loss function of a CNN to estimate the location of objects. Our method does not require the use of bounding boxes in the training stage, and does not require to know the maximum number of objects when designing the network architecture. For simplicity, we describe our method only for a single class of objects, although it can trivially be extended to multiple object classes. Our method is object-agnostic, thus the discussion in this paper does not include any information about the object characteristics. Our approach maps input images to a set of coordinates, and we validate it with diverse types of objects. We evaluate our method with three datasets. One dataset contains images acquired from a surveillance camera in a shopping mall, and we locate the heads of people. The second dataset contains images of human eyes, and we locate the center of the pupil. The third dataset contains aerial images of a crop field taken from an Unmanned Aerial Vehicle (UAV), and we locate the centers of highly occluded plants.
Our approach to object localization via keypoint detection is not a universal drop-in replacement for bounding box detection, specially for those tasks that inherently require bounding boxes, such as automated cropping. Also, a limitation of this approach is that bounding box labeling incorporates some sense of scale, while keypoints do not.
The contributions of our work are:
We propose a loss function for object localization, which we name weighted Hausdorff distance (WHD), that overcomes the limitations of pixelwise losses such as and the Hausdorff distances.
We develop a method to estimate the location and number of objects in an image, without any notion of bounding boxes or region proposals.
We formulate the object localization problem as the minimization of distances between points, independently of the model used in the estimation. This allows to use any fully convolutional network architectural design.
We outperform state-of-the-art generic object detectors and achieve comparable results with crowd counting methods without any domain-specific knowledge, data augmentation, or transfer learning.
Generic object detectors.
Recent advances in deep learning
[16, 27] have increased the accuracy of localization tasks such as object or keypoint detection. By generic object detectors, we mean methods that can be trained to detect any object type or types, such as Faster-RCNN [15], Single Shot MultiBox Detector (SSD) [31], or YOLO [40]. In Fast R-CNN, candidate regions or proposals are generated by classical methods such as selective search [59]. Although activations of the network are shared between region proposals, the system cannot be trained end-to-end. Region Proposal Networks (RPNs) in object detectors such as Faster R-CNN [15, 41] allow for end-to-end training of models. Mask R-CNN [18] extends Faster R-CNN by adding a branch for predicting an object mask but it runs in parallel with the existing branch for bounding box recognition. Mask R-CNN can estimate human pose keypoints by generating a segmentation mask with a single class indicating the presence of the keypoint. The loss function in Mask R-CNN is used location by location, making the keypoint detection highly sensitive to alignment of the segmentation mask. SDD provides fixed-sized bounding boxes and scores indicating the presence of an object in the boxes. The described methods either require groundtruthed bounding boxes to train the CNNs or require to set the maximum number of objects in the image being analyzed. In [19], it is observed that generic object detectors such as Faster R-CNN and SSD perform very poorly for small objects.Counting and locating objects. Counting the number of objects in an image is not a trivial task. In [28], Lempitsky et al. estimate a density function whose integral corresponds to the object count. In [47], Shao et al. proposed two methods for locating objects. One method first counts and then locates, and the other first locates and then counts.
Locating and counting people is necessary for many applications such as crowd monitoring in surveillance systems, surveys for new businesses, and emergency management [28, 60]. There are multiple studies in the literature, where people in videos of crowds are detected and tracked [2, 7]. These detection methods often use bounding boxes around each human as ground truth. Acquiring bounding boxes for each person in a crowd can be labor intensive and imprecise under conditions where lots of people overlap, such as sports events or rush-hour agglomerations in public transport stations. More modern approaches avoid the need of bounding boxes by estimating a density map whose integral yields the total crowd count. In approaches that involve a density map, the label of the density map is constructed from the labels of the people’s heads. This is typically done by centering Gaussian kernels at the location of each head. Zhang et al. [62] estimate the density image using a multi-column CNN that learns features at different scales. In [44], Sam et al
. use multiple independent CNNs to predict the density map at different crowd densities. An additional CNN classifies the density of the crowd scene and relays the input image to the appropriate CNN. Huang
et al. [20] propose to incorporate information about the body part structure to the conventional density map to reformulate the crowd counting as a multi-task problem. Other works such as Zhang et al. [61] use additional information such as the groundtruthed perspective map.Methods for pupil tracking and precision agriculture are usually domain-specific. In pupil tracking, the center of the pupil must be resolved in images obtained in real-world illumination conditions [13]. A wide range of applications, from commercial applications such as video games [52], driving [48, 17] or microsurgery [14] rely on accurate pupil tracking. In remote precision agriculture, it is critical to locate the center of plants in a crop field. Agronomists use plant traits such as plant spacing to predict future crop yield [56, 51, 57, 12, 8], and plant scientists to breed new plant varieties [3, 35]. In [1], Aich et al. count wheat plants by first segmenting plant regions and then counting the number of plants in each segmented patch.
Hausdorff distance. The Hausdorff distance can be used to measure the distance between two sets of points [5]. Modifications of the Hausdorff distance [10] have been used for various multiple tasks, including character recognition [33]
[23] and scene matching [23]. Schutze et al. [46] use the average Hausdorff distance to evaluate solutions in multi-objective optimization problems. In [24], Elkhiyari et al. compare features extracted by a CNN according to multiple variants of the Hausdorff distance for the task of face recognition. In
[11], Fan et al. use the Chamfer and Earth Mover’s distance, along with a new neural network architecture, for 3D object reconstruction by estimating the location of a fixed number of points. The Hausdorff distance is also a common metric to evaluate the quality of segmentation boundaries in the medical imaging community [54, 63, 30, 55].Our work is based on the Hausdorff distance which we briefly review in this section. Consider two unordered non-empty sets of points and and a distance metric between two points and . The function could be any metric. In our case we use the Euclidean distance. The sets and may have different number of points. Let be the space of all possible points. In its general form, the Hausdorff distance between and is defined as
(1) |
When considering a discretized and bounded , such as all the possible pixel coordinates in an image, the suprema and infima are achievable and become maxima and minima, respectively. This bounds the Hausdorff distance as
(2) |
which corresponds to the diagonal of the image when using the Euclidean distance. As shown in [5], the Hausdorff distance is a metric. Thus we have the following properties:
(3a) | ||||
(3b) | ||||
(3c) | ||||
(3d) |
Equation (3b) follows from and being closed, because in our task the pixel coordinate space is discretized. These properties are very desirable when designing a function to measure how similar and are [4].
(dashed dots). Despite the clear difference in the distances between points, their Hausdorff distance are equal because the worst outlier is the same.
A shortcoming of the Hausdorff function is its high sensitivity to outliers [46, 54]. Figure 2 shows an example for two finite sets of points with one outlier. To avoid this, the average Hausdorff distance is more commonly used:
(4) |
where and are the number of points in and , respectively. Note that properties (3a), (3b) and (3c) are still true, but (3d) is not. Also, the average Hausdorff distance is differentiable with respect to any point in or .
Let contain the ground truth pixel coordinates, and be our estimation. Ideally, we would like to use as the loss function during the training of our convolutional neural network (CNN). We find two limitations when incorporating the average Hausdorff distance as a loss function. First, CNNs with linear layers implicitly determine the estimated number of points as the size of the last layer. This is a drawback because the actual number of points depends on the content of the image itself. Second, FCNs such as U-Net [42]
can indicate the presence of an object center with a higher activation in the output layer, but they do not return the pixel coordinates. In order to learn with backpropagation, the loss function must be differentiable with respect to the network output.
To overcome these two limitations, we modify the average Hausdorff distance as follows:
(5) |
where
(6) |
(7) |
is the generalized mean, and is set to . We call the weighted Hausdorff distance (WHD). is the single-valued output of the network at pixel coordinate . The last activation of the network can be bounded between zero and one by using a sigmoid non-linearity. Note that does not need to be normalized, i.e., is not necessary. Note that the generalized mean corresponds to the minimum function when . We justify the modifications applied to Equation (4) to obtain Equation (5) as follows:
The in the denominator of the first term provides numerical stability when .
When , , and , the weighted Hausdorff distance becomes the average Hausdorff distance. We can interpret this as the network indicating with complete certainty where the object centers are. As , the global minimum () corresponds to if and 0 otherwise.
In the first term, we multiply by to penalize high activations in areas of the image where there is no ground truth point nearby. In other words, the loss function penalizes estimated points that should not be there.
In the second term, by using the expression
we enforce that
If , then . This means the point will contribute to the loss as in the AHD (Equation (4)).
If , , then . Then, if , the point will not contribute to the loss because the “minimum” will ignore . If another point closer to with exists, will be “selected” instead by . Otherwise will be high. This means that low activations around ground truth points will be penalized.
Note that is not the only expression that would enforce these two constraints ( and ). We chose a linear function because of its simplicity and numerical stability.
Both terms in the WHD are necessary. If the first term is removed, then the trivial solution is . If the second term is removed, then the trivial solution is . These two cases hold for any value of and the proof can be found in the appendix. Ideally, the parameter so that becomes the minimum operator [26]. However, this would make the second term flat with respect to the output of the network. For a given , changes in in a point that is far from would be ignored by , if there is another point with high activation and closer to . In practice, this makes training difficult because the minimum is not a smooth function with respect to its inputs. Thus, we approximate the minimum with the generalized mean , with . The more negative is, the more similar to the AHD the WHD becomes, at the expense of becoming less smooth. In our experiments, . There is no need to use in the first term because is not inside the minimum, thus the term is already differentiable with respect to .
If the input image needs to be resized to be fed into the network, we can normalize the WHD to account for this distortion. Denote the original image size as and the resized image size as . In Equation (5), we compute distances in the original pixel space by replacing with , where and
(8) |
A naive alternative is to use a one-hot map as label, defined as for and otherwise, and then use a pixelwise loss such as the Mean Squared Error (MSE) or the norm, where . The issue with pixelwise losses is that they are not informative of how close two points and are unless . In other words, it is flat for the vast majority of the pixels, making training unfeasible. This issue is locally mitigated in [58] by using the MSE loss with Gaussians centered at each . By contrast, the WHD in Equation (5) will decrease the closer is to , making the loss function informative outside of the global minimum.
architecture. We add a small fully-connected layer that combines the deepest features and the estimated probability map to regress the number of points.
In this section, we describe the architecture of the fully convolutional network (FCN) we use, and how we estimate the final object locations. We want to emphasize that the network design is not a meaningful contribution of this work, thus we have not made any attempt to optimize it. Our main contribution is the use of the weighted Hausdorff distance as the loss function. We adopt the U-Net architecture [42]
and modify it minimally for this task. Networks similar to U-Net have been proven to be capable of accurately mapping the input image into an output image, when trained in a conditional adversarial network setting
[22] or when using a carefully tuned loss function [42]. Figure 3shows the hourglass design of U-Net. The residuals connections between each layer in the encoder and its symmetric layer in the decoder are not shown for simplicity.
This FCN has two well differentiated blocks. The first block follows the typical architecture of a CNN. It consists of the repeated application of two
convolutions (with padding 1), each followed by a batch normalization operation and a Rectified Linear Unit (ReLU). After the ReLU, we apply a
max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels, starting with 64 channels and using 512 channels for the last 5 layers.The second block consists of repeated applications of the following elements: a bilinear upsampling, a concatenation with the feature map from the downsampling block, and two convolutions, each followed by a batch normalization and a ReLU. The final layer is a convolution layer that maps to the single-channel output of the network, .
To estimate the number of objects in the image, we add a branch that combines the information from the deepest level features and also from the estimated probability map. This branch combines both features (the
feature vector and the
probability map) into a hidden layer, and uses the 128-dimensional feature vector to output a single number. We then apply a ReLU to ensure the output is positive, and round it to the closest integer to obtain our final estimate of the number of objects, .Although we use this particular network architecture, any other architecture could be used. The only requirement is that the output images of the network must be of the same size as the input image. The choice of a FCN arises from the natural interpretation of its output as the weights () in the WHD (Equation (5)). In previous works [24, 11], variants of the average Haussdorf distance were successfully used with non-FCN networks that estimate the point set directly. However, in those cases the size of the estimated set is fixed by the size of the last layer. To locate an unknown number of objects, the network must be able to estimate a variable number of object locations. Thus, we could envision the WHD also being used in non-FCN networks as long as the output of the network is used as in Equation (5).
The training loss we use to train the network is a combination of Equation (5) and a smooth loss for the regression of the object count. The final training loss is
(9) |
where is the set containing the ground truth coordinates of the objects in the image, is the output of the network, , and is the estimated number of objects. is the regression term, for which we use the smooth or Huber loss [21], defined as
(10) |
This loss is robust to outliers when the regression error is high, and at the same time is differentiable at the origin.
The network outputs a saliency map indicating with the confidence that there is an object at pixel . Figure 4 shows in the second row. During evaluation, our ultimate goal is to obtain , i. e., the estimate of all object locations. In order to convert to , we threshold to obtain the pixels . We can use three different methods to decide which to use:
Use a constant for all images.
Use Otsu thresholding [36] to find an adaptive different for every image.
Use a Beta mixture model-based thresholding (BMM). This method fits a mixture of two Beta distributions to the values of
using the algorithm described in [45], and then takes the mean value of the distribution with highest mean as .Figure 4 shows in the third row an example of the result of thresholding the saliency map
. Then, we fit a Gaussian mixture model to the points
. This is done using the expectation maximization (EM)
[34] algorithm and the estimated number of plants .The means of the fitted Gaussians are considered the final estimate . The third row of Figure 4 shows the estimated object locations with red crosses. Note that even if the map produced by the FCN is of good quality, i.e., there is a cluster on each object location, EM may not yield the correct object locations if . An example can be observed in the first column of Figure 4, where a single head is erroneously estimated as two heads.
We evaluate our method with three datasets.
The first dataset consists of 2,000 images acquired from a surveillance camera in a shopping mall. It contains annotated locations of the heads of the crowd. This dataset is publicly available at http://personal.ie.cuhk.edu.hk/~ccloy/downloads_mall_dataset.html [32]. 80%, 10% and 10% of the images were randomly assigned to the training, validation, and testing datasets, respectively.
The second dataset is presented in [13] with the roman letter V and publicly available at http://www.ti.uni-tuebingen.de/Pupil-detection.1827.0.html. It contains 2,135 images with a single eye, and the goal is to detect the center of the pupil. It was also randomly split into training, validation and testing datasets as 80/10/10 %, respectively.
The third dataset consists of aerial images of a crop field taken from a UAV flying at an altitude of 40 m. The images were stitched together to generate a orthoimage of cm/pixel resolution shown in Figure 5
. The location of the center of all plants in this image was groundtruthed, resulting in a total of 15,208 unique plant centers. This mosaic image was split, and the left 80% area was used for training, the middle 10% for validation, and the right 10% for testing. Within each region, random image crops were generated. These random crops have a uniformly distributed height and width between 100 and 600 pixels. We extracted 50,000 random image crops in the training region,
in the validation region, and in the testing region. Note that some of these crops may highly overlap. We are making the third dataset publicly available at https://engineering.purdue.edu/~sorghum/dataset-plant-centers-2016. We believe this dataset will be valuable for the community, as it poses a challenge due to the high occlusion between plants.All the images were resized to because that is the minimum size our architecture allows. The groundtruthed object locations were also scaled accordingly. As for data augmentation, we only use random horizontal flip. For the plant dataset, we also flipped the images vertically. We set in Equation (7). We have also experimented with with no apparent improvement, but we did not attempt to find an optimal value. We retrain the network for every dataset, i.e., we do not use pretrained weights. For the mall and plant dataset, we used a batch size of 32 and Adam optimizer [25, 39] with a learning rate of
and momentum of 0.9. For the pupil dataset, we reduced the size of the network by removing the five central layers, we used a batch size of 64, and stochastic gradient descent with a learning rate of
and momentum of 0.9. At the end of each epoch, we evaluate the average Haussdorf distance (AHD) in Equation (
4) over the validation set, and select the epoch with lowest AHD on validation.As metrics, we report Precision, Recall, F-score, AHD, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percent Error (MAPE):
(11) |
(12) |
where , is the number of images, is the true object count in the -th image, and is our estimate.
A true positive is counted if an estimated location is at most at distance from a ground truth point. A false positive is counted if an estimated location does not have any ground truth point at a distance at most . A false negative is counted if a true location does have any estimated location at a distance at most
. Precision is the proportion of our estimated points that are close enough to a true point. Recall is the proportion of the true points that we are able to detect. The F-score is the harmonic mean of precision and recall. Note that one can achieve a precision and recall of 100% even if we estimate more than one object location per ground truth point. This would not be an ideal localization. To take this into account, we also report metrics (MAE, RMSE and MAPE) that indicate if the number of objects is incorrect. The AHD can be interpreted as the average location error in pixels.
Figure 8 shows the F-score as a function of . Note that is only an evaluation parameter. It is not needed during training or testing. MAE, RMSE, and MAPE are shown in Table 1. Note that we are using the same architecture for all tasks, except for the pupil dataset, where we removed intermediate layers. Also, in the case of the pupil detection, we know that there is always one object in the image. Thus, regression is not necessary and we can remove the regression term in Equation (9) and fix .
A naive alternative approach to object localization would be to use generic object detectors such as Faster R-CNN [41]. One can train these detectors by constructing bounding boxes with fixed size centered at each labeled point. Then the center of each bounding box can be taken as the estimated location. We used bounding boxes of size (the approximate average head and pupil size) and anchor sizes of and . Note that these parameters may be suboptimal even though they were selected to match the type of object. The threshold we used for the softmax scores was 0.5 and for the intersection over union it was 0.4, because they minimize the AHD over the validation set. We used the VGG-16 architecture [49] and trained it using stochastic gradient descent with learning rate of and momentum of 0.9. For the pupil dataset, we always selected the bounding box with the highest score. We experimentally observed that Faster R-CNN struggles with detecting very small objects that are very close to each other. Tables 2-4 show the results of Faster R-CNN results on the mall, pupil, and plant datasets. Note that the mall and plant datasets, with many small and highly overlapping objects, are the most challenging for Faster R-CNN. This behaviour is consistent with the observations in [19], where, all generic object detectors perform very poorly and Faster R-CNN yields a mean Average Precision (mAP) of 5% in the best case.
We also experimented using mean shift [9] instead of Gaussian mixtures (GM) to detect the local maxima. However, mean shift is prone to detect multiple local maxima, and GMs are more robust against outliers. In our experiments, we observed that precision and recall were substantially worse than using GM. More importantly, using Mean Shift slowed down validation an order of magnitude. The average time for the Mean Shift algorithm to run on one of our images was 12 seconds, while fitting GM using expectation maximization took around 0.5 seconds, when using the scikit-learn implementations [38].
We also investigated the effect of the parameter , and the three methods to select it presented in Section 5. One may think that this parameter could be a trade-off between some metrics, and that it should be cross-validated. In practice, we observed that does not balance precision and recall, thus a precision-recall curve is not meaningful. Instead, we plot the F-score as a function of in Figure 8. Also, cross-validating would imply fixing an “optimal” value for all images. Figure 6 shows that we can do better with adaptive thresholding methods (Otsu or BMM). Note that BMM thresholding (dashed lines) always outperforms Otsu (solid lines), and most of fixed . To justify the appropriateness of the BMM method, note that in Figure 4
most of the values in the estimated map are very high or very low. This makes a Beta distribution a better fit than a Normal distribution (as used in Otsu’s method) to model
. Figure 7shows the fitted BMM and a kernel density estimation of the values of
adaptively selected by the BMM method.Lastly, as our method locates and counts objects simultaneously, it could be used as a counting technique. We also evaluated our technique in the task of crowd counting using the ShanghaiTech Part B dataset presented in [62], and achieve a MAE of 19.9. Even though we do not outperform state of the art methods that are specifically fine-tuned for crowd counting [29], we can achieve comparable results with our generic method. We expect future improvements such as architectural changes or using transfer learning to further increase the performance.
A PyTorch implementation of the weighted Hausdorff distance loss and trained models are available at
https://github.com/javiribera/locating-objects-without-bboxes.Metric |
|
|
|
Average | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Precision | 95.2% | 99.5% | 88.1% | 94.4% | ||||||
Recall | 96.2% | 99.5% | 89.2% | 95.0% | ||||||
F-score | 95.7% | 99.5% | 88.6% | 94.6% | ||||||
AHD | 4.5 px | 2.5 px | 7.1 px | 4.7 px | ||||||
MAE | 1.4 | - | 1.9 | 1.7 | ||||||
RMSE | 1.8 | - | 2.7 | 2.3 | ||||||
MAPE | 4.4% | - | 4.2% | 4.3 % |
Metric | Faster-RCNN | Ours |
---|---|---|
Precision | 81.1% | 95.2 % |
Recall | 76.7% | 96.2 % |
F-score | 78.8 % | 95.7 % |
AHD | 7.6 px | 4.5 px |
MAE | 4.7 | 1.4 |
RMSE | 5.6 | 1.8 |
MAPE | 14.8% | 4.4 % |
Method | Precision | Recall | AHD |
Swirski [53] | 77 % | 77 % | - |
ExCuSe [13] | 77 % | 77 % | - |
Faster-RCNN | 99.5 % | 99.5 % | 2.7 px |
Ours | 99.5 % | 99.5 % | 2.5 px |
Metric | Faster-RCNN | Ours |
---|---|---|
Precision | 86.6 % | 88.1 % |
Recall | 78.3 % | 89.2 % |
F-score | 82.2 % | 88.6 % |
AHD | 9.0 px | 7.1 px |
MAE | 9.4 | 1.9 |
RMSE | 13.4 | 2.7 |
MAPE | 17.7 % | 4.2 % |
We have presented a loss function for the task of locating objects in images that does not need bounding boxes. This loss function is a modification of the average Hausdorff distance (AHD), which measures the similarity between two unordered sets of points. To make the AHD differentiable with respect to the network output, we have considered the certainty of the network when estimating an object location. The output of the network is a saliency map of object locations and the estimated number of objects. Our method is not restricted to a maximum number of objects in the image, does not require bounding boxes, and does not use region proposals or sliding windows. This approach can be used in tasks where bounding boxes are not available, or the small size of objects makes the labeling of bounding boxes impractical. We have evaluated our approach with three different datasets, and outperform generic object detectors and task-specific techniques. Future work will include developing a multi-class object location estimator in a single network, and evaluating more modern CNN architectures.
Acknowledgements: This work was funded by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000593. The views and opinions of the authors expressed herein do not necessarily reflect those of the U.S. Government or any agency thereof. We thank Professor Ayman Habib for the orthophotos used in this paper. Contact information: Edward J. Delp, ace@ecn.purdue.edu
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, June 2008. Anchorage, AK.Dynamic facial analysis: From bayesian filtering to recurrent neural network.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1548–1557, July 2017. Honolulu, HI.Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.IEEE Transactions on Evolutionary Computation
, 16(4):504–522, August 2012.Proceedings of the Association for the Advancement of Artificial Intelligence Human Computation Workshop
, WS-12-08:40–46, July 2012. Toronto, Canada.In Section 4, we made the following claim:
Both terms of the Weighted Hausdorff Distance (WHD) are necessary. If the first term is removed, then is the solution that minimizes the WHD. If the second term is removed, then the trivial solution is .
If the first term is removed and , then Equation (5) reduces to
From the definition in Equation (2), ,
For any and ,
Note that if , but the proof holds for any .
If the second term is removed and , then Equation (5) reduces to
∎
Comments
There are no comments yet.