1 Introduction
The ability to find multiple instances of characteristic entities in a scene is core to many computer vision applications. For example, finding people
[29, 42], detecting arbitrary number of classes and objects [27, 13, 26], and detecting local features [21, 2] all rely on this ability. In traditional vision pipelines, selecting the topK responses in a heatmap and using their locations is the typical way to approach the problem [21, 2, 8]. However, due to the nondifferentiable nature of this operation, it has not found immediate application in deep learning based solutions.
Circumventing topK in endtoend learning.
To overcome this challenge, researchers proposed to use grids [25, 13, 6], to simplify the formulation by isolating each instance [40], or to provide alternative supervision by optimizing over multiple branches [24]. While effective, they do not generalize well outside the application domain for which they were designed. Other formulations, such as the use of sequential detection [7] or channelwise approaches [44] are problematic to apply when the number of instances of the same object is large.
Introducing MIST architectures.
Therefore, we introduce a new deep framework which we name Multiple Instance Spatial Transformer or MIST for brevity. From a high level, the MIST framework first decomposes the image into a finite collection of patches, and then processes these patches to perform a given task. As illustrated in Figure 2 for the image synthesis task, given an image we first compute a heatmap via a deep network whose local maxima correspond to locations of interest. From this heatmap, we gather the parameters of the top local maxima, and then extract the corresponding collection of image patches via a resampling process. With the collection of patches, we execute the same taskspecific network whose output is aggregated to finally evaluate a taskspecific loss. We then optimize this task loss to train the entire framework.
Training MISTs by lifting.
Training a pipeline that includes a nondifferentiable selection/gather operation is nontrivial. To solve this problem we propose to lift the problem to a higher dimensional one by treating the parameters defining the interest points as slack variables, and introduce a hard constraint that they must correspond to the output that the heatmap network gives. This constraint is realized by introducing an auxiliary function that creates a heatmap given set of interest point parameters. We then solve for the relaxed version of this problem, where the hard constraint is turned into a soft one, and the slack variables are also optimized within the training process. Critically, our training strategy allows us to have an optimizable version of \⃝raisebox{0.6pt}{1} nonmaximum suppression, and \⃝raisebox{0.6pt}{2} topK selection, thus creating a network architecture resembling compute strategies that were dominant in pre deeplearning computer vision.
Applications.
To demonstrate the capabilities of MISTs, we evaluate our network on a variety of weaklysupervised multiinstance problems. Note how in some of these applications, the value of is the only supervision signal we provide. We consider \⃝raisebox{0.6pt}{1} the problem of recovering the basis functions that created a given texture, \⃝raisebox{0.6pt}{2} the classification of numbers in cluttered scenes where the only supervision is the occurrence of these numbers.
In summary, in this paper:

[noitemsep,leftmargin=1.5em]

we introduce the MIST framework for weaklysupervised multiinstance visual learning;

we propose a training method that allows the use of topK approaches for endtoend trainable architectures;

we show that our framework can reconstruct images as parts, as well as detect/classify instances without any location supervision.
2 Related works
Attention models and the use of localized information have been actively investigated in the literature. Some examples include discriminative tasks such as finegrained classification [32], and pedestrian detection [42], and generative ones such as image synthesis from natural language [18]. We now discuss a selection of representative works, and classify them according to how they deal with multiple instances.
Gridbased methods.
Since the introduction of Region Proposal Networks (RPN) [27], gridbased strategies have been used for dense image captioning [19], instance segmentation [13], keypoint detection [10], multiinstance object detection [26]. Recent improvements to RPNs attempt to learn the concept of a generic object covering multiple classes [30], and to model multiscale information [5]. The multiple transformation corresponding to separate instances can also be densely regressed via Instance Spatial Transformers [39], which removes the need to identify discrete instance early in the network. Unfortunately, all these method are fully supervised, as they require both class labels and object locations for training.
Heatmapbased methods.
Heatmapbased methods have recently gained interest to detect features [40, 24, 6], find landmarks [44, 22], and regress human body keypoint [36, 23]. While it is possible to output one heatmaps per type of point [44, 36], this still restricts the number of instances to one. Yi et al. [40] reformulates the problem based on each instance, but in doing so it introduces a nonideal difference between training and testing regimes. Grids can also be used in combination to heatmaps [6]
, but this results in an unrealistic underlying assumption of uniformly distributed detections in the image. Overall, heatmapbased methods excel when the “final” task of the network is generate a heatmap
[22], but are problematic to use as an intermediate layer in the presence of multiple instances.Sequential inference methods.
Another way to approach multiinstance problems is to attend to one instance at a time in a sequential way. Gregor et al. [11] proposes a recurrent network that processes only a small area at a time for both discriminative and generative tasks. These sequential models have then been extended to localize and recognize MNIST digits in a cluttered image [1, 7]. Overall, RNNs often struggle to generalize to sequences longer than the ones encountered during training, and while recent results on inductive reasoning are promising [12], their performance does not scale well when the number of instances is large.
Knowledge transfer.
To overcome the acquisition cost of labelled training data, one can transfer knowledge from labeled to unlabeled dataset. For example, Inoue et al. [16] train on a single instance dataset, and then attempt to generalize to multiinstance domains, while Uijlings et al. [37] attempts to also transfer a multiclass proposal generator to the new domain. While knowledge transfer can be effective, it is highly desirable to devise unsupervised methods such as ours that do not depend on an additional dataset.
Weakly supervised methods.
To further reduce the labeling effort, weakly supervised methods have also been proposed. Wan et al. [38] learns how to detect multiple instances of a single object via region proposals and ROI pooling, while Tang et al. [34] proposes to use a hierarchical setup to refine their estimates. Gao et al. [9] provides an additional supervision by specifying the number of instances in each class, while Zhang et al. [43] localizes objects by looking at the network activation maps [45, 28]. However, all these method still rely on region proposals from an existing method, or define them via a handtuned process.
3 MIST Framework
A prototypical MIST architecture is composed of two trainable components: \⃝raisebox{0.6pt}{1} the first module receives an image as input and extracts a collection of patches, at image locations and scales that are computed by a trainable heatmap network with weights ; see Section 4. \⃝raisebox{0.6pt}{2} the second module processes each extracted patch with a taskspecific network whose weights are shared across patches, and further manipulates these signals to express a taskspecific loss ; see Section 5. The two modules are connected through nonmaximum suppression on the scalespace heatmap output of , followed by a top selection process to extract the parameters defining the patches, which we denote as . We then sample patches at these locations through bilinear sampling and feed them the second module.
The defining characteristic of MIST architectures is that they are quasiunsupervised: the only strictly required supervision is the number of patches to extract. The training of MIST architectures is summarized by the optimization:
(1) 
where are the network trainable parameters. Note how in the expression above is nondifferentiable, thus making (1) unapproachable by backpropagation for training.
Example task.
In Figure 2, we illustrate an example of a MIST architecture for image reconstruction. In more details, the task is to understand how to resynthesize the image as a superposition of spatially localized basis functions. Note that, while this task is quasiunsupervised, it presents several joint challenges as it needs to estimate: \⃝raisebox{0.6pt}{1} an unknown shared lowdimensional set of latent bases; \⃝raisebox{0.6pt}{2} where to place instances from this latent space; \⃝raisebox{0.6pt}{3} the latent coefficients representing each instance. Further details and additional example tasks are described in Section 5.
Training MISTs.
While our MISTs enable us to approach several vision tasks, such as the ones above, with minimal supervision the true challenge lies in the definition of an effective training strategy. Backpropagation through a selection process is possible, alike to what is performed for maxpooling. In Section
6, we argue why this is highly detrimental, and propose an effective multistage training solution.Evaluation and implementation.
4 Patch extraction
We extracts a set of (square) patches that correspond to “important” locations in the image – where importance is a direct consequence of . The localization of such patches can be computed by regressing a 2D heatmap whose top
peaks correspond to the patch centers. However, as we do not assume these patches to be equal in size, we regress to a collection of heatmaps at different scales. To limit the number of necessary scales, we use a discrete and sparse scalespace, while resolving for intermediate scales by weighted interpolation.
Multiscale heatmap network – .
Our multiscale heatmap network is inspired by LFNet [24]. We employ a fully convolutional network with (shared) weights at multiple scales, indexed by , on the input image . The weights across scales are shared so that the network cannot implicitly favor a particular scale. To do so, we first downsample the image to each scale , execute the network
on it, and finally upsample to the original resolution. This process generates a multiscale heatmap tensor
of size where , and is the height of the image and is the width. For the convolutional network we use ResNet blocks [14], where each block is composed of two convolutions withchannels and relu activations without any downsampling. We then perform a
local spatial softmax operator [24] with spatial extent of to sharpen the responses.Extracting patch location and scale.
We first normalize the heatmap tensor so that it has a unit sum along the scale dimension:
(2) 
where is added to prevent divisions by zero. We then compute the top spatial locations across all scales, to obtain the imagespace coordinates of patch centers:
(3) 
Note that a direct extraction of
maxima is possible because the aforementioned local spatial softmax performs a localized nonmaximal suppression. The corresponding scale is computed by weighted first order moments
[33], where the weights are the responses in the corresponding heatmaps:(4) 
Note we do not need to normalize here, as (2) has unit sum along the dimension. These two operations are abstracted by our topK extractor in (1).
Note also that our extraction process uses a single heatmap for all instances that we extract. By contrast, existing heatmapbased methods [7, 44] typically rely on heatmaps dedicated to each instance, which is problematic when an image contains two instances of the same class. Conversely, we restrict the role of the heatmap network to find the “important” areas in a given image, without having to distinguishing between classes, hence simplifying learning.
Patch sampling.
As a patch is uniquely parameterized its location and scale , we can then proceed to sample its corresponding tensor via bilinear interpolation [17]:
(5) 
Comparison to LFNet.
Note that differently from LFNet [24], we do not perform a softmax along the scale dimension. The scalewise softmax in LFNet is problematic as the computation for a softmax function relies on the input to the softmax being unbounded. For example, in order for the softmax function to behave as a max function, due to exponentiation, it is necessary that one of the input value reaches infinity (i.e. the value that will correspond to the max), or that all other values to reach negative infinity. However, at the network stage where softmax is applied in [24], the score range from zero to one, effectively making the softmax behave similarly to averaging. Our formulation does not suffer from this drawback.
5 Taskspecific networks
We now introduce two instances of the MIST architectures corresponding to different applications. We keep the heatmap network and the extractor architectures the same, and only change the taskspecific network, as well as the loss used for supervision. In particular, we consider a reconstruction problem in Section 5.1, and a classification problem in Section 5.2. Further implementation details can be found in Section 8.
5.1 Image reconstruction
As illustrated in Fig. 2, for image reconstruction we append our patch extraction network with a shared autoencoder for each extracted patch. We can then train this network to reconstruct the original image by inverting the patch extraction process, and forming the task specific loss to be the norm between the input image and the reconstructed image. Specifically, we introduce the inverse sampling operation , which starts with an image of all zeros, and places the patch at . We then add all the images together to obtain the reconstructed image, overall expressing the following task loss:
(6) 
Overall, the network is designed to jointly model and localize repeating structures in the input signal. Regressing shared basis functions can be related to nonlocal mean processes [4], as a model for the input signal is created by agglomerating the information in scattered spatial instances. Our task architecture is also related to transforming autoencoders [15], where the difference is that in this previous work a single instance is present in the image, and that they provide the ground truth transformations as a supervision.
5.2 Multiple instances classification
By appending the patch extraction module with a classification network we can realize an architecture for multiple instance learning. For each extracted patch we apply a shared classifier network to output , where
is the number of classes. In turn, these are then converted into probability estimates by the transformation
. By denoting the onehot groundtruth labels of instance in the image as , we define the multiinstance classification loss as(7) 
where denotes cross entropy and is the number of instances in the image. Note here that we do not provide supervision about where each class instances are, yet the detector network will automatically learn how to localize the content with minimal supervision.
6 Training MISTs
While it is technically possible to backpropagate through MIST architectures, the gradients would only flow through the spatial regions corresponding to the selected keypoints – this results in a training process that ignores locations away from the selection. This is a fundamental issue, as, in order for the network to learn to only respond to the desired locations, we need negative examples just as much as we need positive examples. To circumvent this problem, we propose a multistage training optimization.
Differentiable topK via lifting.
The introduction of auxiliary variables (i.e. lifting) to simplify the structure of an optimization problem has proven effective in a range of domains ranging from registration via ICP [35], to efficient deformation models [31], and robust optimization [41]. To simplify our training optimization, we start by decoupling the heatmap tensor from the optimization (1) by introducing the corresponding auxiliary variables , as well as the patch parameterization variables that are extracted by the topK extractor:
(8)  
s.t.  (9)  
(10) 
We then relax (10) to a leastsquares penalty:
(11)  
s.t.  (12) 
and finally approach it by alternating optimization:
(13)  
(14) 
where has been dropped as it is not a free parameter: it can be computed as after the have been optimized by (13), and as after have been optimized by (14). To accelerate training, we further split (13) into two stages, and alternate between optimizing and . In particular, multiple optimization iterations of are executed to allow keypoints to displace faster during training. The summary for the three stage optimization procedure is outlined in Alg. 1: \⃝raisebox{0.6pt}{1} we optimize the parameters with the loss ; \⃝raisebox{0.6pt}{2} we then fix , and refine the positions of the patches for iterations with . \⃝raisebox{0.6pt}{3} with the optimized patch positions , we invert the top operation by creating a target heatmap , and optimize the parameters of our heatmap network using distance between the two heatmaps, . Notice that we are not introducing any additional supervision signal that is tangent to the given task.
Generating the target heatmap – .
For creating the target heatmap , we create a tensor that has zeros everywhere except for the positions corresponding to the optimized positions. However, as the optimized patch parameters are no longer integer values, we need to quantize them with care. For the spatial locations we simply round to the nearest pixel, which at most creates a quantization error of half a pixel, which does not cause problems in practice. For scale however, simple nearestneighbor assignment causes too much quantization error as our scalespace is sparsely sampled. We therefore assign values to the two nearest neighboring scales in a way that the center of mass would be the optimized scale value. That is, we create a heatmap tensor that would result in the optimized patch locations when used in forward inference.
7 Results and evaluation
To demonstrate the effectiveness of our framework we evaluate two different tasks. We first perform a quasiunsupervised image reconstruction task, where only the total number of instances in the scene is provided. We then show that our method can also be applied to weakly supervised multiinstance classification, where only imagelevel supervision is provided. Note that, unlike region proposal based methods, our localization network only relies on cues from the classifier, and both networks are trained from scratch.
7.1 Image reconstruction
From the MNIST dataset, we derive two different scenarios. In the “MNIST easy” dataset, we consider a simple setup where the sorted digits are confined to a perturbed grid layout; see Figure 3
(top). Specifically, we perturb the digits with a Gaussian noise centered at each grid center, with a standard deviation that is equal to oneeighths of the grid width/height. In the “MNIST hard” dataset, the positions are randomized through a Poisson distribution
[3], as is the identity, and cardinality of each digit. Note how we allow multiple instances of the same digit to appear in this variant. As expected, both these datasets contain a training and testing subsets, and the testing portion is never seen at training time.Comparison baselines
We compare out method against four baselines: \⃝raisebox{0.6pt}{1} grid we setup a grid of keypoints, and apply the same autoencoder architecture as MIST to reconstruct the input image; \⃝raisebox{0.6pt}{2} in the channelwise variant we use the same heatmap network, except for the last convolutional layer giving channels as output, where each channel is dedicated to an interest point. Their locations are obtained through a channelwise soft argmax as in [44]. We also use the same architecture for the autoencoder as MIST; \⃝raisebox{0.6pt}{3} the method of Eslami et al. [7] is a sequential generative model. To generate nine digits, it is required for the method to be trained with also examples where various number of total digits exist (images with only 1 digit, 2 digits, etc.). We make a special exception for this method, and populate the training set with all of these cases; \⃝raisebox{0.6pt}{4} we finally compare to the stateoftheart method by Zhang et al. [44]
that provides a heatmapbased method with channelwise strategy for unsupervised learning of landmarks.
Results for “MNIST easy”
As shown in Figure 3 (top) all methods successfully resynthesize the image, with the exception of [7]. As this method is sequential, with nine digits the sequential implementation simply becomes too difficult to optimize through. Note how this method only learns to describe the scene with a few large regions. Quantitative results are summarized in Table 1.
MIST  Grid  Ch.wise  [7]  [44]  

MNIST easy  .038  .039  .042  .100  .169 
MNIST hard  .089  .047  .128  .154  .191 
Gabor  .095  N/A  N/A  N/A  N/A 
Results for “MNIST hard”
As shown in Figure 3 (bottom), all methods except ours fail to properly represent the image, where not only it was able to reconstruct the image, but also learnt how to localize the digits. Note that while it might look like the grid method succeeded, its trained autoencoder simply failed in capturing the concept of individual digits. Conversely, as shown in Figure 4, our method is able to learn this, demonstrated by the autoencoder successfully separating the existing overlaps. For quantitative results, please see Table 1.
Finding the basis of a procedural texture
We further demonstrate that our methods can be used to find the basis function of a procedural texture. For this experiment we synthesize textures with procedural Gabor noise [20]. Gabor noise is obtained by convolving oriented Gabor wavelets with a Poisson impulse process. Hence, given exemplars of noise, our framework is tasked to regress the underlying impulse process, and reconstruct the Gabor kernels so that when the two are combined, we can reconstruct the original image. Figure 5 illustrates the results of our experiment. Note how the learnt autoencoder learnt very well to reconstruct Gabor kernels, even though in the training images they are heavily overlapped. Further, note that the number of instances detected is significantly larger than that possible with other methods.
MIST  channelwise  

IOU 50%  84.6%  25.4% 
Classification  95.6%  75.5% 
Both  83.5%  24.8% 
7.2 Multiple instance classification
To test our method in a multiple instance classification setup, we rely on the MNIST hard dataset. We compare our method to channelwise, as other baselines are designed for purely generative tasks. To evaluate the accuracy of the models, we compute the intersection over union (IoU) between the groundtruth bounding box and the detection results, and assign it as a match if the IoU score is over 50%. We report the number of correctly classified matches in Table 2. Our method clearly outperforms the channelwise strategy. A few qualitative results are illustrated in Figure 6. Note that even without direct supervision on the digits locations, our method correctly localizes them. Conversely, the channelwise strategy fails to learn. This is because multiple instances of the same digits are present in the image. For example, in the example Figure 6 (right), we have two number sizes, zeros, and nines. This prevents any of these digits from being detected/classified properly by a channelwise approach.
8 Implementation details
Autoencoder network.
The input layer of the autoencoder is 32x32xC where C is the number of color channels. We use 5 up/downsampling levels. Each level is made of 3 resnet blocks and each resnet block uses a number of channels that doubles after each downsampling step. Resnet blocks uses 3x3 convolutions of stride 1 with relu activation. For downsampling we use 2D max pooling with 2x2 stride and kernel. For upsampling we use 2D transposed convolutions with 2x2 stride and kernel. The output layer uses a sigmoid function. We use layer normalization before each convolution layer.
Classification network.
We reuse the same architecture as encoder for first the task and append a dense layer to map the latent space to the score vector of our 10 digit classes.
9 Conclusion
In this paper, we introduced the MIST framework for multiinstance image reconstruction/classification. Both these tasks are based on localized analysis of the image, yet we train the network without providing any localization supervision. The network learns how to extract patches on its own, and these patches are then fed to a taskspecific network to realize an end goal. While at first glance the MIST framework might appear nondifferentiable, we show how via lifting they can be effectively trained in an endtoend fashion. We demonstrated the effectiveness of MNIST by introducing a variant of the MIST dataset, and demonstrating compelling performance in both reconstruction and classification. We also show how the network can be trained to reverse engineer a procedural texture synthesis process. MISTs are a first step towards the definition of optimizable imagedecomposition networks that could be extended to a number of exciting unsupervised learning tasks. Amongst these, we intend to explore the applicability of MISTs to unsupervised detection/localization of objects, facial landmarks, and local feature learning.
Acknowledgements
This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant “Deep Visual Geometry Machines” (RGPIN201803788), and by systems supplied by Compute Canada.
References
 [1] J. L. Ba, V. Mnih, and K. Kavukcuoglu. Multiple Object Recognition With Visual Attention. In ICLR, 2015.
 [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. CVIU, 10(3):346–359, 2008.
 [3] R. Bridson. Fast Poisson Disk Sampling in Arbitrary Dimensions. In SIGGRAPH sketches, 2007.
 [4] A. Buades, B. Coll, and J.M. Morel. A NonLocal Algorithm for Image Denoising. In CVPR, 2005.
 [5] Y.W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the Faster RCNN Architecture for Temporal Action Localization. In CVPR, 2018.
 [6] D. Detone, T. Malisiewicz, and A. Rabinovich. Superpoint: SelfSupervised Interest Point Detection and Description. CVPR Workshop on Deep Learning for Visual SLAM, 2018.
 [7] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, K. Kavukcuoglu, and G. E. Hinton. Attend, Infer, Repeat: Fast Scene Understating with Generative Models. In NIPS, 2015.
 [8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. PAMI, 32(9):1627–1645, 2010.
 [9] M. Gao, A. Li, V. I. Morariu, and L. S. Davis. CWSL: Coungguided Weakly Supervised Localization. In ECCV, 2018.
 [10] G. Georgakis, S. Karanam, Z. Yu, J. Ernst, and J. Košecká. Endtoend Learning of Keypoint Detector and Descriptor for Pose Invariant 3D Matching. In CVPR, 2018.

[11]
K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra.
DRAW: A Recurrent Neural Network For Image Generation.
In ICML, 2015.  [12] A. Gupta, A. Vedaldi, and A. Zisserman. Inductive Visual Localization: Factorised Training for Superior Generalization. In BMVC, 2018.
 [13] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask RCNN. In ICCV, 2017.

[14]
K. He, X. Zhang, R. Ren, and J. Sun.
Delving Deep into Rectifiers: Surpassing HumanLevel Performance on Imagenet Classification.
In ICCV, 2015.  [15] G. Hinton, A. Krizhevsky, and S. Wang. Transforming AutoEncoders. In International Conference on Artificial Neural Networks, pages 44–51, 2011.
 [16] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. CrossDomain WeaklySupervised Object Detection through Progressive Domain Adaptation. In ECCV, 2018.
 [17] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. In NIPS, pages 2017–2025, 2015.
 [18] J. Johnson, A. Gupta, and L. Feifei. Image Generation from Scene Graphs. In CVPR, 2018.

[19]
J. Johnson, A. Karpathy, and L. Feifei.
Densecap: Fully Convolutional Localization Networks for Dense Captioning.
In CVPR, 2016.  [20] A. Lagae, S. Lefebvre, G. Drettakis, and Ph. Dutré. Procedural noise using sparse gabor convolution. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH 2009), July 2009.
 [21] D. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 20(2), 2004.
 [22] D. Merget, M. Rock, and G. Rigoll. Robust Facial Landmark Detection via a FullyConlolutional LocalGlobal Context Network. In CVPR, 2018.
 [23] A. Newell, K. Yang, and J. Deng. Stacked Hourglass Networks for Human Pose Estimation. In ECCV, 2016.
 [24] Y. Ono, E. Trulls, P. Fua, and K. M. Yi. LfNet: Learning Local Features from Images. In NIPS, 2018.
 [25] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, RealTime Object Detection. In CVPR, 2016.
 [26] J. Redmon and A. Farhadi. YOLO 9000: Better, Faster, Stronger. In CVPR, 2017.
 [27] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. In NIPS, 2015.
 [28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. GradCAM: Visual Explanations from Deep Networks via Gradientbased Localization. In ICCV, 2017.
 [29] R. Sewart and M. Andriluka. EndtoEnd People Detection in Crowded Scenes. In CVPR, 2016.
 [30] B. Singh, H. Li, A. Sharma, and L. S. Davis. RFCN3000 at 30fps: Decoupling Detection and Classification. In CVPR, 2018.
 [31] O. Sorkine and M. Alexa. Asrigidaspossible surface modeling. In Symposium on Geometry processing, 2007.
 [32] M. Sun, Y. Yuan, F. Zhou, and E. Ding. MultiAttention MultiClass Constraint for Finegrained Image Recognition. In ECCV, 2018.
 [33] S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of Latent 3D Keypoints via EndToEnd Geometric Reasoning. In NIPS, 2018.
 [34] P. Tang, X. Wang, A. Wang, Y. Yan, W. Liu, J. Huang, and A. Yuille. Weakly Supervised Region Proposal Network and Object Detection. In ECCV, 2018.
 [35] J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, E. Soto, D. Sweeney, J. Valentin, B. Luff, A. Topalian, E. Wood, S. Khamis, P. Kohli, T. Sharp, S. Izadi, R. Banks, A. Fitzgibbon, and J. Shotton. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. TOG, 2016.
 [36] B. Tekin, P. Marquezneila, M. Salzmann, and P. Fua. Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation. In ICCV, 2017.
 [37] J. R. R. Uijlings, S. Popov, and V. Ferrari. Revisiting Knowledge Transfer for Training Object Class Detectors. In CVPR, 2018.
 [38] F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye. MinEntropy Latent Model for Weakly Supervised Object Detection. In CVPR, 2018.
 [39] F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao. GeometryAware Scene Text Detection with Instance Transformation Network. In CVPR, 2018.
 [40] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned Invariant Feature Transform. In ECCV, 2016.
 [41] C. Zach and G. Bournaoud. Descending, Lifting or Smoothing: Secrets of Robust Cost Opimization. In ECCV, 2018.
 [42] S. Zhang, J. Yang, and B. Schiele. Occluded Pedestrian Detection Through Guided Attention in CNNs. In CVPR, 2018.
 [43] X. Zhang, Y. Wei, G. Kang, Y. Wang, and T. Huang. Selfproduced Guidance for Weaklysupervised Object Localization. In ECCV, 2018.
 [44] Y. Zhang, Y. Gui, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised Discovery of Object Landmarks as Structural Representations. In CVPR, 2018.

[45]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
Learning Deep Features for Discriminative Localization.
In CVPR, 2016.