Log In Sign Up

MIST: Multiple Instance Spatial Transformer Network

We propose a deep network that can be trained to tackle image reconstruction and classification problems that involve detection of multiple object instances, without any supervision regarding their whereabouts. The network learns to extract the most significant top-K patches, and feeds these patches to a task-specific network -- e.g., auto-encoder or classifier -- to solve a domain specific problem. The challenge in training such a network is the non-differentiable top-K selection process. To address this issue, we lift the training optimization problem by treating the result of top-K selection as a slack variable, resulting in a simple, yet effective, multi-stage training. Our method is able to learn to detect recurrent structures in the training dataset by learning to reconstruct images. It can also learn to localize structures when only knowledge on the occurrence of the object is provided, and in doing so it outperforms the state-of-the-art.


page 1

page 2

page 6

page 7

page 8


Attentive Symmetric Autoencoder for Brain MRI Segmentation

Self-supervised learning methods based on image patch reconstruction hav...

Auto-Encoder based Co-Training Multi-View Representation Learning

Multi-view learning is a learning problem that utilizes the various repr...

Image-to-Image MLP-mixer for Image Reconstruction

Neural networks are highly effective tools for image reconstruction prob...

Weakly-Supervised Spatial Context Networks

We explore the power of spatial context as a self-supervisory signal for...

Domain-Specific Human-Inspired Binarized Statistical Image Features for Iris Recognition

Binarized statistical image features (BSIF) have been successfully used ...

Inverse Halftoning Through Structure-Aware Deep Convolutional Neural Networks

The primary issue in inverse halftoning is removing noisy dots on flat a...

Learning to Confuse: Generating Training Time Adversarial Data with Auto-Encoder

In this work, we consider one challenging training time attack by modify...

1 Introduction

The ability to find multiple instances of characteristic entities in a scene is core to many computer vision applications. For example, finding people 

[29, 42], detecting arbitrary number of classes and objects [27, 13, 26], and detecting local features [21, 2] all rely on this ability. In traditional vision pipelines, selecting the top-K responses in a heat-map and using their locations is the typical way to approach the problem [21, 2, 8]

. However, due to the non-differentiable nature of this operation, it has not found immediate application in deep learning based solutions.

Circumventing top-K in end-to-end learning.

To overcome this challenge, researchers proposed to use grids [25, 13, 6], to simplify the formulation by isolating each instance [40], or to provide alternative supervision by optimizing over multiple branches [24]. While effective, they do not generalize well outside the application domain for which they were designed. Other formulations, such as the use of sequential detection [7] or channel-wise approaches [44] are problematic to apply when the number of instances of the same object is large.

Figure 1: Examples of training data with associated supervision for two tasks, and the inference produced by MIST.
Figure 2: An example of MIST architecture – A network estimates locations and scales of patches encoded in a heatmap . Patches are then extracted via a sampler , and then fed to a task-specific network . In this example, the specific task is to re-synthesize the image as a super-position of (unknown, local) basis functions.

Introducing MIST architectures.

Therefore, we introduce a new deep framework which we name Multiple Instance Spatial Transformer or MIST for brevity. From a high level, the MIST framework first decomposes the image into a finite collection of patches, and then processes these patches to perform a given task. As illustrated in Figure 2 for the image synthesis task, given an image we first compute a heatmap via a deep network whose local maxima correspond to locations of interest. From this heatmap, we gather the parameters of the top- local maxima, and then extract the corresponding collection of image patches via a re-sampling process. With the collection of patches, we execute the same task-specific network whose output is aggregated to finally evaluate a task-specific loss. We then optimize this task loss to train the entire framework.

Training MISTs by lifting.

Training a pipeline that includes a non-differentiable selection/gather operation is non-trivial. To solve this problem we propose to lift the problem to a higher dimensional one by treating the parameters defining the interest points as slack variables, and introduce a hard constraint that they must correspond to the output that the heatmap network gives. This constraint is realized by introducing an auxiliary function that creates a heatmap given set of interest point parameters. We then solve for the relaxed version of this problem, where the hard constraint is turned into a soft one, and the slack variables are also optimized within the training process. Critically, our training strategy allows us to have an optimizable version of \⃝raisebox{-0.6pt}{1} non-maximum suppression, and \⃝raisebox{-0.6pt}{2} top-K selection, thus creating a network architecture resembling compute strategies that were dominant in pre deep-learning computer vision.


To demonstrate the capabilities of MISTs, we evaluate our network on a variety of weakly-supervised multi-instance problems. Note how in some of these applications, the value of is the only supervision signal we provide. We consider \⃝raisebox{-0.6pt}{1} the problem of recovering the basis functions that created a given texture, \⃝raisebox{-0.6pt}{2} the classification of numbers in cluttered scenes where the only supervision is the occurrence of these numbers.

In summary, in this paper:

  • [noitemsep,leftmargin=1.5em]

  • we introduce the MIST framework for weakly-supervised multi-instance visual learning;

  • we propose a training method that allows the use of top-K approaches for end-to-end trainable architectures;

  • we show that our framework can reconstruct images as parts, as well as detect/classify instances without any location supervision.

2 Related works

Attention models and the use of localized information have been actively investigated in the literature. Some examples include discriminative tasks such as fine-grained classification [32], and pedestrian detection [42], and generative ones such as image synthesis from natural language [18]. We now discuss a selection of representative works, and classify them according to how they deal with multiple instances.

Grid-based methods.

Since the introduction of Region Proposal Networks (RPN) [27], grid-based strategies have been used for dense image captioning [19], instance segmentation [13], keypoint detection [10], multi-instance object detection [26]. Recent improvements to RPNs attempt to learn the concept of a generic object covering multiple classes [30], and to model multi-scale information [5]. The multiple transformation corresponding to separate instances can also be densely regressed via Instance Spatial Transformers [39], which removes the need to identify discrete instance early in the network. Unfortunately, all these method are fully supervised, as they require both class labels and object locations for training.

Heatmap-based methods.

Heatmap-based methods have recently gained interest to detect features [40, 24, 6], find landmarks [44, 22], and regress human body keypoint [36, 23]. While it is possible to output one heatmaps per type of point [44, 36], this still restricts the number of instances to one. Yi et al. [40] re-formulates the problem based on each instance, but in doing so it introduces a non-ideal difference between training and testing regimes. Grids can also be used in combination to heatmaps [6]

, but this results in an unrealistic underlying assumption of uniformly distributed detections in the image. Overall, heatmap-based methods excel when the “final” task of the network is generate a heatmap 

[22], but are problematic to use as an intermediate layer in the presence of multiple instances.

Sequential inference methods.

Another way to approach multi-instance problems is to attend to one instance at a time in a sequential way. Gregor et al. [11] proposes a recurrent network that processes only a small area at a time for both discriminative and generative tasks. These sequential models have then been extended to localize and recognize MNIST digits in a cluttered image [1, 7]. Overall, RNNs often struggle to generalize to sequences longer than the ones encountered during training, and while recent results on inductive reasoning are promising [12], their performance does not scale well when the number of instances is large.

Knowledge transfer.

To overcome the acquisition cost of labelled training data, one can transfer knowledge from labeled to unlabeled dataset. For example, Inoue et al. [16] train on a single instance dataset, and then attempt to generalize to multi-instance domains, while Uijlings et al. [37] attempts to also transfer a multi-class proposal generator to the new domain. While knowledge transfer can be effective, it is highly desirable to devise unsupervised methods such as ours that do not depend on an additional dataset.

Weakly supervised methods.

To further reduce the labeling effort, weakly supervised methods have also been proposed. Wan et al. [38] learns how to detect multiple instances of a single object via region proposals and ROI pooling, while Tang et al. [34] proposes to use a hierarchical setup to refine their estimates. Gao et al. [9] provides an additional supervision by specifying the number of instances in each class, while Zhang et al. [43] localizes objects by looking at the network activation maps [45, 28]. However, all these method still rely on region proposals from an existing method, or define them via a hand-tuned process.

3 MIST Framework

A prototypical MIST architecture is composed of two trainable components: \⃝raisebox{-0.6pt}{1} the first module receives an image as input and extracts a collection of patches, at image locations and scales that are computed by a trainable heatmap network with weights ; see Section 4. \⃝raisebox{-0.6pt}{2} the second module processes each extracted patch with a task-specific network whose weights are shared across patches, and further manipulates these signals to express a task-specific loss ; see Section 5. The two modules are connected through non-maximum suppression on the scale-space heatmap output of , followed by a top- selection process to extract the parameters defining the patches, which we denote as . We then sample patches at these locations through bilinear sampling and feed them the second module.

The defining characteristic of MIST architectures is that they are quasi-unsupervised: the only strictly required supervision is the number of patches to extract. The training of MIST architectures is summarized by the optimization:


where are the network trainable parameters. Note how in the expression above is non-differentiable, thus making (1) unapproachable by back-propagation for training.

Example task.

In Figure 2, we illustrate an example of a MIST architecture for image reconstruction. In more details, the task is to understand how to re-synthesize the image as a super-position of spatially localized basis functions. Note that, while this task is quasi-unsupervised, it presents several joint challenges as it needs to estimate: \⃝raisebox{-0.6pt}{1} an unknown shared low-dimensional set of latent bases; \⃝raisebox{-0.6pt}{2} where to place instances from this latent space; \⃝raisebox{-0.6pt}{3} the latent coefficients representing each instance. Further details and additional example tasks are described in Section 5.

Training MISTs.

While our MISTs enable us to approach several vision tasks, such as the ones above, with minimal supervision the true challenge lies in the definition of an effective training strategy. Back-propagation through a selection process is possible, alike to what is performed for max-pooling. In Section 

6, we argue why this is highly detrimental, and propose an effective multi-stage training solution.

Evaluation and implementation.

We qualitative and quantitatively evaluate MISTs on a number of tasks in Section 7, and provide further implementation details in Section 8.

4 Patch extraction

We extracts a set of (square) patches that correspond to “important” locations in the image – where importance is a direct consequence of . The localization of such patches can be computed by regressing a 2D heatmap whose top-

peaks correspond to the patch centers. However, as we do not assume these patches to be equal in size, we regress to a collection of heatmaps at different scales. To limit the number of necessary scales, we use a discrete and sparse scale-space, while resolving for intermediate scales by weighted interpolation.

Multiscale heatmap network – .

Our multiscale heatmap network is inspired by LF-Net [24]. We employ a fully convolutional network with (shared) weights at multiple scales, indexed by , on the input image . The weights across scales are shared so that the network cannot implicitly favor a particular scale. To do so, we first downsample the image to each scale , execute the network

on it, and finally upsample to the original resolution. This process generates a multiscale heatmap tensor

of size where , and is the height of the image and is the width. For the convolutional network we use ResNet blocks [14], where each block is composed of two convolutions with

channels and relu activations without any downsampling. We then perform a

local spatial softmax operator [24] with spatial extent of to sharpen the responses.

Extracting patch location and scale.

We first normalize the heatmap tensor so that it has a unit sum along the scale dimension:


where is added to prevent divisions by zero. We then compute the top- spatial locations across all scales, to obtain the image-space coordinates of patch centers:


Note that a direct extraction of

maxima is possible because the aforementioned local spatial softmax performs a localized non-maximal suppression. The corresponding scale is computed by weighted first order moments 

[33], where the weights are the responses in the corresponding heatmaps:


Note we do not need to normalize here, as (2) has unit sum along the dimension. These two operations are abstracted by our top-K extractor  in (1).

Note also that our extraction process uses a single heatmap for all instances that we extract. By contrast, existing heatmap-based methods [7, 44] typically rely on heatmaps dedicated to each instance, which is problematic when an image contains two instances of the same class. Conversely, we restrict the role of the heatmap network  to find the “important” areas in a given image, without having to distinguishing between classes, hence simplifying learning.

Patch sampling.

As a patch is uniquely parameterized its location and scale , we can then proceed to sample its corresponding tensor via bilinear interpolation [17]:


Comparison to LF-Net.

Note that differently from LF-Net [24], we do not perform a softmax along the scale dimension. The scale-wise softmax in LF-Net is problematic as the computation for a softmax function relies on the input to the softmax being unbounded. For example, in order for the softmax function to behave as a max function, due to exponentiation, it is necessary that one of the input value reaches infinity (i.e. the value that will correspond to the max), or that all other values to reach negative infinity. However, at the network stage where softmax is applied in [24], the score range from zero to one, effectively making the softmax behave similarly to averaging. Our formulation does not suffer from this drawback.

5 Task-specific networks

We now introduce two instances of the MIST architectures corresponding to different applications. We keep the heatmap network and the extractor architectures the same, and only change the task-specific network, as well as the loss used for supervision. In particular, we consider a reconstruction problem in Section 5.1, and a classification problem in Section 5.2. Further implementation details can be found in Section 8.

5.1 Image reconstruction

As illustrated in Fig. 2, for image reconstruction we append our patch extraction network with a shared auto-encoder for each extracted patch. We can then train this network to reconstruct the original image by inverting the patch extraction process, and forming the task specific loss to be the norm between the input image and the reconstructed image. Specifically, we introduce the inverse sampling operation , which starts with an image of all zeros, and places the patch at . We then add all the images together to obtain the reconstructed image, overall expressing the following task loss:


Overall, the network is designed to jointly model and localize repeating structures in the input signal. Regressing shared basis functions can be related to non-local mean processes [4], as a model for the input signal is created by agglomerating the information in scattered spatial instances. Our task architecture is also related to transforming auto-encoders [15], where the difference is that in this previous work a single instance is present in the image, and that they provide the ground truth transformations as a supervision.

5.2 Multiple instances classification

By appending the patch extraction module with a classification network we can realize an architecture for multiple instance learning. For each extracted patch we apply a shared classifier network to output , where

is the number of classes. In turn, these are then converted into probability estimates by the transformation

. By denoting the one-hot ground-truth labels of instance in the image as , we define the multi-instance classification loss as


where denotes cross entropy and is the number of instances in the image. Note here that we do not provide supervision about where each class instances are, yet the detector network will automatically learn how to localize the content with minimal supervision.

6 Training MISTs

While it is technically possible to back-propagate through MIST architectures, the gradients would only flow through the spatial regions corresponding to the selected keypoints – this results in a training process that ignores locations away from the selection. This is a fundamental issue, as, in order for the network to learn to only respond to the desired locations, we need negative examples just as much as we need positive examples. To circumvent this problem, we propose a multi-stage training optimization.

1: : number of patches to extract, : task specific loss, : input image, : parameters of the heatmap network, : parameters of the task network.
2:function TrainMIST(, )
3:     for each training batch do
4:          with
5:         for  to  do
6:               with
7:         end for
9:          with
10:     end for
11:end function
Algorithm 1 Multi-stage optimization for MISTs

Differentiable top-K via lifting.

The introduction of auxiliary variables (i.e. lifting) to simplify the structure of an optimization problem has proven effective in a range of domains ranging from registration via ICP [35], to efficient deformation models [31], and robust optimization [41]. To simplify our training optimization, we start by decoupling the heatmap tensor from the optimization (1) by introducing the corresponding auxiliary variables , as well as the patch parameterization variables that are extracted by the top-K extractor:

s.t. (9)

We then relax (10) to a least-squares penalty:

s.t. (12)

and finally approach it by alternating optimization:


where has been dropped as it is not a free parameter: it can be computed as after the have been optimized by (13), and as after have been optimized by (14). To accelerate training, we further split (13) into two stages, and alternate between optimizing and . In particular, multiple optimization iterations of are executed to allow keypoints to displace faster during training. The summary for the three stage optimization procedure is outlined in Alg. 1: \⃝raisebox{-0.6pt}{1} we optimize the parameters with the loss ; \⃝raisebox{-0.6pt}{2} we then fix , and refine the positions of the patches for iterations with . \⃝raisebox{-0.6pt}{3} with the optimized patch positions , we invert the top- operation by creating a target heatmap , and optimize the parameters of our heatmap network using distance between the two heatmaps, . Notice that we are not introducing any additional supervision signal that is tangent to the given task.

Generating the target heatmap – .

For creating the target heatmap , we create a tensor that has zeros everywhere except for the positions corresponding to the optimized positions. However, as the optimized patch parameters are no longer integer values, we need to quantize them with care. For the spatial locations we simply round to the nearest pixel, which at most creates a quantization error of half a pixel, which does not cause problems in practice. For scale however, simple nearest-neighbor assignment causes too much quantization error as our scale-space is sparsely sampled. We therefore assign values to the two nearest neighboring scales in a way that the center of mass would be the optimized scale value. That is, we create a heatmap tensor that would result in the optimized patch locations when used in forward inference.

7 Results and evaluation

To demonstrate the effectiveness of our framework we evaluate two different tasks. We first perform a quasi-unsupervised image reconstruction task, where only the total number of instances in the scene is provided. We then show that our method can also be applied to weakly supervised multi-instance classification, where only image-level supervision is provided. Note that, unlike region proposal based methods, our localization network only relies on cues from the classifier, and both networks are trained from scratch.

[width=] fig/mnist_gen/item2.pdf input imageMISTgridchannel-wiseEslami et al.Zhang et al.

Figure 3: MNIST character synthesis examples for (top) the “easy” single instance setup and (bottom) the hard multi-instance setup. We compare the output of MISTs to grid, channel-wise, Eslami et al. [7] and Zhang et al. [44].

7.1 Image reconstruction

From the MNIST dataset, we derive two different scenarios. In the “MNIST easy” dataset, we consider a simple setup where the sorted digits are confined to a perturbed grid layout; see Figure 3

 (top). Specifically, we perturb the digits with a Gaussian noise centered at each grid center, with a standard deviation that is equal to one-eighths of the grid width/height. In the “MNIST hard” dataset, the positions are randomized through a Poisson distribution 

[3], as is the identity, and cardinality of each digit. Note how we allow multiple instances of the same digit to appear in this variant. As expected, both these datasets contain a training and testing subsets, and the testing portion is never seen at training time.

Comparison baselines

We compare out method against four baselines: \⃝raisebox{-0.6pt}{1} grid we setup a grid of keypoints, and apply the same auto-encoder architecture as MIST to reconstruct the input image; \⃝raisebox{-0.6pt}{2} in the channel-wise variant we use the same heatmap network, except for the last convolutional layer giving channels as output, where each channel is dedicated to an interest point. Their locations are obtained through a channel-wise soft argmax as in [44]. We also use the same architecture for the auto-encoder as MIST; \⃝raisebox{-0.6pt}{3} the method of Eslami et al. [7] is a sequential generative model. To generate nine digits, it is required for the method to be trained with also examples where various number of total digits exist (images with only 1 digit, 2 digits, etc.). We make a special exception for this method, and populate the training set with all of these cases; \⃝raisebox{-0.6pt}{4} we finally compare to the state-of-the-art method by Zhang et al. [44]

that provides a heatmap-based method with channel-wise strategy for unsupervised learning of landmarks.

[width=] fig/mnist_gen_us/item2.pdf input imagedetectionssynthesisinput imagedetectionssynthesis

Figure 4: Two auto-encoding examples learnt from MNIST-hard. In the top row, for each example we visualize input, patch detections, and synthesis. In the bottom row we visualize each of the extracted patch, and how it is modified by the learnt auto-encoder. Notice the full process is self-supervised, with the exception that we know every image contains numbers.

Results for “MNIST easy”

As shown in Figure 3 (top) all methods successfully re-synthesize the image, with the exception of [7]. As this method is sequential, with nine digits the sequential implementation simply becomes too difficult to optimize through. Note how this method only learns to describe the scene with a few large regions. Quantitative results are summarized in Table 1.

MIST Grid Ch.-wise [7] [44]
MNIST easy .038 .039 .042 .100 .169
MNIST hard .089 .047 .128 .154 .191
Gabor .095 N/A N/A N/A N/A
Table 1: Reconstruction error in terms of Root Mean Square Error (RMSE) across the various baselines. Best results in bold, and second best underlined. Our method performs best on MNIST easy, and second best on MNIST hard. Note however, that the grid method is not able to learn any notion of individual digits.

Results for “MNIST hard”

As shown in Figure 3 (bottom), all methods except ours fail to properly represent the image, where not only it was able to reconstruct the image, but also learnt how to localize the digits. Note that while it might look like the grid method succeeded, its trained auto-encoder simply failed in capturing the concept of individual digits. Conversely, as shown in Figure 4, our method is able to learn this, demonstrated by the auto-encoder successfully separating the existing overlaps. For quantitative results, please see Table 1.

Figure 5: Unsupervised inverse rendering of procedural Gabor noise. In the last column we highlight localization mistakes.

Finding the basis of a procedural texture

We further demonstrate that our methods can be used to find the basis function of a procedural texture. For this experiment we synthesize textures with procedural Gabor noise [20]. Gabor noise is obtained by convolving oriented Gabor wavelets with a Poisson impulse process. Hence, given exemplars of noise, our framework is tasked to regress the underlying impulse process, and reconstruct the Gabor kernels so that when the two are combined, we can reconstruct the original image. Figure 5 illustrates the results of our experiment. Note how the learnt auto-encoder learnt very well to reconstruct Gabor kernels, even though in the training images they are heavily overlapped. Further, note that the number of instances detected is significantly larger than that possible with other methods.

Figure 6: Two qualitative examples for detection and classification on our Multi-MNIST dataset.
MIST channel-wise
IOU 50% 84.6% 25.4%
Classification 95.6% 75.5%
Both 83.5% 24.8%
Table 2: Instance level detection and classification performance on the MNIST hard dataset.

7.2 Multiple instance classification

To test our method in a multiple instance classification setup, we rely on the MNIST hard dataset. We compare our method to channel-wise, as other baselines are designed for purely generative tasks. To evaluate the accuracy of the models, we compute the intersection over union (IoU) between the ground-truth bounding box and the detection results, and assign it as a match if the IoU score is over 50%. We report the number of correctly classified matches in Table 2. Our method clearly outperforms the channel-wise strategy. A few qualitative results are illustrated in Figure 6. Note that even without direct supervision on the digits locations, our method correctly localizes them. Conversely, the channel-wise strategy fails to learn. This is because multiple instances of the same digits are present in the image. For example, in the example Figure 6 (right), we have two number sizes, zeros, and nines. This prevents any of these digits from being detected/classified properly by a channel-wise approach.

8 Implementation details

Auto-encoder network.

The input layer of the autoencoder is 32x32xC where C is the number of color channels. We use 5 up/down-sampling levels. Each level is made of 3 resnet blocks and each resnet block uses a number of channels that doubles after each downsampling step. Resnet blocks uses 3x3 convolutions of stride 1 with relu activation. For downsampling we use 2D max pooling with 2x2 stride and kernel. For upsampling we use 2D transposed convolutions with 2x2 stride and kernel. The output layer uses a sigmoid function. We use layer normalization before each convolution layer.

Classification network.

We re-use the same architecture as encoder for first the task and append a dense layer to map the latent space to the score vector of our 10 digit classes.

9 Conclusion

In this paper, we introduced the MIST framework for multi-instance image reconstruction/classification. Both these tasks are based on localized analysis of the image, yet we train the network without providing any localization supervision. The network learns how to extract patches on its own, and these patches are then fed to a task-specific network to realize an end goal. While at first glance the MIST framework might appear non-differentiable, we show how via lifting they can be effectively trained in an end-to-end fashion. We demonstrated the effectiveness of MNIST by introducing a variant of the MIST dataset, and demonstrating compelling performance in both reconstruction and classification. We also show how the network can be trained to reverse engineer a procedural texture synthesis process. MISTs are a first step towards the definition of optimizable image-decomposition networks that could be extended to a number of exciting unsupervised learning tasks. Amongst these, we intend to explore the applicability of MISTs to unsupervised detection/localization of objects, facial landmarks, and local feature learning.


This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant “Deep Visual Geometry Machines” (RGPIN-2018-03788), and by systems supplied by Compute Canada.


  • [1] J. L. Ba, V. Mnih, and K. Kavukcuoglu. Multiple Object Recognition With Visual Attention. In ICLR, 2015.
  • [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. CVIU, 10(3):346–359, 2008.
  • [3] R. Bridson. Fast Poisson Disk Sampling in Arbitrary Dimensions. In SIGGRAPH sketches, 2007.
  • [4] A. Buades, B. Coll, and J.-M. Morel. A Non-Local Algorithm for Image Denoising. In CVPR, 2005.
  • [5] Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In CVPR, 2018.
  • [6] D. Detone, T. Malisiewicz, and A. Rabinovich. Superpoint: Self-Supervised Interest Point Detection and Description. CVPR Workshop on Deep Learning for Visual SLAM, 2018.
  • [7] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, K. Kavukcuoglu, and G. E. Hinton. Attend, Infer, Repeat: Fast Scene Understating with Generative Models. In NIPS, 2015.
  • [8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. PAMI, 32(9):1627–1645, 2010.
  • [9] M. Gao, A. Li, V. I. Morariu, and L. S. Davis. C-WSL: Coung-guided Weakly Supervised Localization. In ECCV, 2018.
  • [10] G. Georgakis, S. Karanam, Z. Yu, J. Ernst, and J. Košecká. End-to-end Learning of Keypoint Detector and Descriptor for Pose Invariant 3D Matching. In CVPR, 2018.
  • [11] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra.

    DRAW: A Recurrent Neural Network For Image Generation.

    In ICML, 2015.
  • [12] A. Gupta, A. Vedaldi, and A. Zisserman. Inductive Visual Localization: Factorised Training for Superior Generalization. In BMVC, 2018.
  • [13] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. In ICCV, 2017.
  • [14] K. He, X. Zhang, R. Ren, and J. Sun.

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.

    In ICCV, 2015.
  • [15] G. Hinton, A. Krizhevsky, and S. Wang. Transforming Auto-Encoders. In International Conference on Artificial Neural Networks, pages 44–51, 2011.
  • [16] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation. In ECCV, 2018.
  • [17] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. In NIPS, pages 2017–2025, 2015.
  • [18] J. Johnson, A. Gupta, and L. Fei-fei. Image Generation from Scene Graphs. In CVPR, 2018.
  • [19] J. Johnson, A. Karpathy, and L. Fei-fei.

    Densecap: Fully Convolutional Localization Networks for Dense Captioning.

    In CVPR, 2016.
  • [20] A. Lagae, S. Lefebvre, G. Drettakis, and Ph. Dutré. Procedural noise using sparse gabor convolution. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH 2009), July 2009.
  • [21] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 20(2), 2004.
  • [22] D. Merget, M. Rock, and G. Rigoll. Robust Facial Landmark Detection via a Fully-Conlolutional Local-Global Context Network. In CVPR, 2018.
  • [23] A. Newell, K. Yang, and J. Deng. Stacked Hourglass Networks for Human Pose Estimation. In ECCV, 2016.
  • [24] Y. Ono, E. Trulls, P. Fua, and K. M. Yi. Lf-Net: Learning Local Features from Images. In NIPS, 2018.
  • [25] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, 2016.
  • [26] J. Redmon and A. Farhadi. YOLO 9000: Better, Faster, Stronger. In CVPR, 2017.
  • [27] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015.
  • [28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In ICCV, 2017.
  • [29] R. Sewart and M. Andriluka. End-to-End People Detection in Crowded Scenes. In CVPR, 2016.
  • [30] B. Singh, H. Li, A. Sharma, and L. S. Davis. R-FCN-3000 at 30fps: Decoupling Detection and Classification. In CVPR, 2018.
  • [31] O. Sorkine and M. Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, 2007.
  • [32] M. Sun, Y. Yuan, F. Zhou, and E. Ding. Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition. In ECCV, 2018.
  • [33] S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of Latent 3D Keypoints via End-To-End Geometric Reasoning. In NIPS, 2018.
  • [34] P. Tang, X. Wang, A. Wang, Y. Yan, W. Liu, J. Huang, and A. Yuille. Weakly Supervised Region Proposal Network and Object Detection. In ECCV, 2018.
  • [35] J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, E. Soto, D. Sweeney, J. Valentin, B. Luff, A. Topalian, E. Wood, S. Khamis, P. Kohli, T. Sharp, S. Izadi, R. Banks, A. Fitzgibbon, and J. Shotton. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. TOG, 2016.
  • [36] B. Tekin, P. Marquez-neila, M. Salzmann, and P. Fua. Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation. In ICCV, 2017.
  • [37] J. R. R. Uijlings, S. Popov, and V. Ferrari. Revisiting Knowledge Transfer for Training Object Class Detectors. In CVPR, 2018.
  • [38] F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye. Min-Entropy Latent Model for Weakly Supervised Object Detection. In CVPR, 2018.
  • [39] F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao. Geometry-Aware Scene Text Detection with Instance Transformation Network. In CVPR, 2018.
  • [40] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned Invariant Feature Transform. In ECCV, 2016.
  • [41] C. Zach and G. Bournaoud. Descending, Lifting or Smoothing: Secrets of Robust Cost Opimization. In ECCV, 2018.
  • [42] S. Zhang, J. Yang, and B. Schiele. Occluded Pedestrian Detection Through Guided Attention in CNNs. In CVPR, 2018.
  • [43] X. Zhang, Y. Wei, G. Kang, Y. Wang, and T. Huang. Self-produced Guidance for Weakly-supervised Object Localization. In ECCV, 2018.
  • [44] Y. Zhang, Y. Gui, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised Discovery of Object Landmarks as Structural Representations. In CVPR, 2018.
  • [45] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.

    Learning Deep Features for Discriminative Localization.

    In CVPR, 2016.