AutoCorrect: Deep Inductive Alignment of Noisy Geometric Annotations

08/14/2019 ∙ by Honglie Chen, et al. ∙ 0

We propose AutoCorrect, a method to automatically learn object-annotation alignments from a dataset with annotations affected by geometric noise. The method is based on a consistency loss that enables deep neural networks to be trained, given only noisy annotations as input, to correct the annotations. When some noise-free annotations are available, we show that the consistency loss reduces to a stricter self-supervised loss. We also show that the method can implicitly leverage object symmetries to reduce the ambiguity arising in correcting noisy annotations. When multiple object-annotation pairs are present in an image, we introduce a spatial memory map that allows the network to correct annotations sequentially, one at a time, while accounting for all other annotations in the image and corrections performed so far. Through ablation, we show the benefit of these contributions, demonstrating excellent results on geo-spatial imagery. Specifically, we show results using a new Railway tracks dataset as well as the public INRIA Buildings benchmarks, achieving new state-of-the-art results for the latter.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Digital images are nowadays collected in enormous quantities. An important example is geo-spatial data, collected continuously by satellites, and containing a wealth of information useful for urban planning, crop and forest management, disaster relief, climate modelling, and many other applications. However, the scale of such datasets requires automated processing via machine learning and, while machine learning methods are increasingly powerful, providing annotations manually to train them can be prohibitively expensive.

The annotation costs may be substantially reduced if labels need not be very accurate. In this case, it is sometimes possible to recycle annotations that were not collected specifically for the images at hand. With geo-spatial data, for instance, there are publicly available maps (e.g. OpenStreetMap [OpenStreetMap contributors(2017)], Google Maps [Google(2017)]) that can provide annotations for large areas of the planet for free. However, while maps are generally accurate, they usually fail to match satellite images exactly due to various issues. To list a few: 1) maps do not capture the 3D structure of features such as buildings or vegetation, leading to misaligned annotations due to viewpoint variations; 2) maps may not be temporally synchronized with the satellite data, thus failing to account for variations in buildings, roads and vegetation; 3) features recorded in a map (e.g. subways) may not necessarily be visible in images and vice-versa. Figure 1 shows examples of noisy geometric labels obtained from these data sources in the INRIA buildings and our new Railway tracks datasets, and compares them with the manually-corrected versions.

Figure 1: Example aerial images with noisy labels (Red) and accurate labels (Green). (a) and (b) are extracted from the INRIA buildings dataset. (c) and (d) are examples in the Railway tracks dataset. The original labels (Red) demonstrate the clear registration noise. The cleaned labels (Green) show the corrections we aim to achieve (Human corrected).

Noisy labels can severely impact the quality of learned object detectors, as shown in satellite/aerial segmentation [Mnih and Hinton(2012), Saito et al.(2016)Saito, Yamashita, and Aoki, Alshehhi et al.(2017)Alshehhi, Marpu, Woon, and Mura] and detection [Laptev et al.(2000)Laptev, Mayer, Lindeberg, Eckstein, Steger, and Baumgartner, Hu et al.(2007)Hu, Razdan, Femiani, Cui, and Wonka]. Hence, in this paper, we consider the problem of improving noisy labels to reduce or eliminate the impact of such noise on learned models. Our method, AutoCorrect, is mostly concerned with registration noise, which is usually the predominant noise type in geo-spatial data (Figure 1). We build a model that takes a set of images and misaligned object annotations as input and shifts the annotations to their correct image locations.

There are several challenges. Satellite images usually contain multiple occurrences of the same object types, which may lead to association errors. Geo-spatial images capture the top of tall objects such as building and trees, whereas maps annotate their base. Finally, tall objects (e.g. trees in Figure 1 or buildings) can occlude other objects or cast significant shadows, so that some objects annotated in the map may effectively be invisible.

Given an image and a set of object annotations, AutoCorrect

sequentially registers each annotation to its corresponding object occurrence by estimating an instance-level transformation. This is much more flexible than existing works that seek a single image-level transformation and allows us to obtain substantial improvements compared to these (indeed, as will be seen in the results, the annotations are displaced independently per object, and a single image-level correction will not suffice). However, this comes with several challenges. First, the model may not have access to any noise-free annotation, or at least not be aware of which ones are noise-free, making the correction process ambiguous. Second, there usually are several objects in each image, which means that the model must generalise to an arbitrary number of object occurrences whilst avoiding errors due to duplicate associations.

We solve the first problem by combining a geometric consistency loss, which is valid even if the ground-truth annotations are unknown, with a self-supervised loss, which is reliable for annotations with a small amount of noise. We also show that the symmetry of certain objects such as roads provides an implicit constraint that makes registering annotations much less ambiguous. We solve the second problem by introducing a spatial memory map which represents all image annotations and reflects all previously-applied corrections.

2 Related work

Image alignment.

Two very related works [Girard et al.(2018)Girard, Charpiat, and Tarabalka, Zampieri et al.(2018)Zampieri, Charpiat, Girard, and Tarabalka] have shown good alignment performance by training a CNN to predict a displacement field between a map and an image. [Zampieri et al.(2018)Zampieri, Charpiat, Girard, and Tarabalka] uses a multi-scale CNN, and [Girard et al.(2018)Girard, Charpiat, and Tarabalka] improves performance by training jointly for both alignment and segmentation. We compare to their results (and improve over them) in Section 4.

Inductive models and spatial memory.

Explicit decomposition into repeated sub-tasks and recursively solving the problem have been applied in neural programming [Zaremba et al.(2016)Zaremba, Mikolov, Joulin, and Fergus, Cai et al.(2017)Cai, Shin, and D., Reed and Freitas(2016)] and many visual tasks [Li et al.(2018)Li, Chen, and Koltun, Romera-Paredes and Torr(2015), Kowalski et al.(2017)Kowalski, Naruniec, and Trzcinski, Carreira et al.(2016)Carreira, Agrawal, Fragkiadaki, and Malik, Oberweger et al.(2015)Oberweger, Wohlhart, and Lepetit, Gupta et al.(2018)Gupta, Vedaldi, and Zisserman]. In [Kowalski et al.(2017)Kowalski, Naruniec, and Trzcinski], each stage predicts a landmark transformation that updates the keypoints iteratively. Similarly, an updater function is formulated in [Oberweger et al.(2015)Oberweger, Wohlhart, and Lepetit] for hand pose alignment. [Gupta et al.(2018)Gupta, Vedaldi, and Zisserman] proposes an inductive RNN to localise visual objects which can generalise to an arbitrary number of inputs. Many of these methods use a form of spatial memory, though this isn’t always made explicit. Others have used spatial memory for interactive image segmentation [Li et al.(2018)Li, Chen, and Koltun], and context reasoning in object detection [Chen and Gupta(2017)].

Cycle consistency.

Assessing performance via cycling between two or more samples is a commonly used technique in computer vision. Many successful tasks like optical flow (with forward-backward consistency) 

[Sundaram et al.(2010)Sundaram, Brox, and Keutzer], co-segmentation [Wang et al.(2014)Wang, Huang, Ovsjanikov, and Guibas], image matching [Zhou et al.(2015)Zhou, Zhu, and Daniilidis, Zhou et al.(2016)Zhou, Philipp, Aubry, Huang, and Efros], image translation [Zhu et al.(2017)Zhu, Park, Isola, and Efros] and domain adaptation [Hoffman et al.(2018)Hoffman, Tzeng, Park, Zhu, Isola, Saenko, Efros, and Darrell] have shown its effectiveness. We introduce here a geometric-consistency loss: that within an image, misaligned annotations should be able to transform back to a single unique position.

Learning with imperfect annotation.

Most works on learning with imperfect annotations have considered classification, rather than registration. Examples include having a small set of clean samples (as well as many noisy) [Xiao et al.(2015)Xiao, Xia, Yang, Huang, and Wang, Veit et al.(2017)Veit, Alldrin, Chechik, Krasin, Gupta, , and Belongie]

, using robust loss functions 

[Ghosh et al.(2017)Ghosh, Kumar, and Sastry, Patrini et al.(2017)Patrini, Rozza, Menon, Nock, and Qu], or using a top-k loss [Berrada et al.(2018)Berrada, Zisserman, and Kumar].

3 Approach

Our goal is to train deep networks for the detection of visual objects while relying on noisy annotations. While the approach is fairly general, we apply it to the detection of objects such as building and roads in geo-spatial images, where noisy annotations can be extracted from on-line data repositories such as mapping services. The mismatch between annotations and images is sometimes large, as shown in Figure 1. Naïvely training a model with these annotations leads to inaccurate predictions.

Figure 2: AutoCorrect architecture. The green dotted line shows the ground-truth label for the example image of a railway track. In Stage 1, given an image-label pair , the noisy annotation is further perturbed by applying random transformations and . In Stage 2, the network computes corrections , , producing corrected labels which must satisfy the consistency equation (see text). If is known to be a noise-free, then we can set reducing to the stricter constraint .

There are two main challenges. First, all annotations are potentially noisy and thus it is not clear how the noise can be identified and removed. Second, as different objects in the image may be misaligned in different ways, we must enable instance-level corrections while handling an arbitrary number of object instances per image. We address these challenges in three ways. First, we use a self-supervised consistency loss based on the fact that multiple perturbations of the same label must always map to the same noiseless label. Second, we show that the intrinsic symmetry of certain visual objects provides a powerful implicit constraint that can reduce the ambiguity in the annotation clean-up process. Third, we introduce the idea of inductive alignment, adjusting annotations one instance at a time, sequentially, keeping track of the algorithm state by means of a spatial memory map

. This is implemented by a recurrent neural network (RNN), which applies the same alignment logic to each annotation, but accounting for annotations already processed.

3.1 Single instance alignment

We start by describing a neural network architecture that can predict a translation and rotation for an individual object annotation in order to better align it to the image content. Note that, while this task may sound similar to object detection, it is in fact much easier as the annotation cues us to the existence and rough location of an object.

At each step, the input to the model is a concatenation of the RGB image with a scalar label map which encodes the annotation as a binary image. We know that the annotations can potentially be noisy, so we wish to learn a predictor function that outputs the transformation (i.e2 scalars for translation and 1 for rotation) to align the image and annotation. This is implemented using a CNN that takes as input and and outputs a transformation :

(1)

The corrected annotation is expressed as the transformed version of the annotation by the predicted transformation , where is a group of transformations such as 2D similarities. The symbol denotes warping an image by a transformation. If the annotation is noise-free,

is expected to be an identity matrix and

. If the annotation is noisy, the corrected annotation should approximate the underlying noise-free annotation , which however is never observed during training.

Model (1) has several useful geometric properties:

Lemma 1.

If is the ground-truth annotation for image and a perfect is available, then is the identity transformation. Furthermore, for all invertible transformations , we have and

The lemma is easy to prove once we note that, if is the ground-truth label of image , then is the ground-truth label of image . From this lemma, we can also see that any annotation that can be recovered from an image must have the same symmetries as the image itself.

Lemma 2.

Let be the annotation reconstructed from image using model (1) and assume that is a symmetry of the image, i.e. Then the reconstructed annotation has the same symmetry, in the sense that .

Proof.

This lemma shows that annotations can be predicted from images only if they have the same symmetries as the images. For example, if the model labels a straight road with a line, then the line must coincide to the road axis of symmetry. Hence image symmetries implicitly constrain the predictor (1) (in the example of the road, the correction must move the line onto the visual axis of symmetry of the road), reducing the ambiguity in registering the annotation. Note that this effect does not require specific images to be exactly symmetric; rather, it suffices that the object category is statistically symmetric (for example it is not possible to tell the direction of a road even if there are a few trees on one side, making the image asymmetric).

If we assume all annotations are correct, i.e, then Lemma 1 can be used to train model (1

) via self-supervised learning. The idea is to perturb the noise-free annotations synthetically by applying a random transformation

to the annotation . From Lemma 1, and using the assumption , we have We may capture this constrain in the self-supervised loss:

(2)

However, in our case is unknown so this loss can be used only as an approximation. In this case, the constraint can be written in term of relative transformations. To this end, consider applying two random transformations and to the annotation . From Lemma 1, we have This can be written as a consistency loss:

(3)

Intuitively, when two random transformations operate on one annotation, an ideal alignment model should be able to transform the annotation back to the same position, as is unique.

Overall, to train models on noisy data, we therefore consider a weighted combination (details in Section 3.3).

3.2 Inductive alignment

Figure 3: Correcting annotations sequentially using a memory map. Note, we demonstrate our correction process in the AutoCorrect box, whereas the top Memory maps Visualisation box highlights the corrected annotation (white) on the satellite image. In detail, at each step, the input to the network is the concatenation of the RGB image , the image of the annotation to be corrected, and a memory representing all other annotations, part of which have already been corrected (we colour-code annotations not yet corrected). An update function , is applied to obtain the correction and the latter is used to update .

A naïve implementation of model (1) may align single object instances well, but it would fail when an image contains multiple object occurrences, especially when, as in satellite images, these are spatially close and similar in appearance. In particular, independent alignment may cause different noisy annotations to be incorrectly associated to the same object occurrence. To tackle this challenge, we introduce an inductive alignment model which uses an external spatial memory map to make the algorithm aware of all annotations present in the image as well as to keep track of all correction processed so far. Formally, given a training image with object annotations , our goal can be seen as estimating the joint posterior density of the transformation matrix for all the noisy object annotations . Rather than modelling multiple object annotations simultaneously, we break this down as sequence

of simpler steps, in which a single transformation is predicted at a time, conditioned on the previous decisions, resembling an autoregressive model. Formally, this autoregressive model can be written as:

Note that this process requires learning a sequence of models . Directly parameterising the relations among transformations is difficult and results in a model which is rather opaque; instead, we propose to summarise the effect of conditioning on the previous corrections via a spatial memory map , ideally, the memory map should represent all annotations and corrections performed so far except the annotation that is currently being processed, formally:

(4)

An explicit example is illustrated in Figure 3, showing four railway track annotations to be corrected. The algorithm starts with four binary masks, each coding one of the noisy railway annotations, at the very first step, the memory is composed of three annotations (), and is concatenated as additional input to the network . Then, the first annotation is corrected by predicting the rigid transformation , and the memory is updated by adding the image of and removing the image of annotation , readying for the next cycle. The induction process ends at , where all tracks have been effectively corrected by the model. Note that the order we align instances is from left to right and bottom to top.

3.3 Implementation details

Consider a training image with noisy object annotations . Annotations are perturbed by applying random transformations where is the composition of a translation of up to px in each direction and a rotation of up to 5 degrees (clockwise or anticlockwise) as this was found to be commensurate to the maximum amount of noise in the geo-spatial datasets we used for assessment.

During early training, we set the gating parameters in the joint objective function as . This ensures the model converges quickly to an approximate solution within a few pixels of the ground-truth annotation, despite the fact that annotations are noisy so that term in the objective function is not exactly valid. In a second phase, when the model is close to the final solution, the terms and start to be in conflict for the annotations that contain the largest amount of noise. Hence the coefficient and are adjusted as follows:

(5)

where denotes the standard Intersection over Union measure. This states that when any of the predicted corrections is far away from the given label, the label is expected to contain a large amount of noise, only the consistency loss is applied; otherwise only the stricter self-supervised loss is used.

Architecture and optimisation.

The proposed AutoCorrect model uses as backbone architecture the VGG-M network [Chatfield et al.(2014)Chatfield, Simonyan, Vedaldi, and Zisserman] with minor modifications (details in the supplementary materiel). The network is trained using the Adam optimiser at an initial learning rate of , which is divided by after the training error plateaus.

4 Experiments

The experiments thoroughly assess our AutoCorrect method on two benchmark datasets: our own Railway tracks dataset and the INRIA buildings dataset. The new Railway tracks dataset will be released at http://robots.ox.ac.uk/~vgg/research/autocorrect/.

4.1 Datasets and evaluation

Railway tracks dataset.

The Railway tracks dataset was obtained by extracting views of railways in the UK region from Google Maps. We used zoom level , which corresponds to approx.  meter/pixel (this is the minimum zoom level at which railway tracks can be resolved) and results in images with a pixel resolution. The dataset contains approximately k overhead images of the tracks. Binary mask annotations are provided by Google Maps to indicate the position of the railway tracks; however, the annotations are not perfectly aligned with the images (shown in Figure 1). In order to evaluate the effectiveness of the self-supervised learning loss, the consistency loss, and the spatial memory map, we manually identify images for which railway annotations are accurate. We use these in the experiments by synthetically adding noise to these ground-truth annotations.

INRIA buildings dataset.

The INRIA buildings dataset contains 360 images of

pixels. This dataset may seem small compared to other deep learning datasets, but as each image has a large spatial footprint, it contains a large number of buildings (

buildings just in the test split). In order to directly compare with prior work, we adopt the same data and evaluation protocol of [Girard et al.(2018)Girard, Charpiat, and Tarabalka].

Evaluation metrics.

In order to evaluate the effectiveness of the proposed AutoCorrect model, For Railway tracks dataset, we assess railway alignment using the standard IoU measure between the image of a noise-free label and the predicted correction of a noisy label. For the INRIA buildings dataset, in order to compare with existing work, we adopt the standard protocol and report results using the Percentage of Correct Keypoints (PCKs) metric. The reason we apply the IoU measure on railway tracks is that railway tracks tend to be straight and long, so that, unlike for buildings, it is difficult to define keypoints. Note that IoU is very sensitive for thin structures such as railroads.

Model Data Noise SMM Consist. IoU
A 3k 0% 0.321
B 3k 0% 0.425
C 3k 0% 0.436
D 3k 20% Synth. 0.404
E 3k 20% Synth. 0.429
F 3k 40% Synth. 0.369
G 3k 40% Synth. 0.381
H 20k 40% Natural 0.417
I 20k 40% Natural 0.435
J 35k 40% Natural 0.445
Table 1: Railway tracks dataset results. The SMM and Consist. refers to the spatial memory map and consistency loss respectively.
Table 2: INRIA buildings dataset results. We outperform all recent works; from around 10 pixels threshold, our result is 100% (i.e. it cannot be improved further).

4.2 Railway tracks results

Synthetic annotation noise.

In the following, we use the images with ground-truth (i.e. correct) annotations, split as for training the AutoCorrect network and for testing. With these image-annotation pairs, we aim to perform controlled experiments on evaluating the effectiveness of the proposed components. First, we assess the effectiveness of the spatial memory map by training our model using only the noise-free annotations and the self-supervised loss. Then, to evaluate the robustness of the consistency loss against different levels of noise, we intentionally replace the noise-free annotations with perturbed ones in training set, and train three sets of models, with resp.  noisy annotations, and using or not using the consistency loss. During the testing stage, we artificially perturb the testing images three times, and apply our models to correct the perturbed testing annotations. All artificial perturbations are composed of a random translation up to 25px in each direction and a random rotation up to 5 degrees (clockwise or anti-clockwise).

As shown in Table 2, models A-G are trained on only 3k images with noise-free labels or with the injection of synthetic noise in part of those. H, I, J are trained on real annotation noise. First, to show that our spatial memory map plays an important role in the instance alignment, we compare models A and B: the performance gap is significant ( vs. IoU), as the spatial memory map gives important contextual information. Second, comparing models B and C shows that the consistency loss is beneficial even when training on the noise-free subset of the data. We conjecture that this is because the consistency loss acts as a regularizer. Third, to verify the effectiveness of the consistency loss in dealing with noisy data, we note that as the noise ratio is increased (models D and F), the performance of the model that uses only the self-supervised loss starts to drop dramatically ( and IoU); however, the transformation consistency loss improves the robustness to noise significantly (models E and G, and IoU). Note that, when the noise ratio is around , model E actually performs about as well as model C, which learned on noise-free annotations. This shows that models trained with transformation consistency can discount almost entirely moderate amounts of noise.

Natural annotation noise.

After demonstrating the concept in these controlled experiments, we now train the network using the entire dataset (which we estimate to contain of labels with significant geometric distortion), using either k or k images and switching the consistency loss on and off to test its effectiveness once more. Similar to synthetic annotation noise, we artificially perturb the 1,000 testing images to evaluate models trained on natural annotation noise. The models I and J (k/k images, / IoU) show that, even with substantial real annotation noise (40%), our model reaches similar or superior performance to using a manually filtered dataset (C, 3k images, 0.436 IoU) with no annotation noise. The advantage is that, while datasets I and J are large, they are obtained “for free” without any manual filtering.

4.3 INRIA buildings dataset results

To evaluate our alignment method on the INRIA buildings dataset, we follow the standard testing protocol introduced in  [Girard et al.(2018)Girard, Charpiat, and Tarabalka, Zampieri et al.(2018)Zampieri, Charpiat, Girard, and Tarabalka] by randomly and independently perturbing the accurate annotations on images of the city of San Francisco. In contrast to generating displacement maps in [Girard et al.(2018)Girard, Charpiat, and Tarabalka, Zampieri et al.(2018)Zampieri, Charpiat, Girard, and Tarabalka], we consider instance-level transformations. The testing labels are generated by randomly and independently perturbing the accurate annotation instances to achieve an error comparable to that of [Girard et al.(2018)Girard, Charpiat, and Tarabalka, Zampieri et al.(2018)Zampieri, Charpiat, Girard, and Tarabalka]. As shown in Table 2, the AutoCorrect approach outperforms all previous methods at all thresholds (in pixels). This is because our method outputs transformation parameters for each instance independently, whereas prior works outputs a displacement field map, which is less expressive. Furthermore, our consistency loss also works as a form of data augmentation which counters the small size of the INRIA buildings dataset, further improving the performance.

Note that we learn to correct random and different perturbations of objects that co-occur in the same image, therefore, our proposed local (per-object) correction is a better match to the type of errors observed in practice in aerial datasets as the location of the shifted annotations can be random and uncorrelated.

4.4 Qualitative results

As label noise of Satellite imagery is random, each instance label must be considered and corrected individually. Our AutoCorrect models deal with geometric alignments on an arbitrary number of instances, by aligning each instance sequentially. Figure 4 shows the AutoCorrect correction progression on testing data. Since the noise of each label is random, our model handles it by iteratively correcting each semantic labels individually. Figure 5 shows the AutoCorrect final predictions on a number of examples from the test data.

Figure 4: Correction progression. Red polygons refer to noisy labels, green to noise-free labels, and yellow are our predictions. Noisy annotations are cleaned inductively.
Figure 5: Alignment results for the Railway tracks dataset examples (top row), and INRIA buildings dataset (bottom row). The label noise of each instance (denoted in red) is random i.elocal transformation of each instance is needed. Our predictions (denoted in yellow) achieve accurate correction comparing to ground truth (Green), by predicting a transformation on each instance. In reality, AutoCorrect can correct both noisy instance with regular shape (a), as well as instances with complex shapes ((b) & (h)). Figures (c) and (d) illustrate the capability of correcting an arbitrary number of instances.

5 Conclusion

The AutoCorrect method is based on three ideas: a spatial memory map that enables annotations to be adjusted sequentially while taking into account the other annotations and their corrections, a consistency loss that enables the model to be trained without the knowledge of any noise-free annotation, and a self-supervised loss that generates training data automatically. AutoCorrect outperforms previously-published works and can learn to correct almost for free from a large dataset where 40% of the annotations are heavily distorted, and obtain results that are comparable to approaches that require noise-free annotations. Finally, we have introduced the new Railway tracks benchmark.

Acknowledgement.

We thank Kai Han, Erika Lu and Tengda Han for proofreading. Financial support was provided by the EPSRC Programme Grant Seebibyte EP/M013774/1.

References

  • [Alshehhi et al.(2017)Alshehhi, Marpu, Woon, and Mura] R. Alshehhi, P.R Marpu, W.L Woon, and M.D Mura.

    Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks.

    ISPRS Journal of Photogrammetry and Remote Sensing, 2017.
  • [Berrada et al.(2018)Berrada, Zisserman, and Kumar] L. Berrada, A. Zisserman, and M. P. Kumar. Smooth loss functions for deep top-k classification. In International Conference on Learning Representations, 2018.
  • [Cai et al.(2017)Cai, Shin, and D.] J. Cai, R. Shin, and Song D. Making neural programming architectures generalize via recursion. In Proc. ICLR, 2017.
  • [Carreira et al.(2016)Carreira, Agrawal, Fragkiadaki, and Malik] J. Carreira, P Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. Proc. CVPR, 2016.
  • [Chatfield et al.(2014)Chatfield, Simonyan, Vedaldi, and Zisserman] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.
  • [Chen and Gupta(2017)] X. Chen and A. Gupta. Spatial memory for context reasoning in object detection. In Proc. ICCV, 2017.
  • [Ghosh et al.(2017)Ghosh, Kumar, and Sastry] A. Ghosh, H. Kumar, and P. Sastry. Robust loss functions under label noise for deep neural networks. In AAAI, 2017.
  • [Girard et al.(2018)Girard, Charpiat, and Tarabalka] N. Girard, G. Charpiat, and Y. Tarabalka. Aligning and updating cadaster maps with aerial images by multi-task, multi-resolution deep learning. In Proc. ACCV, 2018.
  • [Google(2017)] Google. Google Maps. https://www.google.co.uk/maps, 2017.
  • [Gupta et al.(2018)Gupta, Vedaldi, and Zisserman] A. Gupta, A. Vedaldi, and A. Zisserman. Inductive visual localisation: Factorised training for superior generalisation. In Proc. BMVC., 2018.
  • [Hoffman et al.(2018)Hoffman, Tzeng, Park, Zhu, Isola, Saenko, Efros, and Darrell] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. Cycada: Cycle consistent adversarial domain adaptation. In Proc. ICML, 2018.
  • [Hu et al.(2007)Hu, Razdan, Femiani, Cui, and Wonka] J. Hu, A. Razdan, J. C. Femiani, M. Cui, and P. Wonka. Road network extraction and intersection detection from aerial images by tracking road footprints. IEEE Transactions on Geoscience and Remote Sensing, 2007.
  • [Kowalski et al.(2017)Kowalski, Naruniec, and Trzcinski] M. Kowalski, J. Naruniec, and T. Trzcinski. Deep alignment network: A convolutional neural network for robust face alignment.

    IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , 2017.
  • [Laptev et al.(2000)Laptev, Mayer, Lindeberg, Eckstein, Steger, and Baumgartner] I. Laptev, H. Mayer, T. Lindeberg, W. Eckstein, C. Steger, and A. Baumgartner. Automatic extraction of roads from aerial images based on scale space and snakes. Machine Vision and Applications, 2000.
  • [Li et al.(2018)Li, Chen, and Koltun] Z. Li, Q. Chen, and V. Koltun. Interactive image segmentation with latent diversity. In Proc. CVPR, 2018.
  • [Mnih and Hinton(2012)] V. Mnih and G. E. Hinton. Learning to label aerial images from noisy data. In Proc. ICML, 2012.
  • [Oberweger et al.(2015)Oberweger, Wohlhart, and Lepetit] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a feedback loop for hand pose estimation. In Proc. ICCV, 2015.
  • [OpenStreetMap contributors(2017)] OpenStreetMap contributors. Planet dump retrieved from https://planet.osm.org . https://www.openstreetmap.org, 2017.
  • [Patrini et al.(2017)Patrini, Rozza, Menon, Nock, and Qu] G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proc. CVPR, 2017.
  • [Reed and Freitas(2016)] S. Reed and N. Freitas. Neural programmer-interpreters. In Proc. ICLR, 2016.
  • [Romera-Paredes and Torr(2015)] B. Romera-Paredes and P. H. S. Torr. Recurrent instance segmentation. In Proc. ECCV, 2015.
  • [Saito et al.(2016)Saito, Yamashita, and Aoki] S Saito, Y Yamashita, and Y Aoki. Multiple object extraction from aerial imagery with convolutional neural networks. Journal of Imaging Science and Technology, 2016.
  • [Sundaram et al.(2010)Sundaram, Brox, and Keutzer] N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by GPU-accelerated large displacement optical flow. In Proc. ECCV, 2010.
  • [Veit et al.(2017)Veit, Alldrin, Chechik, Krasin, Gupta, , and Belongie] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, , and S. J. Belongie. Learning from noisy large-scale datasets with minimal supervision. In Proc. CVPR, 2017.
  • [Wang et al.(2014)Wang, Huang, Ovsjanikov, and Guibas] F. Wang, Q. Huang, M. Ovsjanikov, and L.J. Guibas. Unsupervised multi-class joint image segmentation. In Proc. CVPR, 2014.
  • [Xiao et al.(2015)Xiao, Xia, Yang, Huang, and Wang] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In Proc. CVPR, 2015.
  • [Zampieri et al.(2018)Zampieri, Charpiat, Girard, and Tarabalka] A. Zampieri, G. Charpiat, N. Girard, and Y. Tarabalka. Multimodal image alignment through a multiscale chain of neural networks with application to remote sensing. In Proc. ECCV, 2018.
  • [Zaremba et al.(2016)Zaremba, Mikolov, Joulin, and Fergus] W. Zaremba, T. Mikolov, A. Joulin, and R. Fergus. Learning simple algorithms from examples. In Proc. ICML, 2016.
  • [Zhou et al.(2016)Zhou, Philipp, Aubry, Huang, and Efros] T. Zhou, K. Philipp, M. Aubry, Q. Huang, and A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In Proc. CVPR, 2016.
  • [Zhou et al.(2015)Zhou, Zhu, and Daniilidis] X. Zhou, M. Zhu, and K. Daniilidis. Multi-image matching via fast alternating minimization. In Proc. ICCV, 2015.
  • [Zhu et al.(2017)Zhu, Park, Isola, and Efros] J.Y. Zhu, T. Park, P. Isola, and A. Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In Proc. ICCV, 2017.