The presence of extraneous, confusing or overwhelming information in an image has been shown to reduce task-specific performance in humans 
. As such, it should not come as a surprise that similar performance drops are observed in machine learning applications. From localisation in crowded scenes to tracking in environments with multiple objects, visual clutter can have a significant effect on the performance of neural networks. This can be explained, in both humans andArtificial Intelligence (AI), as multiple objects competing for neural representation .
Object removal is an application in its own right, where the task is to fill in a “gap” left when an object is removed from an image. That removal is normally defined by a mask, which is required at runtime. Object removal can be used to remove visual clutter from images as a pre-processing step to any task. However, there are important limitations to this approach. Firstly, it requires expensive image processing to be done for every single datum in a dataset and, crucially, also at run-time. Secondly, it requires manual mask annotation of the objects that are to be removed. Finally, the approach must reliably fill the image with temporally consistent content in order to be useful for down stream tasks. Our key insight is that for some applications, it is less important that an object is removed from an image and more important that the computer is capable of completely ignoring that object. In this work, we propose an approach to allow a neural network to ignore chosen objects, not by simply removing them from the image, but by training the network to not “see” them in the first place.
Our approach, which we call Neural Blindness, aims to prevent a neural network from representing chosen objects in its latent space. Humans are capable of ignoring clutter when the task at hand requires them to do so. In a similar way, we train an autoencoder network to remove class-specific objects from its latent representation and demonstrate how this improves the performance of an autonomous agent.
We introduce two key novelties. In the first, we introduce the concept of Neural Blindness along with a novel Siamese autoencoder architecture used for training neural-blind latent representations. We additionally offer insights into why certain types of autoencoders are suitable for this task, while others are not. In the second, we demonstrate our neural-blind latent representation can be used to improve the performance on downstream tasks by ignoring visual clutter. More explicitly, we demonstrate Neural Blindness can be used as an effective pre-training approach to inject invariance to distractors, particularly when applied to localisation of autonomous vehicles.
Ii Literature Review
In many computer vision tasks involving the use of latent spaces has become commonplace [8, 28, 31]. These latent spaces encode high-level semantic information about the image. Our task aims for the removal of targeted information/objects from the original image such that this information is never encoded in the representation to begin with. The task therefore requires learning invariant representations and the “forgetting” of information.
One way to obtain a latent space that is domain-specific or, equivalently, invariant to certain information, is via the use of Variational Autoencoders [1, 23, 18, 13]. A VAE  is a Bayesian autoencoder, whereby the training objective involves reconstruction of a high-dimensional observational distribution from a low-dimensional representation. By regularizing this representation so that it resembles a pre-specified prior distribution (often an isotropic Gaussian), the network undertakes variational inference and learns a smooth, traversible, latent manifold.
, and neural image inpainting[8, 31].
Researchers have encouraged invariance or forgetting with VAEs by utilizing various combinations of weak supervision [1, 23], adversarial training , and by trading off reconstruction error against encoding capacity according to the information bottleneck principle . Others have sought to fully disentangle the latent space into independent partitions , thereby enabling the isolation of wanted from unwanted information. Whilst unsupervised methods for disentanglement exist [4, 5], recent research indicates that such methods do not achieve disentanglement consistently  and, even if they do, they do not allocate factors to the latent space in a predictable way. Furthermore, in order for VAEs to handle complex data and to reconstruct detailed images, the complexity of the latent space is increased, reducing the emphasis on achieving disentangled and semantically interpretable latent spaces [21, 26]. For instance, Vector Quantised (VQ-VAE) [24, 27] deviates from the typical formulation of a VAE and utilizes a deterministic, discrete, vector-quantized codebook as a latent space. In this work, we incorporate a number of techniques into the VAE/autoencoder learning in order to achieve a representation that is invariant to targeted objects.
The latent representations in neural networks can be used for downstream tasks such as classification and regression. In this area, one of the key applications is camera pose regression and the most common family of approaches is PoseNet [12, 10, 11, 3] and its derivatives. Fundamentally, these approaches rely on sensor/pose pairs to train a CNN that can regress pose given an image. Normally, they start with a pre-trained encoder network and modify the final fully connected layers to regress 6- Degrees of Freedom (DoF) camera poses. The network then learns a mapping between the latent representation in the encoder, and the camera poses. A key limitation of these approaches is that they indiscriminately map latent representations to poses. More explicitly, the network may learn to map a parked red corvette to a set of camera poses regardless of the fact the car may move. As such, the network may not be able to successfully recover the poses when the car moves or indeed predict them at different locations where red corvettes exist. Work by Barnes et al.  has attempted to explicitly remove objects using an ephimerality mask, but relies on pre-mapped areas using LiDAR. While invariance to such salient objects might be achieved with sufficient data, acquiring it is not always feasible. In this work, we propose to explicitly remove such transient objects from the latent space of a network before it is ever trained for localisation. Such a network would have the ability to learn to regress poses in areas with large amounts of distractors, such as car parks and crowded areas, with relatively small amounts of data.
It is tempting to think of neural blindness as a two-step process of identification and removal
. However, the purpose of this paper is to achieve a subtle and more complicated task. Class-specific neural blindness requires the neural network to be fundamentally incapable of representing those classes in its latent space. Therefore, identifying the object before it can be removed fundamentally defeats the purpose, as the object has to be present in the latent space for removal to occur in the generated image. Our architecture, loss functions and pre-training strategies force the network to learn how to represent images while removing the presence of specific classes from its latent space.
Our goal is to be able to remove a chosen class , where is the set of all classes in the dataset, from the latent space of the network. In order to accomplish this, we adopt a Siamese architecture where each arm is a VQ-VAE, shown in Figure 1. This means that we can feed the network two versions of every image: the original image that does not contain and where the has been added to the image. Since VQ-VAEs are capable of producing both an informative embedding and a crisp reconstruction, we can compare the two arms of the network to ensure that the arm that processes does not learn to represent . We use a VQ-VAE as we have found, empirically, that a standard VAE struggles produce the informative embedding and crisp reconstruction required for training.
The VQ-VAEs employed in the Siamese architecture are two-level hierarchical, which means they each sample from two trained code-books. The quantised output of both code-books are then concatenated into a single latent space, shown in Figure 1 as and for and , respectively. These subsets of the code-book are fed to the decoder arms in order to generate the output images. Employing a hierarchical model allows our approach to specialise the code-book such that the top code-book encodes low-frequency image detail while the bottom encodes high-frequency detail. Since the code-books are trained to focus on different image “levels” it is easier to discriminate for class-specific elements.
In order to ensure the code-book is trained to correctly encode the images, we use a quantisation loss to train our latent space. Given an input image , encoder , quantiser and decoder , the loss function for the code-book is
where is the quantised encoder output and defines a stop-gradient operation that prevents back-propagation. The first term in equation 1 ensures the code-book learns to represent the input correctly, while the second term ensures the code-book does not keep changing and instead “commits” to a set of vectors. Therefore, describes how reluctant the network is to update its code-book. We also adopt the standard practice of  and replace the first term of the equation with an exponential moving average update step for the code-book.
A standard implementation of a VQ-VAE would combine this loss with a reconstruction loss defined as
which forces the network to learn a code-book that accurately reconstructs the image. However, our objective is to ensure the network does not encode a certain class into its code-book. Therefore, we modify the loss in equation 2 as:
where denotes element-wise multiplication and is a mask for the class we want to ignore. More explicitly, we ensure the network gets no reward for reconstructing the target class in the image.
These losses ensure that the network is capable of learning a code-book that can successfully reconstruct an image. However, while there is no reward for reconstructing the class , there is nothing in these loss functions that actively prevents the network from learning global features that can reconstruct the target class. This can happen because the network is never actively discouraged from learning the object representation. Therefore, a loss is required to actively prevent the network from learning to encode and reconstruct the target classes. To accomplish this, we use a Siamese loss consisting of two individual losses. The first, ensures that the objects are not represented in the latent space, the second, that the objects are not present in the reconstructed image.
We assume the existence of an image pair and such that the only difference between the two images is the presence of the target classes (), as shown in Figure 1. This is a strong assumption, and we will provide more details of the creation of such a dataset in section III-B.
Given and , each image can be passed through one of the Siamese encoder networks such that and . The encoded latent spaces then define a loss function as
which maintains a level of separation between the dimensions of the latent space.
This loss ensures that the network learns to map the images containing the target classes into a representation that fundamentally does not encode it. One of the key insights in our work is that this type of Siamese loss is particularly well suited to a VQ-VAE’s architecture. In a standard VAE, the encoder would be forced to implicitly learn to map an image into a latent space without the target classes. In our Siamese VQ-VAE, the network learns to sample from the codebook such that does not end up in the representation space.
As a final step, we use the fact that both the original image and the reconstructed image should not
contain the target classes. Therefore, we can estimate a Siamese reconstruction loss as
which encourages the network not to reconstruct the chosen class and is complimentary to .
The final Siamese loss can therefore be defined as
where is a hyperparameter defining the weight of the reconstruction loss.
Finally the losses are combined
where are training hyperparameters. The combined loss function is capable of training a neural-blind architecture end-to-end, as will be shown in the following section.
Iii-B Latent Space Pre-Training
As mentioned in section III-A, we assume the existence of an image pair and where the presence of the target class () is the only difference. In principle, this would limit our approach to datasets where the target classes are added and removed from scenes while the images are being captured. Instead, we propose to generate this from pre-existing datasets in order to train and evaluate our approach. Given a set of object classes , where is the set of all classes in the dataset (), we split the dataset into a subset of images that contain the category and images that do not such that and . For every image that does not contain the chosen class, we randomly select an image that does and use a mask to overlay one of the class objects into . This allows us to define an overlaid image as
where is the function that overlays a category from to . Using this overlay, a set of 3 images can be created: the original image , the original image with a category overlaid and a mask image which defines the location of the overlaid class on . These overlay images can be used in equations 5 and 6. At training time, these images are automatically generated therefore randomizing the overlay at every iteration. Although the images are sufficient to train the system to be blind, the network is never exposed to the classes in their natural context which may limit the networks ability to generalise outside of the training data. To address this concern, we show the network natural images while training, which can be used in equation 3 to help the network generalise to natural data.
As previously mentioned, during training time, this dataset is dynamic, each iteration presents a new opportunity for the classes to be overlaid on different images, preventing the network from overfitting to the overlay while ensuring performance on natural images.
In this section, we describe the use of a pre-trained latent space for an autonomous vehicle localisation task where blindness to visual clutter (cars) improves localisation performance. We also demonstrate qualitative “blindness” to objects by decoding latent spaces where the object has been removed.
Iv-a Localisation Task
A useful downstream task for neural blindness is localisation. Most learning-based localisation techniques employ an encoder to map an image to a latent space, then regress the pose of the camera. However, such approaches to localisation can easily overfit and often fail to generalise to new sequences. For example, training a system to learn pose from sequence A and then regress the correct pose on unseen sequence B means the latent space must be capable of overcoming environmental or structural changes between the two sequences. Consider elements of the scene that easily change such as people in crowded scenes or cars in automotive datasets. These factors need to be ignored so that localisation can focus on the consistent structure of the environment.
Pose-regression networks should be able to learn to ignore distractors.
However, it isn’t always feasible to capture sufficient data representing the variability needed.
In this section, we demonstrate that our neural-blind network is capable of mapping images to a more useful feature space by actively ignoring distractors, allowing us to train a distractor-robust pose-regression network without providing variance in the training data
without providing variance in the training dataand potentially allowing one-shot training.
Iv-A1 Car-Park Localisation
We first focus on localisation of a car in a multistory carpark. This is an ideal use-case for pose-regression as indoor carparks are often GPS denied. Furthermore, the very nature of a car park means that much of the visual input to any localisation system will include other cars, all of which have different appearance and change on a daily basis. Therefore, any localisation system needs to ignore cars and base its localisation on the structure of the building (as humans do).
We first pre-train a neural-blind network that cannot see cars using the MSCOCO dataset. Once the network is pre-trained, we fix the encoder and feed the representation space to a pose-regression network in order to estimate 6-DoF pose estimates for each input image. This architecture can be seen in Figure 2. It should be noted that the encoder part of the network is not fine-tuned during training of the pose regressor, nor are masks of the distractors required. Instead, we simply learn to regress a pose from a vehicle-blind representation space. Regressing poses from a representation that is incapable of encoding vehicles should lead to better one-shot learning and pose regression.
To train the pose regression network we use a dataset that consists of three separate runs through a 6-floor multi-storey carpark. The runs were captured over two days, with two captures on the first day and a third capture on the second day. Each of these sequences traverses the entire 6 floors. We will refer to these sequences as D1T1, D1T2 and D2T1 to represent the Day (D) and Trajectory (T). The capture vehicle was equipped with 3 front-facing cameras and a 16-beam LiDAR. Of the three front facing cameras, there two are narrow field-of-view cameras (Left and Right) and one wide-angle (Centre) which we use as input to the pose regressor. The vehicle traversed 6 floors, over of driving and a volume of . We use the LiDAR to create the ground truth poses and extract between 2500 and 4000 images per-camera, per-trajectory.
Iv-A2 Cross-Floor Validation
In this experiment we exploit the fact that the floors of the multistory carpark are structurally identical, as can be seen in Figure 3. However, despite structural similarity of the carpark, the appearance varies drastically due to the different population of parked cars. To test the effect of neural blindness we trained three different networks (Blind, Non-Blind and PoseNet ) on the second floor of the multistory carpark and evaluated 2D localisation performance for floor one across D1T1, D1T2 and D2T1.
We train a neural-blind pose regression network for 100 epochs, with a learning rate ofusing an ADAM optimizer and a step learning rate scheduler with . We use the same hyperparameters to train a non-blind network. As in other work, , we report the median error in meters and evaluate our networks against a state-of-the-art PoseNet  trained on the same data. All networks are trained using a homescedastic loss (). For PoseNet, we use a pre-trained ResNet as an encoder and allow the network to fine-tune the entire architecture. For our networks, we do not fine-tune the encoders, and expect the network’s representation space to be robust enough for pose regression.
As can be seen in Table I, PoseNet has the lowest training error suggesting it can memorise the images. However, while the Non-blind VQ-VAE has comparable performance to PoseNet on the unseen sequences (D1T1, D1T2, D2T1), the Car-blind VQ-VAE has a significantly reduced localisation error. This demonstrates the benefit of a localisation framework that is blind to the distractors (cars) in the environment.
In a second experiment, we train both a neural blind VQ-VAE Pose regression network and a standard PoseNet on the entire D1T2 (6 Floors) sequence to regress the full 6 degree of freedom vehicle pose. We train both networks for 500 epochs. We then test both networks on the full 6-floors, using unseen sequences D1T1 and D2T2. Tables II and III show the performance of PoseNet  as well as our approach. Our Neural-Blind Pose regression can easily outperform the standard PoseNet, reducing the error by a significant margin. This is because PoseNet cannot identify the cars as distractors. In this scenario, PoseNet learns to map car-features to poses, and when those features are modified (or absent), the localisation performance suffers. In contrast, our approach doesn’t see or encode car-features. Therefore, the network must rely on other features to estimate the pose and does not suffer in the presence of high variability of distractors. Since any class can be removed from the latent space, we expect this approach to work in any scene where visual clutter is a problem.
|Neural-Blind Pose (Us)||0.99||1.08||1.09|
|Neural-Blind Pose (Us)||1.61||1.74||1.62|
Iv-B Blindness Decoding
We now demonstrate that our network becomes blind to certain classes by decoding the latent space of a neural-blind VQ-VAE. We show, qualitatively, that the class the network is blind to is not reconstructed. It should be noted that our network is not rewarded for the quality of the image “inpainting” but is simply incapable of representing these classes in its latent space, the result is in-painting by proxy.
In a first example, we show how our neural-blind VQ-VAE can be trained to become blind to facial apparel, particularly glasses and hats. We train two blind VQ-VAEs (glasses and hats) using the CelebAMask-HQ dataset . While many approaches have shown excellent results in this domain , they all require a mask before they remove the features. In our work, the network is fundamentally incapable of representing these categories so a mask is not necessary at runtime. As can be seen in Figure 4, our network is capable of removing the eyeglasses class while maintaining the eyes and facial expression. It is also capable of removing hats while maintaining sensible hair appearance. It should be noted that the network was NOT provided with a mask, nor has the network been trained to optimize for appearance (using an adversarial loss or similar). Instead, the network is incapable of encoding these classes into its latent representation.
In a second example, we select a representative set of categories to be blind to and train on each class independently using the MSCOCO 2017 dataset. We evaluate our approach against a state-of-the-art inpainting approach (DeepFill V2  ) and show that our approach is capable of out-performing inpainting without the use of a mask. Figure 8 shows qualitative results on natural images from the validation set of MSCOCO 2014. The first column are the original images, the second column is DeepFill V2  applied to these images with a mask and the third is our neural blind VQ-VAE without a mask. Note that comparing against an approach that is given a mask is inherently unfair to our approach, especially in situations where the mask is unavailable at runtime. However, it shows that the reason for performance gains in localisation are down to the removal of objects in the encoder latent space.
To evaluate quantitatively, we create a fixed unseen dataset for each class out of the MSCOCO 2014 validation set with randomized overlays. This is done to ensure that there is ground-truth data for the area covered by the target class. Natural images, such as those shown in Figure 8, do not contain this information which would make meaningful quantitative evaluation impossible. Note that the chosen classes provide representative performance, but our algorithm is flexible enough to learn blindness for other classes, including combinations of classes. Table IV shows the mean , Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR) estimated between the overlaid images and the originals. It can be seen that we outperform DeepFill V2 in all metrics. This shows that our approach is capable of robustly inpainting a chosen class even without the use of a mask.
|Person||DeepFill V2  (Mask)||0.214||0.123||9.68|
|Ours (No Mask)||0.129||0.056||13.86|
|Car||DeepFill V2  (Mask)||0.229||0.136||9.30|
|Ours (No Mask)||0.148||0.069||13.28|
|Fire-Hydrant||DeepFill V2  (Mask)||0.215||0.125||9.55|
|Ours (No Mask)||0.102||0.036||15.66|
|Cow||DeepFill V2  (Mask)||0.220||0.129||9.39|
|Ours (No Mask)||0.119||0.046||14.33|
|Fork||DeepFill V2  (Mask)||0.195||0.107||10.20|
|Ours (No Mask)||0.082||0.027||16.89|
We have trained a network to not encode certain classes into its latent space and shown this latent space can be used for downstream tasks. This kind of “negative attention” is capable of revolutionising the way we think about certain tasks. It means that visual clutter can be removed from tasks like localisation in crowded scenes or tracking in busy environments. Our contributions allow networks to be trained with smaller amounts of data, as we can choose to remove attention from elements that may cause the network to overfit.
-  (2017) JADE: joint autoencoders for dis-entanglement. arXiv:1711.09163v1. Cited by: §II, §II.
-  (2018) Driven to distraction: self-supervised distractor learning for robust monocular visual odometry in urban environments. In IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 1894–1900. Cited by: §II.
Geometry-aware learning of maps for camera localization.
IEEE/CVF Conf. on Comp. Vis. and Pattern Recognition (CVPR), pp. 2616–2625. Cited by: §II.
-  (2018) Understanding disentangling in Beta-VAE. arXiv:1804.03599v1. Cited by: §II.
-  (2018) Learning disentangled joint continuous and discrete representations. Neural Inf. Proc. Sys. (NeurIPS). Cited by: §II.
-  (2016) Domain-adversarial training of neural networks. arXiv:1505.07818. Cited by: §II.
-  (2017) A recurrent variational autoencoder for human motion synthesis. British Machine Vis. Conf. (BMVC). Cited by: §II.
-  (2018) Variational image inpainting. Workshop on Bayesian Deep Learning (NeurIPS). Cited by: §II, §II.
-  (2019) Disentangled representation learning for 3D face shape. arXiv:1902.09887v2. Cited by: §II.
-  (2016) Modelling uncertainty in deep learning for camera relocalization. In IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 4762–4769. Cited by: §II.
-  (2017) Geometric loss functions for camera pose regression with deep learning. In IEEE/CVF Conf. on Comp. Vis. and Pattern Recognition (CVPR), pp. 5974–5983. Cited by: §II, §IV-A2, §IV-A2, §IV-A2, TABLE I, TABLE II, TABLE III.
-  (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In IEEE Int. Conf. on Comp. Vis. (ICCV), pp. 2938–2946. Cited by: §II, §IV-A2.
-  (2014) Auto-encoding variational Bayes. arXiv:1312.6114v10. Cited by: §II.
-  (2015) Deep convolutional inverse graphics network. arXiv:1503.03167v4. Cited by: §II.
-  (2020) MaskGAN: towards diverse and interactive facial image manipulation. In IEEE/CVF Conf. on Comp. Vis. and Pattern Recognition (CVPR), Cited by: §IV-B.
-  (2019) SuperVAE: superpixelwise variational autoencoder for salient object detection. AAAI Conf. on Art. Int.. Cited by: §II.
-  (2018) Image inpainting for irregular holes using partial convolutions. arXiv:1804.07723v2. Cited by: §IV-B.
-  (2019) DIVA: domain invariant variational autoencoders. arXiv:1905.10427. Cited by: §II.
Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv:1811.12359v3. Cited by: §II.
-  (2019) Disentangling factors of variation using few labels. arXiv:1905.01258. Cited by: §II.
-  (2019) BIVA: a very deep hierarchy of latent variables for generative modeling. Neural Inf. Proc. Sys. (NeurIPS). Cited by: §II.
-  (2011) Interactions of top-down and bottom-up mechanisms in human visual cortex. Journal of Neuroscience 31 (2), pp. 587–597. Cited by: §I.
-  (2018) Invariant representations without adversarial training. Neural Inf. Proc. Sys. (NeurIPS). Cited by: §II, §II.
-  (2019) Generating diverse high-fidelity images with vq-vae-2. arXiv:1906.00446. Cited by: §II, §II.
-  (2015) Invariant representations without adversarial training. arXiv:1503.02406v1. Cited by: §II.
-  (2020) NVAE: a deep hierarchcical variational autoencoder. Neural Inf. Proc. Sys. (NeurIPS). Cited by: §II.
-  (2017) Neural discrete representation learning. Neural Inf. Proc. Sys. (NeurIPS). Cited by: §II, §III-A.
-  (2019) Foreground-aware image inpainting. IEEE/CVF Conf. on Comp. Vis. and Pattern Recognition (CVPR). Cited by: §II.
-  (2018) Generative image inpainting with contextual attention. In IEEE/CVF Conf. on Comp. Vis. and Pattern Recognition (CVPR), pp. 5505–5514. Cited by: (b)b, Fig. 8, §IV-B, TABLE IV.
-  (2019) Free-form image inpainting with gated convolution. In IEEE Int. Conf. on Comp. Vis. (ICCV), pp. 4471–4480. Cited by: (b)b, Fig. 8, §IV-B, TABLE IV.
-  (2019) Variational auto-decoder. arXiv:1903.00840v5. Cited by: §II, §II.