Binding via Reconstruction Clustering

11/19/2015 ∙ by Klaus Greff, et al. ∙ IDSIA 0

Disentangled distributed representations of data are desirable for machine learning, since they are more expressive and can generalize from fewer examples. However, for complex data, the distributed representations of multiple objects present in the same input can interfere and lead to ambiguities, which is commonly referred to as the binding problem. We argue for the importance of the binding problem to the field of representation learning, and develop a probabilistic framework that explicitly models inputs as a composition of multiple objects. We propose an unsupervised algorithm that uses denoising autoencoders to dynamically bind features together in multi-object inputs through an Expectation-Maximization-like clustering process. The effectiveness of this method is demonstrated on artificially generated datasets of binary images, showing that it can even generalize to bind together new objects never seen by the autoencoder during training.



There are no comments yet.


page 5

page 7

page 19

page 20

page 21

page 22

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 The Binding Problem

Two important properties of good representations are that they are distributed and disentangled. Distributed representations (Hinton, 1984) are far more expressive than local ones, requiring exponentially fewer features to capture the same space. Complementary to that, disentangling (Barlow et al., 1989; Schmidhuber, 1992; Bengio et al., 2007) requires the factors of variation in the data to be separated into different independent features. This concept is closely related to invariance and eases further processing because many properties, that we might be interested in, are invariant under a wide variety of transformations (Bengio et al., 2013a). Unfortunately distributed representations can interfere and lead to ambiguities when multiple objects are to be represented at the same time.

The binding problem refers to these ambiguities that can arise from the superposition of multiple distributed representations. This problem has been debated quite extensively in the neuroscience and psychology communities perhaps starting with Milner (1974) and von der Malsburg (1981), but its existence can be traced back at least to a description by Rosenblatt (1961). It is classically demonstrated with a system required to identify an input as either square() or triangle() and to decide whether it is at the top() or at the bottom(). It represents every object as a distributed representation with two active disentangled features (see (b)). The binding problem arises when the system is presented with two objects at the same time: In this scenario, all four features become active and from the representation alone it cannot be determined whether the input contains a square on top and a triangle at the bottom or vice versa.

One way the system can circumvent this problem is through the use of a local representation, with one feature for each combination of shape and position: . Sadly the size of such a purely local-representation scales exponentially with the number of factors to represent. In contrast, distributed representations (Hinton, 1984) are much more expressive and can generalize better through the reuse of features. The former system could, for example, correctly represent the position for a new object such as a circle by the already available position features ().

Generalization of internal representations is a crucial capability of any intelligent system, and one that still sets humans apart from current machine learning systems. Consider (a), an example from studies in psychology: chances are this is the first time you see a Greeble (Gauthier & Tarr, 1997). Nevertheless, you are capable of describing its shape, texture, and color. Moreover, you can easily segment it and tell it apart from the background, without having seen any other Greeble before. It has long been argued that such generalization capabilities are a result of the use of distributed representations in the human brain.

Despite its importance to the neuroscience community, binding has received relatively little attention in representation learning. Two important reasons for this are:

Firstly, most pattern recognition tasks and benchmarks are set up to avoid the binding problem. Many popular visual pattern recognition datasets consist of images that contain only one object at a time. Similarly, speech recognition mostly considers recordings of just one speaker talking and little background noise. In these settings, a machine learning algorithm can assume that there is only a single prominent object of interest, reducing the binding problem to the problem of ignoring irrelevant details. When tackling more challenging problems such as image caption generation, scene parsing segmentation, or the cocktail party problem, the deficiencies of popular methods become more apparent and restrictive.

Secondly, the recent increase in processing power due to the use of Graphics Processing Units made it feasible to mitigate the binding problem using localized binding in the form of convolutions (Riesenhuber & Poggio, 1999). Convolutional Networks (Fukushima, 1979; Le Cun et al., 1990) use feature detectors with limited receptive fields (filters) replicated over the whole input to represent its inputs. Therefore the resulting features of spatially separated objects do not interact: they are invariant to changes outside their field of view. On the other hand, they do not disentangle the location from the detected pattern, which comes at the cost of having to compute the same feature replicated many times over the image. While this is reasonable for low-level features like edges, it seems wasteful to replicate specialized high-level features such as dog-faces.

In this paper, we develop an unsupervised method that dynamically binds features of different objects together. This is in contrast to local representations which by nature statically bind several input features together (a feature for permanently binds the concepts and together). It explicitly models inputs as a composition of multiple entities and recovers these “objects” using the notion of mutual predictability. This is achieved through a clustering process which utilizes a denoising autoencoder (DAE; Behnke, 2001; Vincent et al., 2008) to iteratively reconstruct an input. In the future, such a mechanism could help to effectively use distributed representations in multi-object settings without being impaired by the ambiguities due to superposition. Alternative approaches to the binding problem proposed in the literature are discussed in Section 5.

Figure 1: (a) A Greeble. (b) Example demonstrating the binding problem (c) An illustration of intra-object predictability. The missing pixels from the square can be predicted using other pixels constituting the box, but not from pixels constituting other objects.

2 Reconstruction Clustering

This section describes Reconstruction Clustering (RC), a formal framework for tackling the binding problem as a clustering problem. For ease of explanation we will refer to inputs as images and the individual dimensions of an input as pixels, though the framework is not restricted to visual inputs. It is based upon two insights: Firstly, if the image was segmented into its constituent objects, there would be no binding problem. Secondly, the intuitive notion of an object can be formalized as a group of mutually predictive pixels. The proposed method therefore iteratively clusters pixels based on how well they predict each other.

2.1 Images as Compositions

The first central idea behind RC is to model images as being composed of several independent objects with each pixel belonging to one of them. Unlike in classic segmentation where each pixel is assigned to a predefined class, the goal here is to simply segregate different objects. In doing so we avoid all ambiguities that might arise from a superposition of their representations. Of course, the information about which objects are present and which pixels they consist of is unknown in practical applications. So for each image, the aim is to infer both the object representations and the corresponding pixel assignments.

Formally, we introduce a binary latent vector

for each pixel that specifies which of the objects it belongs to. Therefore, with the constraint that . Let denote the number of pixels in the image . Then we define the prior over as:


where the ’s are assumed to be independent given . The assumed probabilistic structure is shown in (a). We assume

to be uniformly distributed for simplicity, but its estimation can be incorporated into the algorithm if required (see Appendix).

2.2 Objects

So far, we have used the word object to describe a group of pixels that somehow “belong together”. The second central idea of RC is to concretize that notion using mutual predictability of the pixels. Intuitively, knowing about some pixel values that belong to an object helps in predicting the others. An example can be seen in (c) where the corrupted pixels in the bottom left corner of the square could be reconstructed from knowledge about the rest of the square, but not from any of the triangles. So we define an object as a group of pixels that help in predicting each other, but do not carry information about pixels outside of that group.

Predictability, as we use it here, is derived from the structure of the underlying data-distribution. Knowledge about this structure is also precisely what is needed in order to remove corruption from an image. Based on this insight, we propose to use a denoising autoencoder (DAE) to measure predictability.

2.3 Denoising Autoencoder

Let be the encoder and be the decoder of a DAE, such that is the encoded representation of input and is the decoded output. The DAE is trained to remove corruption from images of single objects and thus learns a local model of the data generating distribution (Vincent et al., 2008; Bengio et al., 2013b). After training the same DAE is used for each of the clusters to get predictions from cluster for pixel , where the object in cluster is represented by :


Here ’s are assumed to be independent given . Combining this with the latent variables we get:


2.4 Clustering

Figure 2: (a) The assumed probabilistic structure. (b) A schematic illustration of one iteration of the RC algorithm.

We can now outline a clustering algorithm that estimates the object identities and the corresponding pixel assignments. Formally, we seek to maximize the complete data log-likelihood:


This can be done in an iterative procedure where we start by randomly initializing the latent cluster assignments and then alternating between the following two steps:

  1. Apply the autoencoder to the each of the images that are assigned to the clusters to get a new estimate of the object representations. (R-step)

  2. Re-assign the pixels to the clusters according to their reconstruction accuracy. (E-step)

2.4.1 Reconstruction Step

The R-step applies the encoder to generate a new object representation from each of the partial images that are assigned to each cluster. We call this representation of a partial image since the encoder only gets to see as much of each pixel of the original image as has been soft-assigned to the current cluster. The DAE then denoises the “corruption” caused by the cluster assignments. The R-Step is thus given by the following formula, where denotes point-wise multiplication:


Unfortunately this step can not be guaranteed to increase the expected log-likelihood, because only in expectation does the DAE map from regions of low likelihood to regions of higher likelihood. Moreover, this property only holds for the whole image and not for all subsets of pixels. Thus, convergence can’t be proven and RC is not an Expectation Maximization algorithm (Dempster et al., 1977). Nevertheless, empirical results show that convergence does occur reliably (Section 4.2).

2.4.2 Estimation Step

In the E-step, for each pixel the posterior of , given the data and the predictions of the autoencoders based on the object representations, is


In this paper, we assume the pixels to be binary and the predictions of the network

to correspond to the mean of a binomial distribution. Then the following performs a soft-assignment of the pixels to the

different clusters:


3 Experiments

Figure 3: One example from each of the six datasets. The input images are shown on the top row with the corresponding ground-truth grouping below.

We evaluated RC on a series of artificially generated datasets consisting of binary images of varying complexity. For each dataset, a DAE was trained to remove salt&pepper noise on images with single objects. The autoencoders used were fully-connected feed-forward neural networks with a single hidden layer and sigmoid output units. A random search was used to select appropriate hyperparameters (see Appendix for details). The best DAE obtained for each dataset was used for reconstruction clustering on 1000 test images containing multiple objects, and the binding performance was evaluated based on groud-truth object identities. All the code for this paper (including the creation of the datasets and figures) is available online on

3.1 Datasets

Representative examples from the datasets are shown in Figure 3.

Simple Superposition

A collection of simple pixel patterns two of which are superimposed. Taken from Rao et al. (2008). This is a simple dataset with no translations, but significant overlap between patterns.


Taken from Reichert & Serre (2013). Three shapes are randomly placed in an image (possibly with overlap). This dataset tests binding of shapes under translation invariance and varying overlap.


Introduced by Földiak (1990)

to demonstrate unsupervised learning of independent components of an image. We use the variant from

Reichert & Serre (2013) which employs 6 horizontal, and 6 vertical lines placed in random positions in the image.


This dataset consists of 8 corner shapes placed in random orientations and positions, such that 4 of them align to form a square. It was introduced by Reichert & Serre (2013) to demonstrate that spatial connected-ness is not a requirement for binding.


Another dataset from Reichert & Serre (2013), which combines a random shape from the shapes dataset with a single MNIST digit. This dataset is useful to investigate binding multiple types of objects.


Three random MNIST digits are randomly placed in a 4848 image. It provides a more challenging setup with multiple complex objects.

3.2 Evaluation

Since the data is generated, a ground-truth segmentation for each image is available. For the binding task, all pixels corresponding to the same object should be clustered together. We evaluated performance by measuring the Adjusted Mutual Information (AMI; Vinh et al., 2010) between the true segmentation and the result of the binding, to which we refer to as the score. This score measures how well two cluster assignments agree and takes a value of 1 when they are equivalent, and 0 when their agreement corresponds to that expected by chance. Only pixels that unambiguously belong to one object were counted, ignoring background pixels and regions where multiple objects overlap.

4 Results

4.1 Scores

(a) Overall Scores
(b) Convergence
Figure 4: Left: Mean AMI score over 1000 test samples for all datasets and various number of clusters . Right: Convergence of the log-likelihood on the shapes

dataset for different numbers of clusters, showing test set mean (line) and standard deviation (shaded) over the test set.

(a) shows the mean scores obtained using RC for each dataset averaged over 100 runs. Scores obtained with different choices of the number of clusters . Results are consistent across runs, hence the standard deviations are very low and barely visible. The optimal number of clusters is two for Simple Superposition and MNIST+Shape, three for Multi MNIST and Shapes, five for Corners, and 12 for Bars. Scores are higher than 0.5 for all datasets and higher than 0.8 for four out of the six datasets demonstrating the ability of RC to successfully bind objects together.

4.2 Convergence

(b) shows the convergence of the mean log-likelihood over RC iterations on the shapes dataset. Convergence is quick, typically within 5-10 iterations, depending on the chosen number of clusters and the dataset (not shown). As expected, the final likelihood is highest when the number of clusters equals the number of objects in the shapes dataset (3), matching the results from (a). The likelihood is much lower for than for and drops again slightly if we choose . The likelihood for is significantly lower. In some cases the correct choice of did not result in the highest likelihood, but in general this correspondence appeared to hold. If the number of objects is unknown, this trend can be used to determine the correct number of clusters.

4.3 Qualitative Analysis

Figure 5: The top plot shows the score and confidence for each of the 1000 test images from the shapes dataset, sorted by score. The confidence is the average value of for each evaluated pixel (non-background, non-overlap). The central part of the figure shows six examples (columns) along with the cluster assignments (indicated by different colors) over RC iterations. The corresponding ground-truth is shown at the bottom. The right vertical plot shows the log-likelihood over the RC iterations corresponding to the displayed cluster assignments. Similar plots for the other datasets are included in the Appendix.

Figure 5 shows a few example RC runs of on the shapes dataset for qualitative evaluation. The initial cluster assignments are random, therefore all observed structure is due to the clustering process. The final clustering corresponds well to the ground truth even for cases with significant overlap. Again, it is notable that RC converges quickly (within 5 iterations).

4.4 Loss vs Score

(a) Loss Vs Score
(b) Single vs. Multi-object DAE training
Figure 6: Left: Relationship between the DAE loss and the AMI score. All networks have 250 hidden units and were trained with random learning rates and initializations. A few networks that failed to train were removed from the plot for better visualization. Right: RC scores obtained when training DAEs on multi-object images vs. single object images.

RC utilizes autoencoders trained with the denoising objective for binding. Therefore, it is instructive to examine the relationship between denoising performance and the final RC binding score. For this purpose, we trained 100 DAEs with the same architecture on each dataset with random learning rates and initializations, and then performed RC using each of them. (a) shows the relationship between the denoising loss and binding score for each dataset. It can be observed that lower loss correlates positively with higher score for all datasets, indicating that denoising is a suitable surrogate training objective. We added a regression line to indicate that relation for each dataset, even though for MNIST+Shape and Multi MNIST it doesn’t look even remotely linear. Instead, the individual points are approximately arranged on a curve. This suggests that there is a direct but complex interplay between the denoising performance and the score.

4.5 Training on Multiple Objects

So far the DAEs were trained on single-object images, then used to bind objects in multi-object images. In general it is desirable to not require single-object images for training, and be able to directly use any image without this restriction. This would remove the last bit of supervision and make RC a truly unsupervised method.

Why should this work at all? On the surface it seems that RC would depend on the DAE to prefer single objects in order to work correctly. However, even if each cluster tries to reconstruct every object, there will be small asymmetries due to the difference in inputs they see. Since no object carries any information about the shape and position of another object in our datasets, this will lead to differences in prediction quality of the objects. The resulting difference in reconstruction quality will then be amplified by RC and can still lead to a segregation of the objects.

To test this scenario, we performed a new random search to tune DAE hyperparameters for the case of multi-object training. Similar to the single-object case, we then used the best obtained DAEs to perform RC on test examples. We found that with soft-assignments to the clusters, the differences were too small and would even out over several iterations, leading to uniform cluster assignments. By changing the E-step to hard (K-Means-like) assignments, we were able to amplify these changes enough to make the whole procedure work.

(b) shows that DAEs trained on multi-object images can indeed be used for binding via RC with hard assignments, although they lead to lower scores in comparison. Further discussion and examples for this case can be found in Appendix C.

4.6 Generalization to a new domain

Figure 7: Binding novel objects via RC. The DAE used was trained on the Multi MNIST dataset.

A central intuition behind our approach to binding is that the low-level structures learned by the model will generalize to new and unseen configurations. Evaluation on unseen test sets demonstrated this to be true, but we can take it one step further. We can test what happens when we confront our method with novel objects that the auto-encoders have not been trained on.

We ran RC on several images with non-digits using a DAE trained on the Multi-MNIST dataset. Figure 7 shows that RC “correctly” binds letters and circles together. We also show images for which the resulting binding differs from our expectation. It appears that the network has mainly learned to bind based on spatial proximity with a slight bias towards vertical proximity. This can be expected since that it has only seen digits of roughly the same size so far, and because the used autoencoder is very limited. Nevertheless, it is very interesting that a fully-connected network which is permutation invariant learns the preference for spatial proximity entirely from data. It is reasonable to speculate that it in the future it may be possible to recover other Gestalt Principles such as continuity and similarity with a similar procedure.

5 Relationship to other Methods

The binding problem and its possible solutions are a long standing debate in the neuroscience literature (see e.g. Milner (1974); von der Malsburg (1981); Gray (1999); Treisman (1999); Di Lollo (2012)). A major thread of work on binding has been inspired by the temporal correlation theory (von der Malsburg, 1981)

, based on utilizing synchronous oscillations to bind neuronal activities together.

von der Malsburg (1995) provides an overview of these ideas. Recently, these ideas were implemented using complex valued activations in neural networks to jointly encode firing rate and phase (Rao et al., 2008; Reichert & Serre, 2013). Such binding mechanisms are close to their biological inspiration, clustering only implicitly through synchronization. In contrast, RC is based on a mathematical framework which explicitly incorporates binding.

Mechanisms for tackling the binding problem which do not require temporal synchronization have also been proposed (e.g. O’Reilly et al., 2003). O’reilly & Busby (2002) argued that the intuitive explanation of the binding problem from (b) only applies if the distributed features themselves are local codes. They suggested that neural networks can avoid the binding problem using coarse-coded representations. Various feature representation types including coarse-coding and their limitations were described by Hinton (1984).

In principle, Recurrent Neural Networks (RNNs; e.g. 

Robinson & Fallside, 1987; Werbos, 1988) can solve the binding problem by learning a mechanism to avoid it. Psychologists (Di Lollo, 2012) and machine learning researchers (Weng et al., 2006) alike have suggested feedback as a mechanism to do binding. An RNN may utilize an implicit or explicit attention mechanism to selectively process different parts of the input (Schmidhuber & Huber, 1991; Mnih et al., 2014; Bahdanau et al., 2014). In this context, explicit binding via RC can be seen as a technique of paying attention to multiple objects at once, instead of focusing on them sequentially.

In some aspects, RC is similar to segmentation algorithms. The main difference is that RC learns the segmentation from the data in a largely unsupervised manner. In this sense, it is more similar to superpixel methods (see e.g. Achanta et al. (2012) for an overview). However, these methods impose a handcrafted similarity measure over pixels or pixel regions, whereas RC learns a non-linear similarity measure from the data, parameterized by a DAE.

6 Conclusion and Future Work

We introduced the Reconstruction Clustering framework to explicitly model data as a composition of objects, where the notion of object-ness is defined by mutual predictability. Compared to many previous solutions to the binding problem, this framework is mathematically rigorous, integrates well with current representation learning methods, and is effective for a variety of binary image datasets. While a typical representation learning method (such as a denoising autoencoder) learns a static binding of features, Reconstruction Clustering utilizes it to iteratively perform dynamic binding

for every input example by introducing interaction between the statically bound features extracted by the autoencoder. In particular, this interaction enables dynamic binding of feature combinations never seen before by the autoencoder.

This paper lays the groundwork for many concrete lines of future exploration. The treatment of real-valued inputs is an important next step to extension RC towards natural data. Also the use of more powerful autoencoders will be key. Integrating RC with the training of the DAE should help to deal with multiple objects in the training data. Since the method is general, we expect to apply it to other domains such as audio data (binding different speaker voices together) or medical data (binding various related symptoms of disease together). A particularly interesting direction for future work is to show that Gestalt principles are a natural result of such a representation learning approach.


We thank Jan Koutník, Sjoerd van Steenkiste, Boyan Beronov and Julian Zilly for helpful discussions and comments. This project was funded by the EU project NASCENCE (FP7-ICT-317662).


Appendix A Reconstruction Clustering Derivation

This section contains a more detailed derivation of Reconstruction Clustering (RC) for binary inputs. It follows the notation and derivation of an Expectation Maximization (EM) algorithm wherever possible. Only for the M-step does RC deviate from EM.


binary random variables (one for each pixel) that are distributed according to a mixture of

Bernoulli distributions with means and mixing coefficients that sum to one . Under this model the data likelihood given the parameters is given by:


By defining and and assuming independence of the ’s given and (but not identical distribution)111

This assumption means that we assume the hidden representation of the DAE to capture the structure in the image well.

we get the (incomplete) log likelihood function for this model:


Let us now introduce an explicit binary latent variable with associated with each . Let the prior distribution be:


where we set and assume ’s to be independent given . With that we define the conditional distribution of given the latent variables as:


If we marginalize Equation 12 over all choices of we recover Equation 8:


The second line is obtained by realizing that the sums over exactly terms, each corresponding to a with one and all other entries equal to zero. So we can replace this sum by . The product over the entries of then vanishes except for the term corrsponding to the -th entry.

Using the same conditional independence assumption from before we can thus write the data distribution given all the latent variables as follows:


And by using Bayes rule and assuming that is independent of :


If we set 222Here we deviate slightly from the notation in the paper. the complete-data log likelihood becomes:


To maximize with respect to and we follow the same idea as the EM algorithm: Based on the observation that if we knew the values of either of these two, optimizing the other would be feasible. So we divide the optimization problem into two steps where we pretend to know either (E-step) or (M-step).

In the E-Step we assume to know

and calculate the posterior probability of

for each datapoint calling it : (We assume the to be independent of for )


Next we calculate the value used in EM which is defined as the expectation of the complete data log-likelihood with respect to the posterior of given the data and the old parameters :


In the M-step of EM we aim to maximize over all choices of :


Using a Lagrange multiplier to enforce we find:


But when maximizing wrt. we see that the maximum is trivially obtained by setting for all . This is due to the fact that the problem is actually ill-posed in the sense that we have parameters to fit for each datapoint. So there are infinitely many solutions which achieve the optimal log likelihood of the data of 0.

At this point we introduce an autoencoder with encoder and decoder to restrict the capacity of our model by forcing to be:


We use this reconstruction step (Equation 28) instead of an actual maximization step, thus deviating from the EM formulation.

Appendix B Training Details

All experiments have been performed with the brainstorm library and were organized and logged using sacred. The code for this paper can be found on GitHub.

b.1 Training Denoising Autoncoders

  • simple feed forward fully connected NNs

  • with sigmoid output layer

  • loss is Binomial Cross Entropy Error

  • trained with SGD

  • minibatch size 100

  • salt& pepper noise

  • early stopped when validation BinomialCEE doesn’t decrease for more than 10 epochs

b.2 Random Search

There are several hyperparameters to be chosen for the denoising autoencoders. To find good values we performed a random search with 100 runs for each dataset. For each run we randomly sampled from the following parameters:

  • learning rate log-uniform from

  • Amount of Salt& Pepper Noise from

  • hidden layer size from

  • hidden layer activation function from [rel, sigmoid, tanh]

The best network configurations found by that search can be found in Table 1.

Figure 8: Summary of the scores achieved during the random search
Dataset learning rate # hidden units activation salt&pepper score
bars 0.768015 100 ReL 0.0 0.951809
corners 0.001920 100 ReL 0.0 0.853866
multi_mnist 0.011362 1000 ReL 0.6 0.651657
mnist_shape 0.031685 250 sigmoid 0.6 0.545559
shapes 0.083147 500 tanh 0.4 0.928792
simple_superpos 0.366627 100 ReL 0.1 0.890472
Table 1: Configuration of the best network for each dataset as found by the random search.

b.3 Random Search for Training with Multiple Objects

For training with multiple objects we do an equivalent random search for hyperparameters. The only difference is the training data and that for determining the final score we use K-means-like (hard) cluster assignments in RC. Note also that we didn’t include the Simple Superposition dataset, since we only have 120 images with multiple objects available, and no separate test set.

Figure 9: Summary of the scores achieved during the random search for training with multiple objects
Dataset learning rate # hidden units activation salt&pepper score
bars 0.012192 100 sigmoid 0.8 0.851777
corners 0.026035 100 ReL 0.7 0.704285
mnist_shape 0.033200 1000 ReL 0.6 0.259646
multi_mnist 0.001786 250 sigmoid 0.9 0.614277
shapes 0.049402 100 sigmoid 0.9 0.776656
Table 2: Configuration of the best network trained on multiple objects for each dataset as found by the random search.

Appendix C Multi Object Training

When training the DAEs on images with multiple objects, it is less obvious why running RC should lead to a segregation of the objects. It seems that the autoencoder should always try to reconstruct the whole image including all the objects. And if we run normal (soft) RC we in fact see that after a few iterations each pixel is equally represented by each cluster.

By switching to hard cluster assignments we eliminate this stable state, and force the clusters to compete more for the pixels. Together with the fact that in our datasets objects don’t carry any information about other objects this leads to a stronger amplification of the initial differences in reconstruction quality. In Figure 10 this process can be seen on the shapes dataset. Note that the hard RC converges even faster, but generally leads to worse performance.

Figure 10: Example iterations of RC when using hard assignments and a DAE that has been trained only on images with multiple objects.

Appendix D Additional Figures

Figure 11:
Figure 12:
Figure 13:
Figure 14:
Figure 15: