Neural Expectation Maximization

08/11/2017 ∙ by Klaus Greff, et al. ∙ IDSIA 0

Many real world tasks such as reasoning and physical interaction require identification and manipulation of conceptual entities. A first step towards solving these tasks is the automated discovery of distributed symbol-like representations. In this paper, we explicitly formalize this problem as inference in a spatial mixture model where each component is parametrized by a neural network. Based on the Expectation Maximization framework we then derive a differentiable clustering method that simultaneously learns how to group and represent individual entities. We evaluate our method on the (sequential) perceptual grouping task and find that it is able to accurately recover the constituent objects. We demonstrate that the learned representations are useful for next-step prediction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning useful representations is an important aspect of unsupervised learning, and one of the main open problems in machine learning. It has been argued that such representations should be distributed 

hinton1984distributed ; vondermalsburg1981correlation and disentangled barlow1989finding ; schmidhuber1992learninga ; bengio2013deep . The latter has recently received an increasing amount of attention, producing representations that can disentangle features like rotation and lighting chen2016infogan ; higgins2017betavae .

So far, these methods have mostly focused on the single object case whereas, for real world tasks such as reasoning and physical interaction, it is often necessary to identify and manipulate multiple entities and their relationships. In current systems this is difficult, since superimposing multiple distributed and disentangled representations can lead to ambiguities. This is known as the Binding Problem milner1974model ; vondermalsburg1981correlation ; hinton1984distributed and has been extensively discussed in neuroscience treisman1996binding . One solution to this problem involves learning a separate representation for each object. In order to allow these representations to be processed identically they must be described in terms of the same (disentangled) features. This would then avoid the binding problem, and facilitate a wide range of tasks that require knowledge about individual objects. This solution requires a process known as perceptual grouping: dynamically splitting (segmenting) each input into its constituent conceptual entities.

In this work, we tackle this problem of learning how to group and efficiently represent individual entities, in an unsupervised manner, based solely on the statistical structure of the data. Our work follows a similar approach as the recently proposed Tagger greff2016tagger and aims to further develop the understanding, as well as build a theoretical framework, for the problem of symbol-like representation learning. We formalize this problem as inference in a spatial mixture model where each component is parametrized by a neural network. Based on the Expectation Maximization framework we then derive a differentiable clustering method, which we call Neural Expectation Maximization (N-EM). It can be trained in an unsupervised manner to perform perceptual grouping in order to learn an efficient representation for each group, and naturally extends to sequential data.

2 Neural Expectation Maximization

The goal of training a system that produces separate representations for the individual conceptual entities contained in a given input (here: image) depends on what notion of entity we use. Since we are interested in the case of unsupervised learning, this notion can only rely on statistical properties of the data. We therefore adopt the intuitive notion of a conceptual entity as being a common cause (the object) for multiple observations (the pixels that depict the object). This common cause induces a dependency-structure among the affected pixels, while the pixels that correspond to different entities remain (largely) independent. Intuitively this means that knowledge about some pixels of an object helps in predicting its remainder, whereas it does not improve the predictions for pixels of other objects. This is especially obvious for sequential data, where pixels belonging to a certain object share a common fate (e.g. move in the same direction), which makes this setting particularly appealing.

We are interested in representing each entity (object)

with some vector

that captures all the structure of the affected pixels, but carries no information about the remainder of the image. This modularity is a powerful invariant, since it allows the same representation to be reused in different contexts, which enables generalization to novel combinations of known objects. Further, having all possible objects represented in the same format makes it easier to work with these representations. Finally, having a separate for each object (as opposed to for the entire image) allows to be distributed and disentangled without suffering from the binding problem.

We treat each image as a composition of objects, where each pixel is determined by exactly one object. Which objects are present, as well as the corresponding assignment of pixels, varies from input to input. Assuming that we have access to the family of distributions

that corresponds to an object level representation as described above, we can model each image as a mixture model. Then Expectation Maximization (EM) can be used to simultaneously compute a Maximum Likelihood Estimate (MLE) for the individual

-s and the grouping that we are interested in.

The central problem we consider in this work is therefore how to learn such a in a completely unsupervised fashion. We accomplish this by parametrizing this family of distributions by a differentiable function (a neural network with weights

). We show that in that case, the corresponding EM procedure becomes fully differentiable, which allows us to backpropagate an appropriate outer loss into the weights of the neural network. In the remainder of this section we formalize and derive this method which we call

Neural Expectation Maximization (N-EM).

2.1 Parametrized Spatial Mixture Model

We model each image as a spatial mixture of components parametrized by vectors . A differentiable non-linear function (a neural network) is used to transform these representations into parameters for separate pixel-wise distributions. These distributions are typically Bernoulli or Gaussian, in which case

would be a single probability or a mean and variance respectively. This parametrization assumes that given the representation, the pixels are independent but

not identically distributed (unlike in standard mixture models). A set of binary latent variables encodes the unknown true pixel assignments, such that iff pixel was generated by component , and . A graphical representation of this model can be seen in Figure 1, where are the mixing coefficients (or prior for ). The full likelihood for given is given by:

(1)
Figure 1: left: The probabilistic graphical model that underlies N-EM. right: Illustration of the computations for two steps of N-EM.

2.2 Expectation Maximization

Directly optimizing with respect to is difficult due to marginalization over , while for many distributions optimizing is much easier. 11todo: 1"much easier" is unspecific, and "straightforward" is a bit of a stretch. Optimizing with respect to psi would be straightforward, but not wrt. theta. Expectation Maximization (EM; dempster1977maximum ) takes advantage of this and instead optimizes a lower bound given by the expected log likelihood:

(2)

Iterative optimization of this bound alternates between two steps: in the E-step

we compute a new estimate of the posterior probability distribution over the latent variables given

from the previous iteration, yielding a new soft-assignment of the pixels to the components (clusters):

(3)

In the M-step we then aim to find the configuration of that would maximize the expected log-likelihood using the posteriors computed in the E-step. Due to the non-linearity of there exists no analytical solution to . However, since is differentiable, we can improve by taking a gradient ascent step:111Here we assume that is given by for some fixed , yet a similar update arises for many typical parametrizations of pixel distributions.

(4)

The resulting algorithm belongs to the class of generalized EM algorithms and is guaranteed (for a sufficiently small learning rate ) to converge to a (local) optimum of the data log likelihood (wu1983convergence, ).

2.3 Unrolling

In our model the information about statistical regularities required for clustering the pixels into objects is encoded in the neural network with weights . So far we have considered to be fixed and have shown how we can compute an MLE for alongside the appropriate clustering. We now observe that by unrolling the iterations of the presented generalized EM, we obtain an end-to-end differentiable clustering procedure based on the statistical model implemented by

. We can therefore use (stochastic) gradient descent and fit the statistical model to capture the regularities corresponding to objects for a given dataset. This is implemented by back-propagating an appropriate loss (see

Section 2.4) through “time” (BPTT; werbos1988generalization ; williams1989complexity ) into the weights . We refer to this trainable procedure as Neural Expectation Maximization (N-EM), an overview of which can be seen in Figure 1.

Figure 2: RNN-EM Illustration. Note the changed encoder and recurrence compared to Figure 1.

Upon inspection of the structure of N-EM we find that it resembles

copies of a recurrent neural network with hidden states

that, at each timestep, receive as their input. Each copy generates a new , which is then used by the E-step to re-estimate the soft-assignments . In order to accurately mimic the M-Step (4) with an RNN, we must impose several restrictions on its weights and structure: the “encoder” must correspond to the Jacobian , and the recurrent update must linearly combine the output of the encoder with from the previous timestep. Instead, we introduce a new algorithm named RNN-EM, when substituting that part of the computational graph of N-EM with an actual RNN (without imposing any restrictions). Although RNN-EM can no longer guarantee convergence of the data log likelihood, its recurrent weights increase the flexibility of the clustering procedure. Moreover, by using a fully parametrized recurrent weight matrix RNN-EM naturally extends to sequential data. Figure 2 presents the computational graph of a single RNN-EM timestep.

2.4 Training Objective

N-EM is a differentiable clustering procedure, whose outcome relies on the statistical model

. We are interested in a particular unsupervised clustering that corresponds to grouping entities based on the statistical regularities in the data. To train our system, we therefore require a loss function that teaches

to map from representations to parameters that correspond to pixelwise distributions for such objects. We accomplish this with a two-term loss function that guides each of the networks to model the structure of a single object independently of any other information in the image:22todo: 2this

(5)

The intra-cluster loss corresponds to the same expected data log-likelihood

as is optimized by N-EM. It is analogous to a standard reconstruction loss used for training autoencoders, weighted by the cluster assignment. Similar to autoencoders, this objective is prone to trivial solutions in case of overcapacity, which prevent the network from modelling the statistical regularities that we are interested in. Standard techniques can be used to overcome this problem, such as making

a bottleneck or using a noisy version of to compute the inputs to the network. Furthermore, when RNN-EM is used on sequential data we can use a next-step prediction loss.

Weighing the loss pixelwise is crucial, since it allows each network to specialize its predictions to an individual object. However, it also introduces a problem: the loss for out-of-cluster pixels () vanishes. This leaves the network free to predict anything and does not yield specialized representations. Therefore, we add a second term (inter-cluster loss) which penalizes the KL divergence between out-of-cluster predictions and the pixelwise prior of the data. Intuitively this tells each representation to contain no information regarding non-assigned pixels : .

A disadvantage of the interaction between and in (5) is that it may yield conflicting gradients. For any the loss for a given pixel can be reduced by better predicting , or by decreasing (i.e. taking less responsibility) which is (due to the E-step) realized by being worse at predicting . A practical solution to this problem is obtained by stopping the gradients, i.e. by setting during backpropagation.

3 Related work

The method most closely related to our approach is Tagger greff2016tagger , which similarly learns perceptual grouping in an unsupervised fashion using copies of a neural network that work together by reconstructing different parts of the input. Unlike in case of N-EM, these copies additionally learn to output the grouping, which gives Tagger more direct control over the segmentation and supports its use on complex texture segmentation tasks. Our work maintains a close connection to EM and relies on the posterior inference of the E-Step as a grouping mechanism. This facilitates theoretical analysis and simplifies the task for the resulting networks, which we find can be markedly smaller than in Tagger. Furthermore, Tagger does not include any recurrent connections on the level of the hidden states, precluding it from next step prediction on sequential tasks.222RTagger ilin2017recurrent : a recurrent extension of Tagger that does support sequential data was developed concurrent to this work.

The Binding problem was first considered in the context of Neuroscience milner1974model ; vondermalsburg1981correlation and has sparked some early work in oscillatory neural networks that use synchronization as a grouping mechanism vondermalsburg1995binding ; wang1995locally ; rao2008unsupervised . Later, complex valued activations have been used to replace the explicit simulation of oscillation rao2010objective ; reichert2013neuronal . By virtue of being general computers, any RNN can in principle learn a suitable mechanism. In practice however it seems hard to learn, and adding a suitable mechanism like competition wersing2001competitivelayer , fast weights schmidhuber1992learning , or perceptual grouping as in N-EM seems necessary.

Unsupervised Segmentation has been studied in several different contexts schmidhuber1992learningb , from random vectors hyvarinen2006learning over texture segmentation guerrero-colon2008image to images kannan2007clustering ; isola2015learning . Early work in unsupervised video segmentation jojic2001learning used generalized Expectation Maximization (EM) to infer how to split frames of moving sprites. More recently optical flow has been used to train convolutional networks to do figure/ground segmentation pathak2016learning ; vijayanarasimhan2017sfmnet . A related line of work under the term of multi-causal modelling saund1995multiple has formalized perceptual grouping as inference in a generative compositional model of images. Masked RBMs (leroux2011learning, )

for example extend Restricted Boltzmann Machines with a latent mask inferred through Block-Gibbs sampling.

Gradient backpropagation through inference updates has previously been addressed in the context of sparse coding with (Fast) Iterative Shrinkage/Tresholding Algorithms ((F)ISTA; daubechies2004iterative ; rozell2008sparse ; beck2009fast ). Here the unrolled graph of a fixed number of ISTA iterations is replaced by a recurrent neural network that parametrizes the gradient computations and is trained to predict the sparse codes directly gregor2010learning

. We derive RNN-EM from N-EM in a similar fashion and likewise obtain a trainable procedure that has the structure of iterative pursuit built into the architecture, while leaving tunable degrees of freedom that can improve their modeling capabilities 

sprechmann2015learning . An alternative to further empower the network by untying its weights across iterations hershey2014deep was not considered for flexibility reasons.

4 Experiments

We evaluate our approach on a perceptual grouping task for generated static images and video. By composing images out of simple shapes we have control over the statistical structure of the data, as well as access to the ground-truth clustering. This allows us to verify that the proposed method indeed recovers the intended grouping and learns representations corresponding to these objects. In particular we are interested in studying the role of next-step prediction as a unsupervised objective for perceptual grouping, the effect of the hyperparameter

, and the usefulness of the learned representations.

In all experiments we train the networks using ADAM kingma2014adam with default parameters, a batch size of 64 and train + validation + test inputs. Consistent with earlier work greff2015binding ; greff2016tagger , we evaluate the quality of the learned groupings with respect to the ground truth while ignoring the background and overlap regions. This comparison is done using the Adjusted Mutual Information (AMI; vinh2010information

) score, which provides a measure of clustering similarity between 0 (random) and 1 (perfect match). We use early stopping when the validation loss has not improved for 10 epochs.

333Note that we do not stop on the AMI score as this is not part of our objective function and only measured to evaluate the performance after training. A detailed overview of the experimental setup can be found in Appendix A. All reported results are averages computed over five runs.444Code to reproduce all experiments is available at https://github.com/sjoerdvansteenkiste/Neural-EM

Figure 3: Groupings by RNN-EM (bottom row), N-EM (middle row) for six input images (top row). Both methods recover the individual shapes accurately when they are separated (a, b, f), even when confronted with the same shape (b). RNN-EM is able to handle most occlusion (c, d) but sometimes fails (e). The exact assignments are permutation invariant and depend on initialization; compare (a) and (f).

4.1 Static Shapes

To validate that our approach yields the intended behavior we consider a simple perceptual grouping task that involves grouping three randomly chosen regular shapes () located in random positions of binary images reichert2013neuronal . This simple setup serves as a test-bed for comparing N-EM and RNN-EM, before moving on to more complex scenarios.

We implement by means of a single layer fully connected neural network with a sigmoid output

for each pixel that corresponds to the mean of a Bernoulli distribution. The representation

is a real-valued 250-dimensional vector squashed to the

range by a sigmoid function before being fed into the network. Similarly for RNN-EM we use a recurrent neural network with 250 sigmoidal hidden units and an equivalent output layer. Both networks are trained with

and unrolled for 15 EM steps.

As shown in Figure 3, we observe that both approaches are able to recover the individual shapes as long as they are separated, even when confronted with identical shapes. N-EM performs worse if the image contains occlusion, and we find that RNN-EM is in general more stable and produces considerably better groupings. This observation is in line with findings for Sparse Coding gregor2010learning . Similarly we conclude that the tunable degrees of freedom in RNN-EM help speed-up the optimization process resulting in a more powerful approach that requires fewer iterations. The benefit is reflected in the large score difference between the two: AMI compared to AMI for N-EM. In comparison, Tagger achieves an AMI score of (and with layernorm), while using about twenty times more parameters greff2016tagger .

4.2 Flying Shapes

We consider a sequential extension of the static shapes dataset in which the shapes () are floating along random trajectories and bounce off walls. An example sequence with 5 shapes can be seen in the bottom row of Figure 4. We use a convolutional encoder and decoder inspired by the discriminator and generator networks of infoGAN chen2016infogan , with a recurrent neural network of 100 sigmoidal units (for details see Section A.2). At each timestep the network receives as input, where is the current frame corrupted with additional bitflip noise (). The next-step prediction objective is implemented by replacing with in (5), and is evaluated at each time-step.

Figure 4: A sequence of 5 shapes flying along random trajectories (bottom row). The next-step prediction of each copy of the network (rows 2 to 5) and the soft-assignment of the pixels to each of the copies (top row). Observe that the network learns to separate the individual shapes as a means to efficiently solve next-step prediction. Even when many of the shapes are overlapping, as can be seen in time-steps 18-20, the network is still able to disentangle the individual shapes from the clutter.

Table 1 summarizes the results on flying shapes, and an example of a sequence with 5 shapes when using can be seen in Figure 4. For 3 shapes we observe that the produced groupings are close to perfect (AMI: ). Even in the very cluttered case of 5 shapes the network is able to separate the individual objects in almost all cases (AMI: ).

These results demonstrate the adequacy of the next step prediction task for perceptual grouping. However, we find that the converse also holds: the corresponding representations are useful for the prediction task. In Figure 6 we compare the next-step prediction error of RNN-EM with (which reduces to a recurrent autoencoder that receives the difference between its previous prediction and the current frame as input) to RNN-EM with on this task. To evaluate RNN-EM on next-step prediction we computed its loss using as opposed to to avoid including information from the next timestep. The reported BCE loss for RNN-EM is therefore an upperbound to the true BCE loss. From the figure we observe that RNN-EM produces significantly lower errors, especially when the number of objects increases.

Finally, in Table 1 we also provide insight about the impact of choosing the hyper-parameter , which is unknown for many real-world scenarios. Surprisingly we observe that training with too large is in fact favourable, and that the network learns to leave the excess groups empty. When training with too few components we find that the network still learns about the individual shapes and we observe only a slight drop in score when correctly setting the number of components at test time. We conclude that RNN-EM is robust towards different choices of , and specifically that choosing to be too high is not detrimental.

[.41] [.55]

Figure 5: Binomial Cross Entropy Error obtained by RNN-EM and a recurrent autoencoder (RNN-EM with ) on the denoising and next-step prediction task. RNN-EM produces significantly lower BCE across different numbers of objects.
Figure 6:

Average AMI score (blue line) measured for RNN-EM (trained for 20 steps) across the flying MNIST test-set and corresponding quartiles (shaded areas), computed for each of 50 time-steps. The learned grouping dynamics generalize to longer sequences and even further improve the AMI score.

Train Test Test Generalization
# obj. K AMI # obj. K AMI # obj. K AMI
3 3 0.969 0.006 3 3 0.970 0.005 3 5 0.972 0.007
3 5 0.997 0.001 3 5 0.997 0.002 3 3 0.914 0.015
5 3 0.614 0.003 5 3 0.614 0.003 3 3 0.886 0.010
5 5 0.878 0.003 5 5 0.878 0.003 3 5 0.981 0.003
Table 1: AMI scores obtained by RNN-EM on flying shapes when varying the number of objects and number of components , during training and at test time.

4.3 Flying MNIST

In order to incorporate greater variability among the objects we consider a sequential extension of MNIST. Here each sequence consists of gray-scale images containing two down-sampled MNIST digits that start in random positions and float along randomly sampled trajectories within the image for timesteps. An example sequence can be seen in the bottom row of Figure 7.

We deploy a slightly deeper version of the architecture used in flying shapes. Its details can be found in Appendix A.3

. Since the images are gray-scale we now use a Gaussian distribution for each pixel with fixed

and as computed by each copy of the network. The training procedure is identical to flying shapes except that we replace bitflip noise with masked uniform noise: we first sample a binary mask from a multi-variate Bernoulli distribution with

and then use this mask to interpolate between the original image and samples from a Uniform distribution between the minimum and maximum values of the data (0,1).

We train with and on flying MNIST having two digits and obtain an AMI score of on the test set, measured across 5 runs.

In early experiments we observed that, given the large variability among the unique digits, we can boost the model performance by training in stages using digits. Here we exploit the generalization capabilities of RNN-EM to quickly transfer knowledge from a less varying set of MNIST digits to unseen variations. We used the same hyper-parameter configuration as before and obtain an AMI score of on the test set, measured across 5 runs.

Figure 7: A sequence of 3 MNIST digits flying across random trajectories in the image (bottom row). The next-step prediction of each copy of the network (rows 2 to 4) and the soft-assignment of the pixels to each of the copies (top row). Although the network was trained (stage-wise) on sequences with two digits, it is accurately able to separate three digits.

We study the generalization capabilities and robustness of these trained RNN-EM networks by means of three experiments. In the first experiment we evaluate them on flying MNIST having three digits (one extra) and likewise set . Even without further training we are able to maintain a high AMI score of (stage-wise: ) on the test-set. A test example can be seen in Figure 7. In the second experiment we are interested in whether the grouping mechanism that has been learned can be transferred to static images. We find that using 50 RNN-EM steps we are able to transfer a large part of the learned grouping dynamics and obtain an AMI score of (stage-wise: ) for two static digits. As a final experiment we evaluate the directly trained network on the same dataset for a larger number of timesteps. Figure 6 displays the average AMI score across the test set as well as the range of the upper and lower quartile for each timestep.

The results of these experiments confirm our earlier observations for flying shapes, in that the learned grouping dynamics are robust and generalize across a wide range of variations. Moreover we find that the AMI score further improves at test time when increasing the sequence length.

5 Discussion

The experimental results indicate that the proposed Neural Expectation Maximization framework can indeed learn how to group pixels according to constituent objects. In doing so the network learns a useful and localized representation for individual entities, which encodes only the information relevant to it. Each entity is represented separately in the same space, which avoids the binding problem and makes the representations usable as efficient symbols for arbitrary entities in the dataset. We believe that this is useful for reasoning in particular, and a potentially wide range of other tasks that depend on interaction between multiple entities. Empirically we find that the learned representations are already beneficial in next-step prediction with multiple objects, a task in which overlapping objects are problematic for standard approaches, but can be handled efficiently when learning a separate representation for each object.

As is typical in clustering methods, in N-EM there is no preferred assignment of objects to groups and so the grouping numbering is arbitrary and only depends on initialization. This property renders our results permutation invariant and naturally allows for instance segmentation, as opposed to semantic segmentation where groups correspond to pre-defined categories. RNN-EM learns to segment in an unsupervised fashion, which makes it applicable to settings with little or no labeled data. On the downside this lack of supervision means that the resulting segmentation may not always match the intended outcome. This problem is inherent to this task since in real world images the notion of an object is ill-defined and task dependent. We envision future work to alleviate this by extending unsupervised segmentation to hierarchical groupings, and by dynamically conditioning them on the task at hand using top-down feedback and attention.

6 Conclusion

We have argued for the importance of separately representing conceptual entities contained in the input, and suggested clustering based on statistical regularities as an appropriate unsupervised approach for separating them. We formalized this notion and derived a novel framework that combines neural networks and generalized EM into a trainable clustering algorithm. We have shown how this method can be trained in a fully unsupervised fashion to segment its inputs into entities, and to represent them individually. Using synthetic images and video, we have empirically verified that our method can recover the objects underlying the data, and represent them in a useful way. We believe that this work will help to develop a theoretical foundation for understanding this important problem of unsupervised learning, as well as providing a first step towards building practical solutions that make use of these symbol-like representations.

Acknowledgements

The authors wish to thank Paulo Rauber and the anonymous reviewers for their constructive feedback. This research was supported by the Swiss National Science Foundation grant 200021_165675/1 and the EU project “INPUT” (H2020-ICT-2015 grant no. 687795). We are grateful to NVIDIA Corporation for donating us a DGX-1 as part of the Pioneers of AI Research award, and to IBM for donating a “Minsky” machine.

References

Appendix A Experiment Details

The following subsections provide detailed information about the experimental setup of our empirical evaluation.

In all experiments we train the networks using ADAM [19] with default parameters, a batch size of 64 and train + validation + test inputs. The quality of the learned groupings is evaluated by computing the Adjusted Mutual Information (AMI; [35]) with respect to the ground truth, while ignoring the background and overlap regions (as is consistent with earlier work [8, 7]). We use early stopping when the validation loss has not improved for 10 epochs.

a.1 Experiments on Static Shapes

Each input consists of a binary image containing three regular shapes () located in random positions [26].

For N-EM we implement

by means of a single layer fully connected neural network with a sigmoid activation function. It receives a real-valued 250-dimensional vector

as input and outputs for each pixel a value that parameterizes a Bernoulli distribution. We squash with a Sigmoid before passing it to the network and train an additional weight to implement the learning rate that is used to combine the gradient ascent updates into the current parameter estimate.

Similarly for RNN-EM we use a recurrent neural network with 250 Sigmoidal hidden units and an fully-connected output-layer with a sigmoid activation function that parametrizes a Bernoulli distribution for each pixel in the same fashion.

We train both networks with for 15 EM steps and add bitflip noise with probability 0.1 to each of the pixels. The prior for each pixel in the data is set to a Bernoulli distribution with . The outer-loss is only injected at the final EM-step.

a.2 Experiments on Flying Shapes

Each input consists of a sequence of binary images containing a fixed number of shapes () that start in random positions and float along randomly sampled trajectories within the image for 20 steps.

We use a convolutional encoder-decoder architecture inspired by recent GANs [4] with a recurrent neural network as bottleneck:

  1. conv. 32 ELU. stride 2. layer norm

  2. conv. 64 ELU. stride 2. layer norm

  3. fully connected. 512 ELU. layer norm

  4. recurrent. 100 Sigmoid. layer norm on the output

  5. fully connected. 512 RELU. layer norm

  6. fully connected. RELU. layer norm

  7. reshape 2 nearest-neighbour, conv. 32 RELU. layer norm

  8. reshape 2 nearest-neighbour, conv. 1 Sigmoid

Instead of using transposed convolutions (to implement the "de-convolution") we first reshape the image using the default nearest-neighbour interpolation followed by a normal convolution in order to avoid frequency artifacts [22]. Note that we do not add layer norm on the recurrent connection.

At each timestep we feed as input to the network, where is the input with added bitflip noise (). RNN-EM is trained with a next-step prediction objective implemented by replacing with in (5), which we evaluate at each time-step. A single RNN-EM step is used for each timestep. The prior for each pixel in the data is set to a Bernoulli distribution with . We prevent conflicting gradient updates by not back-propagating any gradients through .

a.3 Experiments on Flying MNIST

Each input consists of a sequence of gray-scale images containing a fixed number of down-sampled (by a factor of two along each dimension) MNIST digits that start in random positions and “fly” across randomly sampled trajectories within the image for timesteps.

We use a slightly deeper version of the architecture used for flying shapes:

  1. conv. 32 ELU. stride 2. layer norm

  2. conv. 64 ELU. stride 2. layer norm

  3. conv. 128 ELU. stride 2. layer norm

  4. fully connected. 512 ELU. layer norm

  5. recurrent. 250 Sigmoid. layer norm on the output

  6. fully connected. 512 RELU. layer norm

  7. fully connected. RELU. layer norm

  8. reshape 2 nearest-neighbour, conv. 64 RELU. layer norm

  9. reshape 2 nearest-neighbour, conv. 32 RELU. layer norm

  10. reshape 2 nearest-neighbour, conv. 1 linear

The training procedure is largely identical to the one described for flying shapes except that we replace the bitflip noise with masked uniform noise: we first sample a binary mask from a multi-variate Bernoulli distribution with and then use this mask to interpolate between the original image and samples from a Uniform distribution between the minimum and maximum values of the data. We use a learning rate of

(from the second stage onwards in case of stage-wise training), scale the second-loss term by a factor of 0.2 and find it beneficial to normalize the masked differences between the prediction and the image (zero mean, standard deviation one) before passing it to the network.