Tagger: Deep Unsupervised Perceptual Grouping

06/21/2016 ∙ by Klaus Greff, et al. ∙ IDSIA The Curious AI Company 0

We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. By enriching the representations of a neural network, we enable it to group the representations of different objects in an iterative manner. By allowing the system to amortize the iterative inference of the groupings, we achieve very fast convergence. In contrast to many other recently proposed methods for addressing multi-object scenes, our system does not assume the inputs to be images and can therefore directly handle other modalities. For multi-digit classification of very cluttered images that require texture segmentation, our method offers improved classification performance over convolutional networks despite being fully connected. Furthermore, we observe that our system greatly improves on the semi-supervised result of a baseline Ladder network on our dataset, indicating that segmentation can also improve sample efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 9

Code Repositories

tagger

Deep Unsupervised Perceptual Grouping


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An example of perceptual grouping for vision.

Humans naturally perceive the world as being structured into different objects, their properties and relation to each other. This phenomenon which we refer to as perceptual grouping is also known as amodal perception in psychology. It occurs effortlessly and includes a segmentation of the visual input, such as that shown in in Figure 1. This grouping also applies analogously to other modalities, for example in solving the cocktail party problem (audio) or when separating the sensation of a grasped object from the sensation of fingers touching each other (tactile). Even more abstract features such as object class, color, position, and velocity are naturally grouped together with the inputs to form coherent objects. This rich structure is crucial for many real-world tasks such manipulating objects or driving a car, where awareness of different objects and their features is required.

In this paper, we introduce a framework for learning efficient iterative inference of such perceptual grouping which we call iTerative Amortized Grouping (TAG). This framework entails a mechanism for iteratively splitting the inputs and internal representations into several different groups. We make no assumptions about the structure of this segmentation and rather train the model end-to-end to discover which are the relevant features and how to perform the splitting.

By using an auxiliary denoising task we focus directly on amortizing the posterior inference of the object features and their grouping. Because our framework does not make any assumptions about the structure of the data, it is completely domain agnostic and applicable to any type of data. The TAG framework works completely unsupervised, but can also be combined with supervised learning for classification or segmentation.

Another class of recently proposed mechanisms for addressing complex structured inputs is attention Schmidhuber & Huber (1991); Bahdanau et al. (2014); Eslami et al. (2016). These methods simplify the problem of perception by learning to restrict processing to a part of the input. In contrast, TAG simply structures the input without directing the focus or discarding irrelevant information. These two systems are not mutually exclusive and could complement each other: the group structure can help in deciding what exactly to focus on, which in turn may help simplify the task at hand.

We apply our framework to two artificial datasets: a simple binary one with multiple shapes and one with overlapping textured MNIST digits on a textured background. We find that our method learns intuitively appealing groupings that support denoising and classification. Our results for the 2-digit classification are significantly better than a strong ConvNet baseline despite the use of a fully connected network. The improvements for semi-supervised learning with 1,000 labels are even greater, suggesting that grouping can help learning by increasing the sample efficiency.

2 Iterative Amortized Grouping (TAG)

Grouping.

Our goal is to enable neural networks to split inputs and internal representations into coherent groups that can be processed separately. We hypothesize that processing the whole input in one clump is often difficult due to unwanted interference. However, if we allow the network to separately process groups, it can make use of invariant distributed features without the risk of ambiguities. We thus define a group to be a collection of inputs and internal representations that are processed together (largely) independently of the other groups.

The “correct” grouping is often dynamic, ambiguous and task dependent. For example, when driving along a road, it is useful to group all the buildings together. To find a specific house, however, it is important to separate the buildings, and to enter one, they need to be subdivided even further. Rather than treating segmentation as a separate task, we provided a mechanism for grouping as a tool for our system. We make no assumptions about the correspondence between objects and groups. If the network can process several objects in one group without unwanted interference, then the network is free to do so.

Processing of the input is split into

different groups, but it is left up to the network to learn how to best use this ability in a given problem, such as classification. To make the task of instance segmentation easy, we keep the groups symmetric in the sense that each group is processed by the same underlying model. We introduce latent binary variables

to encode if input element is assigned to group .111

More formally, we introduce discrete random variables

for each input element indexed by . As a shorthand for we write

and denote the discrete-valued vector of elements

by .

Amortized Iterative Inference.

We want our model to reason not only about the group assignments but also about the representation of each group. This amounts to inference over two sets of variables: the latent group assignments and the individual group representations; A formulation very similar to mixture models for which exact inference is typically intractable. For these models it is a common approach to approximate the inference in an iterative manner by alternating between (re-)estimation of these two sets (e.g., EM-like methods 

Dempster et al. (1977)). The intuition is that given the grouping, inferring the object features becomes easy, and vice versa. We employ a similar strategy by allowing our network to iteratively refine its estimates of the group assignments as well as the object representations. If the model can improve the estimates in each step, then it will converge to a final solution.

Rather than deriving and then running an inference algorithm, we train a parametric mapping to arrive at the end result of inference as efficiently as possible Gregor & LeCun (2010). This is known as amortized inference Srikumar et al. (2012)

, and it is used, for instance, in variational autoencoders where the encoder learns to amortize the posterior inference required by the generative model represented by the decoder. Here we instead apply the framework of denoising autoencoders

Gallinari et al. (1987); Le Cun (1987); Vincent et al. (2008) which are trained to reconstruct original inputs from corrupted versions . This encourages the network to implement useful amortized posterior inference without ever having to specify or even know the underlying generative model whose inference is implicitly learned.

This situation is analogous to normal supervised deep learning, which can also be viewed as amortized inference 

Bengio et al. (2014). Rather than specifying all the hidden variables that are related to the inputs and labels and then deriving and running an inference algorithm, a supervised deep model is trained to arrive at an approximation of the true posterior without the user specifying or typically even knowing the underlying generative model. This works as long as the network is provided with the input information and mechanisms required for an an efficient approximation of posterior inference.

2.1 Definition of the TAG mechanism

Figure 2: Illustration of the TAG framework used for training. Left: The system learns by denoising its input over iterations using several groups to distribute the representation. Each group, represented by several panels of the same color, maintains its own estimate of reconstructions of the input, and corresponding masks , which encode the parts of the input that this group is responsible for representing. These estimates are updated over iterations by the same network, that is, each group and iteration share the weights of the network and only the inputs to the network differ. In the case of images, contains pixel-values. Right: In each iteration and from the previous iteration, are used to compute a likelihood term and modeling error . These four quantities are fed to the parametric mapping to produce and for the next iteration. During learning, all inputs to the network are derived from the corrupted input as shown here. The unsupervised task for the network is to learn to denoise, i.e. output an estimate of the original clean input. See Section 2.1 for more details.

A high-level illustration of the TAG framework is presented in Figure 2: We train a network with a learnable grouping mechanism to iteratively denoise corrupted inputs . The output at each iteration is an approximation

of the true probability

, which is refined over iterations indexed by . As the cost function for training the network, we use the negative log likelihood

(1)

where the summation is over iterations . From here on we mostly omit from the equations for readability. Since this cost function does not require any class labels or intended grouping information, training can be completely unsupervised, though additional terms for supervised tasks can be added too.

Group representation.

Internally, the network maintains versions of its representations indexed by . This can also be thought of as running separate copies of the same network, where each network only sees a subset of the inputs and outputs (the expected value of the input for that group), and (the group assignment probabilities). Each and has the same dimensionality as the input, and they are updated over iterations. Each group makes its own prediction about the original input based on . In the binary case we use , and in the continuous case we take

to represent the mean of a Gaussian distribution with variance

. We assumed the variance of the Gaussian distribution to be constant over iterations and groups but learned it from the data. It would be easy to add a more accurate estimate of the variance.

The final prediction of the network is defined as:

(2)

The group assignment probabilities are forced to be non-negative and sum up to one over :

(3)
Inputs.

In contrast to a normal denoising autoencoder that receives the corrupted , we feed in estimates and from the previous iteration and two additional quantities: and . They are functions of the estimates and the corrupted and carry information about how the estimates could be improved. A parametric mapping (here a neural network) then produces the new estimates and . The initial values for are randomized, and is set to the data mean for all .

Because are continuous variables, their likelihood is a function over all possible values of , and not all of this information can be easily represented. Typically, the relevant information is found close to the current estimate ; therefore we use , which is proportional to the gradient of the negative log likelihood. Essentially, it represents the remaining modeling error:

(4)

The derivation of the analogous term in the binary case is presented in Appendix A.5.

Since we are using denoising as a training objective, the network can only be allowed to take inputs through the corrupted during learning. Therefore, we need to look at the likelihood of the corrupted input when trying to determine how the estimates could be improved. Since are discrete variables unlike , we treat them slightly differently: For it is feasible to express the complete likelihood table assuming other values constant. We denote this function by

(5)

Note that we normalize over such that it sums up to one for each value of . This amounts to providing each group information about how likely each input element belongs to them rather than some other group. In other words, this is equivalent to likelihood ratio rather than the raw likelihood. Intuitively, the term describes how well each group reconstructs the individual input elements relative to the other groups.

Parametric mapping.

The final component needed in the TAG framework is the parametric model, which does all the heavy lifting of inference. This model has a dual task: first, to denoise the estimate

of what each group says about the input, and second, to update the group assignment probabilities of each input element. The information about the remaining modeling error is based on the corrupted input ; thus, the parametric network has to denoise this and in effect implement posterior inference for the estimated quantities. The mapping function is the same for each group and for each iteration. In other words, we share weights and in effect have only a single function approximator that we reuse.

The denoising task encourages the network to iteratively group its inputs into coherent groups that can be modeled efficiently. The trained network can be useful for a real-world denoising application, but typically, the idea is to encourage the network to learn interesting internal representations. Therefore, it is not but rather , and the internal representations of the parametric mapping that we are typically concerned with.

Summary.

By using the negative log likelihood as a cost function, we train our system to compute an approximation of the true denoising posterior at each iteration . An overview of the whole system is given in Figure 2. For each input element we introduce latent binary variables that take a value of if this element is generated by group . This way inference is split into groups, and we can write the approximate posterior in vector notation as follows:

(6)

where we model the group reconstruction as a Gaussian with mean and variance , and the group assignment posterior as a categorical distribution .

The trainable part of the TAG framework is given by a parametric mapping that operates independently on each group and is used to compute both and

(which is afterwards normalized using an elementwise softmax over the groups). This parametric mapping is usually implemented by a neural network and the whole system is trained end-to-end using standard backpropagation through time.

The input to the network for the next iteration consists of the vectors and along with two additional quantities: The remaining modelling error and the group assignment likelihood ratio which carry information about how the estimates can be improved:

Note that they are derived from the corrupted input , to make sure we don’t leak information about the clean input into the system.

2.2 The Tagger: Combining TAG and Ladder Network

Data:
Result:
begin Initialization:
       ;
       ;
       ;
      
end
for  do
       for  do
             ;
             ;
             ;
             ;
             ;
             ;
            
       end for
      ;
       ;
      
end for
;
Algorithm 1 Pseudocode for running Tagger on a single real-valued example . For details and a binary-input version please refer to supplementary material.
Figure 3: An example of how Tagger would use a 3-layer-deep Ladder Network as its parametric mapping to perform its iteration . Note the optional class prediction output for classification tasks. See supplementary material for details.

We chose the Ladder network Rasmus et al. (2015) as the parametric mapping because its structure reflects the computations required for posterior inference in hierarchical latent variable models. This means that the network should be well equipped to handle the hierarchical structure one might expect to find in many domains. We call this Ladder network wrapped in the TAG framework Tagger. This is illustrated in Figure 3 and the corresponding pseudocode can be found in Algorithm 1.

We mostly used the specifications of the Ladder network as described by Rasmus et al. (2015)

, but there are some minor modifications we made to fit it to the TAG framework. We found that the model becomes more stable during iterations when we added a sigmoid function to the gating variable

 (Rasmus et al., 2015, Equation 2) used in all the decoder layers with continuous outputs. None of the noise sources or denoising costs were in use (i.e., for all in Eq. 3 of Ref. Rasmus et al. (2015)), but Ladder’s classification cost ( in Ref. Rasmus et al. (2015)) was added to the Tagger’s cost (Equation 1) for the semi-supervised tasks.

All four inputs (, , , and

) were concatenated and projected to a hidden representation that served as the input layer of the Ladder Network. Subsequently, the values for the next iteration were simply read from the reconstruction (

in Ref. Rasmus et al. (2015)) and projected linearly into and via softmax to to enforce the conditions in Equation 3. For the binary case, we used a logistic sigmoid activation for .

3 Experiments and results

We explore the properties and evaluate the performance of Tagger both in fully unsupervised settings and in semi-supervised tasks in two datasets222

The datasets and a Theano 

Team (2016) reference implementation of Tagger are available at http://github.com/CuriousAI/tagger. Although both datasets consist of images and grouping is intuitively similar to image segmentation, there is no prior in the Tagger model for images: our results (unlike the ConvNet baseline) generalize even if we permute all the pixels .

Shapes.

We use the simple Shapes dataset Reichert et al. (2011) to examine the basic properties of our system. It consists of 60,000 (train) + 10,000 (test) binary images of size 20x20. Each image contains three randomly chosen shapes (, , or ) composed together at random positions with possible overlap.

Textured MNIST.

We generated a two-object supervised dataset (TextureMNIST2) by sequentially stacking two textured 28x28 MNIST-digits, shifted two pixels left and up, and right and down, respectively, on top of a background texture. The textures for the digits and background are different randomly shifted samples from a bank of 20 sinusoidal textures with different frequencies and orientations. Some examples from this dataset are presented in the column of Figure 5. We use a 50k training set, 10k validation set, and 10k test set to report the results. The dataset is assumed to be difficult due to the heavy overlap of the objects in addition to the clutter due to the textures. We also use a textured single-digit version (TextureMNIST1) without a shift to isolate the impact of texturing from multiple objects.

3.1 Training and evaluation

We train Tagger in an unsupervised manner by only showing the network the raw input example , not ground truth masks or any class labels, using 4 groups and 3 iterations. We average the cost over iterations and use ADAM Kingma & Ba (2015)

for optimization. On the Shapes dataset we trained for 100 epochs with a bit-flip probability of 0.2, and on the TextureMNIST dataset for 200 epochs with a corruption-noise standard deviation of 0.2. The models reported in this paper took approximately 3 and 11 hours in wall clock time on a single Nvidia Titan X GPU for Shapes and TextureMNIST2 datasets respectively.

To understand how model size, length of the iterative inference, and the number of groups affect the modeling performance, we evaluate the trained models using two metrics: First, the denoising cost on the validation set, and second we evaluate the segmentation into objects using the adjusted mutual information (AMI) score Vinh et al. (2010) and ignore the background and overlap regions in the Shapes dataset (consistent with Greff et al. (2015)). Evaluations of the AMI score and classification results in semi-supervised tasks were performed using uncorrupted input. The system has no restrictions regarding the number of groups and iterations used for training and evaluation. The results improved in terms of both denoising cost and AMI score when iterating further, so we used 5 iterations for testing. Even if the system was trained with 4 groups and 3 shapes per training example, we could test the evaluation with, for example, 2 groups and 3 shapes, or 4 groups and 4 shapes.

3.2 Unsupervised Perceptual Grouping

Iter 1 Iter 2 Iter 3 Iter 4 Iter 5
Denoising cost 0.094 0.068 0.063 0.063 0.063
AMI 0.58 0.73 0.77 0.79 0.79
Denoising cost* 0.100 0.069 0.057 0.054 0.054
AMI* 0.70 0.90 0.95 0.96 0.97
(a) Convergence of Tagger over iterative inference
AMI
RC Greff et al. (2015) 0.61 0.005
Tagger 0.79 0.034
Tagger* 0.97 0.009
(b) Method comparison
Table 1: Table (a) shows how quickly the algorithm evaluation converges over inference iterations with the Shapes dataset. Table (b) compares segmentation quality to previous work on the Shapes dataset. The AMI score is defined in the range from 0 (guessing) to 1 (perfect match). The results with a star (*) are using LayerNorm Ba et al. (2016) instead of BatchNorm.

Table 1 shows the median performance of Tagger on the Shapes dataset over 20 seeds. Tagger is able to achieve very fast convergences, as shown in (a). Through iterations, the network improves its denoising performances by grouping different objects into different groups. Comparing to Greff et al. (2015), Tagger performs significantly better in terms of AMI score (see (b)).

Figure 4 and Figure 5 qualitatively show the learned unsupervised groupings for the Shapes and textured MNIST datasets. Tagger uses its TAG mechanism slightly differently for the two datasets. For Shapes, represents filled-in objects and masks show which part of the object is actually visible. For textured MNIST, represents the textures and masks texture segments. In the case of the same digit or two identical shapes, Tagger can segment them into separate groups, and hence, it performs instance segmentation. We used 4 groups for training even though there are only 3 objects in the Shapes dataset and 3 segments in the TexturedMNIST2 dataset. The excess group is left empty by the trained system but its presence seems to speed up the learning process.

Figure 4: Results for Shapes dataset. Left column: 7 examples from the test set along with their resulting groupings in descending AMI score order and 3 hand-picked examples (A, B, and C) to demonstrate generalization. A: Testing 2-group model on 3 object data. B: Testing a 4-group model trained with 3-object data on 4 objects. C: Testing 4-group model trained with 3-object data on 2 objects. Right column: Illustration of the inference process over iterations for four color-coded groups; and .
Figure 5: Results for the TextureMNIST2 dataset. Left column: 7 examples from the test set along with their resulting groupings in descending AMI score order and 3 hand-picked examples (D, E1, E2). D: An example from the TextureMNIST1 dataset. E1-2: A hand-picked example from TextureMNIST2. E1 demonstrates typical inference, and E2 demonstrates how the system is able to estimate the input when a certain group (topmost digit 4) is removed. Right column: Illustration of the inference process over iterations for four color-coded groups; and .

The hand-picked examples A-C in Figure 4 illustrate the robustness of the system when the number of objects changes in the evaluation dataset or when evaluation is performed using fewer groups.

Example is particularly interesting; shows how the normal evaluation looks like but demonstrates how we can remove the topmost digit from the scene and let the system fill in digit below and the background. We do this by setting the corresponding group assignment probabilities to a large negative number just before the final softmax over groups in the last iteration.

To solve the textured two-digit MNIST task, the system has to combine texture cues with high-level shape information. The system first infers the background texture and mask which are finalized on the first iteration. Then the second iteration typically fixes the texture used for topmost digit, while subsequent iterations clarify the occluded digit and its texture. This demonstrates the need for iterative inference of the grouping.

3.3 Classification

We investigate the role of grouping for the task of classification. We evaluate the Tagger against four baseline models on the textured MNIST task. As our first baseline we use a fully connected network (FC

) with ReLU activations and batch normalization after each layer. Our second baseline is a ConvNet (

Conv) based on Model C from Springenberg et al. (2014), which has close to state-of-the-art results on CIFAR-10. We removed dropout, added batch normalization after each layer and replaced the final pooling by a fully connected layer to improve its performance for the task. Furthermore, we compare with a fully connected Ladder Rasmus et al. (2015) (FC Ladder) network.

All models use a softmax output and are trained with 50,000 samples to minimize the categorical cross entropy error. In case there are two different digits in the image (most examples in the TextureMNIST2 dataset), the target is for both classes. We evaluate the models based on classification errors. For the two-digit case, we score the network based on the two highest predicted classes (top 2).

For Tagger, we first train the system in an unsupervised phase for 150 epochs and then add two fresh randomly initialized layers on top and continue training the entire system end to end using the sum of unsupervised and supervised cost terms for 50 epochs. Furthermore, the topmost layer has a per-group softmax activation that includes an added ’no class’ neuron for groups that do not contain any digit. The final classification is then performed by summing the softmax output over all groups for the true 10 classes and renormalizing this sum to add up to one.

The final results are summarized in Table 2. As shown in this table, Tagger performs significantly better than all the fully connected baseline models on both variants, but the improvement is more pronounced for the two-digit case. This result is expected because for cases with multi-object overlap, grouping becomes more important. It, moreover, confirms the hypothesis that grouping can help classification and is particularly beneficial for complex inputs. Remarkably, Tagger, despite being fully connected, is on par with the convolutional baseline for the TexturedMNIST1 dataset and even outperforms it in the two-digit case. We hypothesize that one reason for this result is that grouping allows for the construction of efficient invariant features already in the low layers without losing information about the assignment of features to objects. Convolutional networks solve this problem to some degree by grouping features locally through the use of receptive fields, but that strategy is expensive and can break down in cases of heavy overlap.

3.4 Semi-Supervised Learning

Training TAG does not rely on labels and is therefore directly usable in a semi-supervised context. For semi-supervised learning, the Ladder Rasmus et al. (2015) is arguably one of the strongest baselines with SOTA results on 1,000 MNIST and 60,000 permutation invariant MNIST classification. We follow the common practice of using 1,000 labeled samples and 49,000 unlabeled samples for training Tagger and the Ladder baselines. For completeness, we also report results of the convolutional (ConvNet) and fully-connected (FC) baselines trained fully supervised on only 1,000 samples.

From the results in Table 2, it is obvious that all the fully supervised methods fail on this task with 1,000 labels. The best result of approximately error for the single-digit case is achieved by ConvNet, which still performs only at chance level for two-digit classification. The best baseline result is achieved by the FC Ladder, which reaches error for one digit but for TextureMNIST2.

For both datasets, Tagger achieves by far the lowest error rates: and

, respectively. Again, this difference is amplified for the two-digit case, where the Tagger with 1,000 labels even outperforms the Ladder baseline with all 50k labels. This result matches our intuition that grouping can often segment out objects even of an unknown class and thus help select the relevant features for learning. This is particularly important in semi-supervised learning where the inability to self-classify unlabeled samples can easily mean that the network fails to learn from them at all.

To put these results in context, we performed informal tests with five human subjects. The task turned out to be quite difficult and the subjects needed to have regular breaks to be able to maintain focus. The subjects improved significantly over training for a few days but there were also significant individual differences. The best performing subjects scored around 10 % error for TextureMNIST1 and 30 % error for TextureMNIST2. For the latter task, the test subject took over 30 seconds per sample.

Dataset Method Error 50k Error 1k Model details
TextureMNIST1 FC MLP 31.1 2.2 89.0 0.2 2000-2000-2000 / 1000-1000
FC Ladder 7.2 0.1 30.5 0.5 3000-2000-1000-500-250
FC Tagger (ours) 4.0 0.3 10.5 0.9 3000-2000-1000-500-250
ConvNet 3.9 0.3 52.4 5.3 based on Model C Springenberg et al. (2014)
TextureMNIST2 FC MLP 55.2 1.0 79.4 0.3 2000-2000-2000 / 1000-1000
FC Ladder 41.1 0.2 68.5 0.2 3000-2000-1000-500-250
FC Tagger (ours) 7.9 0.3 24.9 1.8 3000-2000-1000-500-250
ConvNet 12.6 0.4 79.1 0.8 based on Model C Springenberg et al. (2014)
Table 2: Test-set classification errors for textured one-digit MNIST (chance level: 90 %) and top-2 error on the textured two-digit MNIST dataset (chance level: 80 %). We report mean and sample standard deviation over 5 runs. FC = Fully Connected

4 Related work

Attention models have recently become very popular, and similar to perceptual grouping they help in dealing with complex structured inputs. These approaches are not, however, mutually exclusive and can benefit from each other. Overt attention models Schmidhuber & Huber (1991); Eslami et al. (2016) control a window (fovea) to focus on relevant parts of the inputs. Two of their limitations are that they are mostly tailored to the visual domain and are usually only suited to objects that are roughly the same shape as the window. But their ability to limit the field of view can help to reduce the complexity of the target problem and thus also help segmentation. Soft attention mechanisms Schmidhuber (1993a); Deco (2001); Yli-Krekola et al. (2009) on the other hand use some form of top-down feedback to suppress inputs that are irrelevant for a given task. These mechanisms have recently gained popularity, first in machine translation Bahdanau et al. (2014) and then for many other problems such as image caption generation Xu et al. (2015). Because they re-weigh all the inputs based on their relevance, they could benefit from a perceptual grouping process that can refine the precise boundaries of attention.

Our work is primarily built upon a line of research based on the concept that the brain uses synchronization of neuronal firing to bind object representations together. This view was introduced by von der Malsburg (1981) and has inspired many early works on oscillations in neural networks (see the survey von der Malsburg (1995) for a summary). Simulating the oscillations explicitly is costly and does not mesh well with modern neural network architectures (but see Meier et al. (2014)). Rather, complex values have been used to model oscillating activations using the phase as soft tags for synchronization Rao et al. (2008); Reichert & Serre (2013). In our model, we further abstract them by using discretized synchronization slots (our groups). It is most similar to the models of Wersing et al. (2001), Hyvärinen & Perkiö (2006) and Greff et al. (2015). However, our work is the first to combine this with denoising autoencoders in an end-to-end trainable fashion.

Another closely related line of research Saund (1995); Ross & Zemel (2006) has focused on multi-causal modeling of the inputs. Many of the works in that area Le Roux et al. (2011); Tang et al. (2012); Sohn et al. (2013); Huang & Murphy (2015)

build upon Restricted Boltzmann Machines. Each input is modeled as a mixture model with a separate latent variable for each object. Because exact inference is intractable, these models approximate the posterior with some form of expectation maximization 

Dempster et al. (1977) or sampling procedure. Our assumptions are very similar to these approaches, but we allow the model to learn the amortized inference directly (more in line with Goodfellow et al. (2013)).

Since recurrent neural networks (RNNs) are general purpose computers, they can in principle implement arbitrary computable types of temporary variable binding 

Schmidhuber (1992b, 1993a), unsupervised segmentation Schmidhuber (1992a), and internal Schmidhuber (1993a) and external attention Schmidhuber & Huber (1991). For example, an RNN with fast weights Schmidhuber (1993a) can rapidly associate or bind the patterns to which the RNN currently attends. Similar approaches even allow for metalearning Schmidhuber (1993b), that is, learning a learning algorithm. Hochreiter et al. (2001), for example, learned fast online learning algorithms for the class of all quadratic functions of two variables. Unsupervised segmentation could therefore in principle be learned by any RNN as a by-product of data compression or any other given task.

The recurrent architecture most similar to the Tagger is the Neural Abstraction Pyramid (NAP; (Behnke, 1999)

) – a convolutional neural network augmented with lateral connections which help resolve local ambiguities and feedback connections that allow incorporation of high-level information. In early pioneering work the NAP was trained for iterative image binarization 

Behnke (2003) and iterative image denoising Behnke (2001), much akin to the setup we use. Being recurrent, the NAP layers too, could in principle learn a perceptual grouping as a byproduct. That does not, however, imply that every RNN will, through learning, easily discover and implement this tool. The main improvement that our framework adds is an explicit mechanism for the network to split the input into multiple representations and thus quickly and efficiently learn a grouping mechanism. We believe this special case of computation to be important enough for many real-world tasks to justify this added complexity.

5 Future Work

So far we’ve assumed the groups to represent independent objects or events. However, this assumption is unrealistic in many cases. Assuming only conditional independence would be considerably more reasonable, and could be implemented by allowing all groups to share the same top-layer of their Ladder network.

The TAG framework assumes just one level of (global) groups, which does not reflect the hierarchical structure of the world. Therefore, another important future extension is to rather use a hierarchy of local groupings, by using our model as a component of a bigger system. This could be achieved by collapsing the groups of a Tagger network by summing them together at some hidden layer. That way this abstract representation could serve as input for another tagger with new groupings at this higher level. We hypothesize that a hierarchical Tagger could also represent relations between objects, because they are simply the couplings that remain from the assumption of independent objects.

Movement is a strong segmentation cue and a simple temporal extensions of the TAG framework could be to allow information to flow forward in time between higher layers, not just via the inputs. Iteration would then occur in time alongside the changing inputs. We believe that these extensions will make it possible to scale the approach to video.

6 Conclusion

In this paper, we have argued that the ability to group input elements and internal representations is a powerful tool that can improve a system’s ability to handle complex multi-object inputs. We have introduced the TAG framework, which enables a network to directly learn the grouping and the corresponding amortized iterative inference in a unsupervised manner. The resulting iterative inference is very efficient and converges within five iterations. We have demonstrated the benefits of this mechanism for a heavily cluttered classification task, in which our fully connected Tagger even significantly outperformed a state-of-the-art convolutional network. More impressively, we have shown that our mechanism can greatly improve semi-supervised learning, exceeding conventional Ladder networks by a large margin. Our method makes minimal assumptions about the data and can be applied to any modality. With TAG, we have barely scratched the surface of a comprehensive integrated grouping mechanism, but we already see significant advantages. We believe grouping to be crucial to human perception and are convinced that it will help to scale neural networks to even more complex tasks in the future.

Acknowledgments

The authors wish to acknowledge useful discussions with Theofanis Karaletsos, Jaakko Särelä, Tapani Raiko, and Søren Kaae Sønderby. And further acknowledge Rinu Boney, Timo Haanpää and the rest of the Curious AI Company team for their support, computational infrastructure, and human testing. This research was supported by the EU project “INPUT” (H2020-ICT-2015 grant no. 687795).

References

Appendix A Supplementary Material

a.1 Notation

Symbol Space Description
input dimensionality
total number of groups
input and output dimension of the parametric mapping
iteration index
input element index
group index
input vector with elements
corrupted input
the predicted mean of input for group
probabilities for each input to be assigned to group
modeling error for group
group assignment likelihood ratio
the training loss for input
variance of the input estimate. Only used in the continuous case
Projection weights from tagger inputs to ladder inputs
Projection weights from ladder output to and
Contains all parameters of the ladder

rectified linear activation function

logistic sigmoid activation function
elementwise softmax over the groups
Latent random variable that encodes which group belongs to.
Shorthand for . Mostly used for .
a vector of all .
posterior of the data given the corrupted data
learnt approximation of
Shorthand for
Figure 6: Illustration of the TAG framework used for training. Left: The system learns by denoising its input over iterations using several groups to distribute the representation. Each group, represented by several panels of the same color, maintains its own estimate of reconstructions of the input, and corresponding masks , which encode the parts of the input that this group is responsible for representing. These estimates are updated over iterations by the same network, that is, each group and iteration share the weights of the network and only the inputs to the network differ. In the case of images, contains pixel-values. Right: In each iteration and from the previous iteration, are used to compute a likelihood term and modeling error . These four quantities are fed to the parametric mapping to produce and for the next iteration. During learning, all inputs to the network are derived from the corrupted input as shown here. The unsupervised task for the network is to learn to denoise, i.e. output an estimate of the original clean input.

a.2 Input

In its basic form (without supervision) Tagger receives as input only a datapoint . It corresponds to either a binary vector or a real-valued vector and is then corrupted with either bitflip or Gaussian noise. The training objective is the removal of this noise.

Bitflip Noise

In the case of binary inputs we use bitflip noise for corruption:

where denotes componentwise XOR, and

is Bernoulli distributed noise with probability

. In our experiments on the Shapes dataset we use .

Gaussian Noise

If the inputs are real-valued, we corrupt it using Gaussian noise:

where is the standard deviation of the input noise. We used .

a.3 Group Assignments

Within the TAG framework the group assignment is represented by the vectors which contain one entry for each input element or pixel. These entries of

represent the discreet probability distribution over

groups for each input . They therefore sum up to one:

(7)
Initialization

Similar to expectation maximization, the group assignment is initialized randomly, but such that Equation 7 holds. So we first sample an auxiliary from a standard Gaussian distribution and then normalize it using a softmax:

(8)
(9)

a.4 Predicted Inputs

Tagger maintains an input reconstruction for each group .

Binary Case

In the binary case we use a sigmoid activation function on and interpret it directly as the probability

(11)

We can use it to compute which will be used for the modeling error (Section A.5) and the group likelihood:

(12)
(13)
(14)
(15)
(16)
(17)

Therefore we have:

(18)
Continuous Case

For the continuous case we interpret as the means of an isotropic Gaussian with learned variance :

(19)

Using the additivity of Gaussian distributions we directly get:

(20)
Initialization

For simplicity we initialize all to the expectation of the data for all . In our experiments these values are for the TextureMNIST datasets and for the Shapes dataset.

a.5 Modeling Error

As explained in Section 2.1, carries information about the remaining modeling error. During training as a denoiser, we can only allow information about the corrupted as inputs but not about the original clean . Therefore, we use the derivative of the cost on the corrupted input as helpful information for the parametric mapping. Since we work with the input elements individually we skip the index in the following:

(21)

More precisely for a single iteration (omitting the index ) we have::

(22)
(23)
(24)
(25)
Continuous Case

For the continuous case this gives us:

(27)
(28)
(29)

Note that since the network will multiply its inputs with weights, we can always omit any constant multipliers.

Binary Case

Let us denote the corruption bit-flip probability by and define

Then we get:

and thus:

(30)
(31)

which simplifies for as

and for as

Putting it back together:

a.6 Ladder Modifications

We mostly used the specifications of the Ladder network as described by Rasmus et al. (2015), but there are some minor modifications we made to fit it to the TAG framework. We found that the model becomes more stable during iterations when we added a sigmoid function to the gating variable  (Rasmus et al., 2015, Equation 2) used in all the decoder layers with continuous outputs. None of the noise sources or denoising costs were in use (i.e., for all in Eq. 3 of Ref. Rasmus et al. (2015)), but Ladder’s classification cost ( in Ref. Rasmus et al. (2015)) was added to the Tagger’s cost for the semi-supervised tasks.

All four inputs (, , , and ) were concatenated and projected to a hidden representation that served as the input layer of the Ladder Network. Subsequently, the values for the next iteration were simply read from the reconstruction ( in Ref. Rasmus et al. (2015)) and projected linearly into and via softmax to to enforce the conditions in Equation 7. For the binary case, we used a logistic sigmoid activation for .

a.7 Pseudocode

In this section we put it all together and provide the pseudocode for running Tagger both on binary (Algorithm 3) and real-valued inputs (Algorithm 2). The provided code shows the steps needed to run for iterations on a single example using groups. Here we use three activation functions: is the rectified linear function, is the logistic sigmoid, and is a softmax operation over the groups. All three include a batch-normalization operation, which we omitted for clarity. Only the forward pass for a single example is shown, but derivatives of the cost wrt. parameters , , and are computed using regular backpropagation through time. For training we use ADAM with a batch-size of .

Data:
Result:
begin Initialization:
       ;
       ;
       ;
      
end
for  do
       for  do
             ;
             ;
             ;
             ;
             ;
            
       end for
      ;
       ;
      
end for
;
Algorithm 2 Pseudocode for running Tagger on a single real-valued example .
Data:
Result:
begin Initialization:
       ;
       ;
       ;
      
end
for  do
       for  do
             ;
             ;
             ;
             ;
             ;
            
       end for
      ;
       ;
      
end for
;
Algorithm 3 Pseudocode for running Tagger on a single binary example .