Stacked Capsule Autoencoders

by   Adam R. Kosiorek, et al.
University of Oxford

An object can be seen as a geometrically organized set of interrelated parts. A system that makes explicit use of these geometric relationships to recognize objects should be naturally robust to changes in viewpoint, because the intrinsic geometric relationships are viewpoint-invariant. We describe an unsupervised version of capsule networks, in which a neural encoder, which looks at all of the parts, is used to infer the presence and poses of object capsules. The encoder is trained by backpropagating through a decoder, which predicts the pose of each already discovered part using a mixture of pose predictions. The parts are discovered directly from an image, in a similar manner, by using a neural encoder, which infers parts and their affine transformations. The corresponding decoder models each image pixel as a mixture of predictions made by affine-transformed parts. We learn object- and their part-capsules on unlabeled data, and then cluster the vectors of presences of object capsules. When told the names of these clusters, we achieve state-of-the-art results for unsupervised classification on SVHN (55 state-of-the-art on MNIST (98.5


page 4

page 13


Geometric Capsule Autoencoders for 3D Point Clouds

We propose a method to learn object representations from 3D point clouds...

Capsule Networks – A Probabilistic Perspective

'Capsule' models try to explicitly represent the poses of objects, enfor...

DPR-CAE: Capsule Autoencoder with Dynamic Part Representation for Image Parsing

Parsing an image into a hierarchy of objects, parts, and relations is im...

Inference for Generative Capsule Models

Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge ...

Unsupervised part representation by Flow Capsules

Capsule networks are designed to parse an image into a hierarchy of obje...

Inference and Learning for Generative Capsule Models

Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge ...

Cerberus: A Multi-headed Derenderer

To generalize to novel visual scenes with new viewpoints and new object ...

1 Introduction

CNN work better than networks without weight-sharing because of their inductive bias: if a local feature is useful in one image location, the same feature is likely to be useful in other locations. It is tempting to exploit other effects of viewpoint changes by replicating features across scale, orientation and other affine degrees of freedom, but this quickly leads to cumbersome high-dimensional feature maps.

An alternative to replicating features across the non-translational degrees of freedom is to explicitly learn transformations between the natural coordinate frame of a whole object and the natural coordinate frames of each of its parts. Computer graphics relies on such objectpart coordinate transformations to represent the geometry of an object in a viewpoint-invariant manner. Moreover, there is strong evidence that, unlike standard CNNs, human vision also relies on coordinate frames: imposing an unfamiliar coordinate frame on a familiar object makes it difficult to recognize the object or its geometry (Rock73; Hinton79).

A neural system can learn to reason about transformation between objects, their parts and the viewer, but each of the transformations is likely to require different representation. An OP is viewpoint-invariant and is naturally coded by learned weights. The relationship of an object or part to the viewer changes with the viewpoint (it is viewpoint-equivariant) and is naturally coded using neural activations111 This may explain why accessing perceptual knowledge about objects, when they are not visible, requires creating a mental image of the object with a specific viewpoint. . With this representation, pose of a single object is represented by its relationship to the viewer. Consequently, representing a single object does not necessitate replicating neural activations across space, unlike in CNN. It is only processing two (or more) different instances of the same type of object in parallel that requires spatial replicas of both model parameters and neural activations.

In this paper we propose the SCAu, which has two stages (Fig. 1). The first stage, the PCAu, segments an image into constituent parts, infers their poses, and reconstructs each image pixel as a mixture of the pixels of transformed part templates. The second stage, the OCAu, tries to organize discovered parts and their poses into a smaller set of objects that can explain the part poses using a separate mixture of predictions for each part. Every object capsule contributes components to each of these mixtures by multiplying its pose—the OV—by the relevant OPOP222The type of a part capsule may determine which, if any, of an object’s parts contribute to the mixture used to model the pose of an already discovered part.

Stacked Capsule Autoencoders (Section 2

) capture spatial relationships between whole objects and their parts when trained on unlabelled data. The vectors of presence probabilities for the object capsules tend to form tight clusters, and when we assign a class to each cluster we achieve state-of-the-art results for unsupervised classification on

svhn (55%) and near state-of-the-art on mnist (98.5%), which can be further improved to 67% and 99%, respectively, by learning fewer than 300 parameters. We also present promising proof-of-concept results on cifar10 (Section 3). We describe related work in Section 4 and discuss implications of our work and future directions in Section 5.

Figure 1: Stacked Capsule Autoencoder (scae): (a) part capsules segment the input into parts and their poses. The poses are then used to reconstruct the input by affine-transforming learned templates. (b) object capsules try to arrange inferred poses into objects, thereby discovering underlying structure. scae is trained by maximizing image and part log-likelihoods subject to sparsity constraints.

2 Stacked Capsule Autoencoders (scae)

Segmenting an image into parts is non-trivial, so we begin by abstracting away pixels and the part-discovery stage, and develop the CCAu (Section 2.1). It uses two-dimensional points as parts, and their coordinates are given as the input to the system. CCAu learns to model sets of points as arrangements of familiar constellations, each of which has been transformed by an independent similarity transform. The CCAu learns to assign individual points to their respective constellations—without knowing the number of constellations or their individual shapes in advance. Next, in Section 2.2, we develop the PCAuPCAu which learns to infer parts and their poses from images. Finally, we stack the OCAuOCAu, which closely resembles the CCAu, on top of the PCAu to form the SCAuSCAu.

2.1 Constellation Autoencoder (ccae)

Let be a set of two-dimensional input points, where every point belongs to a constellation as in Figure 2. We first encode all input points (which take the role of part capsules) with Set Transformer (Lee2019set)—a permutation-invariant encoder based on attention mechanisms—into object capsules. An object capsule consists of a capsule feature vector , its presence probability and a OVOV matrix, which represents the affine transformation between the object (constellation) and the viewer. Note that each object capsule can represent only one object at a time. Every object capsule uses a separate MLP to predict part candidates from the capsule feature vector . Each candidate consists of the conditional probability

that a given candidate part exists, an associated scalar standard deviation

, and a OPOP matrix, which represents the affine transformation between the object capsule and the candidate part333Deriving these matrices from the capsule feature vector allows for deformable objects. We model OPs as the sum of an input-dependent component and a constant bias. We encourage different capsules to specialize to different constellations by putting a strong penalty on the former.. Candidate predictions are given by the product of the object capsule OV and the candidate OP matrices. We then model each input part as a Gaussian mixture, where and are the centers and standard deviations of the isotropic components. See Figures 5 and 1 for illustration; formal description follows: &ov_1:K, _1:K, a_1:K = h^caps (_1:M) &encode object capsule parameters,
&op_k,1:N, a_k, 1:N, λ_k, 1:N = h_k^part (_k) &decode candidate parameters from ’s,
&V_k,n = ov_k op_k,n &decode a part pose candidate,
&_mk,n = _m ∣μ_k,n, λ_k,n &turn candidates into mixture components,


The model is trained without supervision by maximizing the likelihood of part capsules in Equation 1 subject to sparsity constraints, cfSection 2.4. The part capsule can be assigned to the object capsule as .444We treat parts as independent and evaluate their probability under the same mixture model. While there are no clear 1:1 connections between parts and predictions, it seems to work well in practice.

Figure 2:

Unsupervised segmentation of points belonging to up to three constellations of squares and triangles at different positions, scales and orientations. The model is trained to reconstruct the points (top row) under the CCAu mixture model. The bottom row colors the points based on the parent with highest posterior probability in the mixture model. The right-most column shows a failure case. Note that the model uses sets of points, not pixels, as its input; we use images only to visualize the constellation arrangements.

Empirical results show that this model is able to perform unsupervised instance-level segmentation of points belonging to different constellations, even in data which is difficult to interpret for humans. See Figure 2 for an example and Section 3.1 for details.

2.2 Part Capsule Autoencoder (pcae)

Explaining images as geometrical arrangements of parts requires first inferring what parts the images are composed of, as well as the relationships of the parts to the viewer (which we call their poses). For the CCAu a part is just a 2D point, but here each part capsule has a six DOF pose, a presence variable and a unique identity. We frame the part-discovery problem as auto-encoding: the encoder learns to infer the poses and presences of different part capsules, while the decoder learns an image template for each part (Fig. 3) similar to Tieleman2014thesis; Eslami2016. The templates corresponding to present parts are affine-transformed using their poses, and the pixels of these transformed templates are used to create a separate mixture model for each image pixel. The PCAu is followed by an OCAuOCAu, which closely resambles the CCAu and is described in Section 2.3.

Let be the image. We limit the maximum number of part capsules to and use an encoder to infer their poses , presence probabilities , and special features , one per part capsule. The latter do not take part in direct image reconstruction, but inform the OCAu about special aspects of the corresponding part; they are trained by backpropagating derivatives from the OCAu.

At present, we do not allow multiple occurrences of the same type of part in an image, so the part capsules themselves are not replicated across space, though they could be. However, we do need to recognize the part wherever it occurs in the image, and therefore the encoder consists of a CNN with a bottom-up attention mechanism; for every part capsule , it predicts a feature map of capsule parameters with spatial dimensions  , as well as a single-channel attention mask . The final parameters for that capsule are computed as , where is along the spatial dimensions. This is similar to global average pooling, but allows some spatial locations to contribute to the final result more than others; we call this approach attention-based pooling. Its effect on the model performance is analyzed in Section 3.3.

The image pixels are modelled as independent Gaussian mixtures. For every pixel, we take the corresponding pixels of the transformed templates and treat them as centers of isotropic Gaussian components with constant variance. Their mixing probabilities are proportional to both presence probabilities of part capsules and a function

of the color value at that location555 Templates are assumed to be sparse; if there exists a template that has a non-zero value at a given location, then this templates should be used., where is the number of image channels. More formally: &_1:M, d_1:M, _1:M = h^enc() &encode the image to part capsule parameters,
&^T_m = TransformImage (T_m, _m) &apply affine transforms to image templates,
&p^y_m,i,j ∝d_m f_c(^T_m,i,j) &compute mixing probabilities,
& = ∏_i,j ∑_m=1^M p^y_m,i,j  y_i,j ∣^T_m,i,j, σ^2_y &calculate image likelihood.

2.3 Object Capsule Autoencoder (ocae)

The next step is to find objects in the already discovered parts666 Discovered objects are not used top-down to refine the presences or poses of the parts during inference. However, the derivatives backpropagated via OCAu refine the lower-level encoder network that infers the parts. . To do so, we use concatenated poses , special features and flattened templates (which convey the identity of the part capsule) as an input to the OCAu, which differs from the CCAu in the following ways. Firstly, we feed part capsule presence probabilities into the OCAu’s encoder—these are used to bias the Set Transformer’s attention mechanism to not take absent points into account. Secondly, ’s are also used to weigh the part-capsules’ log-likelihood, cf. Equation 1. Additionally, we stop gradient on all of OCAu’s inputs except the special features to improve training stability and avoid the problem of collapsing latent variables; seeRasmus2015ladder. Finally, parts discovered by the PCAu have independent identities (templates and special features rather than 2D points). Therefore, every part-pose is explained as an independent mixture of predictions from object-capsules—where every object capsule makes exactly candidate predictions , or exactly one candidate prediction per part. Consequently, the part-capsule likelihood is given by,

Figure 3: Templates learned on mnist (left) as well as sobel-filtered svhn (middle) and cifar10 (right). In each case templates converge to strokes. For svhn they often take the form of double strokes—this is due to sobel filtering, which effectively extracts edges.
Figure 4: mnist (a) images and their (b) reconstructions from part capsules in red and object capsules in green, with overlapping regions in yellow. Only a few object capsules are activated for every input (c) a priori (left) and even fewer are needed to reconstruct it (right). The most active capsules (d) capture object identity and the majority of information about its appearance. Finally, (e) affine-transformed templates show how exactly parts are used to reconstruct the images.

2.4 Achieving Sparse and Diverse Capsule Presences

Stacked Capsule Autoencoders are trained to maximise pixel and part log-likelihoods (). If not constrained, however, they tend to either use all of the part and object capsules to explain every data example, or collapse onto using always the same subset of capsules, regardless of the input. We would like the model to use different sets of part-capsules for different input examples and to specialize object-capsules to particular arrangements of parts; to encourage this, we impose sparsity and entropy constraints. We evaluate their importance in Section 3.3.

We first define prior and posterior object-capsule presence as follows. For a minibatch of size with object capsules and part capsules we define a minibatch of prior capsule presence with dimension and posterior capsule presence with dimension as,


respectively; the former is the maximum presence probability among predictions from object capsule while the latter is the unnormalized mixing probability used to explain part capsule .


Prior sparsity

Let the average presence probability of the object capsule among different training examples, and the sum of object capsule presence probabilities for a given example. If we assume that training examples contain objects from different classes uniformly at random and we would like to assign the same number of object capsules to every class then each class would obtain capsules. Moreover, if we assume that only one object is present in every image, then object capsules should be present for every input example. To this end, we minimize,

Posterior Sparsity

Similarity, let and be the the normalized versions of and , respectively. We find it beneficial to minimize the within-example entropy of capsule posterior presence and maximize its between-example entropy , where is the entropy. The final loss reads as,

Every active object capsule should explain at least two parts

We say that an object capsule has ‘won’ a part if it has the highest posterior mixing probability for that part among other object capsules. We then create binary labels for each of object capsules, where the label is if the capsule wins at least two parts and it is otherwise. The final loss takes the form of binary cross-entropy between the generated label and the prior capsule presence. This loss is used only for the stand-alone constellation model experiments on point data, cf. Sections 3.1 and 2.1.

Figure 5: SCAu architecture.

Fig. 5 shows the schematic architecture of SCAu. We optimize a weighted sum of image and part likelihoods and the auxiliary losses. Loss weight selection process as well as the values used for experiments are explained in Appendix A.

In order to make the values of presence probabilities ( and ) closer to binary we inject uniform noise

into logits, similar to

Tieleman2014thesis. This forces the model to predict logits that are far from zero to avoid stochasticity and makes the predicted presence probabilities close to binary. Interestingly, it tends to work better in our case than using the Concrete distribution (Maddison2017concrete).

3 Evaluation

The decoders in the SCAu use explicitly parameterised affine transformations that allow the encoders’ inputs to be explained with a small set of transformed objects or parts. The following evaluations show how the embedded geometrical knowledge helps to discover patterns in data. Firstly, we show that the CCAu discovers underlying structures in arrangements of constellations made of two-dimensional points, thereby performing instance-level segmentation. Secondly, we pair an OCAu with a PCAu and investigate whether the resulting SCAu can discover structure in real images. Finally, we present an ablation study that shows which components of the model contribute to the results.

3.1 Discovering Constellations

We create arrangements of constellations online, where every input example consists of up to 11 two-dimensional points belonging to up to three different constellations (two squares and a triangle) as well as binary variables indicating presence of the points (points can be missing). Each constellation is included with probability

and undergoes a similarity transformation, whereby it is randomly scaled, rotated by up to 180° and shifted. Finally, every input example is normalized such that all points lie within . Note that we use sets of points, and not images, as inputs to our model.

We compare the CCAu against a baseline that uses the same encoder but a simpler decoder: the decoder uses the capsule parameter vector to directly predict the location, precision and presence probability of each of the four points as well as the presence probability of the whole corresponding constellation. Implementation details are listed in Section A.1.

Both models are trained unsupervised by maximizing the part log-likelihood. We evaluate them by trying to assign each input point to one of the object capsules. To do so, we assign every input point to the object capsule with the highest posterior probability for this point, cf. Section 2.1, and compute segmentation accuracy (the true-positive rate).

The CCAu consistently achieves below error with the best model achieving , while the best baseline achieved

error using the same budget for hyperparameter search. This shows that wiring in an inductive bias towards modelling geometric relationships can help to bring down the error by an order of magnitude—at least in a toy setup where each set of points is composed of familiar constellations that have been independently transformed.

3.2 Unsupervised Class Discovery

Method mnist cifar10 svhn kmeans (adc) 53.49 20.8 12.5 ae (ae) 81.2 31.4 - gan (gan) 82.8 31.5 - imsat (imsat) 98.4 (0.4) 45.6 (0.8) 57.3 (3.9) iic (iic) 98.4 (0.6) 57.6 (5.0) - adc (adc) 98.7 (0.6) 29.3 (1.5) 38.6 (4.1) max-act (SCAu) 98.0 (.15) 19.79 (1.0) 49.07 (1.7) clust-nn (SCAu) 98.5 (.11) 19.39 (1.5) 53.0 (3.8) lin-match (SCAu) 98.5 (.10) 25.01 (1.0) 55.33 (3.4) lin-pred (SCAu) 98.9 (.07) 33.48 (0.3) 67.27 (4.5)
Table 1: Unsupervised classification results in % with (standard deviation) are averaged over 5 runs. Methods based on mutual information are shaded. Results marked with use data augmentation, use imagenet-pretrained features instead of images, while are taken from iic

. We highlight the best results and those that are are within its 98% confidence interval according to a two-sided t test.

To allow for multimodality in the appearance of objects of a specific class, we typically use more object capsules than the number of class labels. We expect that the vector of presence probabilities of object capsules should be highly informative of the class label. To test this hypothesis, we train SCAu on mnist, svhn and cifar10 and try to assign class labels to vectors of object capsule presences. This is done with one of the following methods: max-act: we search for a training example that maximally activates given object capsule and assign the corresponding label to this capsule; cluster-nn: we perform kmeans clustering into clusters and then find the training example that is the closest to each cluster’s centroid to assign a label to the cluster; lin-match: after finding 10 clusters777All considered datasets have 10 classes. with kmeans we use bipartite graph matching (Kuhn1955hungarian) to find the permutation of cluster indices that minimizes the classification error—this is standard practice in unsupervised classification, seeiic; lin-pred

: we train a linear classifier with supervision given the presence vectors; this learns

weights and biases, where is the number of object capsules, but it does not modify any parameters of the main model.

In agreement with previous work on unsupervised clustering (iic; imsat; Hjelm2019deepinfomax; adc), we train our models and report results on full datasets (train, valid and test

splits). The linear transformation used in

lin-pred variant of our method is trained on the train split of respective datasets while its performance on the test split is reported.

We used an PCAu with 24 single-channel templates for mnist and 24 and 32 three-channel templates for svhn and cifar10, respectively. We used sobel-filtered images as the reconstruction target for svhn and cifar10, as in jaiswal, while using the raw pixel intensities as the input to PCAu. The OCAu used 24, 32 and 64 object capsules, respectively. Further details on model architectures and hyper-parameter tuning are available in Appendix A. All results are presented in Table 1. SCAu achieves competitive results in unsupervised object classification on mnist and svhn and under-performs slightly on cifar10, which is further discussed in Section 5.

3.3 Ablation study

SCAus have many moving parts; an ablation study shows which model components are important and to what degree. We train SCAu variants on mnist

as well as a padded-and-translated

version of the dataset, where the original digits are translated up to 6 pixels in each direction. Trained models are tested on test splits of both datasets; additionally, we evaluate the model trained on the mnist on the test split of affnist dataset. Testing on affnist shows whether the model can generalize to unseen viewpoints. This task was used by sparsecaps to evaluate Sparse Unsupervised Capsules, which achieved accuracy. SCAu achieves , which indicates that it is better at viewpoint generalization. We choose the lin-match performance metric, since it is the one favoured by the unsupervised classification community.

Method mnist mnist affnist full model 97.0 (.87) 98.5 (.1) 92.2 (.59) no posterior sparsity 96.7 (.7) 98.2 (.48) 87.6 (1.63) a) no prior sparsity 90.5 (7.56) 94.0 (3.03) 74.0 (4.94) no prior/posterior sparsity 63.0 (13.48) 62.7 (10.46) 40.7 (6.81) no noise in object caps 96.4 (1.41) 97.8 (.67) 90.8 (2.97) b) no noise in any caps 84.8 (6.22) 85.1 (13.13) 76.3 (12.89) no noise in part caps 83.9 (7.57) 80.2 (9.1) 73 (9.04) c) similarity transforms 90.4 (13.78) 97.4 (.99) 90.1 (2.62) no deformations 87.6 (6.13) 95.2 (1.04) 87.6 (1.26) d) linear part enc 94.8 (3.0) 98.1 (.26) 76.3 (2.22) conv part enc 96.3 (.85) 97.8 (.95) 80.1 (2.58) e) MLP enc for object caps 73.0 (6.34) 70.3 (11.2) 52.5 (11.29) f) no special features 63.1 (10.55) 66.9 (23.59) 50.5 (18.26)
Table 2: Ablation study on mnist. All used model components contribute to its final performance. Affnist results show out-of-distribution generalization properties and come from a model trained on mnist. Numbers represent average % and (standard deviation) over 10 runs. We highlight the best results and those that are are within its 98% confidence interval according to a two-sided t test.

Results are split into several groups and shown in Table 2. We describe each group in turn. Group a) shows that sparsity losses introduced in Section 2.4 increase model performance, but that the posterior loss might not be necessary. Group b) checks the influence of injecting noise into logits for presence probabilities, cf. Section 2.4. Injecting noise into part capsules seems critical, while noise in object capsules seems unnecessary—the latter might be due to sparsity losses. Group c) shows that using similarity (as opposed to affine) transforms in the decoder can be restrictive in some cases, while not allowing deformations hurts performance in every case.

Group d) evaluates the type of the part-capsule encoder. The linear encoder entails a CNN followed by a fully-connected layer, while the conv encoder predicts one feature map for every capsule parameter, followed by global-average pooling. The choice of part-capsule encoder seems not to matter much for within-distribution performance; however, our attention-based pooling does achieve much higher classification accuracy when evaluated on a different dataset, showing better generalization to novel viewpoints.

Additionally, e) using Set Transformer as the object-capsule encoder is essential. We hypothesize that it is due to the natural tendency of Set Transformer to find clusters, as reported in Lee2019set. Finally, f) using special features seems not less important—presumably due to effects the high-level capsules have on the representation learned by the primary encoder.

4 Related Work

Capsule Networks  Our work combines ideas from Transforming Autoencoders (Hinton2011tae) and em Capsules (Hinton2018capsule). Transforming autoencoders discover affine-aware capsule instantiation parameters by training an autoencoder to predict an affine-transformed version of the input image from the original image plus an extra input, which explicitly represents the transformation. By contrast, our model does not need any input other than the image.

Both em Capsules and the preceding Dynamic Capsules (Sabour2017capsule) use the poses of parts and learned partobject relationships to vote for the poses of objects. When multiple parts cast very similar votes, the object is assumed to be present, which is facilitated by an interactive inference (routing) algorithm. Iterative routing is inefficient and has prompted further research. wang2018optimization formulated routing as an optimization of a clustering loss and a kl-divergence-based regularization term. zhang2018fast

proposed a weighted kernel density estimation-based routing method.

encapsule proposed approximating routing with two branches and sending feedback via optimal transport divergence between two distributions (lower and higher capsules). In contrast to prior work, we use objects to predicts parts rather than vice-versa, therefore we can dispense with iterative routing at inference time. The encoder of the OCAu learns how to group parts into objects and it respects the single parent constraint, because it is trained using derivatives produced by a decoder that uses a mixture model of parts which assumes that each part must be explained by a single object.

Additionally, since it is the objects that predict parts, the parts are allowed to have fewer degrees-of-freedom in their poses than objects (as in the CCAu). Inference is still possible, because the OCAu encoder makes object predictions based on all the parts rather than an individual part.

A further advantage of our version of capsules is that it can perform unsupervised learning. Previous versions of capsules used discriminative learning, though

sparsecaps used the reconstruction MLP introduced in Sabour2017capsule to train Dynamic Capsules without supervision and has shown that unsupervised training for capsule-conditioned reconstruction helps with generalization to affnist classification; we further improve on their results, cfSection 3.3.

Unsupervised Classification

  There are two main approaches to unsupervised object category detection in computer vision. The first one is based on representation learning and typically requires discovering clusters or learning a classifier on top of the learned representation.

Eslami2016; Kosiorek2018sqair use an iterative procedure to infer a variable number of latent variables, one for every object in a scene, that are highly informative of object class, while Greff2019multi; Burgess2019monet perform unsupervised instance-level segmentation in an iterative fashion. While similar to our work, these approaches cannot decompose objects into their constituent parts and do not provide explicit description of object shape (templates and their poses in our model).

The second approach targets classification explicitly by minimizing mutual information (mi)-based losses and directly learning class-assignment probabilities. Iic (iic) maximizes an exact estimator of mi between two discrete probability vectors describing (transformed) versions of the input image. DeepInfoMax (Hjelm2019deepinfomax) relies on negative samples and maximizes mi

between the predicted probability vector and its input via noise-contrastive estimation

(Gutmann2010nce). This class of methods directly maximizes the amount of information contained in an assignment to discrete clusters and they hold state-of-the-art results on most unsupervised classification tasks. Mi-based methods suffer from typical drawbacks of mutual information estimation: they require heavy data augmentation and large batch sizes. This is in contrast to our method, which achieves comparable performance with batch size no bigger than 128 and with no data augmentation.

Geometrical Reasoning  Other attempts at incorporating geometrical knowledge into neural networks include exploiting equivariance properties of group transformations (Cohen2016group) or new types of convolutional filters (mallat; kocvok). Although they achieve significant parameter efficiency in handling rotations or reflections compared to standard CNN, these methods cannot handle additional degrees of freedom of affine transformations—like scale. lenssen combined capsule networks with group convolutions to guarantee equivariance and invariance in capsule networks. Spatial Transformers (st; Jaderberg2015) apply affine transformations to the image sampling grid while steerable networks (Cohen2016steerable; Jacobsen2017dynamic) dynamically change convolutional filters. These methods are similar to ours in the sense that transformation parameters are predicted by a neural network, but differ in the sense that st uses global transformations applied to the whole image while steerable networks use only local transformations. Our approach can use different global transformations for every object as well as local transformations for each of their parts.

5 Discussion

The main contribution of our work is a novel method for representation learning, in which highly structured decoder networks are used to train one encoder network that can segment an image into parts and their poses and another encoder network that can compose the parts into coherent wholes. Despite the fact that our training objective is not concerned with classification or clustering, SCAu is the only method that achieves competitive results in unsupervised object classification without relying on mutual information (mi). This is significant, since unlike our method, mi-based methods require sophisticated data augmentation. It may be possible to further improve results by using an mi-based loss to train SCAu, where the vector of capsule probabilities could take the role of discrete probability vectors in iic (iic). SCAu under-performs on cifar10, which could be because of using fixed templates, which are not expressive enough to model real data. This might be fixed by building deeper hierarchies of capsule autoencoders (complicated scenes in computer graphics are modelled as deep trees of affine-transformed geometric primitives) as well as using input-dependent shape functions instead of fixed templates—both of which are promising directions for future work. It may also be possible to make a much better PCAu for learning the primary capsules by using a differentiable renderer in the generative model that reconstructs pixels from the primary capsules.

Finally, the SCAu could be the ‘figure’ component of a mixture model that also includes a versatile ‘ground’ component that can be used to account for everything except the figure. A complex image could then be analyzed using sequential attention to perceive one figure at a time.

6 Acknowledgements

We would like to thank Sandy H. Huang for help with editing the manuscript and making Figure 1. Additionally, we would like to thank S. M. Ali Eslami and Danijar Hafner for helpful discussions throughout the project. We also thank Hyunjik Kim, Martin Engelcke, Emilien Dupont and Simon Kornblith for feedback on initial versions of the manuscript.

Appendix A Model Details

a.1 Constellation Experiments

The CCAu uses a four-layer Set Transformer as its encoder. Every layer has four attention heads, 128 hidden units per head, and is followed by layer norm (Ba2016LayerN). The encoder outputs three 32-dimensional vectors—one for each object capsule. The decoder uses a separate neural net for each object capsule to predict all parameters used to model its points: this includes four candidate part predictions per capsule for a total of 12 candidates. In this experiment, each objectpart relationship OP is just a 2-D offset in the object’s frame of reference (instead of a matrix) and it is affine transformed by the corresponding OV matrix to predict the 2-D point.

a.2 Image Experiments

We use a convolutional encoder for part capsules and a set transformer encoder (Lee2019set) for object capsules. Decoding from object capsule to part capsules is done with MLP, while the input image is reconstructed with affine-transformed learned templates. Details of the architectures we used are available in Table 3.

Table 3: Architecture details. S in the last column means that the entry is the same as for svhn.
Dataset Constellation mnist svhn cifar10
num templates N/A 24 24 32
template size N/A s
num capsules 3 24 32 64
part cnn N/A 2x(128:2)-2x(128:1) 2x(128:1)-2x(128:2) s
set transformer 4x(4-128)-32 3x(1-16)-256 3x(2-64)-128 s

We use ReLu nonlinearities except for presence probabilities, for which we use sigmoids. (128:2) for a CNN means 128 channels with a stride of two. All kernels are

. For set transformer (1-16)-256 means one attention head, 16 hidden units and 256 output units; it uses layer normalization (Ba2016LayerN) as in the original paper (Lee2019set) but no dropout. All experiments (apart from constellations) used 16 special features per part capsule.

For svhn and cifar10, we use normalized sobel-filtered images as the target of the reconstruction to emphasize the shape importance. Figure 6 in Appendix B shows examples of svhn and cifar10 reconstruction. The filtering procedure is as follows: 1) apply sobel filtering, 2) subtract the median color, 3) take the absolute value of the image, 4) normalize for image values to be .

All models are trained with the RMSProp optimizer (tieleman2012rms) and . Batch size is 64 for constellations and 128 for all other datasets. The learning rate was equal to for mnist and constellation experiments (without any decay), while we run a hyperparameter search for svhn and cifar10: we searched learning rates in the range of to and exponential learning rate decay of 0.96 every or weight updates. Learning rate of was selected for both svhn and cifar10, the decay steps was for svhn and for cifar10. The lin-pred accuracy on a validation set is used as a proxy to select the best hyperparameters—including weights on different losses, reported in Table 4. Models were trained for up to iterations on single Tesla V100 GPUs, which took 40 minutes for constellation experiments and less than a day for cifar10.

Table 4: Loss weights values. The within and between quantifiers in sparsity losses corresponds to different terms of Equations 5 and 4.
Dataset Constellation mnist svhn cifar10
part ll weight 1 1 2.56 2.075
image ll weight N/A 1 1 1
prior within sparsity 1 1 0.22 0.17
prior between sparsity 1 1 0.1 0.1
posterior within sparsity 0 10 8.62 1.39
posterior between sparsity 0 10 0.26 7.32
too-few-active-capsules 10 0 0 0

Appendix B Reconstructions

Figure 6: 10 Sample SVHN and Cifar10 reconstructions. First row shows Sobel filtered target image. Second row shows the reconstruction from Part Capsule Layer directly. Third row shows the reconstruction if we use the object predictions for the Part poses instead of Part poses themselves for reconstruction. The templates in this model has the same number of channels as the image, but they have converged to black and white templates and the reconstruction do not have color diversity. The SCAu model is trained completely unsupervised but the reconstructions tend to focus on the center digit in SVHN and filter the rest of the clutter.