CNN work better than networks without weight-sharing because of their inductive bias: if a local feature is useful in one image location, the same feature is likely to be useful in other locations. It is tempting to exploit other effects of viewpoint changes by replicating features across scale, orientation and other affine degrees of freedom, but this quickly leads to cumbersome high-dimensional feature maps.
An alternative to replicating features across the non-translational degrees of freedom is to explicitly learn transformations between the natural coordinate frame of a whole object and the natural coordinate frames of each of its parts. Computer graphics relies on such objectpart coordinate transformations to represent the geometry of an object in a viewpoint-invariant manner. Moreover, there is strong evidence that, unlike standard CNNs, human vision also relies on coordinate frames: imposing an unfamiliar coordinate frame on a familiar object makes it difficult to recognize the object or its geometry (Rock73; Hinton79).
A neural system can learn to reason about transformation between objects, their parts and the viewer, but each of the transformations is likely to require different representation. An OP is viewpoint-invariant and is naturally coded by learned weights. The relationship of an object or part to the viewer changes with the viewpoint (it is viewpoint-equivariant) and is naturally coded using neural activations111 This may explain why accessing perceptual knowledge about objects, when they are not visible, requires creating a mental image of the object with a specific viewpoint. . With this representation, pose of a single object is represented by its relationship to the viewer. Consequently, representing a single object does not necessitate replicating neural activations across space, unlike in CNN. It is only processing two (or more) different instances of the same type of object in parallel that requires spatial replicas of both model parameters and neural activations.
In this paper we propose the SCAu, which has two stages (Fig. 1). The first stage, the PCAu, segments an image into constituent parts, infers their poses, and reconstructs each image pixel as a mixture of the pixels of transformed part templates. The second stage, the OCAu, tries to organize discovered parts and their poses into a smaller set of objects that can explain the part poses using a separate mixture of predictions for each part. Every object capsule contributes components to each of these mixtures by multiplying its pose—the OV—by the relevant OPOP222The type of a part capsule may determine which, if any, of an object’s parts contribute to the mixture used to model the pose of an already discovered part.
Stacked Capsule Autoencoders (Section 2
) capture spatial relationships between whole objects and their parts when trained on unlabelled data. The vectors of presence probabilities for the object capsules tend to form tight clusters, and when we assign a class to each cluster we achieve state-of-the-art results for unsupervised classification onsvhn (55%) and near state-of-the-art on mnist (98.5%), which can be further improved to 67% and 99%, respectively, by learning fewer than 300 parameters. We also present promising proof-of-concept results on cifar10 (Section 3). We describe related work in Section 4 and discuss implications of our work and future directions in Section 5.
2 Stacked Capsule Autoencoders (scae)
Segmenting an image into parts is non-trivial, so we begin by abstracting away pixels and the part-discovery stage, and develop the CCAu (Section 2.1). It uses two-dimensional points as parts, and their coordinates are given as the input to the system. CCAu learns to model sets of points as arrangements of familiar constellations, each of which has been transformed by an independent similarity transform. The CCAu learns to assign individual points to their respective constellations—without knowing the number of constellations or their individual shapes in advance. Next, in Section 2.2, we develop the PCAuPCAu which learns to infer parts and their poses from images. Finally, we stack the OCAuOCAu, which closely resembles the CCAu, on top of the PCAu to form the SCAuSCAu.
2.1 Constellation Autoencoder (ccae)
Let be a set of two-dimensional input points, where every point belongs to a constellation as in Figure 2. We first encode all input points (which take the role of part capsules) with Set Transformer (Lee2019set)—a permutation-invariant encoder based on attention mechanisms—into object capsules. An object capsule consists of a capsule feature vector , its presence probability and a OVOV matrix, which represents the affine transformation between the object (constellation) and the viewer. Note that each object capsule can represent only one object at a time. Every object capsule uses a separate MLP to predict part candidates from the capsule feature vector . Each candidate consists of the conditional probability
that a given candidate part exists, an associated scalar standard deviation, and a OPOP matrix, which represents the affine transformation between the object capsule and the candidate part333Deriving these matrices from the capsule feature vector allows for deformable objects. We model OPs as the sum of an input-dependent component and a constant bias. We encourage different capsules to specialize to different constellations by putting a strong penalty on the former.. Candidate predictions are given by the product of the object capsule OV and the candidate OP matrices. We then model each input part as a Gaussian mixture, where and are the centers and standard deviations of the isotropic components. See Figures 5 and 1 for illustration; formal description follows: &ov_1:K, _1:K, a_1:K = h^caps (_1:M) &encode object capsule parameters,
&op_k,1:N, a_k, 1:N, λ_k, 1:N = h_k^part (_k) &decode candidate parameters from ’s,
&V_k,n = ov_k op_k,n &decode a part pose candidate,
&_mk,n = _m ∣μ_k,n, λ_k,n &turn candidates into mixture components,
The model is trained without supervision by maximizing the likelihood of part capsules in Equation 1 subject to sparsity constraints, cf. Section 2.4. The part capsule can be assigned to the object capsule as .444We treat parts as independent and evaluate their probability under the same mixture model. While there are no clear 1:1 connections between parts and predictions, it seems to work well in practice.
2.2 Part Capsule Autoencoder (pcae)
Explaining images as geometrical arrangements of parts requires first inferring what parts the images are composed of, as well as the relationships of the parts to the viewer (which we call their poses). For the CCAu a part is just a 2D point, but here each part capsule has a six DOF pose, a presence variable and a unique identity. We frame the part-discovery problem as auto-encoding: the encoder learns to infer the poses and presences of different part capsules, while the decoder learns an image template for each part (Fig. 3) similar to Tieleman2014thesis; Eslami2016. The templates corresponding to present parts are affine-transformed using their poses, and the pixels of these transformed templates are used to create a separate mixture model for each image pixel. The PCAu is followed by an OCAuOCAu, which closely resambles the CCAu and is described in Section 2.3.
Let be the image. We limit the maximum number of part capsules to and use an encoder to infer their poses , presence probabilities , and special features , one per part capsule. The latter do not take part in direct image reconstruction, but inform the OCAu about special aspects of the corresponding part; they are trained by backpropagating derivatives from the OCAu.
At present, we do not allow multiple occurrences of the same type of part in an image, so the part capsules themselves are not replicated across space, though they could be. However, we do need to recognize the part wherever it occurs in the image, and therefore the encoder consists of a CNN with a bottom-up attention mechanism; for every part capsule , it predicts a feature map of capsule parameters with spatial dimensions , as well as a single-channel attention mask . The final parameters for that capsule are computed as , where is along the spatial dimensions. This is similar to global average pooling, but allows some spatial locations to contribute to the final result more than others; we call this approach attention-based pooling. Its effect on the model performance is analyzed in Section 3.3.
The image pixels are modelled as independent Gaussian mixtures. For every pixel, we take the corresponding pixels of the transformed templates and treat them as centers of isotropic Gaussian components with constant variance. Their mixing probabilities are proportional to both presence probabilities of part capsules and a functionof the color value at that location555 Templates are assumed to be sparse; if there exists a template that has a non-zero value at a given location, then this templates should be used., where is the number of image channels. More formally: &_1:M, d_1:M, _1:M = h^enc() &encode the image to part capsule parameters,
&^T_m = TransformImage (T_m, _m) &apply affine transforms to image templates,
&p^y_m,i,j ∝d_m f_c(^T_m,i,j) &compute mixing probabilities,
& = ∏_i,j ∑_m=1^M p^y_m,i,j y_i,j ∣^T_m,i,j, σ^2_y &calculate image likelihood.
2.3 Object Capsule Autoencoder (ocae)
The next step is to find objects in the already discovered parts666 Discovered objects are not used top-down to refine the presences or poses of the parts during inference. However, the derivatives backpropagated via OCAu refine the lower-level encoder network that infers the parts. . To do so, we use concatenated poses , special features and flattened templates (which convey the identity of the part capsule) as an input to the OCAu, which differs from the CCAu in the following ways. Firstly, we feed part capsule presence probabilities into the OCAu’s encoder—these are used to bias the Set Transformer’s attention mechanism to not take absent points into account. Secondly, ’s are also used to weigh the part-capsules’ log-likelihood, cf. Equation 1. Additionally, we stop gradient on all of OCAu’s inputs except the special features to improve training stability and avoid the problem of collapsing latent variables; seeRasmus2015ladder. Finally, parts discovered by the PCAu have independent identities (templates and special features rather than 2D points). Therefore, every part-pose is explained as an independent mixture of predictions from object-capsules—where every object capsule makes exactly candidate predictions , or exactly one candidate prediction per part. Consequently, the part-capsule likelihood is given by,
2.4 Achieving Sparse and Diverse Capsule Presences
Stacked Capsule Autoencoders are trained to maximise pixel and part log-likelihoods (). If not constrained, however, they tend to either use all of the part and object capsules to explain every data example, or collapse onto using always the same subset of capsules, regardless of the input. We would like the model to use different sets of part-capsules for different input examples and to specialize object-capsules to particular arrangements of parts; to encourage this, we impose sparsity and entropy constraints. We evaluate their importance in Section 3.3.
We first define prior and posterior object-capsule presence as follows. For a minibatch of size with object capsules and part capsules we define a minibatch of prior capsule presence with dimension and posterior capsule presence with dimension as,
respectively; the former is the maximum presence probability among predictions from object capsule while the latter is the unnormalized mixing probability used to explain part capsule .
- Prior sparsity
Let the average presence probability of the object capsule among different training examples, and the sum of object capsule presence probabilities for a given example. If we assume that training examples contain objects from different classes uniformly at random and we would like to assign the same number of object capsules to every class then each class would obtain capsules. Moreover, if we assume that only one object is present in every image, then object capsules should be present for every input example. To this end, we minimize,
- Posterior Sparsity
Similarity, let and be the the normalized versions of and , respectively. We find it beneficial to minimize the within-example entropy of capsule posterior presence and maximize its between-example entropy , where is the entropy. The final loss reads as,
- Every active object capsule should explain at least two parts
We say that an object capsule has ‘won’ a part if it has the highest posterior mixing probability for that part among other object capsules. We then create binary labels for each of object capsules, where the label is if the capsule wins at least two parts and it is otherwise. The final loss takes the form of binary cross-entropy between the generated label and the prior capsule presence. This loss is used only for the stand-alone constellation model experiments on point data, cf. Sections 3.1 and 2.1.
Fig. 5 shows the schematic architecture of SCAu. We optimize a weighted sum of image and part likelihoods and the auxiliary losses. Loss weight selection process as well as the values used for experiments are explained in Appendix A.
In order to make the values of presence probabilities ( and ) closer to binary we inject uniform noise
into logits, similar toTieleman2014thesis. This forces the model to predict logits that are far from zero to avoid stochasticity and makes the predicted presence probabilities close to binary. Interestingly, it tends to work better in our case than using the Concrete distribution (Maddison2017concrete).
The decoders in the SCAu use explicitly parameterised affine transformations that allow the encoders’ inputs to be explained with a small set of transformed objects or parts. The following evaluations show how the embedded geometrical knowledge helps to discover patterns in data. Firstly, we show that the CCAu discovers underlying structures in arrangements of constellations made of two-dimensional points, thereby performing instance-level segmentation. Secondly, we pair an OCAu with a PCAu and investigate whether the resulting SCAu can discover structure in real images. Finally, we present an ablation study that shows which components of the model contribute to the results.
3.1 Discovering Constellations
We create arrangements of constellations online, where every input example consists of up to 11 two-dimensional points belonging to up to three different constellations (two squares and a triangle) as well as binary variables indicating presence of the points (points can be missing). Each constellation is included with probabilityand undergoes a similarity transformation, whereby it is randomly scaled, rotated by up to 180° and shifted. Finally, every input example is normalized such that all points lie within . Note that we use sets of points, and not images, as inputs to our model.
We compare the CCAu against a baseline that uses the same encoder but a simpler decoder: the decoder uses the capsule parameter vector to directly predict the location, precision and presence probability of each of the four points as well as the presence probability of the whole corresponding constellation. Implementation details are listed in Section A.1.
Both models are trained unsupervised by maximizing the part log-likelihood. We evaluate them by trying to assign each input point to one of the object capsules. To do so, we assign every input point to the object capsule with the highest posterior probability for this point, cf. Section 2.1, and compute segmentation accuracy (the true-positive rate).
The CCAu consistently achieves below error with the best model achieving , while the best baseline achieved
error using the same budget for hyperparameter search. This shows that wiring in an inductive bias towards modelling geometric relationships can help to bring down the error by an order of magnitude—at least in a toy setup where each set of points is composed of familiar constellations that have been independently transformed.
3.2 Unsupervised Class Discovery
To allow for multimodality in the appearance of objects of a specific class, we typically use more object capsules than the number of class labels. We expect that the vector of presence probabilities of object capsules should be highly informative of the class label. To test this hypothesis, we train SCAu on mnist, svhn and cifar10 and try to assign class labels to vectors of object capsule presences. This is done with one of the following methods: max-act: we search for a training example that maximally activates given object capsule and assign the corresponding label to this capsule; cluster-nn: we perform kmeans clustering into clusters and then find the training example that is the closest to each cluster’s centroid to assign a label to the cluster; lin-match: after finding 10 clusters777All considered datasets have 10 classes. with kmeans we use bipartite graph matching (Kuhn1955hungarian) to find the permutation of cluster indices that minimizes the classification error—this is standard practice in unsupervised classification, seeiic; lin-pred
: we train a linear classifier with supervision given the presence vectors; this learnsweights and biases, where is the number of object capsules, but it does not modify any parameters of the main model.
In agreement with previous work on unsupervised clustering (iic; imsat; Hjelm2019deepinfomax; adc), we train our models and report results on full datasets (train, valid and test
splits). The linear transformation used inlin-pred variant of our method is trained on the train split of respective datasets while its performance on the test split is reported.
We used an PCAu with 24 single-channel templates for mnist and 24 and 32 three-channel templates for svhn and cifar10, respectively. We used sobel-filtered images as the reconstruction target for svhn and cifar10, as in jaiswal, while using the raw pixel intensities as the input to PCAu. The OCAu used 24, 32 and 64 object capsules, respectively. Further details on model architectures and hyper-parameter tuning are available in Appendix A. All results are presented in Table 1. SCAu achieves competitive results in unsupervised object classification on mnist and svhn and under-performs slightly on cifar10, which is further discussed in Section 5.
3.3 Ablation study
SCAus have many moving parts; an ablation study shows which model components are important and to what degree. We train SCAu variants on mnist
as well as a padded-and-translatedversion of the dataset, where the original digits are translated up to 6 pixels in each direction. Trained models are tested on test splits of both datasets; additionally, we evaluate the model trained on the mnist on the test split of affnist dataset. Testing on affnist shows whether the model can generalize to unseen viewpoints. This task was used by sparsecaps to evaluate Sparse Unsupervised Capsules, which achieved accuracy. SCAu achieves , which indicates that it is better at viewpoint generalization. We choose the lin-match performance metric, since it is the one favoured by the unsupervised classification community.
Results are split into several groups and shown in Table 2. We describe each group in turn. Group a) shows that sparsity losses introduced in Section 2.4 increase model performance, but that the posterior loss might not be necessary. Group b) checks the influence of injecting noise into logits for presence probabilities, cf. Section 2.4. Injecting noise into part capsules seems critical, while noise in object capsules seems unnecessary—the latter might be due to sparsity losses. Group c) shows that using similarity (as opposed to affine) transforms in the decoder can be restrictive in some cases, while not allowing deformations hurts performance in every case.
Group d) evaluates the type of the part-capsule encoder. The linear encoder entails a CNN followed by a fully-connected layer, while the conv encoder predicts one feature map for every capsule parameter, followed by global-average pooling. The choice of part-capsule encoder seems not to matter much for within-distribution performance; however, our attention-based pooling does achieve much higher classification accuracy when evaluated on a different dataset, showing better generalization to novel viewpoints.
Additionally, e) using Set Transformer as the object-capsule encoder is essential. We hypothesize that it is due to the natural tendency of Set Transformer to find clusters, as reported in Lee2019set. Finally, f) using special features seems not less important—presumably due to effects the high-level capsules have on the representation learned by the primary encoder.
4 Related Work
Capsule Networks Our work combines ideas from Transforming Autoencoders (Hinton2011tae) and em Capsules (Hinton2018capsule). Transforming autoencoders discover affine-aware capsule instantiation parameters by training an autoencoder to predict an affine-transformed version of the input image from the original image plus an extra input, which explicitly represents the transformation. By contrast, our model does not need any input other than the image.
Both em Capsules and the preceding Dynamic Capsules (Sabour2017capsule) use the poses of parts and learned partobject relationships to vote for the poses of objects. When multiple parts cast very similar votes, the object is assumed to be present, which is facilitated by an interactive inference (routing) algorithm. Iterative routing is inefficient and has prompted further research. wang2018optimization formulated routing as an optimization of a clustering loss and a kl-divergence-based regularization term. zhang2018fast
proposed a weighted kernel density estimation-based routing method.encapsule proposed approximating routing with two branches and sending feedback via optimal transport divergence between two distributions (lower and higher capsules). In contrast to prior work, we use objects to predicts parts rather than vice-versa, therefore we can dispense with iterative routing at inference time. The encoder of the OCAu learns how to group parts into objects and it respects the single parent constraint, because it is trained using derivatives produced by a decoder that uses a mixture model of parts which assumes that each part must be explained by a single object.
Additionally, since it is the objects that predict parts, the parts are allowed to have fewer degrees-of-freedom in their poses than objects (as in the CCAu). Inference is still possible, because the OCAu encoder makes object predictions based on all the parts rather than an individual part.
A further advantage of our version of capsules is that it can perform unsupervised learning. Previous versions of capsules used discriminative learning, thoughsparsecaps used the reconstruction MLP introduced in Sabour2017capsule to train Dynamic Capsules without supervision and has shown that unsupervised training for capsule-conditioned reconstruction helps with generalization to affnist classification; we further improve on their results, cf. Section 3.3.
There are two main approaches to unsupervised object category detection in computer vision. The first one is based on representation learning and typically requires discovering clusters or learning a classifier on top of the learned representation.Eslami2016; Kosiorek2018sqair use an iterative procedure to infer a variable number of latent variables, one for every object in a scene, that are highly informative of object class, while Greff2019multi; Burgess2019monet perform unsupervised instance-level segmentation in an iterative fashion. While similar to our work, these approaches cannot decompose objects into their constituent parts and do not provide explicit description of object shape (templates and their poses in our model).
The second approach targets classification explicitly by minimizing mutual information (mi)-based losses and directly learning class-assignment probabilities. Iic (iic) maximizes an exact estimator of mi between two discrete probability vectors describing (transformed) versions of the input image. DeepInfoMax (Hjelm2019deepinfomax) relies on negative samples and maximizes mi
between the predicted probability vector and its input via noise-contrastive estimation(Gutmann2010nce). This class of methods directly maximizes the amount of information contained in an assignment to discrete clusters and they hold state-of-the-art results on most unsupervised classification tasks. Mi-based methods suffer from typical drawbacks of mutual information estimation: they require heavy data augmentation and large batch sizes. This is in contrast to our method, which achieves comparable performance with batch size no bigger than 128 and with no data augmentation.
Geometrical Reasoning Other attempts at incorporating geometrical knowledge into neural networks include exploiting equivariance properties of group transformations (Cohen2016group) or new types of convolutional filters (mallat; kocvok). Although they achieve significant parameter efficiency in handling rotations or reflections compared to standard CNN, these methods cannot handle additional degrees of freedom of affine transformations—like scale. lenssen combined capsule networks with group convolutions to guarantee equivariance and invariance in capsule networks. Spatial Transformers (st; Jaderberg2015) apply affine transformations to the image sampling grid while steerable networks (Cohen2016steerable; Jacobsen2017dynamic) dynamically change convolutional filters. These methods are similar to ours in the sense that transformation parameters are predicted by a neural network, but differ in the sense that st uses global transformations applied to the whole image while steerable networks use only local transformations. Our approach can use different global transformations for every object as well as local transformations for each of their parts.
The main contribution of our work is a novel method for representation learning, in which highly structured decoder networks are used to train one encoder network that can segment an image into parts and their poses and another encoder network that can compose the parts into coherent wholes. Despite the fact that our training objective is not concerned with classification or clustering, SCAu is the only method that achieves competitive results in unsupervised object classification without relying on mutual information (mi). This is significant, since unlike our method, mi-based methods require sophisticated data augmentation. It may be possible to further improve results by using an mi-based loss to train SCAu, where the vector of capsule probabilities could take the role of discrete probability vectors in iic (iic). SCAu under-performs on cifar10, which could be because of using fixed templates, which are not expressive enough to model real data. This might be fixed by building deeper hierarchies of capsule autoencoders (complicated scenes in computer graphics are modelled as deep trees of affine-transformed geometric primitives) as well as using input-dependent shape functions instead of fixed templates—both of which are promising directions for future work. It may also be possible to make a much better PCAu for learning the primary capsules by using a differentiable renderer in the generative model that reconstructs pixels from the primary capsules.
Finally, the SCAu could be the ‘figure’ component of a mixture model that also includes a versatile ‘ground’ component that can be used to account for everything except the figure. A complex image could then be analyzed using sequential attention to perceive one figure at a time.
We would like to thank Sandy H. Huang for help with editing the manuscript and making Figure 1. Additionally, we would like to thank S. M. Ali Eslami and Danijar Hafner for helpful discussions throughout the project. We also thank Hyunjik Kim, Martin Engelcke, Emilien Dupont and Simon Kornblith for feedback on initial versions of the manuscript.
Appendix A Model Details
a.1 Constellation Experiments
The CCAu uses a four-layer Set Transformer as its encoder. Every layer has four attention heads, 128 hidden units per head, and is followed by layer norm (Ba2016LayerN). The encoder outputs three 32-dimensional vectors—one for each object capsule. The decoder uses a separate neural net for each object capsule to predict all parameters used to model its points: this includes four candidate part predictions per capsule for a total of 12 candidates. In this experiment, each objectpart relationship OP is just a 2-D offset in the object’s frame of reference (instead of a matrix) and it is affine transformed by the corresponding OV matrix to predict the 2-D point.
a.2 Image Experiments
We use a convolutional encoder for part capsules and a set transformer encoder (Lee2019set) for object capsules. Decoding from object capsule to part capsules is done with MLP, while the input image is reconstructed with affine-transformed learned templates. Details of the architectures we used are available in Table 3.
For svhn and cifar10, we use normalized sobel-filtered images as the target of the reconstruction to emphasize the shape importance. Figure 6 in Appendix B shows examples of svhn and cifar10 reconstruction. The filtering procedure is as follows: 1) apply sobel filtering, 2) subtract the median color, 3) take the absolute value of the image, 4) normalize for image values to be .
All models are trained with the RMSProp optimizer (tieleman2012rms) and . Batch size is 64 for constellations and 128 for all other datasets. The learning rate was equal to for mnist and constellation experiments (without any decay), while we run a hyperparameter search for svhn and cifar10: we searched learning rates in the range of to and exponential learning rate decay of 0.96 every or weight updates. Learning rate of was selected for both svhn and cifar10, the decay steps was for svhn and for cifar10. The lin-pred accuracy on a validation set is used as a proxy to select the best hyperparameters—including weights on different losses, reported in Table 4. Models were trained for up to iterations on single Tesla V100 GPUs, which took 40 minutes for constellation experiments and less than a day for cifar10.
|part ll weight||1||1||2.56||2.075|
|image ll weight||N/A||1||1||1|
|prior within sparsity||1||1||0.22||0.17|
|prior between sparsity||1||1||0.1||0.1|
|posterior within sparsity||0||10||8.62||1.39|
|posterior between sparsity||0||10||0.26||7.32|
Appendix B Reconstructions