1 Introduction
An attractive way to set up the problem of object recognition is hierarchically
, where an object is described in terms of its parts, and these parts are in turn composed of subparts, and so on. For example a face can be described in terms of the eyes, nose, mouth, hair, etc.; and a teapot can be described in terms of a body, handle, spout and lid parts. This approach has a long history in computer vision, see e.g. the work on Pictoral Structures by
Fischler and Elschlager (1973), and RecognitionbyComponents by Biederman (1987). More recently Felzenszwalb et al. (2009) used discriminativelytrained partsbased models to obtain stateoftheart results (at the time) for object recognition. Advantages of recognizing objects by first recognizing their constituent parts include tolerance to the occlusion of some parts, and that parts may vary less under a change of pose than the appearance of the whole object.Recent work by Hinton and collaborators (Sabour et al., 2017; Hinton et al., 2018) has developed capsule networks. These exploit the fact that if the pose^{1}^{1}1i.e. the location and rotation of the object in 2D or 3D.
of the object changes, this can have very complicated effects on the pixel intensities in an image, but the geometric transformation between the object and the parts can be described by a simple linear transformation (as used in computer graphics). In these papers a part in a lower level can vote for the pose of an object in the higher level, and an object’s presence is established by the agreement between votes for its pose.
Hinton et al. (2018, p. 1)use an iterative process called “routingbyagreement” which “updates the probability with which a part is assigned to a whole based on the proximity of the vote coming from that part to the votes coming from other parts that are assigned to that whole”. Subsequently
Kosiorek et al. (2019)framed inference for a capsule network in terms of an autoencoder, the Stacked Capsule Autoencoder (SCAE). Here, instead of the iterative routingbyagreement algorithm, a neural network
takes as input the set of input parts and outputs predictions for the object capsules’ instantiation parameters . Further networks are then used to predict part candidates from each .The objective function used in Hinton et al. (2018) (their eq. 4) is quite complex (involving four separate terms), and is not derived from first principles. In this paper we argue that the description in the paragraph above is backwards—it is more natural to describe the generative process by which an object gives rise to its parts, and that the appropriate routingbyagreement inference algorithm then falls out naturally from this principled formulation.
The contributions of this paper are to:

Derive a novel variational inference algorithm for routingbyagreement, based on a generative model of objectpart relationships, including a relaxation of the permutationmatrix formulation for matching object parts to observations;

Focus on the problem of inference for scenes containing multiple objects. Much of the work on capsules considers only single objects (although sec. 6 in Sabour et al. (2017) is a notable exception).

Demonstrate the effectiveness of our algorithm on (i) “constellations” data generated from multiple geometric objects (e.g. triangles, squares) at arbitrary translations, rotations and scales; and (ii) data of multiple faces from a novel partsbased model of faces;

Evaluate the performance of our algorithm and the RANSAC method vs. competitors on the constellations data.

Derive a learning algorithm for the object models, based on variational expectation maximization (Jordan et al., 1999).
This paper extends our previous work (Nazabal et al., 2022) by covering the learning of the models as well as inference.
The structure of the remainder of the paper is as follows: in section 2 we provide an overview of the method. Section 3 gives details of the generative model and the matching distribution between observed and model parts. The variational inference algorithm derived from this model is given in section 4, and RANSAC inference is described in sec. 5. Approaches to learning generative capsule models are given in section 6, and related work is discussed in section 7. The experiments are described in section 8, and results are given in sec. 9. We conclude with a discussion.
(a)  (b)  (c) 
2 Overview
Our images consist of a set of objects in different poses. Examples include images of handwritten digits, faces, or geometric shapes in 2D or 3D. An object can be defined as an instantiation of a specific model object (or template) along with a particular pose (or geometric transformation). Furthermore, objects, and thus templates, are decomposed in parts, which are the basic elements that comprise the objects. For example, faces can be decomposed into parts (e.g. mouth, nose etc.), or a geometric shape can be decomposed into vertices. These parts can have internal variability, (e.g. eyes open or shut).
More formally, let be the set of templates that are used to generate a scene. Each template is composed of parts . We assume that scenes can only be generated using the available templates. Furthermore, every scene can present a different configuration of objects, with some objects missing in some scenes. For example, in scenes that could potentially contain all digits from 0 to 9 once, and if only the digits 2 and 5 are in the image, we consider that the other digits are missing. If all the templates were employed in the scene, then the number of observed parts is equal to the sum of all the parts of all the templates .
Each observed template in a scene is then transformed by an independent transformation , different for each template, generating a transformed object
(1) 
The transformation includes both the geometric transformation of the template, and also the appearance variability in the parts.
Assume that we are given a scene composed of observed parts coming from multiple templates. The inference problem for involves a number of different tasks. We need to determine which objects from the set of templates were employed to generate the scene. Also, we need to infer what transformation was applied to each template to generate the objects. This allows us to infer the correspondences between the template parts and the scene parts.
We demonstrate our method on “constellations” data as shown in Fig. 1(a), and data containing multiple faces (Fig. 1(c)). In the constellations data, the real generators are triangle and square objects, but only their vertices are provided in the data. The faces data is generated from the partsbased model of faces shown in Fig. 1(b) and described in sec. 8.2.
3 A Generative Capsules Model (GCM)
We propose a generative model to describe the problem. Consider a template (or model) for the th object. is composed of parts .^{2}^{2}2For simplicity of notation we suppress the dependence of on for now. Each part is described in its reference frame by its geometry and its appearance . Each object also has associated latent variables which transform from the reference frame to the image frame, so is split into geometric variables and appearance variables .
Geometric transformations:
Here we consider 2D templates and a similarity transformation (translation, rotation and scaling) for each object, but this can be readily extended to allow 3D templates and a scenetoviewer camera transformation. We assume that contains the and locations of the part, and also its size and orientation relative to the reference frame.^{3}^{3}3For the constellations data, the size and orientation information is not present, nor are there any appearance features. The size and orientation are represented as the projected size of the part onto the any axes, as this allows us to use linear algebra to express the transformations (see below). Thus .
Consider a template with parts for that we wish to scale by a factor , rotate through with a clockwise rotation angle and translate by . We obtain a transformed object with geometric observations for the th part , where the and subscripts denote the projections of the scaled and rotated part onto the and axes respectively ( and are mnemonic for cosine and sine).
For each part in the template, the geometric transformation works as follows:
(2) 
Decoding the third equation, we see that using standard trigonometric identities. The equation is derived similarly. We shorten eq. 2 to , where is the
column vector, and
is the matrix to its left. Allowing Gaussian observation noise with precision we obtain(3) 
The prior distribution over similarity transformations is modelled with a Gaussian distribution with mean and covariance matrix :
(4) 
where denotes the set . Notice that modelling with a Gaussian distribution implies that we are modelling the translation in with a Gaussian distribution. If and then has a distribution, and
is uniformly distributed in its range
by symmetry. For more complex linear transformations (e.g. an affine transformation), we need only to increase the dimension of and change the form of , but the generative model in (4) would remain the same.Appearance transformations:
The appearance of part in the image depends on . For our faces data, is a vector latent variable which models the covariation of the appearance of the parts via a linear (factor analysis) model; see sec. 8.2 for a fuller description. Hence
(5) 
where maps from to the predicted appearance features in the image,
is a diagonal matrix of variances and
allows for the appearance features to have a nonzero mean. The dimensionality of the th part of the th template is . The prior for is taken to be a standard Gaussian, i.e. . Combining (4) and the prior for , we have that , where stacks and from the appearance, and is a diagonal matrix with blocks and .Joint distribution:
Let indicate whether observed part matches to part of object . The set of these match variables is denoted by . Every observation belongs uniquely to a tuple , or in other words, a point belongs uniquely to the part defined by acting on the template matrix . The opposite is also partially true; every tuple belongs uniquely to a point , or it is unassigned if part of template is missing in the scene.
The joint distribution of the variables in the model is given by
(6) 
where
is a Gaussian mixture model explaining how the points in a scene were generated from the templates
(7) 
where consists of the diagonal matrices and and consists of a zero vector for the geometric features stacked on top of the mean for the appearance features . Note that has blocks of zeros so that does not depend on , and does not depend on .
Annealing parameter:
During the fitting of the model to data, it is useful to modify the covariance matrix to , where is a parameter . The effect of this is to inflate the variances in , allowing greater uncertainty in the inferred matches early on in the fitting process, as used, e.g. in Revow et al. (1996). is increased according to an annealing schedule during the fitting.
Match distribution :
In a standard Gaussian mixture model, the assignment matrix is characterized by a Categorical distribution, where each point is assigned to one part
(8) 
with being a 0/1 vector with only one 1, and being the probability vector for each tuple . However, the optimal solution to our problem occurs when each part of a template belongs uniquely to one observed part in a scene. This means that should be a permutation matrix, where each point is assigned to a tuple and vice versa. Notice that a permutation matrix is a square matrix, so if , we add dummy rows to , which are assigned to missing points in the scene.
The set of permutation matrices of dimension is a discrete set containing permutation matrices. A discrete prior over permutation matrices assigns each valid matrix a probability :
(9) 
with and being the indicator function, equal to if and otherwise.
The number of possible permutation matrices increases as , which makes exact inference over permutations intractable, except for very small . An interesting property of
is that its first moment
is a doublystochastic (DS) matrix, a matrix of elements in whose rows and columns sum to 1. We propose to relax to a distributionthat is characterized by the doublystochastic matrix
with elements , such that :(10) 
is fully characterized by elements. In the absence of any prior knowledge of the affinities, a uniform prior over with can be used. However, note that can also represent a particular permutation matrix by setting the appropriate entries of to 0 or 1, and indeed we would expect this to occur during variational inference (see sec. 4) when the model converges to a correct solution.
Related models:
Our model for a single object has both geometric and appearance variability (see eqs. 3 and 5). A similar model but with geometric features only was developed by Revow et al. (1996). Fergus et al. (2003) described a “constellation of parts” model, that used a joint Gaussian model for locations of the parts, and an imagepatch model for each part appearance. However, the appearance model was a single Gaussian per part, without the correlations between parts afforded by the factor analysis model. This model was applied to images of single (foreground) objects, summing out over possible assignments . Rao and Ballard (1999) developed a hierarchical factor analysis model, but used it to model extended edges in natural image patches rather than correlations between the parts of an object. See sec. 6 for further discussion of partsbased models.
Hierarchical modelling:
above we have described a twolayer model with partobject relations. This can of course be extended to a deeper hierarchy; for example we would expect to find relationships between the objects in a scene, such as the relationships between a dining table and the dining chairs, or between a bed and its associated nightstand(s). Such scenelevel relationships can be formulated in terms of graphical models for the groups of objects. We believe the most difficult aspect is handling the assignment between model parts and observed parts (as covered in this paper), and that inference in the graphical model above is fairly standard, and is left for future work.
4 Variational Inference
Variational inference for the above model can be derived similarly to the Gaussian Mixture model case (Bishop, 2006, Chapter 10). The variational distribution under the mean field assumption is given by , where the optimal solutions for and under the generative model can be expressed as
(11)  
(12) 
For we obtain an expression with the same form as the prior in (10)
(13) 
where represents the unnormalized probability of point being assigned to tuple and vice versa. These unnormalized probabilities have a different form depending on whether we are considering a point that appears in the scene ()
(14) 
or whether we are considering a dummy row of the prior (),
(15) 
When a point is part of the scene (14), and thus the update of is similar to the Gaussian mixture model case. However, if a point is not part of the scene (15), and thus then the matrix is not updated and the returned value is the prior . The expectation term in (14) is given by:
(16) 
(17) 
where . The elements of matrix
represent the posterior probability of each point
being uniquely assigned to the partobject tuple and viceversa. This means that needs to be a DS matrix. This can be achieved by employing the SinkhornKnopp algorithm (Sinkhorn and Knopp, 1967) as given in Algorithm 1, which updates a square nonnegative matrix by normalizing the rows and columns alternately until the resulting matrix becomes doubly stochastic.The use of the SinkhornKnopp algorithm for approximating matching problems has also been described by Powell and Smith (2019) and Mena et al. (2020), but note that in our case we also need to alternate with inference for .
Furthermore, the optimal solution to the assignment problem occurs when is a permutation matrix itself. When this happens we exactly recover a discrete posterior (with the same form as (9)) over permutation matrices where one of them has probability one, with the others being zero.
The evidence lower error bound (ELBO) for this model is derived in Appendix A.
The distribution for is a Gaussian with
(18)  
(19)  
(20) 
where the updates for both and depend explicitly on the annealing parameter and the templates employed in the model. Note that the prediction from datapoint to the mean of depends on , i.e. a weighted sum of the predictions of each part with weights . These expressions remain the same when considering a Gaussian mixture prior such as (8).
Algorithm 2 summarizes the inference procedure for this model; the alternating updates of and carry out routingbyagreement.
4.1 Comparison with other objective functions
In Hinton et al. (2018) an objective function is defined (their eq. 1) which considers inference for the pose of a higherlevel capsule on pose dimension . Translating to our notation, combines the predictions from each datapoint for capsule as , where is a Gaussian, and is the “routing softmax assignment” between and . It is interesting to compare this with our equation (20). Firstly, note that the vote of to part in object is given explicitly by i.e. we do not require the introduction of an independent voting mechanism, this falls out directly from the inference. Secondly, note that our must keep track not only of assignments to objects, but also to parts of the objects. In our experiments with the constellations data each observed part could match any object/part combination of the models, so this is necessary. For the faces data, the observed parts are of identifiable type (e.g. nose, mouth), so in this case they only need to vote for the object. In contrast to Hinton et al. (2018), our inference scheme is derived from variational inference on the generative model for , rather than introducing an ad hoc objective function that corresponds to a clustering in space.
De Sousa Ribeiro et al. (2020) develop a variational Bayes extension of the model of Hinton et al. (2018). They write down a mixture model in space where the “datapoints” are the votes , and use Bayesian priors for the mixing proportions, and for the means and covariance matrices of the components. This can improve training stability, e.g. by reducing the problem of variancecollapse where a capsule claims sole custody of of a datapoint. However, this is still a model for clustering in space, and not a generative model for .
The specialization of the SCAE method of Kosiorek et al. (2019) to constellation data is called the “constellation capsule autoencoder (CCAE)” and discussed in their sec. 2.1. Under their equation 5, we have that
(21) 
where is the presence probability of capsule , is the conditional probability that a given candidate part exists, and is the predicted location of part . The s are predicted by the network , while the s and s are produced by separate networks for each part .
We note that (21) provides an autoencoder style reconstructive likelihood for , as the ’s and ’s depend on the data. To handle the arbitrary number of datapoints , the network employs a Set Transformer architecture (Lee et al., 2019). In comparison to our iterative variational inference, the CCAE is a “one shot” inference mechanism. This may be seen as an advantage, but in scenes with overlapping objects, humans may perform reasoning like “if that point is one of the vertices of a square, then this other point needs to be explained by a different object” etc, and it may be rather optimistic to believe this can be done in a simple forward pass. Also, the CCAE cannot exploit prior knowledge of the geometry of the objects, as it relies on an opaque network which requires extensive training.
5 A RANSAC Approach to Inference
A radical alternative to “routing by agreement” inference is to make use of a “random sample consensus” approach (RANSAC, Fischler and Bolles, 1981), where a minimal number of parts are used in order to instantiate an object. The original RANSAC fitted just one object, but Sequential RANSAC (see, e.g., Torr 1998; Vincent and Laganière 2001) repeatedly removes the parts associated with a detected object and reruns RANSAC, so as to detect all objects.
For the constellations problem, we can try matching any pair of points on one of the templates to every possible pair of
points in the scene. The key insight is that a pair of known points is sufficient to estimate the 4dimensional
vector in the case of similarity transformations. Using the transformation , we can then predict the location of the remaining parts of the template, and check if these are actually present. If so, this provides evidence for the existence of and . After considering the subsets, the algorithm then combines the identified instantiations to give an overall explanation of the scene.More details of the RANSAC approach are given in Algorithm 3. Assume we have chosen parts and as the basis for object , and that we have selected datapoints and as their hypothesized counterparts in the image. Let be the vector obtained by stacking and , and be the square matrix obtained by stacking and . Then . Finally, selects those points in that are close to with a given tolerance and add them to the output. Among them, the solution is given by the one that minimizes .
The above algorithm chooses a specific basis for each object, but one can consider all possible bases for each object. It is then efficient to use a hash table to store the predictions for each part, as used in Geometric Hashing (Wolfson and Rigoutsos, 1997). Geometric Hashing is traditionally employed in computer vision to match geometric features against previously defined models of such features. This technique works well with partially occluded objects, and is computationally efficient if the basis dimension is low.
For the faces data, each part has location, scale and orientation information, so a single part is sufficient to instantiate the whole object geometrically. For the appearance, we follow the procedure for inference in a factor analysis model with missing data, as given in Williams et al. (2019), to predict given the appearance of the single part.
It is interesting to compare the RANSAC and routingbyagreement approaches to identifying objects. In RANSAC a minimal basis is chosen so as to instantiate the object, which is then verified by finding the remaining parts in the predicted locations. In contrast routingbyagreement makes predictions from all of the observed parts, and then seeks to adjust the matrix so as to cluster these predictions into coherent objects. This is generally an iterative process, although the SCAE autoencoder architecture is “one shot”.^{4}^{4}4One can extend variational autoencoders to use iterative amortized inference, see Marino et al. (2018).
(a)  

(b)  
partseta  partsetb 
6 Learning Generative Capsule Models
Above we have assumed that the models are given, i.e. that and are known for each part and model . One advantage of the interpretable nature of the GCM is that one can use a “curriculum learning” approach (Bengio et al., 2009), where individual models can first be learned, and then composed together for inference in more complex situations.
For a given object, a key issue is the learning of the decomposition into parts. For the constellations data this is not an issue as the points (i.e. parts) are given, but for image data it must be addressed. An important issue is the degree of flexibility of each part—is it simply a fixed template, or are there internal degrees of flexibility? An example of the former is nonnegative matrix factorization (NMF) (Lee and Seung, 1999), where, for example, aligned images of faces were decomposed into nonnegative basis functions (parts) that were combined with nonnegative coefficients. An example of a richer partsbased model is by Ross and Zemel (2006), who developed “multiple cause factor analysis” (MCFA) and applied it to faces, to learn the regions governed by each part and the variability within each part.^{5}^{5}5In contrast to our work they did not have a higherlevel factor analyser to model correlations between part appearances, but did allow variability in the masks of the parts. This work was also carried out on aligned, vertically oriented face images, so it was not necessary to factor out geometric transformations. For our work on faces, we made use of the partsbased model from the “PhotoFitMe” work described in section 8.2. This provided a readymade parts decomposition, but we learned a factor analysis model on top to model the correlations between the parts.
Kosiorek et al. (2019)
developed a Part Capsule Autoencoder (PCAE) to learn parts from images, and applied it to MNIST images. Each PCAE part is a template which can undergo an affine transformation, and it has "special features" that were used to encode the colour of the part. If the overall model is to be equivariant to geometric transformations, it is vital that the input part decomposition is stable to such variation, otherwise the model is building on shaky foundations. However, we have observed that the parts detected by PCAE are not equivariant to rotation. Figure
2 shows that the PCAE part decompositions inferred for a digit 4 are not stable to different angles of rotation: notice e.g. in panel (a) how the part coloured white maps to different parts of the 4 for  and . The details of this experiment are described in Appendix D.As identified above, a strength of the GCM is that individual models can first be learned, before moving on to more complex situations. We can start with an initial guess for the object configuration, which can be chosen as one of the noisy configurations from the training set. We then run variational expectation maximization (variational EM; see sec. 6.2 in Jordan, Ghahramani, and Jaakkola 1999). In the Estep, variational inference is run as in sec. 4 with this model to infer the and variables on each training example. In the Mstep the locations of the template points are updated so as to increase the ELBO, summed over the training cases. Details of this update are given in Appendix B for the constellations dataset, and experimental results are given in sec. 9.2.
7 Related Work
The origins of capsule networks (including the name “capsule”) can be traced back at least as far as the work of Hinton et al. (2011) on Transforming Autoencoders. Capsules were further developed in Sabour et al. (2017). Above we have already discussed later work by Hinton and coworkers, including Hinton et al. (2018), Kosiorek et al. (2019), and also De Sousa Ribeiro et al. (2020).
There is some other work on capsules, for example Li and Zhu (2019)
develop a capsule Restricted Boltzmann Machine (RBM). This generative model contains a number of capsules, each having a binary existence variable and a vector of capsule variables similar to our
. However, these capsule variables are not used to model geometric transformations (the key idea behind capsule networks), but instead to handle appearance variability of the parts.Both Li et al. (2020) and Smith et al. (2021) propose a layered directed generative model. These both make use of the "single parent constraint", where a capsule in a layer can be connected to only one capsule (its parent) in the layer above. Such treestructured relationships are similar to those studied in Hinton et al. (2000) and Storkey and Williams (2003). Like Li and Zhu (2019), Li et al. (2020) do not explicitly consider modelling geometric transformations with their network. This aspect is explicit in Smith et al. (2021), where their capsule variables (similar to our s) do model pose, but not appearance variability.
However, most importantly, these recent capsule models do not properly handle the fact that the input to a capsule should be a set of parts; instead in their work the first layer capsules that interact with the input image model specific locationdependent features/templates in the image, and their second layer capsules have interactions with the specific first layer capsules (e.g. the fixed affinity of Smith et al. (2021) parent in layer for child ). But if we consider a secondlayer capsule that is, for example, detecting the conjunction of the 3 strokes comprising a letter “A”, then at different translations, scales and rotations of the A it will need to connect to different image level stroke features, so these connection affinities must depend on transformation variables of the parent. This point is illustrated in Fig. 2 for the PCAE parts, and is also made in sec. 5.4 of Smith et al. (2021), where it is shown that the parts decomposition used of a digit image is not stable wrt rotation of the digit.
8 Experimental details
Below we provide details of the data generators, inference methods and evaluation criteria for the constellations data in sec. 8.1, and the faces data in sec. 8.2.
8.1 Constellations data
In order to allow fair comparisons, we use the same dataset generator for geometric shapes employed by Kosiorek et al. (2019). We create a dataset of scenes, where each scene consists of a set of 2D points, generated from different geometric shapes. The possible geometric shapes (templates) are a square and an isosceles triangle, with parts being represented by the 2D coordinates of the vertices. We use the same dimensions for the templates as used by Kosiorek et al. (2019), side 2 for the square, and base and height 2 for the triangle. All templates are centered at . In every scene there are at most two squares and one triangle, so . Each shape is transformed with a random transformation to create a scene of 2D points given by the object parts. To match the evaluation of Kosiorek et al. (2019), all scenes are normalized so as the points lie in on both dimensions. When creating the scene, we select randomly (with probability 0.5) whether an object is going to be present or not, but delete empty scenes. A test set used for evaluation is comprised of 450460 nonempty scenes, based on 512 draws.
Additionally, we study how the methods compare when objects are created from noisy templates. We consider that the original templates used for the creation of the images are corrupted with Gaussian noise with standard deviation
. Once the templates are corrupted with noise, a random transformation is applied to obtain the object of the scene. As with the noisefree data, the objects are normalized to lie in [1,1] on both dimensions.The CCAE is trained by creating random batches of 128 scenes as described above and optimizing the objective function in (21). The authors run CCAE for epochs, and when the parameters of the neural networks are trained, they use their model on the test dataset to generate an estimation of which points belong to which capsule, and where the estimated points are located in each scene.
The variational inference approach allows us to model scenes where the points are corrupted with some noise. The annealing parameter controls the level of noise allowed in the model. We use an annealing strategy to fit , increasing it every time the ELBO has converged, up to a maximum value of
. We set the hyperparameters of the model to
, , and . We run Algorithm 2 with 5 different random initializations of and select the solution with the best ELBO. Similarly to Kosiorek et al. (2019), we incorporate a sparsity constraint in our model, that forces every object to explain at least two parts. Once our algorithm has converged, for a given if any and it means that the model has converged to a solution where object is assigned to less than 2 parts. In these cases, we rerun Algorithm 2 with a new initialization of . Notice that this is also related to the minimum basis size necessary in the RANSAC approach for the types of transformations that we are considering.The implementation of SubsetMatch in Algorithm 3 considers all matches between the predicted and the scene points where the distance between them is less than . Among them, it selects the matching with minimum distance between scene and predicted points.
For both the variational inference algorithm and the RANSAC algorithm, a training dataset is not necessary if we have prior knowledge of the target shapes. This contrasts with SCAE, which learns from whole scenes. If prior knowledge is not available, these shapes can be learned as described in sec. 6, and illustrated in sec. 9.2.
Unfortunately we do not have access to the code employed by Hinton et al. (2018), so we have been unable to make comparisons with it.
Evaluation: Three metrics are used to evaluate the performance of the different methods: variation of information (Meilă, 2003), adjusted Rand index (Hubert and Arabie, 1985) and segmentation accuracy (Kosiorek et al., 2019)
. They are based on partitions of the datapoints into those associated with each object, and those that are missing. Compared to standard clustering evaluation metrics, some modifications are needed to handle the missing objects. Details are provided in Appendix
C. We also use an average scene accuracy metric, where a scene is correct if the method returns the full original scene, and is incorrect otherwise.8.2 Partsbased face model
We have developed a novel hierarchical partsbased model for face appearances. It is based on five parts, namely eyes, nose, mouth, hair and forehead, and jaw (see Fig. 1(b)). Each part has a specified mask, and we have cropped the hair region to exclude highly variable hairstyles. This decomposition is based on the "PhotoFit Me" work and data of Prof. Graham Pike, see https://www.open.edu/openlearn/PhotoFitMe. For each part we trained a probabilistic PCA (PPCA) model to reduce the dimensionality of the raw pixels; the dimensionality is chosen so as to explain 95% of the variance. This resulted in dimensionalities of 24, 11, 12, 16 and 28 for the eyes, nose, mouth, jaw and hair parts respectively. We then add a factor analysis (FA) model on top with latent variables to model the correlations of the PPCA coefficients across parts. The dataset used (from PhotoFit Me) is balanced by gender (female/male) and by race (Black/Asian/Caucasian), hence the highlevel factor analyser can model regularities across the parts, e.g. wrt skin tone. is predicted from as as in (5). would have an effect on the part appearance, e.g. by scaling and rotation, but this can be removed by running the PPCA part detectors on copies of the input image that have been rescaled and rotated.
The “PhotoFit Me” project utilizes 7 different partimages for each gender/race group, for each of the five part types. As a result, we generated synthetic faces for each group, by combining these faceparts, which led to a total of faces. All faces were centered on a pixel canvas. For each synthetic face we created an appearance vector , which consisted of the stacked vectors from the 5 different principal component subspaces. Finally, we created a balanced subset from the generated faces ( images), which we used to train a FA model. We tuned the latent dimension of this model by training it multiple times with a different number of factors, and finally chose 12 factors, where a knee in the reconstruction loss on the face data was observed on a validation set.
To evaluate our inference algorithm we generated pixel scenes of faces. These consisted of 2, 3, 4 or 5 randomly selected faces from a balanced testset of synthetic faces, which were transformed with random similarity transformations. The faceobjects were randomly scaled down by a minimum of 50% and were also randomly translated and rotated, with the constraint that all the faces fit the scene and did not overlap each other. An example of such a scene is shown in Fig. 1(c), and further examples are shown in Figure 5. For each scene it is easy to determine the correct number of faces, as the number of faces present is equal to the count of each of the parts detected. Afterwards, these two constraints were dropped to test the ability of our model to perform inference with occluded parts, see Figure 5(e) for an example. These occluded scenes were comprised of 3 faces. In our experiments we assume that the face parts are detected accurately, i.e. as generated.
In the case of facial parts—and parts with both geometric and appearance features in general—it only makes sense to assign the observed parts to template parts of the same type (e.g. an observed “nose” part should be assigned only to a template “nose” part). We assume that this information is known, since the size of the appearance vector of each parttype is unique. Thus it no longer makes sense to initialize the assignment matrix uniformly for all entries, but rather only for the entries that correspond to templates of the observed part’s type. Consequently, (14) is only utilized for pairs of the same type. Similarly to the constellation experiments, we initialized the assignment matrix 5 times and selected the solution with the largest ELBO.
In the experiments we evaluated the part assignment accuracy of the algorithms. In a given scene, the assignment is considered correct if all the observed parts have been correctly assigned to their corresponding template parts with the highest probability. In order to evaluate the prediction of the appearance features, we measured the root mean square error (RMSE) between the input and generated scenes.
9 Results
We present results for inference on the constellations data in sec. 9.1, and for learning constellation models in sec. 9.2. Results for inference on the faces data are given in sec. 9.3.
9.1 Constellation Inference Experiments
In Table 1 we show a comparison between CCAE, the variational inference method with a Gaussian mixture prior (8) (GCMGMM), with a DS prior over permutation matrices (9) (GCMDS), and the RANSAC approach for scenes without noise and with noise levels of .^{6}^{6}6Code at: https://github.com/anazabal/GenerativeCapsules For GCMGMM and GCMDS we show the results where the initial . The effect of different initializations is discussed below.
Metric  Model  =0  =0.1  =0.25 

SA  CCAE  0.828  0.754  0.623 
GCMGMM  0.753  0.757  0.744  
GCMDS  0.899  0.882  0.785  
RANSAC  1  0.992  0.965  
ARI  CCAE  0.599  0.484  0.248 
GCMGMM  0.586  0.572  0.447  
GCMDS  0.740  0.699  0.498  
RANSAC  1  0.979  0.914  
VI  CCAE  0.481  0.689  0.988 
GCMGMM  0.478  0.502  0.677  
GCMDS  0.299  0.359  0.659  
RANSAC  0  0.034  0.135  
CCAE  0.365  0.138  0.033  
Scene  GCMGMM  0.179  0.173  0.132 
Acc  GCMDS  0.664  0.603  0.377 
RANSAC  1  0.961  0.843 
We see that GCMDS improves over CCAE and GCMGMM in all of the metrics, with GCMGMM being comparable to CCAE. Interestingly, for the noisefree scenarios, the RANSAC method achieves a perfect score for all of the metrics. Since there is no noise on the observations and the method searches over all possible solutions of , it can find the correct solution for any configuration of geometric shapes in a scene. For the noisy scenarios, all the methods degrade as increases. However, the relative performance between them remains the same, with RANSAC performing the best, followed by GCMDS and then GCMGMM.
Figure 3 shows some reconstruction examples from CCAE and GCMDS for the noisefree scenario. In columns (a) and (b) we can see that CCAE recovers the correct parts assignments but the object reconstruction is inaccurate. In (a) one of the squares is reconstructed as a triangle, while in (b) the assignment between the reconstruction and the ground truth is not exact. For GCMDS, if the <parts are assigned to the ground truth properly, and there is no noise, then the reconstruction of the object is perfect. In column (c) all methods work well. In column (d), CCAE fits the square correctly (green), but adds an additional red point. In this case GCMDS actually overlays two squares on top of each other. Both methods fail badly on column (e). Note that CCAE is not guaranteed to reconstruct an existing object correctly (square or triangle in this case). In column (f) we can see that CCAE fits an irregular quadrilateral (blue) to the assigned points, while GCMDS obtains the correct fit. Additional examples for noisy cases with are shown in Appendix E.
To assess the effect of in the initial value of , we considered 6 values: 0.005, 0.01, 0.05, 0.1, 0.2, and 0.5. We found that GCMDS is always better than CCAE and GCMGMM. As the initial is increased, GCMDS performs better across all the metrics. We found that the performance of GCMDS and GCMGMM degrades with .
We conducted paired ttests between CCAE and GCMGMM, GCMDS and RANSAC on the three clustering metrics for
and initial . The differences between CCAE and GCMDS are statistically significant with pvalues less than , and between CCAE and RANSAC with pvalues less than . For CCAE and GCMGMM the differences are not statistically significant.9.2 Learning Constellations
Making use of the interpretable nature of the GCM, we consider learning each template (triangle or square) individually from noisy data, with noise values of . (Note that the full dataset as generated contains 1/7 = 14% single triangles, and 28% single squares which can be selected simply based on the number of points.)
To compare the learned template to the groundtruth (GT), we have to bear in mind that the learned template could be a rotated version, as this will still be centered and have the correct scale. To remove this rotational degree of freedom, we compute the Procrustes rotation (see, e.g.,
Mardia, Kent, and Bibby 1979, sec. 14.7) that best aligns them. After this, we can compute the standardized mean squared error (SMSE) .We use examples to train each template, and use one random sample to seed the initial template (after translation, scaling and rotation). (In fact, for noisefree constellation data (), a single observed triangle or square configuration serves as a perfect template.) As is 3 or 4 in this case, we can use exact inference over permutations, rather than the variational approach. We initialized and increased it after each iteration by a factor of two, similarly to the annealing strategy utilized in Algorithm 2. The stopping criterion for the learning process was that the SMSE between two consecutive updates of the learned template should not be greater than .
For the triangle object, the SMSE of the learned template was at , and at . For the square object, the corresponding SMSEs were and respectively. Figure 4 shows example learned templates (after Procrustes rotation) compared to the GT templates. We see that with only examples, an accurate template can be learned for each object. As expected, the error increases with increasing . Note that gives quite noticeably distorted examples, see e.g. the examples in Fig. 6 (top row). In contrast, for the CCAE, the encoder network in the autoencoder is not able to benefit from curriculum learning, and must tackle the full problem for the start; Kosiorek et al. (2019) used 300k batches of 128 scenes to train this model.
square,  triangle, 
9.3 Face Experiments
Ground Truth  VI  RANSAC 

(a)  
(b)  
(c)  
(d)  
(e) 
Firstly, the VI algorithm was evaluated on scenes of multiple, randomly selected and transformed faces.^{7}^{7}7Code at: https://github.com/tsagkas/capsules For scenes with 2, 3, 4 and 5 faces, the assignment accuracy was 100%, 100%, 99.2% and 93.7% respectively (based on 250 scenes per experiment). RANSAC gave 100% accurate assignments in all four cases. This is to be expected, since from each part the pose of the whole can be predicted accurately. However, RANSAC’s ability to infer the appearance of the faces proved to be limited. More specifically, in 250 instances uniformly distributed across scenes of 2, 3, 4 and 5 faces, the VI algorithm had RMSE of , while RANSAC scored , with consistently higher error on all scenes. This is illustrated in the examples of Figure 5, where it is clear that RANSAC is less accurate in capturing key facial characteristics. If inference for is run as a postprocessing step for RANSAC using all detected parts in an object, this difference disappears.
The supplementary material contains a movie showing the fitting of the models to the data. It is not possible for us to make a fair comparison with the SCAE algorithm on the faces data, as the PCAE model used is not rich enough to model PCA subspaces.
Secondly, we evaluated the ability of our algorithm to perform inference in scenes where some parts have been occluded, either by overlapping with other faces or by extending outside of the scene. In 250 scenes with 3 partially occluded faces, both the VI and RANSAC algorithms were fully successful in assigning the observed parts to the corresponding template accurately; see Figure 5(e) for an example.
10 Discussion
In our experiments RANSAC was shown to often be an effective alternative to variational inference (VI). This is particularly the case when the basis in RANSAC is highly informative about the object. For the constellations experiments this was the case, even when the datapoints were corrupted by noise. However, as we saw in sec. 9.3 for the faces data, an individual part was less informative about the appearance than the geometry, and so led to worse reconstructions unless a postprocessing step using all of the detected parts was used. Also, RANSAC’s samplingbased inference may be less amenable to neuralnetwork style parallel implementations than VI.
Above we have described a generative capsules model. The key features of this approach are:

The model is interpretable, and thus admits the use of prior knowledge, e.g. if we already know some things about an object. The formulation is also composable in that models for individual objects can be learned separately, then combined together at inference time.

The variational inference algorithm is obtained directly from a generative model for the observations . In contrast other leading formulations set up an objective to produce clusters in space.

The interpretable structure of the GCM allows other inference methods to be used, as demonstrated by our use of RANSAC.

The GCM conforms to the view, as promoted in Kosiorek et al. (2019), that the input is regarded as a set of parts. This formulation ensures that if the parts can be detected equivariantly, then the inferences for the objects will also be equivariant. This was demonstrated in the constellations and faces experiments.
As noted above, for the GCM to be equivariant to large transformations of the input, the parts need to to be detected equivariantly. Some capsules papers have used the affNIST dataset^{8}^{8}8https://www.cs.toronto.edu/~tijmen/affNIST/, but this only used small rotations of up to . Hinton et al. (2018, sec. 5.1) did investigate the use of very different viewpoints on the smallNORB dataset; while their capsules results in Table 2 did outperform a competitor CNN, it is noticeable that there is still a performance gap between novel and familiar viewpoints. We have demonstrated (see Fig. 2) that the PCAE decomposition is not equivariant to large rotations, and similar observations have been made by Smith et al. (2021) for their model. Thus we believe that further work on the equivariant extraction of parts is necessary in order to achieve equivariant object recognition.
Acknowledgements
This work was supported in part by The Alan Turing Institute under EPSRC grant EP/N510129/1.
Appendix A Details for Variational Inference
The evidence lower bound (ELBO) for this model is decomposed in three terms:
(22) 
where
is the KullbackLeibler divergence between distributions
and . The first term indicates how well the generative model fits the observations under our variational model :(23) 
The KullbackLeibler divergence between the two Gaussian distributions and in our model has the following expression:
(24) 
where is the dimensionality of .
The expression for is given by
(25) 
Appendix B Learning for the Constellations Model
We specialize the ELBO given in Appendix A for one constellation, taking (and hence dropping the index ). As there are no appearance features, we drop the superscript on and . Also the mean is only needed for the appearance features and thus can be omitted. is specialized to , we set in this appendix and allow to vary, and we assume . Hence we obtain
(26) 
where , with and specialized from eqs. 19 and 20 as
(27) 
Our goal in learning is to adapt the template parameters so as to increase the variational log likelihood (ELBO) . Note that the terms and do not depend explicitly on , and hence any derivative of these KL terms wrt will be zero. Thus these terms can be omitted when optimizing the ELBO wrt .
The trace term in eq. 26 can be simplified using , to give
(28) 
It turns out that it is more convenient to write in terms of and the mean transformation parameters , where is the posterior mean of , and is the same for . Hence we have that
(29) 
Hence , where , and the quadratic form can be rewritten as .