Set-Structured Latent Representations

by   Qian Huang, et al.

Unstructured data often has latent component structure, such as the objects in an image of a scene. In these situations, the relevant latent structure is an unordered collection or set. However, learning such representations directly from data is difficult due to the discrete and unordered structure. Here, we develop a framework for differentiable learning of set-structured latent representations. We show how to use this framework to naturally decompose data such as images into sets of interpretable and meaningful components and demonstrate how existing techniques cannot properly disentangle relevant structure. We also show how to extend our methodology to downstream tasks such as set matching, which uses set-specific operations. Our code is available at



There are no comments yet.


page 5

page 8

page 12

page 13

page 14

page 15

page 16

page 17


Editing a classifier by rewriting its prediction rules

We present a methodology for modifying the behavior of a classifier by d...

The Structured Weighted Violations MIRA

We present the Structured Weighted Violation MIRA (SWVM), a new structur...

Learning Representations of Endoscopic Videos to Detect Tool Presence Without Supervision

In this work, we explore whether it is possible to learn representations...

Learning to Represent and Predict Sets with Deep Neural Networks

In this thesis, we develop various techniques for working with sets in m...

Encoding Domain Information with Sparse Priors for Inferring Explainable Latent Variables

Latent variable models are powerful statistical tools that can uncover r...

Unsupervised Representation Learning via Neural Activation Coding

We present neural activation coding (NAC) as a novel approach for learni...

DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold

Unsupervised learning of feature representations is a challenging yet im...

Code Repositories


[NeurIPS 2020] Better Set Representations For Relational Reasoning (

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern deep learning models perform many tasks well, from speech recognition to object detection. However, despite their many successes, a criticism of deep learning is its limitation to low-level tasks as opposed to more sophisticated reasoning. This gap has drawn analogies

iii to the difference in so-called “System 1” (i.e, low-level perception and intuitive knowledge) and “System 2” (i.e, reasoning, planning, and imagination) from cognitive psychology (Kahneman, 2011). Proposals for moving towards System 2 reasoning in learning systems involve creating new abilities for composition, combinatorial generalization, and disentanglement (Battaglia et al., 2018; van Steenkiste et al., 2019; Marra et al., 2019).

One approach for augmenting neural networks with these capabilities is the use of discrete structured representations, such as graphs or sets. In particular, this approach has been used to great effect in the computer vision community, for tasks such as visual question answering, image captioning, and video understanding

(Santoro et al., 2017; Yang et al., 2019; Hussein et al., 2019). For example, in systems for visual question answering, a standard component involves identifying the set of objects in the scene (Desta et al., 2018). Similarly, a system for image captioning can benefit from incorporating a scene-graph representation (Yao et al., 2018).

Existing frameworks for operating with set-like structure revolve around two types of problems. In the first, a set object is the input and structure is placed on the model, such as permutation invariance with respect to the input (Qi et al., 2016; Zaheer et al., 2017; Lee et al., 2018). The second is producing a set-like structure as output; this is standard in object detection or neural set generation (Zhang et al., 2019; Vinyals et al., 2015; Rezatofighi et al., 2018).

In many cases, though, we expect that unstructured data such as images have latent discrete structure that we would like to learn “in the middle,” i.e., we would like to automatically learn a meaningful discrete representation—such as a set of vectors—that can be incorporated in a machine learning pipeline for some task. In this sense, the set structure is treated as latent information rather than explicitly imposed as an input or an output. However, constructing such a representation is a major challenge. Existing approaches do not respect discrete invariances

(Santoro et al., 2017) or resort to generating discrete structure in a non-differentiable manner with separately trained submodules (Yao et al., 2018).

Part of the challenge is that meaningfully generating a set in a differentiable manner is difficult. For example, a standard approach is to produce a collection of vectors through separate “channels” (e.g., several multi-layer perceptrons) and then use the elements of the collection in an arbitrary order

(Stewart et al., 2015; Rezatofighi et al., 2018)

. While this heuristic can be easy to implement, omitting an explicit permutation invariance structure leads to issues of discontinuity in the system

(Zhang et al., 2020).

Here, we propose a framework for learning a set-structured latent representation (SSLR). Our main idea is to take unstructured data as input, enforce set structure as an intermediate representation, and use the set to make a prediction (Fig. 1). By making this entire pipeline differentiable, the intermediate step encourages automatic learning of a meaningful representation of set structure to capture from the unstructured input. As one example, we will consider a case where the input and prediction are the same image. This is a reconstruction task on which we enforce a set structure, and we show how our approach effectively disentangles the objects.

Our SSLR framework is general in the sense that it allows for different implementations of the set structure component, primarily relying only on a set generator function that produces a set from a data point. However, we demonstrate that the choice of the generator is extremely important for learning high-quality latent set representations as well as making good predictions. As a first pass, we consider using multiple MLP channels as described above as the generator and verify known issues with this approach within our framework. We then develop a novel Set Refinement Network (SRN) as a set generator function, which derives better sets and predictions by more explicitly enforcing set structure. The architecture for our SRN draws on the recently developed Deep Set Prediction Networks (Zhang et al., 2019) that outperforms the multiple-MLP approach on producing set outputs.

We demonstrate our SSLR framework on a number of tasks on synthetic and real-world data. First, to cleanly illustrate our ideas, we generate synthetic images with multiple simple objects and use an image reconstruction task to learn latent sets. We find that by using the SRN, the learned latent set structure disentangles the objects and the learned set representations themselves have meaningful and interpretable structure in terms of the position and geometry of the objects. In contrast, the MLP approach produces set representations that suffer from discontinuity issues that are immediate when interpolating between two images, even if the reconstruction error is small. Essentially, the MLP approach cannot disentangle the objects in the set representation. We perform similar experiments on the CLEVR dataset with similar results.

Finally, we also demonstrate that our framework is particularly well-suited for downstream tasks that involve making predictions relevant to the set structure. More specifically, we consider a task of set matching: given two images with the same objects but in different places, determine the distances between the objects. Again, our methodology outperforms standard approaches.

2 Related Work

2.1 Unsupervised Object Detection and Segmentation

One line of work that has focused on a set representation is unsupervised object detection and segmentation (Greff et al., 2019; Lin et al., 2020). These methods demonstrate impressive results in decomposing scenes without explicit supervision. However, they require techniques specific to object detection and segmentation, such as bounding boxes and highly specific architectures. Greff et al. (2019)

in particular requires an auto-encoder setup, which makes it unsuitable for learning of latent sets in general supervised learning tasks. Instead, we propose a general set extraction framework that naturally imposes a set-structured latent representation.

2.2 Set Encoders and Set Decoders

Both set encoders and set decoders have been well-studied (Zaheer et al., 2017; Zhang et al., 2019, 2020; Lee et al., 2018). Although they can be applied within our framework, our work primarily differs from previous set learning work in that we are focused on set structure of the latent space—set encoders focus on sets as input, while set decoders focus on sets as output. This means that we do not need any set-structured data to be able to use our latent set representations.

2.3 Incorporation of Discrete Structure

As previously mentioned, many approaches have incorporated discrete structures to obtain better performance on a given task (Yang et al., 2019; Chen et al., 2019). Some ignore the permutation-invariant nature of graphs and sets (Santoro et al., 2017), while others have non-differentiable methods (Yao et al., 2018). Our methodology addresses both of these issues.

3 Set-Structured Latent Representation Framework

In this section, we develop our set-structured latent representation (SSLR) framework. The framework is intentionally general, and the key generic components are a set generator function that can transform inputs into a set of vectors and a set function that can make a prediction from this set of vectors. We explore various ways of implementing the set generator using existing techniques and a novel Set Refiner Network (SRN). The set function that makes a prediction will be task-specific, and we will discuss the details later in our experiments.

It is worth reflecting on why we might desire a set-structured latent space in the first place. One immediate answer is that the underlying space factorizes into a set, such as in scenes with multiple objects. By imposing a set-structured latent space, we hope that the model is encouraged to take advantage of this structure. In some sense, the model must take advantage of this structure in order to maximize performance. Another reason to use a set-structured latent space is if one wants to perform set-specific operations; we explore this in set matching tasks in our experiments. Finally, since we force the dimension of each set element to be far lower than the total dimension of the set latent space (the product of the number of set elements and the dimension of each set element), we expect that the learned model will effectively decompose information across multiple set elements in order to achieve good performance.

3.1 Basic Model

Our general model is straightforward: we start with some unstructured data point (e.g., an image), then produce a set of vectors, and finally, transform the set of vectors to make some prediction. Formally, we write this as follows:


where is the set generator function, is the set of vectors, and is a function that maps that set to the prediction . Throughout the paper, we will focus on the case when is an image and assume that is a differentiable task-specific set function. The set generator

will make use of neural networks designed to process the input data, such as convolutional neural networks for images; for example, we could use multiple networks in parallel to construct a latent set of vectors (indeed, this will be one approach we consider).

However, the main challenge is designing to produce a set of vectors in a way that respects set invariants in a meaningful and differentiable manner. Given a dataset with data points and outputs , we can learn the SSLR by optimizing some loss . In principle, if the prediction that we want to make can be effectively made from a set of vectors representing the unstructured data point, then a good choice in will encourage learning that set-structured latent representation. Again, it is useful to have images in mind. If is a scene and is a prediction about the number of certain types of objects in a scene, then we would expect that an intermediate set representation wherein each element represents a distinct object would make it easier for to be a good predictor, as in Fig. 1.

3.2 Choice of set generator

In principle, the set generator can be any function that maps a data point to a set of vectors. We describe below the two types of generators that we consider.

Multilayer Perceptrons

Our first implementation of

is to use a collection of Multilayer Perceptrons (MLPs). Given a target set size

, we can use MLPs to produce a collection of vectors given by the final layer of each MLP. At this point, these vectors can be considered a set and used for prediction by . This type of approach has previously been used for making predictions where the output is a set (Rezatofighi et al., 2018; Stewart et al., 2015).

Although can process the vectors as a set, this approach is unsatisfying since no real set structure is imposed on . In some sense, the outputs of the MLPs are just a list (as opposed to a set), and one hopes that using a set function to make predictions somehow enforces that the “list” has “set structure.”

This conceptual issue was recently formalized by Zhang et al. (2020) through the so-called “responsibility problem.” Each of the MLPs is “responsible” for producing one of the underlying set elements, but this responsibility changes discontinuously. In the context of images, when slowly swapping the position of two otherwise equal objects, there must be a point where the first MLP has to suddenly be responsible for predicting the other object and vice-versa. This can be seen in Fig. (b)b

, where the MLP must slowly pass responsibility for the circle between two different set elements. Because MLPs are not well-suited for modeling these jump discontinuities, even exceedingly simple sets are often predicted incorrectly by the

MLPs in set prediction tasks. We next describe a novel approach that can discontinuously change responsibility (Fig. (c)c).

(a) Original Image
(b) SSLR-MLP Interpolation
(c) SSLR-SRN Interpolation
Figure 5: We shift the input green circle from left to right by 10 pixels and visualize our models’ reconstructions. The colors correspond to the set element that is responsible for generating that part of the circle. We see that the “responsibility” changes continuously for the SSLR-MLP, causing poor representation while it’s doing so. On the other hand, SSLR-SRN is able to discontinuously hand around “responsibility”. More detailed discussion is present in the interpolation section of 4.1.

Set Refiner Network

Zhang et al. (2019) proposed Deep Set Prediction Networks (DSPNs) to circumvent the responsibility problem for predicting output sets from vectors. The core idea is a meta-learning procedure. Given a vector representation of some data, the DSPN uses an inner optimization loop to search for a set of vectors that, when passed through a set encoder, is close to the original vector representation. By learning that set encoder, a mapping between the input vector and the output set is established. However, their approach needs ground truth sets for supervision.

Our goals are slightly different—we end up computing to make a prediction rather than predicting itself, and we have no ground-truth sets for supervision. However, we can still use a similar architecture to take a data point to a set and then learn with a loss involving . For our tasks, we also noticed that performance hinged on finding a good initial guess for the inner optimization loop, rather than the shared initialization used in DSPN. Thus, we treat initialization as a function to be learned and refine the initial guess with a search procedure.

We call this approach a Set Refiner Network (SRN). Putting everything together, is defined as follows:


where and are parameterized by neural networks designed to process (e.g., CNNs for images), is a vector, and is a permutation invariant set function. The term in Eq. 5 is the meta-learning component and represents running gradient descent over the optimization variable for the loss starting from the initial point for iterations. In our implementation, we also use a momentum term in the descent procedure.

To summarize, SRN learns (i) a function for a vector embedding of a data point , (ii) a function for an initial guess of the latent set, and (iii) a set function that encourages refinement of the initial guess through an inner optimization loop. The functions are learned through supervision on the prediction . We show in our experiments that this approach avoids discontinuities in the set embedding that are prevalent with MLP approaches.

4 Experiments

In this section, we conduct two different types of experiments. First, we consider an image reconstruction task, where the SSLR is expected to contain as much information about the original image as possible. We find on multiple datasets that using SRN as the set generator provides meaningful set representations that disentangle objects, whereas MLP-based approaches learn sets that entangle objects. Second, we demonstrate that our framework can be used to improve performance on a set matching task. Throughout, we refer to SSLR implemented with SRN as SSLR-SRN and to SSLR implemented with MLPs as SSLR-MLP. For each task, SSLR-SRN is implemented as SSLR-MLP only with the additional refinement step in Eq. 5, i.e., .

4.1 Circles Dataset Reconstruction


We first experiment on a synthetic “Circles Dataset.” Each image is pixels with RGB channels in the range 0 to 1 (see Fig. (a)a for an example). An image contains 0 to 10 circles, either green or red, with a radius of 1 to 11 pixels (integer only), with each of these components sampled uniformly at random. Each circle is fully contained in the image with no overlap between circles of the same color. A red and a green circles may overlap, which produces a yellow color. In total, we use 64000 images for training and 4000 images for testing.

(a) Ground Truth Image
(b) SSLR-SRN Reconstruction (c) SSLR-SRN Decomposition (d) SSLR-MLP Reconstruction (e) SSLR-MLP Decomposition
Figure 11: Image reconstruction on the synthetic Circles Dataset. (a) Sample image from the Circles dataset. We use SSLR-SRN (b) and SSLR-MLP (d) to reconstruct the ground truth image. We also show the set of decoded images after applying attention and right before final summation for SSLR-SRN (c) and SSLR-MLP (e). SSLR-SRN naturally decomposes the image to the set of complete circles, while SSLR-MLP splits the circles.


To reconstruct the image, we first encode the input image to a global embedding vector of 100 dimensions using a standard CNN (this is in Eq. 3; see Appendix A for details). The function (Eq. 4) shares the first several convolutional layers of , until the final 512-channel feature map. We then group these channels to 32 groups and project each channel group using a shared fully connected layer to 16 dimensions. The function (Eq. 5) processes each element individually with a 3-layer MLP, followed by FSPool (Zhang et al., 2020) as a pooling function. The inner optimization loop uses gradient descent with step size 0.1 and momentum 0.5 for iterations.

After obtaining the latent set, the prediction function decodes each element to an image independently through shared transpose-convolution layers. Finally, we weight the generated images by their softmax score (to ensure their sum lies in the range ) and sum the result to obtain the final prediction:


We train the model with linear least square reconstruction loss using the Adam optimizer with learning rate 3e-4.

Decomposition Result

Figure 11 shows an example reconstruction and images decoded from the latent set elements. Although we only provide supervision on the entire image’s reconstruction, SSLR-SRN naturally disentangles most images into a set of the individual circles. We compare this to SSLR-MLP, which cannot disentangle the circles (Fig. (e)e). Although SSLR-MLP does not succeed in disentangling the image into its component circles, it still has a set-structured latent space and achieves some amount of decomposition.

Figure 12: Interpolation between two latent set elements — one that decodes to an empty image and one that decodes to a single green circle. The first row is generated by SSLR-MLP, and the second row is generated by SSLR-SRN. The decoding of an interpolation of the MLP-based embeddings introduces multiple circles and circles of different colors. On the other hand, decoding an interpolation of the SRN-based embeddings smoothly moves from the empty image to the full green circle by interpolating the circle’s size.
% of Images Completely Disentangled
# of Circles SSLR-MLP SSLR-SRN
1 45.1% 95.8%
2 16.5% 88.7%
3 7.8% 79.9%
4 3.2% 72.7%
5 1.9% 65.2%
Table 1: Success rate of complete disentanglement in synthetic data. Using our SRN to implement the set generator has much higher rates of disentanglement compared to the MLP approach.

To explore this disentanglement further, we first create 1000 images with circles for . We then see how often SSLR-SRN and SSLR-MLP successfully disentangle the circles by measuring the number of non-empty set elements. SSLR-SRN has much higher rates of disentanglement (Table 1). With just one circle, SSLR-SRN succeeds more than 95% of the time, while SSLR-MLP succeeds less than half of the time. With five circles, SSLR-SRN still succeeds 65% of the time, while SSLR-SRN succeeds less than 2% of the time.

X coordinate X coordinate X coordinate
(a) X coordinate
(b) Y coordinate
(c) Color
Figure 16: Two-dimensional t-SNE plots of latent set elements. Subfigures (a), (b) and (c) are color coded by X coordinate, Y coordinate, and color value, respectively. The first row is generated by SSLR-MLP, and the second row is generated by SSLR-SRN. The representations generated by MLPs are separated into discontinuous clusters, and only within each cluster are the representations consistent. On the other hand, SRN is able to generate a much more globally consistent representation.
(a) X coordinate
(b) Y coordinate
(c) Radius
(d) Color
Figure 21: Three-dimensional t-SNE embeddings of set elements generated by our SSLR-SRN method, colored by various parameters of the circles: X coordinate (a), Y coordinate (b), radius (c), and color (d). The representations have clear continuous structure.

Representation Interpolation

We next consider an interpolation experiment to demonstrate the differences between SRN and MLP. First, we perform an interpolation experiment in the latent space. We choose a latent set element that decodes to a green circle and a latent set element that decodes to an empty image, and we interpolate between the two. As we see in Fig. 12, the interpolation for SSLR-MLP does not look like a circle for most of the interpolation and even worse, has red artifacts. Meanwhile, the interpolation for SSLR-SRN smoothly grows from empty space to the fully sized circle. In other words, interpolation between an empty image and a circle in the SRN latent space corresponds to size interpolation.

Next, we interpolate in image space by gradually moving the location of a circle to the right. Due to the responsibility problem, we know that SSLR-MLP cannot generate a completely unordered set.

In Figure 5, we visualize what part of the image each set element is responsible for by associating each set element with a distinct color. If the circle is one color, that means that one set element is responsible for generating the whole circle. If the circle is split into multiple colors, then that means that set elements are splitting the responsibility for generating the circle.

With SSLR-MLP, the responsibility for the circle gradually shifts from one set element to another. This demonstrates that the set generated by SSLR-MLP corresponds to locations, as opposed to discrete circles. This is due to the fact that MLPs have difficulty expressing the discontinuity required to represent a proper set generation function.

With our Set Refinement Networks, SSLR-SRN overcomes the responsibility problem. As we interpolate the circle from left to right, a single set element is always responsible for generating the entire circle. In particular, even though we move the circle by a small amount, the responsibility shifts completely from one set element to another. This discontinuity is what allows SSLR-SRN to avoid the responsibility problem and is key to generating a good set-structured latent representation. As a result, SSLR-SRN successfully encodes the circle in one set element throughout the entire interpolation.

t-SNE Visualization

We now visualize how our set-structured latent sets capture meaningful structure. To this end, we grid sample images with one red or green circle at different locations and plot the latent space of the set elements corresponding to the circle using a two-dimensional t-SNE embedding with default scikit-learn settings (Fig. 16). Consistent with the previous interpolation experiments, we see that the set elements generated by SSLR-MLP are in discontinuous clusters. On the other hand, SSLR-SRN shows the desired grid structure, with the two clear factors of variation—x and y coordinates. We also visualize coordinates, radius and color with a three-dimensional t-SNE embedding (Fig. 21), which shows that our latent representation is both globally and locally consistent, as well as highly meaningful. On the other hand, the latent space of SSLR-MLP is only consistent within each cluster. This is unsurprising in light of the responsibility problem.

Recall that the model was never explicitly encouraged to disentangle the objects, yet SSLR-SRN can learn the correct underlying structure, effectively reducing the task of modeling the complicated scene into modeling the individual components. Overall, this demonstrates that simply incorporating a set structure and generating the set properly can naturally decompose images into meaningful components. We now turn to a similar task on a more realistic dataset.

(a) Input Image
(b) SSLR-SRN Reconstruction
(c) SSLR-SRN Decomposition
(d) Ground Truth Image
(e) SSLR-MLP Reconstruction
(f) SSLR-MLP Decomposition
Figure 28: Example reconstruction results on the CLEVR dataset, analogous to the results on the synthetic circles dataset in Figure 11. Again, we see that our SRN decomposition creates disentangled latent set structure, whereas the green objects are entangled between multiple set elements with MLP.

4.2 CLEVR Object Reconstruction


We next use the CLEVR dataset (Johnson et al., 2016) to show that such object decomposition holds in more complicated settings. The dataset contains 70,000 training and 15,000 validation images. The original dataset does not contain images of the masked objects, but we generate these through publicly available code (see Appendix B for details).


We again encode images using a CNN that takes 128 128 images as input and has four convolutional layers to produce a 512-dimensional image embedding. We use the same general architecture for SSLR-MLP and SSLR-SRN as for the circles dataset, just with different numbers of hidden dimensions (architectural and experimental details are in Appendix B).


Figure 28 shows a representative example of the reconstructed images and set of decomposed images. It again highlights how the SRN approach can disentangle objects while the MLP approach cannot. We also measured the decomposition quality using intersection over union (IoU) score (Table 2). We use two metrics for evaluation: the standard overall intersection-over-union (IoU) and a per-object intersection-over-union. The overall IoU is calculated pixel-wise over all objects in the foreground in the reconstructed image. The per-object IoU is the IoU over the Chamfer matching between ground-truth bounding boxes and the bounding boxes of the predicted decomposition. In both cases, SSLR-SRN performs better than SSLR-MLP. The low per-object IoU for the SSLR-MLP is due to the poor object decomposition and its inability to handle occluded objects. In contrast, SSLR-SRN can decompose the scene with the superposition of full objects, including the occluded parts, providing a more meaningful disentangled representation.

Model Description SSLR-MLP SSLR-SRN
overall IoU 0.9193 0.9345
per-object IoU 0.7737 0.8305
Table 2: Intersection over Union (IoU) results for CLEVR image reconstruction. Our SRN approach out-performs the MLP baseline.

4.3 Set Matching Task

So far, we have demonstrated that set-structured latent representation can provide a meaningful latent representation that disentangles objects. We now demonstrate another advantage of such representations, namely the ability to perform set-specific operations. More specifically, we consider the following set matching task: given two images containing the same set of objects, predict the sum of squared distances between each pair of objects (Appendix C contains experimental details).


For simplicity, we reuse the Circles Dataset, restricting to images with only 2 circles, and create a complementary image with the same two circles in different locations.

Model relative error (%) L2 error
Naive CNNs 42.64 1.89 0.027 2.58e-4
SSLR-MLP 37.20 0.56 0.021 7.60e-4
SSLR-SRN 27.58 0.73 0.016 2.16e-4
SSLR-SRN (r + init) 23.86 0.23 0.014 1.00e-3
Table 3:

Set Matching Task Result. Mean and standard deviations are reported over three experiments. We find that SSLR out-performs a basic CNN approach, SRN outperforms MLP, and including a reconstruction loss with initialization from the reconstruction task can boost performance of SSLR-SRN.


We use the same architecture as in the previous reconstruction tasks. In addition, we predict location embedding and attribute embedding from each set element individually and then use Hungarian matching to match the elements based on the attribute embedding. The location embeddings are then permuted accordingly and reduced by sum of element-wise squared difference. Finally, we minimize the sum of Hungarian matching error and prediction error both in the L2 norm. We again compare SSLR-MLP with SSLR-SRN and also consider a simple CNN as a baseline, which directly computes L2 difference between embeddings output by .

Table 3 shows the results and reveals several important findings. First, imposing set structure (with MLP or SRN) improves over the CNN baseline. Second, SRN again outperforms MLP. Finally, we also considered appending the reconstruction loss along with an initialization of parameters from the reconstruction experiment. This can provide substantial improvements for the SSLR-SRN model.

5 Conclusion

We have introduced a general framework to incorporate set-structured latent representation (SSLR) in neural networks, which uses a novel Set Refinement Network as a set generator function. Our approach leads to natural decompositions of images into objects. Moreover, each object embedding in the set is meaningful and consistent, which is not the case for naive implementations with MLPs. We also showed how imposing SSLR is useful for set-based tasks such as set matching. One natural focus for future work is incorporating new set generation procedures into our framework.


This research was supported by NSF Award DMS-1830274, ARO Award W911NF19-1-0057, and ARO MURI.


  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. F. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, Ç. Gülçehre, H. F. Song, A. J. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani, K. R. Allen, C. Nash, V. Langston, C. Dyer, N. M. O. Heess, D. Wierstra, P. Kohli, M. M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu (2018) Relational inductive biases, deep learning, and graph networks. ArXiv abs/1806.01261. Cited by: §1.
  • T. Chen, M. Xu, X. Hui, H. Wu, and L. Lin (2019) Learning semantic-specific graph representation for multi-label image recognition. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.3.
  • M. T. Desta, L. Chen, and T. Kornuta (2018) Object-based reasoning in vqa. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1814–1823. Cited by: §1.
  • K. Greff, R. L. Kaufman, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. In ICML, Cited by: §2.1.
  • N. Hussein, E. Gavves, and A. W. M. Smeulders (2019) VideoGraph: recognizing minutes-long human activities in videos. ArXiv abs/1905.05143. Cited by: §1.
  • J. E. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick (2016) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 1988–1997.
    Cited by: §4.2.
  • D. Kahneman (2011) Thinking, fast and slow. Macmillan. Cited by: §1.
  • J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh (2018) Set transformer: a framework for attention-based permutation-invariant neural networks. In ICML, Cited by: §1, §2.2.
  • Z. Lin, Y. Wu, S. V. Peri, W. Sun, G. Singh, F. Deng, J. Jiang, and S. Ahn (2020)

    SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition

    ArXiv abs/2001.02407. Cited by: §2.1.
  • G. Marra, F. Giannini, M. Diligenti, and M. Gori (2019) Integrating learning and reasoning with deep logic models. arXiv preprint arXiv:1901.04195. Cited by: §1.
  • J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber (2011)

    Stacked convolutional auto-encoders for hierarchical feature extraction

    In International conference on artificial neural networks, pp. 52–59. Cited by: §A.1.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3d classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85. Cited by: §1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §A.1.
  • S. H. Rezatofighi, R. Kaskman, F. T. Motlagh, Q. Shi, D. Cremers, L. Leal-Taixé, and I. Reid (2018) Deep perm-set net: learn to predict sets with unknown permutation and cardinality using deep neural networks. ArXiv. Cited by: §1, §1, §3.2.
  • A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. W. Battaglia, and T. P. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §2.3.
  • R. J. Stewart, M. Andriluka, and A. Y. Ng (2015) End-to-end people detection in crowded scenes. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2325–2333. Cited by: §1, §3.2.
  • S. van Steenkiste, F. Locatello, J. Schmidhuber, and O. Bachem (2019) Are disentangled representations helpful for abstract visual reasoning?. In Advances in Neural Information Processing Systems, pp. 14222–14235. Cited by: §1.
  • O. Vinyals, S. Bengio, and M. Kudlur (2015) Order Matters: Sequence to sequence for sets. In International Conference on Learning Representations (ICLR), External Links: 1511.06391 Cited by: §1.
  • X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.3.
  • T. Yao, Y. Pan, Y. Li, and T. Mei (2018) Exploring visual relationship for image captioning. In ECCV, Cited by: §1, §1, §2.3.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In NeurIPS, Cited by: §1, §2.2.
  • Y. Zhang, J. S. Hare, and A. Prügel-Bennett (2019) Deep set prediction networks. In NeurIPS, Cited by: §1, §1, §2.2, §3.2.
  • Y. Zhang, J. S. Hare, and A. Prügel-Bennett (2020) FSPool: learning set representations with featurewise sort pooling. In ICLR (to appear), Cited by: §A.1, §1, §2.2, §3.2, §4.1.

Appendix A Circles Dataset Reconstruction Experiment

a.1 Architecture Details

Set Generator For SSLR-MLP, the set generator is a standard image encoder derived from Stacked Convolutional Auto-Encoders (Masci et al., 2011) and DCGAN  (Radford et al., 2015) with additional processing at the end:

  1. Conv2d layer takes 3 channels as input, 64 filters, 4 kernel size, stride 2, padding 1, no bias, relu activation.

  2. Conv2d layer takes 64 channels as input, 128 filters, 4 kernel size, stride 2, padding 1, no bias,batch normalized, relu activation.

  3. Conv2d layer takes 128 channels as input, 256 filters, 4 kernel size, stride 2, padding 1, no bias, batch normalized,relu activation.

  4. Conv2d layer takes 256 channels as input, 512 filters, 4 kernel size, stride 4, padding 1, no bias,batch normalized, relu activation.

  5. Reshape to (, 2048 / ), where is the size of the set, and transpose dimension 1,2

  6. Conv1d layer takes (2048 / ) channels as input, element dimension number of filters, 1 kernel size.

The output tensor would then be of shape

, where is the batch size and is the dimension of each set element. This output is interpreted as a set of vector elements for each instance, with the corresponding set size and element dimension pre-specified. This generator is designed to process images with the convolutional layers. The output of step 4 can be seen as a set of feature maps, where each feature map has equal perception field that covers the whole image. We then use step 5 and step 6 to transform this set to the desired shape by grouping feature maps and processing each feature map group individually with shared function. Overall, this makes sure that each set element is produced with the same initial input (the whole image) and with the same architecture, although the weights might be different.

For SSLR-SRN, the same architecture is used for . is shared with until step 4. After that, flattens the feature maps and project down to a 100-dimensional vector as the final embedding using a fully connected layer. processes each element in the input set individually with a 3-layer MLP with 512 as the hidden dimensions, followed by FSPool (Zhang et al., 2020) with 20 pieces and no relaxation.

Set Function As described in the main paper, the prediction function decodes each element in the generated set to an image independently through shared transpose-convolution layers. We then aggregate the set of images with self attention. Both SSLR-MLP and SSLR-SRN have the same image decoder (TransposeConvs) architecture:

  1. Fully connected layer, projecting from element dimension to feature maps of shape (1024, 4, 4). Apply batch norm before reshape and use relu activation.

  2. TransposeConv layer filters 512, kernel size 4, stride 2, padding 1, no bias, batch normalized, relu activation.

  3. TransposeConv layer filters 256, kernel size 4, stride 2, padding 1, no bias, batch normalized, relu activation.

  4. TransposeConv layer filters 3, kernel size 4, stride 4, padding 0, no bias.

This architecture first creates objects and then superimposes them for the final reconstruction. This is permutation invariant as the order of the elements in the set does not change the output and allows each element to be interpreted from the individual reconstructed objects, leading to direct disentanglement.

a.2 Disentanglement Metric Specification

To measure whether an image was completely disentangled, we examine each of the individual images generated by each of the set elements. If any of the values were more than (out of ), we considered that set element to be “non-empty.” If the total number of non-empty set elements matched the total number of circles, we considered the image to be “completely disentangled.” Although it is possible that this results in false positives (for example, if the model splits one circle into 2 set elements and puts 2 other circles in one set element), we rarely observe this in practice.

a.3 Additional Results

Figures 29 and 30 show 10 randomly sampled images from the test set along with the latent sets learned by SSLR-SRN. As shown in the figure, the decomposition is almost perfect for SSLR-SRN, whereas SSLR-MLP frequently cannot disentangle objects.

Figure 29: Five sampled reconstruction and decomposition results. From left to right, column-wise: original images, SSLR-SRN reconstruction, SSLR-SRN decomposition, SSLR-MLP reconstruction, SSLR-MLP decomposition.
Figure 30: Five sampled reconstruction and decomposition results. From left to right, column-wise: original images, SSLR-SRN reconstruction, SSLR-SRN decomposition, SSLR-MLP reconstruction, SSLR-MLP decomposition.

Appendix B CLEVR Object Reconstruction Results Experiment

b.1 Architecture Details

The architecture for the CLEVR experiment are nearly the same as for the Circles Dataset described above. The only difference is that the projection layers in and the decoder are modified to adapt to image size of .

b.2 IoU Metric Specification

To measure object disentanglement, we compute both the standard overall IoU and a per-object IoU. As mentioned in the paper, the per-object IoU is the IoU over the Chamfer matching between ground-truth bounding boxes and the bounding boxes of the predicted decomposition. In both cases, we first threshold the pixels valued in by (i.e., pixels with all channels smaller than are set to zero). For the overall IoU case, this threshold is applied on the final reconstruction, and in the per-object case it is applied on each set element. Due to rendering noise, it is difficult to obtain instance segmentation masks. Instead, we algorithmically generate bounding boxes with OpenCV’s findContours() function applied to the thresholded prediction. These bounding boxes are then compared to ground truth bounding boxes. We match the latent set of a prediction (where each element may contain multiple bounding boxes if a prediction fails to disentangle objects) with the set of generated bounding boxes using Chamfer matching, where the cost is the IoU between the pairs of elements: each prediction is matched with the closest ground truth bounding box. With this computed assignment, we then compute the average per-object IoU.

b.3 Additional Results

Figures 31 and 32 show 10 randomly sampled images from the test set, along with the SSLR-SRN and SSLR-MLP latent sets. Figures 3637, and 38 each show one sample in more detail. Again, SSLR-SRN disentangled objects and completed occluded part of the objects reasonably, while SSLR-MLP failed to disentangle objects.

Figure 31: Five sampled reconstruction and decomposition results. rom left to right, column-wise: ground truth objects image, SSLR-SRN reconstruction, SSLR-SRN decomposition, SSLR-MLP reconstruction, SSLR-MLP decomposition
Figure 32: Five sampled reconstruction and decomposition results. rom left to right, column-wise: ground truth objects image, SSLR-SRN reconstruction, SSLR-SRN decomposition, SSLR-MLP reconstruction, SSLR-MLP decomposition
(a) Ground Truth Image
(b) SSLR-SRN Reconstruction
(c) SSLR-MLP Reconstruction
Figure 36: The example ground truth and reconstructions for Figure 37 and 38.
Figure 37: The decomposition of SSLR-SRN for example in Figure 36.
Figure 38: The decomposition of SSLR-MLP for example in Figure 36.

Appendix C Set Matching Task Results Experiment

c.1 Architecture Details

The Naive CNNs considered in the main paper uses as the encoder. It then computes the square difference of the embeddings from two images transforms it to the final scalar output through a linear layer.

For SSLR models, the set generators are the same as for the Circle Dataset reconstruction task. The set function is designed to solve the matching problem with Hungarian matching. Given the generated sets for the two input images, we use a linear layer to decompose element embeddings into two five-dimensional embeddings — one intended to capture location information and one intended to capture attribute information (e.g., circle color and radius). We compute a Hungarian matching the set elements using the attribute embeddings. Finally, the prediction is the sum of distances between the location embeddings in this matching.

c.2 Training Details

The training loss is the sum of the

distance error, the Hungarian matching cost, and (if used) the reconstruction loss. We used RMSprop with learning rate of 5e-4 for all SSLR models. For the naive CNN, we used a learning rate of 1e-5. For SSLR-SRN with extra reconstruction, we added 1e-4 weight decay. We trained all models until convergence or overfit (around 60 epochs) and picked best models. The inner optimization loop uses gradient descent with step size 0.1 and momentum 0.5 for


Appendix D Datasets and Code Release

All datasets we used in this paper will be released together with our code, which will also contain the code and reference for generating the datasets.