BetterSetRepresentations
[NeurIPS 2020] Better Set Representations For Relational Reasoning (https://arxiv.org/abs/2003.04448)
view repo
Unstructured data often has latent component structure, such as the objects in an image of a scene. In these situations, the relevant latent structure is an unordered collection or set. However, learning such representations directly from data is difficult due to the discrete and unordered structure. Here, we develop a framework for differentiable learning of setstructured latent representations. We show how to use this framework to naturally decompose data such as images into sets of interpretable and meaningful components and demonstrate how existing techniques cannot properly disentangle relevant structure. We also show how to extend our methodology to downstream tasks such as set matching, which uses setspecific operations. Our code is available at https://github.com/CUVL/SSLR.
READ FULL TEXT VIEW PDF[NeurIPS 2020] Better Set Representations For Relational Reasoning (https://arxiv.org/abs/2003.04448)
Modern deep learning models perform many tasks well, from speech recognition to object detection. However, despite their many successes, a criticism of deep learning is its limitation to lowlevel tasks as opposed to more sophisticated reasoning. This gap has drawn analogies
^{i}^{i}ihttps://nips.cc/Conferences/2019/ScheduleMultitrack?event=15488 to the difference in socalled “System 1” (i.e, lowlevel perception and intuitive knowledge) and “System 2” (i.e, reasoning, planning, and imagination) from cognitive psychology (Kahneman, 2011). Proposals for moving towards System 2 reasoning in learning systems involve creating new abilities for composition, combinatorial generalization, and disentanglement (Battaglia et al., 2018; van Steenkiste et al., 2019; Marra et al., 2019).One approach for augmenting neural networks with these capabilities is the use of discrete structured representations, such as graphs or sets. In particular, this approach has been used to great effect in the computer vision community, for tasks such as visual question answering, image captioning, and video understanding
(Santoro et al., 2017; Yang et al., 2019; Hussein et al., 2019). For example, in systems for visual question answering, a standard component involves identifying the set of objects in the scene (Desta et al., 2018). Similarly, a system for image captioning can benefit from incorporating a scenegraph representation (Yao et al., 2018).Existing frameworks for operating with setlike structure revolve around two types of problems. In the first, a set object is the input and structure is placed on the model, such as permutation invariance with respect to the input (Qi et al., 2016; Zaheer et al., 2017; Lee et al., 2018). The second is producing a setlike structure as output; this is standard in object detection or neural set generation (Zhang et al., 2019; Vinyals et al., 2015; Rezatofighi et al., 2018).
In many cases, though, we expect that unstructured data such as images have latent discrete structure that we would like to learn “in the middle,” i.e., we would like to automatically learn a meaningful discrete representation—such as a set of vectors—that can be incorporated in a machine learning pipeline for some task. In this sense, the set structure is treated as latent information rather than explicitly imposed as an input or an output. However, constructing such a representation is a major challenge. Existing approaches do not respect discrete invariances
(Santoro et al., 2017) or resort to generating discrete structure in a nondifferentiable manner with separately trained submodules (Yao et al., 2018).Part of the challenge is that meaningfully generating a set in a differentiable manner is difficult. For example, a standard approach is to produce a collection of vectors through separate “channels” (e.g., several multilayer perceptrons) and then use the elements of the collection in an arbitrary order
(Stewart et al., 2015; Rezatofighi et al., 2018). While this heuristic can be easy to implement, omitting an explicit permutation invariance structure leads to issues of discontinuity in the system
(Zhang et al., 2020).Here, we propose a framework for learning a setstructured latent representation (SSLR). Our main idea is to take unstructured data as input, enforce set structure as an intermediate representation, and use the set to make a prediction (Fig. 1). By making this entire pipeline differentiable, the intermediate step encourages automatic learning of a meaningful representation of set structure to capture from the unstructured input. As one example, we will consider a case where the input and prediction are the same image. This is a reconstruction task on which we enforce a set structure, and we show how our approach effectively disentangles the objects.
Our SSLR framework is general in the sense that it allows for different implementations of the set structure component, primarily relying only on a set generator function that produces a set from a data point. However, we demonstrate that the choice of the generator is extremely important for learning highquality latent set representations as well as making good predictions. As a first pass, we consider using multiple MLP channels as described above as the generator and verify known issues with this approach within our framework. We then develop a novel Set Refinement Network (SRN) as a set generator function, which derives better sets and predictions by more explicitly enforcing set structure. The architecture for our SRN draws on the recently developed Deep Set Prediction Networks (Zhang et al., 2019) that outperforms the multipleMLP approach on producing set outputs.
We demonstrate our SSLR framework on a number of tasks on synthetic and realworld data. First, to cleanly illustrate our ideas, we generate synthetic images with multiple simple objects and use an image reconstruction task to learn latent sets. We find that by using the SRN, the learned latent set structure disentangles the objects and the learned set representations themselves have meaningful and interpretable structure in terms of the position and geometry of the objects. In contrast, the MLP approach produces set representations that suffer from discontinuity issues that are immediate when interpolating between two images, even if the reconstruction error is small. Essentially, the MLP approach cannot disentangle the objects in the set representation. We perform similar experiments on the CLEVR dataset with similar results.
Finally, we also demonstrate that our framework is particularly wellsuited for downstream tasks that involve making predictions relevant to the set structure. More specifically, we consider a task of set matching: given two images with the same objects but in different places, determine the distances between the objects. Again, our methodology outperforms standard approaches.
One line of work that has focused on a set representation is unsupervised object detection and segmentation (Greff et al., 2019; Lin et al., 2020). These methods demonstrate impressive results in decomposing scenes without explicit supervision. However, they require techniques specific to object detection and segmentation, such as bounding boxes and highly specific architectures. Greff et al. (2019)
in particular requires an autoencoder setup, which makes it unsuitable for learning of latent sets in general supervised learning tasks. Instead, we propose a general set extraction framework that naturally imposes a setstructured latent representation.
Both set encoders and set decoders have been wellstudied (Zaheer et al., 2017; Zhang et al., 2019, 2020; Lee et al., 2018). Although they can be applied within our framework, our work primarily differs from previous set learning work in that we are focused on set structure of the latent space—set encoders focus on sets as input, while set decoders focus on sets as output. This means that we do not need any setstructured data to be able to use our latent set representations.
As previously mentioned, many approaches have incorporated discrete structures to obtain better performance on a given task (Yang et al., 2019; Chen et al., 2019). Some ignore the permutationinvariant nature of graphs and sets (Santoro et al., 2017), while others have nondifferentiable methods (Yao et al., 2018). Our methodology addresses both of these issues.
In this section, we develop our setstructured latent representation (SSLR) framework. The framework is intentionally general, and the key generic components are a set generator function that can transform inputs into a set of vectors and a set function that can make a prediction from this set of vectors. We explore various ways of implementing the set generator using existing techniques and a novel Set Refiner Network (SRN). The set function that makes a prediction will be taskspecific, and we will discuss the details later in our experiments.
It is worth reflecting on why we might desire a setstructured latent space in the first place. One immediate answer is that the underlying space factorizes into a set, such as in scenes with multiple objects. By imposing a setstructured latent space, we hope that the model is encouraged to take advantage of this structure. In some sense, the model must take advantage of this structure in order to maximize performance. Another reason to use a setstructured latent space is if one wants to perform setspecific operations; we explore this in set matching tasks in our experiments. Finally, since we force the dimension of each set element to be far lower than the total dimension of the set latent space (the product of the number of set elements and the dimension of each set element), we expect that the learned model will effectively decompose information across multiple set elements in order to achieve good performance.
Our general model is straightforward: we start with some unstructured data point (e.g., an image), then produce a set of vectors, and finally, transform the set of vectors to make some prediction. Formally, we write this as follows:
(1)  
(2) 
where is the set generator function, is the set of vectors, and is a function that maps that set to the prediction . Throughout the paper, we will focus on the case when is an image and assume that is a differentiable taskspecific set function. The set generator
will make use of neural networks designed to process the input data, such as convolutional neural networks for images; for example, we could use multiple networks in parallel to construct a latent set of vectors (indeed, this will be one approach we consider).
However, the main challenge is designing to produce a set of vectors in a way that respects set invariants in a meaningful and differentiable manner. Given a dataset with data points and outputs , we can learn the SSLR by optimizing some loss . In principle, if the prediction that we want to make can be effectively made from a set of vectors representing the unstructured data point, then a good choice in will encourage learning that setstructured latent representation. Again, it is useful to have images in mind. If is a scene and is a prediction about the number of certain types of objects in a scene, then we would expect that an intermediate set representation wherein each element represents a distinct object would make it easier for to be a good predictor, as in Fig. 1.
In principle, the set generator can be any function that maps a data point to a set of vectors. We describe below the two types of generators that we consider.
Our first implementation of
is to use a collection of Multilayer Perceptrons (MLPs). Given a target set size
, we can use MLPs to produce a collection of vectors given by the final layer of each MLP. At this point, these vectors can be considered a set and used for prediction by . This type of approach has previously been used for making predictions where the output is a set (Rezatofighi et al., 2018; Stewart et al., 2015).Although can process the vectors as a set, this approach is unsatisfying since no real set structure is imposed on . In some sense, the outputs of the MLPs are just a list (as opposed to a set), and one hopes that using a set function to make predictions somehow enforces that the “list” has “set structure.”
This conceptual issue was recently formalized by Zhang et al. (2020) through the socalled “responsibility problem.” Each of the MLPs is “responsible” for producing one of the underlying set elements, but this responsibility changes discontinuously. In the context of images, when slowly swapping the position of two otherwise equal objects, there must be a point where the first MLP has to suddenly be responsible for predicting the other object and viceversa. This can be seen in Fig. (b)b
, where the MLP must slowly pass responsibility for the circle between two different set elements. Because MLPs are not wellsuited for modeling these jump discontinuities, even exceedingly simple sets are often predicted incorrectly by the
MLPs in set prediction tasks. We next describe a novel approach that can discontinuously change responsibility (Fig. (c)c).Zhang et al. (2019) proposed Deep Set Prediction Networks (DSPNs) to circumvent the responsibility problem for predicting output sets from vectors. The core idea is a metalearning procedure. Given a vector representation of some data, the DSPN uses an inner optimization loop to search for a set of vectors that, when passed through a set encoder, is close to the original vector representation. By learning that set encoder, a mapping between the input vector and the output set is established. However, their approach needs ground truth sets for supervision.
Our goals are slightly different—we end up computing to make a prediction rather than predicting itself, and we have no groundtruth sets for supervision. However, we can still use a similar architecture to take a data point to a set and then learn with a loss involving . For our tasks, we also noticed that performance hinged on finding a good initial guess for the inner optimization loop, rather than the shared initialization used in DSPN. Thus, we treat initialization as a function to be learned and refine the initial guess with a search procedure.
We call this approach a Set Refiner Network (SRN). Putting everything together, is defined as follows:
(3)  
(4)  
(5) 
where and are parameterized by neural networks designed to process (e.g., CNNs for images), is a vector, and is a permutation invariant set function. The term in Eq. 5 is the metalearning component and represents running gradient descent over the optimization variable for the loss starting from the initial point for iterations. In our implementation, we also use a momentum term in the descent procedure.
To summarize, SRN learns (i) a function for a vector embedding of a data point , (ii) a function for an initial guess of the latent set, and (iii) a set function that encourages refinement of the initial guess through an inner optimization loop. The functions are learned through supervision on the prediction . We show in our experiments that this approach avoids discontinuities in the set embedding that are prevalent with MLP approaches.
In this section, we conduct two different types of experiments. First, we consider an image reconstruction task, where the SSLR is expected to contain as much information about the original image as possible. We find on multiple datasets that using SRN as the set generator provides meaningful set representations that disentangle objects, whereas MLPbased approaches learn sets that entangle objects. Second, we demonstrate that our framework can be used to improve performance on a set matching task. Throughout, we refer to SSLR implemented with SRN as SSLRSRN and to SSLR implemented with MLPs as SSLRMLP. For each task, SSLRSRN is implemented as SSLRMLP only with the additional refinement step in Eq. 5, i.e., .
We first experiment on a synthetic “Circles Dataset.” Each image is pixels with RGB channels in the range 0 to 1 (see Fig. (a)a for an example). An image contains 0 to 10 circles, either green or red, with a radius of 1 to 11 pixels (integer only), with each of these components sampled uniformly at random. Each circle is fully contained in the image with no overlap between circles of the same color. A red and a green circles may overlap, which produces a yellow color. In total, we use 64000 images for training and 4000 images for testing.
To reconstruct the image, we first encode the input image to a global embedding vector of 100 dimensions using a standard CNN (this is in Eq. 3; see Appendix A for details). The function (Eq. 4) shares the first several convolutional layers of , until the final 512channel feature map. We then group these channels to 32 groups and project each channel group using a shared fully connected layer to 16 dimensions. The function (Eq. 5) processes each element individually with a 3layer MLP, followed by FSPool (Zhang et al., 2020) as a pooling function. The inner optimization loop uses gradient descent with step size 0.1 and momentum 0.5 for iterations.
After obtaining the latent set, the prediction function decodes each element to an image independently through shared transposeconvolution layers. Finally, we weight the generated images by their softmax score (to ensure their sum lies in the range ) and sum the result to obtain the final prediction:
(6)  
(7) 
We train the model with linear least square reconstruction loss using the Adam optimizer with learning rate 3e4.
Figure 11 shows an example reconstruction and images decoded from the latent set elements. Although we only provide supervision on the entire image’s reconstruction, SSLRSRN naturally disentangles most images into a set of the individual circles. We compare this to SSLRMLP, which cannot disentangle the circles (Fig. (e)e). Although SSLRMLP does not succeed in disentangling the image into its component circles, it still has a setstructured latent space and achieves some amount of decomposition.
% of Images Completely Disentangled  
# of Circles  SSLRMLP  SSLRSRN 
1  45.1%  95.8% 
2  16.5%  88.7% 
3  7.8%  79.9% 
4  3.2%  72.7% 
5  1.9%  65.2% 
To explore this disentanglement further, we first create 1000 images with circles for . We then see how often SSLRSRN and SSLRMLP successfully disentangle the circles by measuring the number of nonempty set elements. SSLRSRN has much higher rates of disentanglement (Table 1). With just one circle, SSLRSRN succeeds more than 95% of the time, while SSLRMLP succeeds less than half of the time. With five circles, SSLRSRN still succeeds 65% of the time, while SSLRSRN succeeds less than 2% of the time.







We next consider an interpolation experiment to demonstrate the differences between SRN and MLP. First, we perform an interpolation experiment in the latent space. We choose a latent set element that decodes to a green circle and a latent set element that decodes to an empty image, and we interpolate between the two. As we see in Fig. 12, the interpolation for SSLRMLP does not look like a circle for most of the interpolation and even worse, has red artifacts. Meanwhile, the interpolation for SSLRSRN smoothly grows from empty space to the fully sized circle. In other words, interpolation between an empty image and a circle in the SRN latent space corresponds to size interpolation.
Next, we interpolate in image space by gradually moving the location of a circle to the right. Due to the responsibility problem, we know that SSLRMLP cannot generate a completely unordered set.
In Figure 5, we visualize what part of the image each set element is responsible for by associating each set element with a distinct color. If the circle is one color, that means that one set element is responsible for generating the whole circle. If the circle is split into multiple colors, then that means that set elements are splitting the responsibility for generating the circle.
With SSLRMLP, the responsibility for the circle gradually shifts from one set element to another. This demonstrates that the set generated by SSLRMLP corresponds to locations, as opposed to discrete circles. This is due to the fact that MLPs have difficulty expressing the discontinuity required to represent a proper set generation function.
With our Set Refinement Networks, SSLRSRN overcomes the responsibility problem. As we interpolate the circle from left to right, a single set element is always responsible for generating the entire circle. In particular, even though we move the circle by a small amount, the responsibility shifts completely from one set element to another. This discontinuity is what allows SSLRSRN to avoid the responsibility problem and is key to generating a good setstructured latent representation. As a result, SSLRSRN successfully encodes the circle in one set element throughout the entire interpolation.
We now visualize how our setstructured latent sets capture meaningful structure. To this end, we grid sample images with one red or green circle at different locations and plot the latent space of the set elements corresponding to the circle using a twodimensional tSNE embedding with default scikitlearn settings (Fig. 16). Consistent with the previous interpolation experiments, we see that the set elements generated by SSLRMLP are in discontinuous clusters. On the other hand, SSLRSRN shows the desired grid structure, with the two clear factors of variation—x and y coordinates. We also visualize coordinates, radius and color with a threedimensional tSNE embedding (Fig. 21), which shows that our latent representation is both globally and locally consistent, as well as highly meaningful. On the other hand, the latent space of SSLRMLP is only consistent within each cluster. This is unsurprising in light of the responsibility problem.
Recall that the model was never explicitly encouraged to disentangle the objects, yet SSLRSRN can learn the correct underlying structure, effectively reducing the task of modeling the complicated scene into modeling the individual components. Overall, this demonstrates that simply incorporating a set structure and generating the set properly can naturally decompose images into meaningful components. We now turn to a similar task on a more realistic dataset.






We next use the CLEVR dataset (Johnson et al., 2016) to show that such object decomposition holds in more complicated settings. The dataset contains 70,000 training and 15,000 validation images. The original dataset does not contain images of the masked objects, but we generate these through publicly available code (see Appendix B for details).
We again encode images using a CNN that takes 128 128 images as input and has four convolutional layers to produce a 512dimensional image embedding. We use the same general architecture for SSLRMLP and SSLRSRN as for the circles dataset, just with different numbers of hidden dimensions (architectural and experimental details are in Appendix B).
Figure 28 shows a representative example of the reconstructed images and set of decomposed images. It again highlights how the SRN approach can disentangle objects while the MLP approach cannot. We also measured the decomposition quality using intersection over union (IoU) score (Table 2). We use two metrics for evaluation: the standard overall intersectionoverunion (IoU) and a perobject intersectionoverunion. The overall IoU is calculated pixelwise over all objects in the foreground in the reconstructed image. The perobject IoU is the IoU over the Chamfer matching between groundtruth bounding boxes and the bounding boxes of the predicted decomposition. In both cases, SSLRSRN performs better than SSLRMLP. The low perobject IoU for the SSLRMLP is due to the poor object decomposition and its inability to handle occluded objects. In contrast, SSLRSRN can decompose the scene with the superposition of full objects, including the occluded parts, providing a more meaningful disentangled representation.
Model Description  SSLRMLP  SSLRSRN 

overall IoU  0.9193  0.9345 
perobject IoU  0.7737  0.8305 
So far, we have demonstrated that setstructured latent representation can provide a meaningful latent representation that disentangles objects. We now demonstrate another advantage of such representations, namely the ability to perform setspecific operations. More specifically, we consider the following set matching task: given two images containing the same set of objects, predict the sum of squared distances between each pair of objects (Appendix C contains experimental details).
For simplicity, we reuse the Circles Dataset, restricting to images with only 2 circles, and create a complementary image with the same two circles in different locations.
Model  relative error (%)  L2 error 

Naive CNNs  42.64 1.89  0.027 2.58e4 
SSLRMLP  37.20 0.56  0.021 7.60e4 
SSLRSRN  27.58 0.73  0.016 2.16e4 
SSLRSRN (r + init)  23.86 0.23  0.014 1.00e3 
Set Matching Task Result. Mean and standard deviations are reported over three experiments. We find that SSLR outperforms a basic CNN approach, SRN outperforms MLP, and including a reconstruction loss with initialization from the reconstruction task can boost performance of SSLRSRN.
We use the same architecture as in the previous reconstruction tasks. In addition, we predict location embedding and attribute embedding from each set element individually and then use Hungarian matching to match the elements based on the attribute embedding. The location embeddings are then permuted accordingly and reduced by sum of elementwise squared difference. Finally, we minimize the sum of Hungarian matching error and prediction error both in the L2 norm. We again compare SSLRMLP with SSLRSRN and also consider a simple CNN as a baseline, which directly computes L2 difference between embeddings output by .
Table 3 shows the results and reveals several important findings. First, imposing set structure (with MLP or SRN) improves over the CNN baseline. Second, SRN again outperforms MLP. Finally, we also considered appending the reconstruction loss along with an initialization of parameters from the reconstruction experiment. This can provide substantial improvements for the SSLRSRN model.
We have introduced a general framework to incorporate setstructured latent representation (SSLR) in neural networks, which uses a novel Set Refinement Network as a set generator function. Our approach leads to natural decompositions of images into objects. Moreover, each object embedding in the set is meaningful and consistent, which is not the case for naive implementations with MLPs. We also showed how imposing SSLR is useful for setbased tasks such as set matching. One natural focus for future work is incorporating new set generation procedures into our framework.
This research was supported by NSF Award DMS1830274, ARO Award W911NF1910057, and ARO MURI.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 1988–1997. Cited by: §4.2.SPACE: unsupervised objectoriented scene representation via spatial attention and decomposition
. ArXiv abs/2001.02407. Cited by: §2.1.Stacked convolutional autoencoders for hierarchical feature extraction
. In International conference on artificial neural networks, pp. 52–59. Cited by: §A.1.Set Generator For SSLRMLP, the set generator is a standard image encoder derived from Stacked Convolutional AutoEncoders (Masci et al., 2011) and DCGAN (Radford et al., 2015) with additional processing at the end:
Conv2d layer takes 64 channels as input, 128 filters, 4 kernel size, stride 2, padding 1, no bias,batch normalized, relu activation.
Conv2d layer takes 128 channels as input, 256 filters, 4 kernel size, stride 2, padding 1, no bias, batch normalized,relu activation.
Conv2d layer takes 256 channels as input, 512 filters, 4 kernel size, stride 4, padding 1, no bias,batch normalized, relu activation.
Reshape to (, 2048 / ), where is the size of the set, and transpose dimension 1,2
Conv1d layer takes (2048 / ) channels as input, element dimension number of filters, 1 kernel size.
The output tensor would then be of shape
, where is the batch size and is the dimension of each set element. This output is interpreted as a set of vector elements for each instance, with the corresponding set size and element dimension prespecified. This generator is designed to process images with the convolutional layers. The output of step 4 can be seen as a set of feature maps, where each feature map has equal perception field that covers the whole image. We then use step 5 and step 6 to transform this set to the desired shape by grouping feature maps and processing each feature map group individually with shared function. Overall, this makes sure that each set element is produced with the same initial input (the whole image) and with the same architecture, although the weights might be different.For SSLRSRN, the same architecture is used for . is shared with until step 4. After that, flattens the feature maps and project down to a 100dimensional vector as the final embedding using a fully connected layer. processes each element in the input set individually with a 3layer MLP with 512 as the hidden dimensions, followed by FSPool (Zhang et al., 2020) with 20 pieces and no relaxation.
Set Function As described in the main paper, the prediction function decodes each element in the generated set to an image independently through shared transposeconvolution layers. We then aggregate the set of images with self attention. Both SSLRMLP and SSLRSRN have the same image decoder (TransposeConvs) architecture:
Fully connected layer, projecting from element dimension to feature maps of shape (1024, 4, 4). Apply batch norm before reshape and use relu activation.
TransposeConv layer filters 512, kernel size 4, stride 2, padding 1, no bias, batch normalized, relu activation.
TransposeConv layer filters 256, kernel size 4, stride 2, padding 1, no bias, batch normalized, relu activation.
TransposeConv layer filters 3, kernel size 4, stride 4, padding 0, no bias.
This architecture first creates objects and then superimposes them for the final reconstruction. This is permutation invariant as the order of the elements in the set does not change the output and allows each element to be interpreted from the individual reconstructed objects, leading to direct disentanglement.
To measure whether an image was completely disentangled, we examine each of the individual images generated by each of the set elements. If any of the values were more than (out of ), we considered that set element to be “nonempty.” If the total number of nonempty set elements matched the total number of circles, we considered the image to be “completely disentangled.” Although it is possible that this results in false positives (for example, if the model splits one circle into 2 set elements and puts 2 other circles in one set element), we rarely observe this in practice.
Figures 29 and 30 show 10 randomly sampled images from the test set along with the latent sets learned by SSLRSRN. As shown in the figure, the decomposition is almost perfect for SSLRSRN, whereas SSLRMLP frequently cannot disentangle objects.
The architecture for the CLEVR experiment are nearly the same as for the Circles Dataset described above. The only difference is that the projection layers in and the decoder are modified to adapt to image size of .
To measure object disentanglement, we compute both the standard overall IoU and a perobject IoU. As mentioned in the paper, the perobject IoU is the IoU over the Chamfer matching between groundtruth bounding boxes and the bounding boxes of the predicted decomposition. In both cases, we first threshold the pixels valued in by (i.e., pixels with all channels smaller than are set to zero). For the overall IoU case, this threshold is applied on the final reconstruction, and in the perobject case it is applied on each set element. Due to rendering noise, it is difficult to obtain instance segmentation masks. Instead, we algorithmically generate bounding boxes with OpenCV’s findContours() function applied to the thresholded prediction. These bounding boxes are then compared to ground truth bounding boxes. We match the latent set of a prediction (where each element may contain multiple bounding boxes if a prediction fails to disentangle objects) with the set of generated bounding boxes using Chamfer matching, where the cost is the IoU between the pairs of elements: each prediction is matched with the closest ground truth bounding box. With this computed assignment, we then compute the average perobject IoU.
Figures 31 and 32 show 10 randomly sampled images from the test set, along with the SSLRSRN and SSLRMLP latent sets. Figures 36, 37, and 38 each show one sample in more detail. Again, SSLRSRN disentangled objects and completed occluded part of the objects reasonably, while SSLRMLP failed to disentangle objects.



The Naive CNNs considered in the main paper uses as the encoder. It then computes the square difference of the embeddings from two images transforms it to the final scalar output through a linear layer.
For SSLR models, the set generators are the same as for the Circle Dataset reconstruction task. The set function is designed to solve the matching problem with Hungarian matching. Given the generated sets for the two input images, we use a linear layer to decompose element embeddings into two fivedimensional embeddings — one intended to capture location information and one intended to capture attribute information (e.g., circle color and radius). We compute a Hungarian matching the set elements using the attribute embeddings. Finally, the prediction is the sum of distances between the location embeddings in this matching.
The training loss is the sum of the
distance error, the Hungarian matching cost, and (if used) the reconstruction loss. We used RMSprop with learning rate of 5e4 for all SSLR models. For the naive CNN, we used a learning rate of 1e5. For SSLRSRN with extra reconstruction, we added 1e4 weight decay. We trained all models until convergence or overfit (around 60 epochs) and picked best models. The inner optimization loop uses gradient descent with step size 0.1 and momentum 0.5 for
iterations.All datasets we used in this paper will be released together with our code, which will also contain the code and reference for generating the datasets.
Comments
There are no comments yet.