Set Flow: A Permutation Invariant Normalizing Flow

09/06/2019 ∙ by Kashif Rasul, et al. ∙ Zalando 0

We present a generative model that is defined on finite sets of exchangeable, potentially high dimensional, data. As the architecture is an extension of RealNVPs, it inherits all its favorable properties, such as being invertible and allowing for exact log-likelihood evaluation. We show that this architecture is able to learn finite non-i.i.d. set data distributions, learn statistical dependencies between entities of the set and is able to train and sample with variable set sizes in a computationally efficient manner. Experiments on 3D point clouds show state-of-the art likelihoods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most of machine learning research concerns itself with either modeling independent and identically distributed (i.i.d.) data, or a full joint probability over a number of variables is modeled, i.e.

. However, some data is conceptually best represented as a finite unordered set: e.g. point clouds of objects, voice data from a given speaker, or documents as bag-of-words. This is why there has been growing interest in set modeling typically via composition of elementwise operations with permutation invariant reduction operations such as averaging as in Zaheer et al. (2017) or taking the maximum as in Qi et al. (2017) which introduces a bottleneck in what information about the set can be extracted.

Formally, any finite joint probability distribution over

exchangeable random variables

(called “entities” from now on) must fulfill the following requirement for all permutations :

(1)

It has been shown that finite exchangeable distributions can be written as a signed mixture of i.i.d. distributions Kerns and Székely (2006). Note that this differs from de Finetti’s theorem, which is a similar statement for exchangeable processes, i.e. infinite sequences of random variables, and states that in this case the distribution is a mixture of i.i.d. processes: , with a probability distribution. For illustration of this difference, take a distribution of two exchangeable random variables that sum up to a fixed number (here commutativity ensures exchangeability)—in this case sampling of the two numbers cannot be written as conditionally independent given a . Recent generative models build on top of de Finetti’s results Korshunova et al. (2018); Bender et al. (2019), and hence assume an underlying infinite sequence of exchangeable variables.

In this work we present an architecture to explicitly model finite exchangeable data. In other words, we are concerned with data comprised of sets which are i.i.d. sampled from our underlying data, but each set

is a finite exchangeable set, which can be non-i.i.d. data samples in some arbitrary order. We develop a density estimation model that is permutation invariant and is able to model dependencies between the entities in this setting. We call the resulting architecture

Set Flow, as it builds on ideas of normalizing flows, in particular compositions of bijections like Real NVP Dinh et al. (2017), and combines these ideas with set models Zaheer et al. (2017).

The paper is structured as follows. In Section 2 we review background concepts and Section 3 has related work. Section 4 describes our model and how it is trained. In Section 5 we present experiments on synthetic and real data, and finally conclude in Section 6 with a brief summary and discussion of future directions.

2 Background

2.1 Sets

One straight-forward approach to generate a set function is to treat the input as a sequence and train an RNN, but augmented with all possible input permutations, in the hopes that the RNN will become invariant to the input order. This approach might be robust to small sequences but for set sizes in the thousands it becomes hard to scale. Also, as described in Vinyals et al. (2016), the order of the sequences does matter and cannot be discarded.

A recently proposed neural network method, which is invariant to the order if its inputs, is the Deep Set architecture 

Zaheer et al. (2017). The key idea of this approach is to map each input to a learned feature representation, on which a pooling operation is performed (e.g. a sum), which then is passed through another function. With being the set of all sets, being a set, the deep set function can be written as , where and are chosen as a neural networks.

Recent methods like Janossy pooling Murphy et al. (2019) expresses a permutation invariant function as the average of a permutation variant function applied to all reorderings of the input sequence which allows the layer to leverage complicated permutation variant functions to construct permutation invariant ones. This is computationally demanding, but can be done in a tractable fashion via approximation of the ordering or via random permutations. One can also train a permutation optimization module that learns a canonical ordering Zhang et al. (2019) to permute a set and then use it in a permutation invariant fashion, typically by processing it via an RNN.

2.2 Density Estimation via Normalizing Flows

Real NVP Dinh et al. (2017) is a type of normalizing flow Tabak and Turner (2013) where densities in the input space are transformed into some simple distribution space , like an isotropic Gaussian, via , which is composed of stacks of bijections or invertible mappings, with the property that the inverse is easy to evaluate and computing the Jacobian determinant takes time. Due to the change of variables formula we can evaluate via the Gaussian by

(2)

The bijection introduced by Real NVP called the Coupling Layer satisfies the above two properties. It leaves part of its inputs unchanged and transforms the other part via functions of the un-transformed variables

where is an element wise product, is a scaling and a translation function from , given by neural networks. To model a complex nonlinear density map , a number of coupling layers

are composed together, while alternating the dimensions which are unchanged and transformed. Via the change of variables formula the probability density function (PDF) of the flow given a data point can be written as

(3)

Note that the Jacobian for the Real NVP is a block-triangular matrix and thus the log-determinant simply becomes

(4)

where

is the sum over all the vector elements,

is the element-wise logarithm and is the diagonal of the Jacobian. This model, parameterized by the weights of the scaling and translation neural networks

, is then trained via stochastic gradient descent (SGD) on training data points where for each batch

the log likelihood (3) as given by

is maximized. One can trivially condition the PDF on some additional information to model by concatenating to the inputs of the scaling and translation function approximators, i.e. and which are modified to map . This does not change the log-determinant of the coupling layers given by (4).

In practice Batch Normalization 

Ioffe and Szegedy (2015)

is applied, as a bijection, to outputs of coupling layers to stabilize training of normalizing flow. This bijection implements the normalization procedure using a weighted average of a moving average of the layer’s mean and standard deviation values, which are different depending if we are training or doing inference.

3 Related Work

The Real NVP approach can be generalized as in the Masked Autoregressive Flow (MAF) Papamakarios et al. (2017) which models the random numbers used in each stack to generate data. Glow Kingma and Dhariwal (2018) augments Real NVP by the addition of a reversible convolution, as well as removing other components and thus simplifying the overall architecture to obtain qualitatively better samples for high dimensional data like images.

The BRUNO model Korshunova et al. (2018)

performs exact Bayesian inference on sets of data such that the joint distribution over observations is permutation invariant in an autoregressive fashion, in that new samples can be generated conditional on previous ones and a stream of new data points can be easily incorporated at test time. This is easily possible for our method as well, where the network architecture is considerably simple as it only draws upon ideas from normalizing flows. BRUNO, on the other hand, makes use of Student-

processes, i.e. Bayesian models of real-valued functions admitting closed form marginal likelihood and posterior predictive expressions

Shah et al. (2014). The main issue with this building block is that inference typically scales cubically in the number of data points, although the Woodbury matrix inversion lemma can be used to alleviate this issue for the streaming data setting.

Similar to BRUNO, the PILET model Bender et al. (2019)

utilizes an autoregressive model, build upon normalizing flow ideas instead of Student-

-processes Oliva et al. (2018)

. This is combined with a permutation equivariant function to capture interdependence of entities in a set while maintaining exchangeability. They extend their method to make use of a latent code in an exchangeable variational autoencoder framework called PILET-VAE. Note both BRUNO and PILET transform base distributions by applying bijections to entity dimension.

Bayesian Sets Ghahramani and Heller (2006) also models exchangeable sets of binary features but it is not reversible so does not allow sampling from it.

4 Set Flow

In order to make a model invariant to input permutations, one can try to sort the input into some canonical order. While sorting is a very simple solution, for high dimensional points the ordering is in general not stable with respect to the point perturbations and thus does not fully resolve the issue. This makes it hard for a model to learn a consistent mapping even if we constrain the model to have the same set size.

We propose a normalizing flow architecture called Set Flow that in each stack transforms each entity of the set via a shared global Gaussian noise vector, and then this noise vector gets transformed via a symmetric function of all the transformed elements of the set, for example via a Deep Set Zaheer et al. (2017) layer.

Figure 1: Schematic of a single Set Flow stack where a set of entities , where and a global Gaussian noise vector , are transformed via (5). See text for detailed description.

Figure 1 shows a single Set Flow stack, which takes its input from layer to the next stack . The block takes a set of entities where , and a global Gaussian noise vector and transforms it to given by:

(5)

where is a permutation invariant function given via a Deep Set, and are deep neural networks function approximators and is a standard Real NVP—all of these functions are layer specific and do not share weights across layers. By stacking such Set-Coupling layers we arrive at our Set Flow model. As one can see from the construction this mapping is permutation equivariant due to the Deep Set layer and invertable via the bijections.

The inverse transformation starts by sampling a global noise vector as well as a set of the desired number of Gaussian sample entities and going through the flow model in reverse (or from the top to bottom in Figure 1).

As in the Real NVP we can also condition this model via some for each set of entities by the following modification in (5):

to obtain a set-conditioned model, for example when the entities of a set come from a particular category.

4.1 Training

We train the model by sampling batches where for each batch the size of the set is fixed, and construct sets where each set has entities as well as a global noise vector. We use Adam Kingma and Ba (2015) with standard parameters, to maximize the log likelihood:

(6)

where for each term above (2) is employed to explicitly evaluate the likelihoods and calculate their derivatives, with respect to which denotes all parameters of the Set Flow model.

Note that we choose in all our experiments. As we’re interested in the likelihoods of the sets we hence subtract

times the entropy of a Gaussian (the maximum likelihood solution of the global variables) with variance

from the calculated likelihoods of (6) when reporting the test set likelihoods.

5 Experiments

Our first goal in the experiments is to demonstrate and analyze the ability of the proposed model to capture non-i.i.d. dependencies within finite sets. In a second set of experiments, we show that the model scales to much larger and complex datasets by learning 3D point clouds.

5.1 Generation of Non-i.i.d. Exchangeable Data Sets

In order to best understand the ability of the model to capture dependencies of entities, we generate a toy dataset of finite sets with a non-i.i.d. structure: equidistant 2D points on circles with varying radius and position. The generative process of each set is given as follows: first, the center position , radius and a rotation offset is sampled uniformly as , and . Then points are generated with coordinates and , where , with independent radial noise and angular noise .

Figure 2 (left) shows sample sets with a size drawn from this generative model—colors indicate set membership. For the experiment, we trained a model on uniformly random sampled set sizes in , where each minibatch of sets contained the same set sizes. The second subfigure from left shows that after set samples, the model groups elements of sets together in clusters, but fails to produce discernible circles with equidistant points on them. After set samples, the model reproduces the dataset more faithfully, as can be seen in the second to right Figure. The rightmost subfigure in Figure 2 (top) shows the distribution of inferred phases from fitted circles to sets of size (the mean phase across the set is subtracted for alignment). It can be seen that the model (green) nicely captures the equidistant peaks, similar to the original data (blue). Note that this implies that the model captured the generative process of the finite set—otherwise there would be more mass in between the peaks. The variance, however, is larger than in the ground truth phases. Similarly, the model has a bias towards smaller circles—as can be seen in the distribution of inferred radii.

Figure 2: Non-i.i.d. analysis. The leftmost subfigure shows samples of sets with 2D equidistant entities drawn on circles with random positions and radii. The model initially captures the global position variance in the data (second from left at samples) and later in learning captures the non-i.i.d. equidistance property on the circles (at , second from right). The rightmost subfigure shows that the model nicely captures equidistant phases on the sampled circles, but has a bias towards smaller circles than in the original dataset.

5.2 3D Point Cloud Experiments

We train Set Flow from point clouds of Airplane and Chair classes of the ModelNet40 Wu et al. (2015) dataset, where we sample random points from a point cloud of 10,000 for each model to construct a set for the chosen class. We split the model files into a training and test set via an 80% split. We train the model on two individual classes: airplane and chair separately and report the mean test likelihoods in Table 1. We also show some sample generated point clouds in Figure 3 for different set sizes.

We also train the model on three classes (airplane, chair and lamp) together and then given two sets we obtain the noise vectors by passing the sets through our model. We can then linearly interpolate between these two sets and generate samples by passing the interpolated noise, both global and for the entities of the set, backwards. Figure 

4 shows the results of this experiment for a chair to another chair, chair to a lamp and chair to an airplane.

Finally, we train the model on all 40 classes both without supplying the class labels and with class labels via a set class embedding vector . We report the mean test log-likelihoods over each entity in the set in Table 1 together with results from other methods.

We have implemented all the experiments in PyTorch 

Paszke et al. (2017) and will make the code available after the review process here111https://www.github.com/xxx/xxx

. We used the following hyperparameters: batch size

, global noise vector dimension , size of Deep Set pooling output , size of conditioning embedding vector , number of Set Flow stacks , number of random entities in a set and a learning rate of for all our experiments.

Model Airplane Chair All All with labels
PILET-VAE 4.08 2.03 2.13 2.29
PILET
BRUNO
Set Flow 4.1 2.045 2.143 2.311
Table 1: Mean test log-likelihoods for ModelNet40 Wu et al. (2015)

dataset models with two times standard error for our method.

Figure 3: Samples with different number of entities from Set Flow trained on chairs and airplane models separately.



Figure 4: Generated samples from interpolating the noise vectors obtained from two models (left-most and right-most) from Set Flow trained on chairs, lamp and airplane models together.

6 Discussion and Conclusions

We have introduced a simple generative architecture for learning and sampling from exchangeable data of finite sets via a normalizing flow architecture using permutation invariant functions like, for example, Deep Sets. As shown in the experiments our model captures dependencies between entities of a set in a computationally feasible manner. We demonstrated the capability of the model to capture finite exchange invariant generative processes on toy data. We also demonstrated state-of-the art performance for generative modeling of 3D point clouds. In principle, the propose model can be applied to higher dimensional data points, like for example sets of images e.g. in an outfit.

In future work we will further explore alternative architectures of these models, utilize them to learn on sets of images and experiment to see if these methods can be used to learn correlations in time series data across a large number of entities.

References