Pointcloudsevolution
Deep neural networks for pointclouds evolution
view repo
In this paper, we study the problem of designing objective functions for machine learning problems defined on finite sets. In contrast to traditional objective functions defined for machine learning problems operating on finite dimensional vectors, the new objective functions we propose are operating on finite sets and are invariant to permutations. Such problems are widespread, ranging from estimation of population statistics poczos13aistats, via anomaly detection in piezometer data of embankment dams Jung15Exploration, to cosmology Ntampaka16Dynamical,Ravanbakhsh16ICML1. Our main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and image tagging.
READ FULL TEXT VIEW PDFDeep neural networks for pointclouds evolution
Baby exercise to learn nonlinear transformations on the MNIST dataset using deep sets.
A typical machine learning algorithm, like regression or classification, is designed for fixed dimensional data instances. Their extensions to handle the case when the inputs or outputs are permutation invariant sets rather than fixed dimensional vectors is not trivial and researchers have only recently started to investigate them Oliva et al. (2013); Szabo et al. (2016); Muandet et al. (2013, 2012). In this paper, we present a generic framework to deal with the setting where input and possibly output instances in a machine learning task are sets.
Similar to fixed dimensional data instances, we can characterize two learning paradigms in case of sets. In supervised learning, we have an output label for a set that is invariant or equivariant to the permutation of set elements. Examples include tasks like estimation of population statistics Poczos et al. (2013), where applications range from gigascale cosmology Ntampaka et al. (2016); Ravanbakhsh et al. (2016a) to nanoscale quantum chemistry Faber et al. (2016).
Next, there can be the unsupervised setting, where the “set” structure needs to be learned, e.g. by leveraging the homophily/heterophily tendencies within sets. An example is the task of set expansion (a.k.a. audience expansion), where given a set of objects that are similar to each other (e.g. set of words {lion, tiger, leopard}), our goal is to find new objects from a large pool of candidates such that the selected new objects are similar to the query set (e.g. find words like jaguar or cheetah among all English words). This is a standard problem in similarity search and metric learning, and a typical application is to find new image tags given a small set of possible tags. Likewise, in the field of computational advertisement, given a set of highvalue customers, the goal would be to find similar people. This is an important problem in many scientific applications, e.g. given a small set of interesting celestial objects, astrophysicists might want to find similar ones in large sky surveys.
In this paper, (i) we propose a fundamental architecture, DeepSets, to deal with sets as inputs and show that the properties of this architecture are both necessary and sufficient (Sec. 2). (ii) We extend this architecture to allow for conditioning on arbitrary objects, and (iii) based on this architecture we develop a deep network that can operate on sets with possibly different sizes (Sec. 3). We show that a simple parametersharing scheme enables a general treatment of sets within supervised and semisupervised settings. (iv) Finally, we demonstrate the wide applicability of our framework through experiments on diverse problems (Sec. 4).
A function transforms its domain into its range . Usually, the input domain is a vector space and the output response range is either a discrete space, e.g. in case of classification, or a continuous space in case of regression. Now, if the input is a set , i.e., the input domain is the power set , then we would like the response of the function to be “indifferent” to the ordering of the elements. In other words, A function acting on sets must be permutation invariant to the order of objects in the set, i.e. for any permutation . In the supervised setting, given examples of of as well as their labels
, the task would be to classify/regress (with variable number of predictors) while being permutation invariant w.r.t. predictors. Under unsupervised setting, the task would be to assign high scores to valid sets and low scores to improbable sets. These scores can then be used for set expansion tasks, such as image tagging or audience expansion in field of computational advertisement. In
transductive setting, each instance has an associated labeled . Then, the objective would be instead to learn a permutation equivariant function that upon permutation of the input instances permutes the output labels, i.e. for any permutation :(1) 
We want to study the structure of functions on sets. Their study in total generality is extremely difficult, so we analyze casebycase. We begin by analyzing the invariant case when is a countable set and , where the next theorem characterizes its structure.
A function operating on a set having elements from a countable universe, is a valid set function, i.e., invariant to the permutation of instances in , iff it can be decomposed in the form , for suitable transformations and .
The extension to case when is uncountable, like , we could only prove that holds for sets of fixed size. The proofs and difficulties in handling the uncountable case, are discussed in Appendix A. However, we still conjecture that exact equality holds in general.
Next, we analyze the equivariant case when and
is restricted to be a neural network layer. The standard neural network layer is represented as
where is the weight vector andis a nonlinearity such as sigmoid function. The following lemma states the necessary and sufficient conditions for permutationequivariance in this type of function. The function
defined above is permutation equivariant iff all the offdiagonal elements of are tied together and all the diagonal elements are equal as well. That is,This result can be easily extended to higher dimensions, i.e., when can be matrices.
The general form of Theorem 2.2 is closely related with important results in different domains. Here, we quickly review some of these connections.
A related concept is that of an exchangeable model in Bayesian statistics, It is backed by deFinetti’s theorem which states that any exchangeable model can be factored as
(2) 
where is some latent feature and
are the hyperparameters of the prior. To see that this fits into our result, let us consider exponential families with conjugate priors, where we can analytically calculate the integral of (
2). In this special case and . Now if we marginalize out , we get a form which looks exactly like the one in Theorem 2.2(3) 
Support distribution machines use as the prediction function Muandet et al. (2012); Poczos et al. (2012), where are distributions and . In practice, the distributions are never given to us explicitly, usually only i.i.d. sample sets are available from these distributions, and therefore we need to estimate kernel using these samples. A popular approach is to use , where is another kernel operating on the samples and . Now, these prediction functions can be seen fitting into the structure of our Theorem.
A consequence of the polynomial decomposition is that spectral methods Anandkumar et al. (2012) can be viewed as a special case of the mapping : in that case one can compute polynomials, usually only up to a relatively low degree (such as ), to perform inference about statistical properties of the distribution. The statistics are exchangeable in the data, hence they could be represented by the above map.
Invariant model. The structure of permutation invariant functions in Theorem 2.2 hints at a general strategy for inference over sets of objects, which we call DeepSets. Replacing and by universal approximators leaves matters unchanged, since, in particular, and can be used to approximate arbitrary polynomials. Then, it remains to learn these approximators, yielding in the following model:
[leftmargin=3mm, itemsep=0mm,partopsep=0pt,parsep=0pt]
Each instance is transformed (possibly by several layers) into some representation .
The representations are added up and the output is processed using the network in the same manner as in any deep network (e.g. fully connected layers, nonlinearities, etc.).
Optionally: If we have additional metainformation , then the above mentioned networks could be conditioned to obtain the conditioning mapping .
In other words, the key is to add up all representations and then apply nonlinear transformations.
Equivariant model. Our goal is to design neural network layers that are equivariant to the permutations of elements in the input . Based on Lemma 2.2, a neural network layer is permutation equivariant if and only if all the offdiagonal elements of are tied together and all the diagonal elements are equal as well, i.e., for . This function is simply a nonlinearity applied to a weighted combination of (i) its input and; (ii) the sum of input values . Since summation does not depend on the permutation, the layer is permutationequivariant. We can further manipulate the operations and parameters in this layer to get other variations, e.g.:
(4) 
where the maxpooling operation over elements of the set (similar to sum) is commutative. In practice, this variation performs better in some applications. This may be due to the fact that for , the input to the nonlinearity is maxnormalized. Since composition of permutation equivariant functions is also permutation equivariant, we can build DeepSets by stacking such layers.
Several recent works study equivariance and invariance in deep networks w.r.t. general group of transformations Gens and Domingos (2014); Cohen and Welling (2016); Ravanbakhsh et al. (2017). For example, Chen et al. (2014) construct deep permutation invariant features by pairwise coupling of features at the previous layer, where is invariant to transposition of and . Pairwise interactions within sets have also been studied in Chang et al. (2016); Guttenberg et al. (2016). Vinyals et al. (2015) approach unordered instances by finding “good” orderings.
The idea of pooling a function across setmembers is not new. In LopezPaz et al. (2016), pooling was used binary classification task for causality on a set of samples. Shi et al. (2015) use pooling across a panoramic projection of 3D object for classification, while Su et al. (2015) perform pooling across multiple views. Hartford et al. (2016) observe the invariance of the payoff matrix in normal form games to the permutation of its rows and columns (i.e.
player actions) and leverage pooling to predict the player action. The need of permutation equivariance also arise in deep learning over sensor networks and multiagent setings, where a special case of Lemma
2.2 has been used as the architecture Sukhbaatar et al. (2016).In light of these related works, we would like to emphasize our novel contributions: (i) the universality result of Theorem 2.2 for permutation invariance that also relates DeepSets to other machine learning techniques, see Sec. 3; (ii) the permutation equivariant layer of (4), which, according to Lemma 2.2 identifies necessary and sufficient form of parametersharing in a standard neural layer and; (iii) novel application settings that we study next.
We present a diverse set of applications for DeepSets. For the supervised setting, we apply DeepSets to estimation of population statistics, sum of digits and classification of pointclouds, and regression with clustering sideinformation. The permutationequivariant variation of DeepSets is applied to the task of outlier detection. Finally, we investigate the application of DeepSets to unsupervised setexpansion, in particular, conceptset retrieval and image tagging. In most cases we compare our approach with the stateofthe art and report competitive results.




In the first experiment, we learn entropy and mutual information of Gaussian distributions, without providing any information about Gaussianity to DeepSets. The Gaussians are generated as follows:
[leftmargin=3mm, itemsep=0.7mm,partopsep=0mm,parsep=0mm]
Rotation: We randomly chose a covariance matrix , and then generated sample sets from of size for random values of . Our goal was to learn the entropy of the marginal distribution of first dimension. is the rotation matrix.
Correlation: We randomly chose a covariance matrix for , and then generated sample sets from of size for random values of . Goal was to learn the mutual information of among the first and last dimension.
Rank 1: We randomly chose and then generated a sample sets from of size for random values of . Goal was to learn the mutual information.
Random: We chose random covariance matrices for , and using each, generated a sample set from of size . Goal was to learn the mutual information.
We train using
loss with a DeepSets architecture having 3 fully connected layers with ReLU activation for both transformations
and . We compare against Support Distribution Machines (SDM) using a RBF kernel Poczos et al. (2012), and analyze the results in Fig. 1.Next, we compare to what happens if our set data is treated as a sequence. We consider the task of finding sum of a given set of digits. We consider two variants of this experiment:
We randomly sample a subset of maximum digits from this dataset to build “sets” of training images, where the setlabel is sum of digits in that set. We test against sums of digits, for starting from 5 all the way up to 100 over another examples.
MNIST8m Loosli et al. (2007) contains 8 million instances of greyscale stamps of digits in . We randomly sample a subset of maximum images from this dataset to build “sets” of training and sets of test images, where the setlabel is the sum of digits in that set (i.e. individual labels per image is unavailable). We test against sums of images of MNIST digits, for starting from 5 all the way up to 50.
We compare against recurrent neural networks – LSTM and GRU. All models are defined to have similar number of layers and parameters. The output of all models is a scalar, predicting the sum of
digits. Training is done on tasks of length 10 at most, while at test time we use examples of length up to 100. The accuracy, i.e. exact equality after rounding, is shown in Fig. 2. DeepSets generalize much better. Note for image case, the best classification error for single digit is around for MNIST8m, so in a collection of of images at least one image will be misclassified is , which is 40% for . This matches closely with observed value in Fig. 2(b).Model  Instance Size  Representation  Accuracy 

3DShapeNets Wu et al. (2015)  voxels (using convolutional deep belief net)  
VoxNet Maturana and Scherer (2015)  voxels (voxels from pointcloud + 3D CNN)  
MVCNN Su et al. (2015)  multivew images (2D CNN + viewpooling)  
VRN Ensemble Brock et al. (2016)  voxels (3D CNN, variational autoencoder) 

3D GAN Wu et al. (2016)  voxels (3D CNN, generative adversarial training)  
DeepSets  pointcloud  
DeepSets  pointcloud 
A pointcloud is a set of lowdimensional vectors. This type of data is frequently encountered in various applications like robotics, vision, and cosmology. In these applications, existing methods often convert the pointcloud data to voxel or mesh representation as a preprocessing step, e.g. Maturana and Scherer (2015); Ravanbakhsh et al. (2016b); Lin et al. (2004). Since the output of many range sensors, such as LiDAR, is in the form of pointcloud, direct application of deep learning methods to pointcloud is highly desirable. Moreover, it is easy and cheaper to apply transformations, such as rotation and translation, when working with pointclouds than voxelized 3D objects.
As pointcloud data is just a set of points, we can use DeepSets to classify pointcloud representation of a subset of ShapeNet objects Chang et al. (2015), called ModelNet40 Wu et al. (2015). This subset consists of 3D representation of 9,843 training and 2,468 test instances belonging to 40 classes of objects. We produce pointclouds with 100, 1000 and 5000 particles each (coordinates) from the mesh representation of objects using the pointcloudlibrary’s sampling routine Rusu and Cousins (2011)
. Each set is normalized by the initial layer of the deep network to have zero mean (along individual axes) and unit (global) variance. Tab.
1 compares our method using three permutation equivariant layers against the competition; see Appendix H for details.An important regression problem in cosmology is to estimate the redshift of galaxies, corresponding to their age as well as their distance from us Binney and Merrifield (1998) based on photometric observations. One way to estimate the redshift from photometric observations is using a regression model Connolly et al. (1995) on the galaxy clusters. The prediction for each galaxy does not change by permuting the members of the galaxy cluster. Therefore, we can treat each galaxy cluster as a “set” and use DeepSets to estimate the individual galaxy redshifts. See Appendix G for more details.
Method  Scatter 

MLP  0.026 
redMaPPer  0.025 
DeepSets  0.023 
For each galaxy, we have photometric features from the redMaPPer galaxy cluster catalog Rozo and Rykoff (2014) that contains photometric readings for 26,111 red galaxy clusters. Each galaxycluster in this catalog has between galaxies – i.e. , where is the clustersize. The catalog also provides accurate spectroscopic redshift estimates for a subset of these galaxies.
We randomly split the data into 90% training and 10% test clusters, and minimize the squared loss of the prediction for available spectroscopic redshifts. As it is customary in cosmology literature, we report the average scatter , where is the accurate spectroscopic measurement and is a photometric estimate in Tab. 2.
In the set expansion task, we are given a set of objects that are similar to each other and our goal is to find new objects from a large pool of candidates such that the selected new objects are similar to the query set. To achieve this one needs to reason out the concept connecting the given set and then retrieve words based on their relevance to the inferred concept. It is an important task due to wide range of potential applications including personalized information retrieval, computational advertisement, tagging large amounts of unlabeled or weakly labeled datasets.
Going back to de Finetti’s theorem in Sec. 3.2
, where we consider the marginal probability of a set of observations, the marginal probability allows for very simple metric for scoring additional elements to be added to
. In other words, this allows one to perform set expansion via the following score(5) 
Note that is the pointwise mutual information between and . Moreover, due to exchangeability, it follows that regardless of the order of elements we have
(6) 
When inferring sets, our goal is to find set completions for an initial set of query terms , such that the aggregate set is coherent. This is the key idea of the Bayesian Set algorithm Ghahramani and Heller (2005) (details in Appendix D). Using DeepSets, we can solve this problem in more generality as we can drop the assumption of data belonging to certain exponential family.
For learning the score
, we take recourse to largemargin classification with structured loss functions
Taskar et al. (2004) to obtain the relative loss objective . In other words, we want to ensure that whenever should be added and should not be added to .Conditioning. Often machine learning problems do not exist in isolation. For example, task like tag completion from a given set of tags is usually related to an object , for example an image, that needs to be tagged. Such metadata are usually abundant, e.g. author information in case of text, contextual data such as the user click history, or extra information collected with LiDAR point cloud.
Conditioning graphical models with metadata is often complicated. For instance, in the BetaBinomial model we need to ensure that the counts are always nonnegative, regardless of
. Fortunately, DeepSets does not suffer from such complications and the fusion of multiple sources of data can be done in a relatively straightforward manner. Any of the existing methods in deep learning, including feature concatenation by averaging, or by maxpooling, can be employed. Incorporating these metadata often leads to significantly improved performance as will be shown in experiments; Sec.
4.2.2.Method  LDA (Vocab = )  LDA (Vocab = )  LDA (Vocab = )  

Recall (%)  MRR  Med.  Recall (%)  MRR  Med.  Recall (%)  MRR  Med.  
@10  @100  @1k  @10  @100  @1k  @10  @100  @1k  
Random  0.06  0.6  5.9  0.001  8520  0.02  0.2  2.6  0.000  28635  0.01  0.2  1.6  0.000  30600 
Bayes Set  1.69  11.9  37.2  0.007  2848  2.01  14.5  36.5  0.008  3234  1.75  12.5  34.5  0.007  3590 
w2v Near  6.00  28.1  54.7  0.021  641  4.80  21.2  43.2  0.016  2054  4.03  16.7  35.2  0.013  6900 
NNmax  4.78  22.5  53.1  0.023  779  5.30  24.9  54.8  0.025  672  4.72  21.4  47.0  0.022  1320 
NNsumcon  4.58  19.8  48.5  0.021  1110  5.81  27.2  60.0  0.027  453  4.87  23.5  53.9  0.022  731 
NNmaxcon  3.36  16.9  46.6  0.018  1250  5.61  25.7  57.5  0.026  570  4.72  22.0  51.8  0.022  877 
DeepSets  5.53  24.2  54.3  0.025  696  6.04  28.5  60.7  0.027  426  5.54  26.1  55.5  0.026  616 
In text concept set retrieval, the objective is to retrieve words belonging to a ‘concept’ or ‘cluster’, given few words from that particular concept. For example, given the set of words {tiger, lion, cheetah}, we would need to retrieve other related words like jaguar, puma, etc, which belong to the same concept of big cats. This task of concept set retrieval can be seen as a set completion task conditioned on the latent semantic concept, and therefore our DeepSets form a desirable approach.
We construct a large dataset containing sets of related words by extracting topics from latent Dirichlet allocation Pritchard et al. (2000); Blei et al. (2003), taken outofthebox^{1}^{1}1github.com/dmlc/experimentallda. To compare across scales, we consider three values of giving us three datasets LDA, LDA, and LDA, with corresponding vocabulary sizes of and .
We learn this using a margin loss with a DeepSets architecture having 3 fully connected layers with ReLU activation for both transformations and . Details of the architecture and training are in Appendix E. We compare to several baselines: (a) Random picks a word from the vocabulary uniformly at random. (b) Bayes Set Ghahramani and Heller (2005). (c) w2vNear computes the nearest neighbors in the word2vec Mikolov et al. (2013)
space. Note that both Bayes Set and w2v NN are strong baselines. The former runs Bayesian inference using BetaBinomial conjugate pair, while the latter uses the powerful
dimensional word2vec trained on the billion word GoogleNews corpus^{2}^{2}2code.google.com/archive/p/word2vec/. (d) NNmax uses a similar architecture as our DeepSets but uses max pooling to compute the set feature, as opposed to sum pooling. (e) NNmaxcon uses max pooling on set elements but concatenates this pooled representation with that of query for a final set feature. (f) NNsumcon is similar to NNmaxcon but uses sum pooling followed by concatenation with query representation.We consider the standard retrieval metrics – recall@K, median rank and mean reciprocal rank, for evaluation. To elaborate, recall@K measures the number of true labels that were recovered in the top K retrieved words. We use three values of K. The other two metrics, as the names suggest, are the median and mean of reciprocals of the true label ranks, respectively. Each dataset is split into TRAIN (), VAL () and TEST (
). We learn models using TRAIN and evaluate on TEST, while VAL is used for hyperparameter selection and early stopping.
As seen in Tab. 3: (a) Our DeepSets model outperforms all other approaches on LDA and LDA by any metric, highlighting the significance of permutation invariance property. (b) On LDA, our model does not perform well when compared to w2vNear. We hypothesize that this is due to small size of the dataset insufficient to train a high capacity neural network, while w2vNear has been trained on a billion word corpus. Nevertheless, our approach comes the closest to w2vNear amongst other approaches, and is only 0.5% lower by Recall@10.
Method  ESP game  IAPRTC12.5  

P  R  F1  N+  P  R  F1  N+  
Least Sq.  35  19  25  215  40  19  26  198 
MBRM  18  19  18  209  24  23  23  223 
JEC  24  19  21  222  29  19  23  211 
FastTag  46  22  30  247  47  26  34  280 
Least Sq.(D)  44  32  37  232  46  30  36  218 
FastTag(D)  44  32  37  229  46  33  38  254 
DeepSets  39  34  36  246  42  31  36  247 
We next experiment with image tagging, where the task is to retrieve all relevant tags corresponding to an image. Images usually have only a subset of relevant tags, therefore predicting other tags can help enrich information that can further be leveraged in a downstream supervised task. In our setup, we learn to predict tags by conditioning DeepSets on the image, i.e., we train to predict a partial set of tags from the image and remaining tags. At test time, we predict tags from the image alone.
We report results on the following three datasets  ESPGame, IAPRTC12.5 and our inhouse dataset, COCOTag. We refer the reader to Appendix F, for more details about datasets.
The setup for DeepSets to tag images is similar to that described in Sec. 4.2.1. The only difference being the conditioning on the image features, which is concatenated with the set feature obtained from pooling individual element representations.
We perform comparisons against several baselines, previously reported in Chen et al. (2013)
. Specifically, we have Least Sq., a ridge regression model, MBRM
Feng et al. (2004), JEC Makadia et al. (2008) and FastTag Chen et al. (2013). Note that these methods do not use deep features for images, which could lead to an unfair comparison. As there is no publicly available code for MBRM and JEC, we cannot get performances of these models with Resnet extracted features. However, we report results with deep features for FastTag and Least Sq., using code made available by the authors
^{3}^{3}3http://www.cse.wustl.edu/~mchen/.For ESPgame and IAPRTC12.5, we follow the evaluation metrics as in
Guillaumin et al. (2009)–precision (P), recall (R), F1 score (F1), and number of tags with nonzero recall (N+). These metrics are evaluate for each tag and the mean is reported (see Guillaumin et al. (2009) for further details). For COCOTag, however, we use recall@K for three values of K, along with median rank and mean reciprocal rank (see evaluation in Sec. 4.2.1 for metric details).Method  Recall  MRR  Med.  

@10  @100  @1k  
w2v NN (blind)  5.6  20.0  54.2  0.021  823 
DeepSets (blind)  9.0  39.2  71.3  0.044  310 
DeepSets  31.4  73.4  95.3  0.131  28 
Tab. 4 shows results of image tagging on ESPgame and IAPRTC12.5, and Tab. 5 on COCOTag. Here are the key observations from Tab. 4: (a) performance of our DeepSets model is comparable to the best approaches on all metrics but precision, (b) our recall beats the best approach by 2% in ESPgame. On further investigation, we found that the DeepSets model retrieves more relevant tags, which are not present in list of ground truth tags due to a limited tag annotation. Thus, this takes a toll on precision while gaining on recall, yet yielding improvement on F1. On the larger and richer COCOTag, we see that the DeepSets approach outperforms other methods comprehensively, as expected. Qualitative examples are in Appendix F.
The objective here is to find the anomalous face in each set, simply by observing examples and without any access to the attribute values. CelebA dataset Liu et al. (2015) contains 202,599 face images, each annotated with 40 boolean attributes. We build sets of stamps, using these attributes each containing images (on the training set) as follows: randomly select 2 attributes, draw 15 images having those attributes, and a single target image where both attributes are absent. Using a similar procedure we build sets on the test images. No individual person‘s face appears in both train and test sets.
Our deep neural network consists of 9 2Dconvolution and maxpooling layers followed by 3 permutationequivariant layers, and finally a softmax layer that assigns a probability value to each set member (Note that one could identify arbitrary number of outliers using a sigmoid activation at the output). Our trained model successfully finds the anomalous face in
75% of test sets. Visually inspecting these instances suggests that the task is nontrivial even for humans; see Fig. 3.As a baseline, we repeat the same experiment by using a setpooling layer after convolution layers, and replacing the permutationequivariant layers with fully connected layers of same size, where the final layer is a 16way softmax. The resulting network shares the convolution filters for all instances within all sets, however the input to the softmax is not equivariant to the permutation of input images. Permutation equivariance seems to be crucial here as the baseline model achieves a training and test accuracy of ; the same as random selection. See Appendix I for more details.
In this paper, we develop DeepSets, a model based on powerful permutation invariance and equivariance properties, along with the theory to support its performance. We demonstrate the generalization ability of DeepSets across several domains by extensive experiments, and show both qualitative and quantitative results. In particular, we explicitly show that DeepSets outperforms other intuitive deep networks, which are not backed by theory (Sec. 4.2.1, Sec. 4.1.2). Last but not least, it is worth noting that the stateoftheart we compare to is a specialized technique for each task, whereas our one model, i.e., DeepSets, is competitive across the board.
Multiview convolutional neural networks for 3d shape recognition.
InProceedings of the IEEE International Conference on Computer Vision
, pages 945–953, 2015.Learning multiagent communication with backpropagation.
In Neural Information Processing Systems, pages 2244–2252, 2016.Training invariant support vector machines using selective sampling.
In Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA., 2007.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1912–1920, 2015.A function transforms its domain into its range . Usually, the input domain is a vector space and the output response range is either a discrete space, e.g. in case of classification, or a continuous space in case of regression.
Now, if the input is a set , i.e. , then we would like the response of the function not to depend on the ordering of the elements in the set. In other words,
Property 1 A function acting on sets must be permutation invariant to the order of objects in the set, i.e.
(7) 
for any permutation .
Now, roughly speaking, we claim that such functions must have a structure of the form for some functions and . Over the next two sections we try to formally prove this structure of the permutation invariant functions.
Theorem 2 Assume the elements are countable, i.e. . A function operating on a set can be a valid set function, i.e. it is permutation invariant to the elements in , if and only if it can be decomposed in the form , for suitable transformations and .
Permutation invariance follows from the fact that sets have no particular order, hence any function on a set must not exploit any particular order either. The sufficiency follows by observing that the function satisfies the permutation invariance condition.
To prove necessity, i.e. that all functions can be decomposed in this manner, we begin by noting that there must be a mapping from the elements to natural numbers functions, since the elements are countable. Let this mapping be denoted by . Now if we let then constitutes an unique representation for every set . Now a function can always be constructed such that . ∎
The extension to case when is uncountable, e.g. , is not so trivial. We could only prove in case of fixed set size, e.g. instead of , that any permutation invariant continuous function can be expressed as . Also, we show that there is a universal approximator of the same form. These results are discussed below.
To illustrate the uncountable case, we assume a fixed set size of . Without loss of generality we can let . Then the domain becomes . Also, to handle ambiguity due to permutation, we often define the domain to be the set for some ordering of the elements in .
The proof builds on the famous NewtonGirard formulae which connect moments of a sample set (sumofpower) to the elementary symmetric polynomials. But first we present some results needed for the proof. The first result establishes that sumofpower mapping is injective.
Lemma 4 Let . The sumofpower mapping defined by the coordinate functions
(8) 
is injective.
Suppose for some , we have . We will now show that it must be the case that . Construct two polynomials as follows:
(9) 
If we expand the two polynomials we obtain:
(10)  
with coefficients being elementary symmetric polynomials in and respectively, i.e.
(11) 
These elementary symmetric polynomials can be uniquely expressed as a function of and respectively, by NewtonGirard formula. The th coefficient is given by the determinant of matrix having terms from and respectively:
(12)  
Since we assumed implying , which in turn implies that the polynomials and are the same. Therefore, their roots must be the same, which shows that . ∎
The second result we borrow from Ćurgus and Mascioni [2006] which establishes a homeomorphism between coefficients and roots of a polynomial.
Theorem 5 Ćurgus and Mascioni [2006] The function , which associates every to the multiset of roots, , of the monic polynomial formed using as the coefficient i.e. , is a homeomorphism.
Among other things, this implies that (complex) roots of a polynomial depends continuously on the coefficients. We will use this fact for our next lemma.
Finally, we establish a continuous inverse mapping for the sumofpower function.
Lemma 6 Let . We define the sumofpower mapping by the coordinate functions
(13) 
where is the range of the function. The function has a continuous inverse mapping.
First of all note that , the range of , is a compact set. This follows from following observations:
The domain of is a bounded polytope (i.e. a compact set),
is a continuous function, and
image of a compact set under a continuous function is a compact set.
To show the continuity of inverse mapping, we establish connection to the continuous dependence of roots of polynomials on its coefficients.
As in Lemma 4, for any , let and construct the polynomial:
(14) 
If we expand the polynomial we obtain:
(15) 
with coefficients being elementary symmetric polynomials in , i.e.
(16) 
These elementary symmetric polynomials can be uniquely expressed as a function of by NewtonGirard formula:
(17) 
Since determinants are just polynomials, is a continuous function of . Thus to show continuity of inverse mapping of , it remains to show continuity from back to the roots . In this regard, we invoke Theorem 5. Note that homeomorphism implies the mapping as well as its inverse is continuous. Thus, restricting to the compact set where the map from coefficients to roots only goes to the reals, the desired result follows. To explicitly check the continuity, note that limit of , as approaches from inside , always exists and is equal to since it does so in the complex plane. ∎
With the lemma developed above we are in a position to tackle the main theorem.
Theorem 7 Let be a permutation invariant continuous function iff it has the representation
(18) 
for some continuous outer and inner function and respectively. The inner function is independent of the function .
The sufficiency follows by observing that the function satisfies the permutation invariance condition.
To prove necessity, i.e. that all permutation invariant continuous functions over the compact set can be expressed in this manner, we divide the proof into two parts, with outline in Fig. 4. We begin by looking at the continuous embedding formed by the inner function: . Consider defined as . Now as is a polynomial, the image of in under is a compact set as well, denote it by . Then by definition, the embedding is surjective. Using Lemma 4 and 6, we know that upon restricting the permutations, i.e. replacing with , the embedding is injective with a continuous inverse. Therefore, combining these observation we get that is a homeomorphism between and . Now it remains to show that we can map the embedding to desired target value, i.e. to show the existence of the continuous map such that . In particular consider the map . The continuity of follows directly from the fact that composition of continuous functions is continuous. Therefore we can always find continuous functions and to express any permutation invariant function as . ∎
A very similar but more general results holds in case of any continuous function (not necessarily permutation invariant). The result is known as KolmogorovArnold representation theorem [Khesin and Tabachnikov, 2014, Chap. 17] which we state below:
Theorem 8 (Kolmogorov–Arnold representation) Let be an arbitrary multivariate continuous function iff it has the representation
(19) 
with continuous outer and inner functions and . The inner function is independent of the function .
This theorem essentially states a representation theorem for any multivariate continuous function. Their representation is very similar to the one we are proved, except for the dependence of inner transformation on the coordinate through . Thus it is reassuring that behind all the beautiful mathematics something intuitive is happening. If the function is permutation invariant, this dependence on coordinate of the inner transformation gets dropped!
Further we can show that arbitrary approximator having the same form can be obtained for continuous permutationinvariant functions.
Theorem 9 Assume the elements are from a compact set in , i.e. possibly uncountable, and the set size is fixed to . Then any continuous function operating on a set , i.e. which is permutation invariant to the elements in can be approximated arbitrarily close in the form of , for suitable transformations and .
Permutation invariance follows from the fact that sets have no particular order, hence any function on a set must not exploit any particular order either. The sufficiency follows by observing that the function satisfies the permutation invariance condition.
To prove necessity, i.e. that all continuous functions over the compact set can be approximated arbitrarily close in this manner, we begin noting that polynomials are universal approximators by Stone–Weierstrass theorem [Marsden and Hoffman, 1993, sec. 5.7]. In this case the ChevalleyShephardTodd (CST) theorem [Bourbaki, 1990, chap. V, theorem 4], or more precisely, its special case, the Fundamental Theorem of Symmetric Functions states that symmetric polynomials are given by a polynomial of homogeneous symmetric monomials. The latter are given by the sum over monomial terms, which is all that we need since it implies that all symmetric polynomials can be written in the form required by the theorem. ∎
Finally, we still conjecture that even in case of sets of all sizes, i.e. when the domain is , a representation of the form should exist for all “continuous” permutation invariant functions for some suitable transformations and . However, in this case even what a “continuous” function means is not clear as the space does not have any natural topology. As a future work, we want to study further by defining various topologies, like using Fréchet distance as used in Ćurgus and Mascioni [2006] or MMD distance. Our preliminary findings in this regards hints that using MMD distance if the representation is allowed to be in , instead of being finite dimensional, then the conjecture seems to be provable. Thus, clearly this direction needs further exploration. We end this section by providing some examples:
, Consider and , then is the desired function.
, Consider and , then is the desired function.
, Consider and , then is the desired function.
, Consider and , then as , then we have approaching the desired function.
Second largest among , Consider and , then as , we have approaching the desired function.
Our goal is to design neural network layers that are equivariant to permutations of elements in the input . The function is equivariant to the permutation of its inputs iff
where the symmetric group is the set of all permutation of indices .
Consider the standard neural network layer
(20) 
where is the weight vector and is a nonlinearity such as sigmoid function. The following lemma states the necessary and sufficient conditions for permutationequivariance in this type of function.
Lemma 3 The function as defined in (20) is permutation equivariant if and only if all the offdiagonal elements of are tied together and all the diagonal elements are equal as well. That is,
where is the identity matrix.
From definition of permutation equivariance and definition of in (20), the condition becomes , which (assuming sigmoid is a bijection) is equivalent to . Therefore we need to show that the necessary and sufficient conditions for the matrix to commute with all permutation matrices is given by this proposition. We prove this in both directions:
To see why commutes with any permutation matrix, first note that commutativity is linear – that is
Since both Identity matrix , and constant matrix , commute with any permutation matrix, so does their linear combination .
We need to show that in a matrix that commutes with “all” permutation matrices
All diagonal elements are identical: Let for , be a transposition (i.e. a permutation that only swaps two elements). The inverse permutation matrix of is the permutation matrix of . We see that commutativity of with the transposition implies that :
Therefore, and commute for any permutation , they also commute for any transposition and therefore .
All offdiagonal elements are identical: We show that since commutes with any product of transpositions, any choice two offdiagonal elements should be identical. Let and be the index of two offdiagonal elements (i.e. and ). Moreover for now assume and . Application of the transposition , swaps the rows in . Similarly, switches the column with column. From commutativity property of and we have
where in the last step we used our assumptions that , , and . In the cases where either or , we can use the above to show that and , for some and , and conclude .
∎
The structure of permutation invariant functions in Theorem 2.2 hints at a general strategy for inference over sets of objects, which we call deep sets. Replacing and by universal approximators leaves matters unchanged, since, in particular, and can be used to approximate arbitrary polynomials. Then, it remains to learn these approximators. This yields in the following model:
Each instance is transformed (possibly by several layers) into some representation .
The addition of these representations processed using the network very much in the same manner as in any deep network (e.g. fully connected layers, nonlinearities, etc).
Optionally: If we have additional metainformation , then the above mentioned networks could be conditioned to obtain the conditioning mapping .
In other words, the key to deep sets is to add up all representations and then apply nonlinear transformations.
The overall model structure is illustrated in Fig. 7.
This architecture has a number of desirable properties in terms of universality and correctness. We assume in the following that the networks we choose are, in principle, universal approximators. That is, we assume that they can represent any functional mapping. This is a well established property (see e.g. Micchelli [1986]
for details in the case of radial basis func