We propose convolutional deep averaging networks for classifying and learning feature representations of datasets containing instances with unordered features, where each feature is considered a tuple composed of one or more values. CDANs accept variable-size input and are invariant to permutations of the input’s order. In addition, as a side-effect of the training process, CDANs learn discriminative, nonlinear embeddings of individual input elements into a space of chosen dimensionality. Contrary to their name, which is inspired by the work of Iyyer et al. , CDANs could perhaps be more accurately termed convolutional deep pooling networks as we also consider the effects of functions other than averaging such as taking element-wise maximums or sums.
We propose CDANs for classifying unordered feature sets. We show that a CDAN with nonlinear embeddings is competitive with and perhaps even superior to recurrent neural networks and known permutation-invariant architectures for classifying instances containing variable-size sets of unordered features. We also find that the type of pooling plays a significant role in determining the efficacy of the network with sum-pooling clearly outperforming max- and average-pooling.
I-B Related Research
Sets, particularly those without an inherent ordering, comprise a class of data for which an obvious deep learning
treatment is somewhat elusive. A simple feed-forward neural network such as a
multi-layer perceptron(MLP)  is insufficient without enormous amounts of data and even more so if the sets are not of constant size. In addition, RNNs are generally insufficient since the order of the elements may be unreliable or bias the network toward certain spurious or transient patterns. Recently, the deep learning community has begun to explicitly consider architectures specifically made to address the unique challenges proposed by sets and other unusually structured data such as graphs  and ordered sequences . These architectures usually work by exploiting or preserving symmetries in the data (see Gens and Domingos  or  for general frameworks). In the remainder of the section, we focus on work more directly related to our own.
Iyyer et al.  proposed deep averaging networks for classifying text from an unordered list of words and showed that this rivaled more complex network architectures for the same task. A DAN
is essentially a traditional feed-forward neural network whose main distinguishing feature lies in the nature of its input: the element-wise average of word embeddings in a vector space.Iyyer et al. did not consider learning word embeddings as part of the architecture, instead opting to use a set of predefined embeddings. In addition, only averaging was considered as a means of aggregating the word embeddings.
Hill et al.  considered learning linear embeddings as part of the network architecture and summing instead of averaging the embeddings. The resulting network was cast as an RNN with identity weight matrices and served as a baseline against the article’s primary architectures. We show that linear embeddings are not sufficient for all tasks and indeed are unnecessary with certain pooling operations including averaging and summing.
Richard and Gall  develop a neural bag-of-words model that is equivalent to a single-layer-embedding CDAN
with average pooling. Each dimension of the embedding is interpreted as the probability of a Gaussian-distributed visual word given the embedded element. Consequently, the embedding is constrained by a softmax output.Richard and Gall do not appear to explicitly treat instances as sets rather than sequences, but their architecture is nevertheless permutation invariant. A specialized layer representing a support vector machine (SVM) with certain types of nonlinear kernels is incorporated after pooling.
Permutation equivariance is closely related to the concept of invariance. Whereas invariance prescribes that the output of a function is unchanged when the input is permuted, equivariance indicates that the output (presumed to be a sequence or set of the same cardinality as the input) is permuted in the same manner as the input. In other words, equivariance dictates that when a function is given permuted by any , where is the symmetric group on symbols, then
Note that invariance means that
Ravanbakhsh et al.  propose a computationally efficient permutation equivariant layer accomplished via a precise pattern of weight sharing. The following equation computes the output of a recommended version of this layer given an element -dimensional input set represented as a matrix ,
is some nonlinear activation function,is a vector of the column-wise maximum values of , is a weight matrix, is a bias, and is a vector of ones. Guttenberg et al.  also propose a permutation-equivariant layer for dynamics prediction but base their version on applying an arbitrary function to all pairwise combinations of input elements and averaging (pooling) the output, i.e. given inputs , and a function , the -th index of the output is given by
As noted by Ravanbakhsh et al. , permutation invariance can be obtained from a permutation equivariant function by pooling over its output.
Edwards and Storkey 
propose a variational autoencoder for learning statistics of independent and identically distributed data. This work is perhaps the most similar to our own in that the proposed statistical network implicitly contains a CDAN as part of its structure. The application of the implicit CDAN is distinguished from ours in that it is applied at the instance level rather than the feature level. Whereas we are embedding individual features, Edwards and Storkey embed instances. In addition, Edwards and Storkey appear to focus solely on average pooling.
Ii Convolutional Deep Averaging Networks
Suppose we have a dataset composed of subsets , of some set (theoretically, each may in fact be a multiset). Let us assume so that a given subset contains arbitrarily indexed vectors ,
. Our objective is to design a neural network architecture capable of converting each of these variable-size subsets into a fixed-size representation that is useful for machine learning tasks such as classification.
One could certainly use an RNN by treating each as a sequence. However, if there is no inherent ordering to the elements, then an RNN possesses some significant disadvantages. The RNN may learn or be biased towards spurious patterns that are a result of the chosen ordering scheme. In addition, the removal of an element in the middle of the sequence could lead to unexpected results.
We reason that the ideal architecture for this problem is invariant to the order of the input, and we propose augmenting the DAN architecture by directly incorporating the embedding function into the structure of the network, where is the chosen size of the embedding. We call the resulting architecture a convolutional deep averaging network due to its similarity to a convolutional neural network (CNN), which will become apparent shortly.
In theory, we place no restriction on the form that
may take except that it be parameterized in a manner compatible with backpropagation-based training. For the sake of simplicity, we assume thatcan be represented by an MLP, although an RNN is also conceivable if elements of are sequences or time series. When given a set , the embedding function is applied separately to each . One could informally interpret the embedding layer as a sort of convolution of with the elements of . The embeddings are then combined in a manner that does not depend on their order, e.g. through a binary, commutative, and associative operator. To borrow familiar language from CNNs, the embeddings are pooled. Let denote the pooling function and note that usually as is the case for typical pooling operations such as summation. A CDAN is then defined by the function , where and represents a neural network with arbitrary structure. A CDAN with single-layer can be cast as a special type of CNN by considering each set as an image where is a bank of filters. Alternatively, simply removing from (3) yields an equivalent layer. CDANs with MLP embeddings may also be considered CNNs with multiple convolutional layers. See Figure 1 for an illustration of the proposed architecture.
In an alternative interpretation of the embedding, we posit that effectively performs a type of bin or bucket sort of the set elements by allocating them to bins. Each dimension of the embedding is thus associated with a certain region within the input space . Unlike an actual bucket sort, we do not require the bins to be disjoint. By constraining the output of
with a softmax function, however, one could produce a probability distribution over the bins. This interpretation generalizes the neural bag-of-words model of Richard and Gall by allowing the distribution of each visual word to be learned rather than constrained to be Gaussian. In a sense, such a network computes a probabilistic -means with non-linear clusters. Depending on the dimensionality of , this interpretation provides us one way to visualize and examine the embedded feature space by plotting the activation of a bin or the distribution of the visual word in the input space.
The form of the embedding function plays a significant role in the performance of the network. In the following subsections, we show that nonlinear embeddings are generally preferable to linear.
Ii-a Disadvantage of Linear Embeddings
Consider a linear embedding defined by
where is an weight matrix and
is a bias vector. Assume that the pooling layer consists of anaverage operation. The output of the pooling layer and input to the deep portion of the network given is then
We see that we could have simply pooled the input elements directly. In addition, if and are the weights and bias of the first post-pooling layer, then could be merged into the layer by substituting and with and . In other words, the linear embedding is computationally unnecessary and can be eliminated. A similar conclusion may be reached if sum-pooling is used instead (or any linear operation). Max-pooling is an exception as it introduces a nonlinearity. However, max-pooling with linear embeddings still has potential issues with ambiguity.
Ii-B Nonlinear Embeddings Mitigate Ambiguity
Based on the previous subsection’s result, one may consider simply skipping a learned embedding and working directly with the input points as the plain DAN of Iyyer et al.  suggests. In general, though, this course of action may be unwise. In particular, suppose there are two sets , such that
One could even construct a situation wherein both sets also have the same element-wise maximums by choosing and to have the same convex hull. In such an event, and are indistinguishable under linear embeddings with max-pooling since the maximum (and minimum) of a linear function will always lie on the boundary (i.e. vertices) of a convex set. Regardless of the cause of the ambiguity, the consequence is that instances with potentially significant differences are functionally identical from the network’s perspective. The primary issue, though, is the fact that these ambiguities are not caused by particularly exotic circumstances.
A nonlinear embedding allows the network to learn functions that can differentiate sets that are ambiguous under linear pooling. Note that ambiguity is still possible with a nonlinear embedding. However, since the embedding is learned to satisfy some objective, one can expect these ambiguities to either be benign or to indicate some inherent similarity between the ambiguous instances. For example, consider the sets of black and white points in Figure 2(a) that are ambiguous under sum- and max-pooling. Using a pair of sigmoidal activation functions each defined by
with inputs , , where , are each small and positive, we can compute nonlinear embeddings that are unambiguous under sum-pooling.
The nonlinear embedding of the entire set is the key point; linear point-wise embeddings followed by max-pooling may be sufficient when equivalent convex hulls are rare. However, we hypothesize that nonlinear embeddings are inherently more powerful and thus more useful since they have greater representational capacity.
We conduct experiments to evaluate the performance of CDANs
against alternative architectures as well as examine the effects of different pooling operations. Our experiments focus especially on variable-size sets, which do not seem to have many existing results in the literature. All models were implemented and tested using the Keras
deep learning framework with the Theano backend.
Iii-a Posture Recognition from Point Sets
A motion capture dataset of hand postures provides the primary basis for our evaluation.111The dataset along with further documentation is available at http://www.latech.edu/~jkanno/collaborative.htm. The dataset consists of variable-size point sets representing five hand postures captured from 12 users. The size of each point set ranges from 3 to 12, although it should be noted that only 11 markers were physically present. Each point set shares the same coordinate system, so no rotations or translations should be required to process the data. Regardless, we center each point set to have zero mean in each dimension. The goal is to classify each point set as one of the five postures.
In order to make the problem more challenging, we do a leave-one-user-out evaluation where all but one user contribute to the training and validation sets and the test set is drawn exclusively from the left-out user. Each user is iteratively left out, and the resulting test accuracies are averaged to obtain a reasonable evaluation of the tested classifier’s generalization error. Training, validation, and test sets are disjoint and each consist of 75 uniformly randomly selected instances per class per user without replacement. This process is repeated five times in order to obtain some measure of confidence in the results.
Iii-B Model Specification
We compare a variety of CDAN architectures for this task, including linear embeddings with max-pooling, linear embedding with sum-pooling (i.e. no embedding), and nonlinear embeddings with average-, sum-, and max-pooling. These models are compared against an RNN with gated recurrent units  as well as an experimental variant of the CDAN architecture with recurrent connections between the embeddings, which we call a recurrent deep averaging network (RDAN). In an RDAN, we trade permutation invariance and independent embeddings for increased functional capacity. Note that an RDAN is effectively just an RNN that is pooled over the entire time axis. Finally, we implement the permutation-equivariant layers of Guttenberg et al.  (defined by (4)) and Ravanbakhsh et al.  (defined by (3)) for an external, contemporaneous comparison. One or more permutational layers enable one to obtain dependent nonlinear embeddings that are permutation invariant (after pooling) as opposed to the RNNs. From this point forth, we refer to the permutational layer of Guttenberg et al. as a pairwise layer due to its structure and to distinguish it from the permutation-equivariant layer of Ravanbakhsh et al.. We refer to the respective types of model as pairwise convolutional deep averaging networks and permutational deep averaging networks for brevity. Despite the fact that the nonlinear embeddings of a CDAN are not always technically convolutions, we will sometimes refer to them as convolutional layers when compared against the recurrent, pairwise, and permutational layers of competing architectures.
Given the incredibly diverse array of architectural and training options available in the literature, we tried to make our architectures as uniform as possible in order to enable fair comparison. Since the recurrent architectures depend on the order of the input, points in each set were lexicographically sorted by their -, - and
-coordinates. Gaussian noise with a standard deviation of 20 (millimeters, which is the scale of the input) was applied to the input as a form of regularization for each network. Dropout of 10% was applied to the hidden layers of each network, and regularization with a magnitude of 0.001 was applied to the weights of each layer. We did not apply dropout to the input, but we did adopt the simultaneous dropout suggested by Ravanbakhsh et al.  for the PDANs
, which consists of dropping a feature simultaneously in all elements of an input set rather than independently. A default embedding size of 11 was chosen for computational expedience as well as to let each embedded dimension hypothetically represent one of the physical markers. Two special (i.e. convolutional, recurrent, etc.) layers with 11 neurons each were used in each architecture. Each tested recurrent network was bidirectional with the forward and backward RNN outputs concatenated at each timestep (i.e. 22 dimensional output). To clarify, the final timesteps in each direction were concatenated in the case of the plain RNN without pooling. Except in the case of linear embeddings, maxout activations 
with 2 pieces were used in each layer except for the network output, which incorporated a softmax activation. Each model used the same post-embedding architecture, which consisted of one 11-neuron layer with a residual connection
followed by the 5-neuron (one per class) softmax layer.
CDANs offer significant computational and practical advantages over the other architectures that arise primarily from the fact that the embeddings are independent. Unlike an RNN or RDAN, the embeddings can be computed in parallel rather than sequentially. In addition, only embedding function evaluations are required as opposed to function evaluations for a PCDAN. The fact that the embeddings are independent also enables their re-use in intersecting sets whereas recurrent or permutational architectures must re-evaluate each point. PDANs are of similar complexity, although they lack any advantages derived from independent embeddings. For these reasons we were able to experiment with CDANs and PDANs with embedding sizes that are an order of magnitude higher than the other models (100 to be precise) yet still require less computation.
Iii-C Results and Discussion
Results are presented in Table I, where we can see that the highest average accuracy was achieved by a CDAN with sum-pooling and 100-dimensional nonlinear embeddings. We also immediately notice a significant difference between types of pooling for permutation invariant architectures. RDANs, on the other hand, appear to be robust to changes in the pooling mode and just as effective if not marginally better than the RNN.
In general, we note that the highest accuracies achieved for each type are clustered around 90% accuracy. Our results do not provide enough confidence to say that the best-performing models are significantly different (in a statistical sense) than one another, but they do suggest a potential advantage to certain CDANs and disadvantage to PDAN. The PDANs slightly inferior performance may be explained by the fact that their permutation-equivariant layers are slightly more constrained than the competition. Furthermore, we tested only a portion of the possible architectures proposed by the framework of Ravanbakhsh et al. . Regardless of whether the best CDAN does achieve significantly higher accuracy than the competition, the computational advantages of a CDAN over RNNs and PCDANs certainly warrants their utility. In particular, reducing the embedding size to 11 renders a significantly more efficient classifier (with relatively few parameters) with only a marginal drop in accuracy.
We can hypothesize potential reasons for the pattern of results induced by different pooling modes. Note first that the difference between average- and sum-pooling must arise from the fact that the input sets are not of constant size. If the size was constant, then both pooling modes would be the same but for a constant factor. A potential cause for their difference here may thus arise from the fact that average-pooling effectively removes information (the implicitly encoded size of the set) and introduces ambiguity between certain set embeddings. On the other hand, max-pooling’s relatively poor performance may be partially due to the choice of the maxout activation function. We noted in some exploratory trials that its accuracy significantly improved when paired with rectified linear unit (ReLU) activations. Some theoretical basis for sum
-pooling’s apparent advantage may be given by a probabilistic interpretation. Though embeddings were not constrained by a softmax output, we may interpret them as the logarithm of unscaled posterior probabilities as indicated by a neural bag-of-words model. The sum of the embeddings then gives the log-likelihood (shifted by some amount) for the parameters associated with each visual word given the point set.
We also show that nonlinear embeddings can yield significant gains over linear or identity (i.e. no) embeddings. Indeed, for this problem the identity embedding yields a network no better than guessing. The linear embedding with max-pooling, on the other hand, is competitive with its counterpart nonlinear embedding. However, whereas the nonlinear embedding could potentially be improved by adding more layers or changing its activation functions, the linear embedding is already exhausting its functional capacity.
We introduced the CDAN, a class of neural networks designed for classifying instances containing unordered, variable-size feature sets. The proposed architecture works by directly incorporating a function into the network’s structure that embeds the features in a high-dimensional space and pooling the subsequent embeddings. As the name implies, an equivalence can be drawn between the convolution operation in CNNs and the application of the embedding function. Experiments show that in terms of accuracy, CDANs are competitive with competing recurrent and permutation-equivariant architectures. CDANs are also computationally efficient compared to alternative architectures, favoring parallel implementations and re-use of prior results since feature embeddings are set-invariant. In addition, the learned feature embeddings are a useful by-product that can potentially solve related problems such as nonlinear clustering.
Future work may include further exploration of CDAN properties and optimal architectures when applied to other problems or datasets. One should note that networks incorporating convolutional, recurrent, or permutation-equivariant layers need not be mutually exclusive. Architectures that perform a convolutional embedding prior to a permutation-equivariant layer (or vice-versa) may be worth exploring and could be capable of achieving results superior to either method when used alone.
- Chen et al.  X. Chen, X. Cheng, and S. Mallat. Unsupervised deep Haar scattering on graphs. arXiv:1406.2390, 2014. URL https://arxiv.org/abs/1406.2390.
- Chollet  F. Chollet. Keras. https://github.com/fchollet/keras, Nov. 2016. Version 1.2.2.
- Chung et al.  J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. URL https://arxiv.org/abs/1412.3555.
- Cohen and Welling  T. S. Cohen and M. Welling. Group equivariant convolutional networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, volume 48 of ICML ’16, pages 2990–2999. JMLR.org, May 2016. URL https://arxiv.org/abs/1602.07576.
- Edwards and Storkey  H. Edwards and A. Storkey. Towards a neural statistician. arXiv:1606.02185, 2016. URL https://arxiv.org/abs/1606.02185.
- Gens and Domingos  R. Gens and P. M. Domingos. Deep symmetry networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2537–2545. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5424-deep-symmetry-networks.pdf.
- Goodfellow et al.  I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of ICML ’13, pages 1319–1327, Atlanta, Georgia, USA, Jun. 2013. PMLR. URL http://proceedings.mlr.press/v28/goodfellow13.html.
- Guttenberg et al.  N. Guttenberg, N. Virgo, O. Witkowski, H. Aoki, and R. Kanai. Permutation-equivariant neural networks applied to dynamics prediction. arXiv:1612.04530, Dec. 2016. URL https://arxiv.org/abs/1612.04530.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In doi: 10.1109/CVPR.2016.90.
Hill et al. 
F. Hill, K. Cho, and A. Korhonen.
Learning distributed representations of sentences from unlabelled data.Transactions of the Association for Computational Linguistics, 4:17 – 30, Feb. 2016.
- Iyyer et al.  M. Iyyer, V. Manjunatha, J. L. Boyd-Graber, and H. Daumé III. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, ACL ’15, pages 1681–1691. The Association for Computer Linguistics, 2015. ISBN 978-1-941643-72-3. URL http://dblp.uni-trier.de/db/conf/acl/acl2015-1.html#IyyerMBD15.
- Jain et al.  A. K. Jain, J. Mao, and K. Mohiuddin. Artificial neural networks: A tutorial. IEEE Computer, 29:31–44, 1996.
- Kingma and Welling  D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations, ICLR ’14, 2014. URL https://arxiv.org/abs/1312.6114.
- LeCun et al.  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. ISSN 0028-0836. URL http://dx.doi.org/10.1038/nature14539.
- Ravanbakhsh et al.  S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep learning with sets and point clouds. arXiv:1611.04500v3, 2017. URL https://arxiv.org/abs/1611.04500.
- Richard and Gall  A. Richard and J. Gall. A bag-of-words equivalent recurrent neural network for action recognition. Computer Vision and Image Understanding, 156:79–91, 2017. ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2016.10.014. URL http://www.sciencedirect.com/science/article/pii/S1077314216301680. Image and Video Understanding in Big Data.
- Schuster and Paliwal  M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, Nov. 1997. ISSN 1053-587X. doi: 10.1109/78.650093.
- Srivastava et al.  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
- Theano Development Team  Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688. Version 0.8.2.
- Tieleman and Hinton  T. Tieleman and G. Hinton. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2):26–31, 2012.
- Vinyals et al.  O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. arXiv:1511.06391, Feb. 2016. URL https://arxiv.org/abs/1511.06391.