Classifying Unordered Feature Sets with Convolutional Deep Averaging Networks

Unordered feature sets are a nonstandard data structure that traditional neural networks are incapable of addressing in a principled manner. Providing a concatenation of features in an arbitrary order may lead to the learning of spurious patterns or biases that do not actually exist. Another complication is introduced if the number of features varies between each set. We propose convolutional deep averaging networks (CDANs) for classifying and learning representations of datasets whose instances comprise variable-size, unordered feature sets. CDANs are efficient, permutation-invariant, and capable of accepting sets of arbitrary size. We emphasize the importance of nonlinear feature embeddings for obtaining effective CDAN classifiers and illustrate their advantages in experiments versus linear embeddings and alternative permutation-invariant and -equivariant architectures.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/10/2018

Learning Representations of Sets through Optimized Permutations

Representations of sets are challenging to learn because operations on s...
05/31/2018

Interpretable Set Functions

We propose learning flexible but interpretable functions that aggregate ...
03/14/2022

Permutation Invariant Representations with Applications to Graph Deep Learning

This paper presents primarily two Euclidean embeddings of the quotient s...
11/14/2016

Deep Learning with Sets and Point Clouds

We introduce a simple permutation equivariant layer for deep learning wi...
06/23/2022

Set Norm and Equivariant Skip Connections: Putting the Deep in Deep Sets

Permutation invariant neural networks are a promising tool for making pr...
05/02/2018

Deep Perm-Set Net: Learn to Predict Sets with Unknown Permutation and Cardinality Using Deep Neural Networks

We present a novel approach for learning to predict sets with unknown pe...
03/08/2022

DuMLP-Pin: A Dual-MLP-dot-product Permutation-invariant Network for Set Feature Extraction

Existing permutation-invariant methods can be divided into two categorie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We propose convolutional deep averaging networks for classifying and learning feature representations of datasets containing instances with unordered features, where each feature is considered a tuple composed of one or more values. CDANs accept variable-size input and are invariant to permutations of the input’s order. In addition, as a side-effect of the training process, CDANs learn discriminative, nonlinear embeddings of individual input elements into a space of chosen dimensionality. Contrary to their name, which is inspired by the work of Iyyer et al. [11], CDANs could perhaps be more accurately termed convolutional deep pooling networks as we also consider the effects of functions other than averaging such as taking element-wise maximums or sums.

I-a Contributions

We propose CDANs for classifying unordered feature sets. We show that a CDAN with nonlinear embeddings is competitive with and perhaps even superior to recurrent neural networks and known permutation-invariant architectures for classifying instances containing variable-size sets of unordered features. We also find that the type of pooling plays a significant role in determining the efficacy of the network with sum-pooling clearly outperforming max- and average-pooling.

I-B Related Research

Sets, particularly those without an inherent ordering, comprise a class of data for which an obvious deep learning 

[14]

treatment is somewhat elusive. A simple feed-forward neural network such as a

multi-layer perceptron

(MLP[12] is insufficient without enormous amounts of data and even more so if the sets are not of constant size. In addition, RNNs are generally insufficient since the order of the elements may be unreliable or bias the network toward certain spurious or transient patterns. Recently, the deep learning community has begun to explicitly consider architectures specifically made to address the unique challenges proposed by sets and other unusually structured data such as graphs [1] and ordered sequences [21]. These architectures usually work by exploiting or preserving symmetries in the data (see Gens and Domingos [6] or [4] for general frameworks). In the remainder of the section, we focus on work more directly related to our own.

Iyyer et al. [11] proposed deep averaging networks for classifying text from an unordered list of words and showed that this rivaled more complex network architectures for the same task. A DAN

is essentially a traditional feed-forward neural network whose main distinguishing feature lies in the nature of its input: the element-wise average of word embeddings in a vector space.

Iyyer et al. did not consider learning word embeddings as part of the architecture, instead opting to use a set of predefined embeddings. In addition, only averaging was considered as a means of aggregating the word embeddings.

Hill et al. [10] considered learning linear embeddings as part of the network architecture and summing instead of averaging the embeddings. The resulting network was cast as an RNN with identity weight matrices and served as a baseline against the article’s primary architectures. We show that linear embeddings are not sufficient for all tasks and indeed are unnecessary with certain pooling operations including averaging and summing.

Richard and Gall [16] develop a neural bag-of-words model that is equivalent to a single-layer-embedding CDAN

with average pooling. Each dimension of the embedding is interpreted as the probability of a Gaussian-distributed visual word given the embedded element. Consequently, the embedding is constrained by a softmax output.

Richard and Gall do not appear to explicitly treat instances as sets rather than sequences, but their architecture is nevertheless permutation invariant. A specialized layer representing a support vector machine (SVM) with certain types of nonlinear kernels is incorporated after pooling.

Permutation equivariance is closely related to the concept of invariance. Whereas invariance prescribes that the output of a function is unchanged when the input is permuted, equivariance indicates that the output (presumed to be a sequence or set of the same cardinality as the input) is permuted in the same manner as the input. In other words, equivariance dictates that when a function is given permuted by any , where is the symmetric group on symbols, then

(1)

Note that invariance means that

(2)

Ravanbakhsh et al. [15] propose a computationally efficient permutation equivariant layer accomplished via a precise pattern of weight sharing. The following equation computes the output of a recommended version of this layer given an element -dimensional input set represented as a matrix ,

(3)

where

is some nonlinear activation function,

is a vector of the column-wise maximum values of , is a weight matrix, is a bias, and is a vector of ones. Guttenberg et al. [8] also propose a permutation-equivariant layer for dynamics prediction but base their version on applying an arbitrary function to all pairwise combinations of input elements and averaging (pooling) the output, i.e. given inputs , and a function , the -th index of the output is given by

(4)

As noted by Ravanbakhsh et al. [15], permutation invariance can be obtained from a permutation equivariant function by pooling over its output.

Edwards and Storkey [5]

propose a variational autoencoder 

[13] for learning statistics of independent and identically distributed data. This work is perhaps the most similar to our own in that the proposed statistical network implicitly contains a CDAN as part of its structure. The application of the implicit CDAN is distinguished from ours in that it is applied at the instance level rather than the feature level. Whereas we are embedding individual features, Edwards and Storkey embed instances. In addition, Edwards and Storkey appear to focus solely on average pooling.

Ii Convolutional Deep Averaging Networks

Suppose we have a dataset composed of subsets , of some set (theoretically, each may in fact be a multiset). Let us assume so that a given subset contains arbitrarily indexed vectors ,

. Our objective is to design a neural network architecture capable of converting each of these variable-size subsets into a fixed-size representation that is useful for machine learning tasks such as classification.

One could certainly use an RNN by treating each as a sequence. However, if there is no inherent ordering to the elements, then an RNN possesses some significant disadvantages. The RNN may learn or be biased towards spurious patterns that are a result of the chosen ordering scheme. In addition, the removal of an element in the middle of the sequence could lead to unexpected results.

We reason that the ideal architecture for this problem is invariant to the order of the input, and we propose augmenting the DAN architecture by directly incorporating the embedding function into the structure of the network, where is the chosen size of the embedding. We call the resulting architecture a convolutional deep averaging network due to its similarity to a convolutional neural network (CNN), which will become apparent shortly.

In theory, we place no restriction on the form that

may take except that it be parameterized in a manner compatible with backpropagation-based training. For the sake of simplicity, we assume that

can be represented by an MLP, although an RNN is also conceivable if elements of are sequences or time series. When given a set , the embedding function is applied separately to each . One could informally interpret the embedding layer as a sort of convolution of with the elements of . The embeddings are then combined in a manner that does not depend on their order, e.g. through a binary, commutative, and associative operator. To borrow familiar language from CNNs, the embeddings are pooled. Let denote the pooling function and note that usually as is the case for typical pooling operations such as summation. A CDAN is then defined by the function , where and represents a neural network with arbitrary structure. A CDAN with single-layer can be cast as a special type of CNN by considering each set as an image where is a bank of filters. Alternatively, simply removing from (3) yields an equivalent layer. CDANs with MLP embeddings may also be considered CNNs with multiple convolutional layers. See Figure 1 for an illustration of the proposed architecture.

Fig. 1: An illustration of a generic CDAN. The inputs are arbitrarily indexed from 1 to , where is the presumed cardinality of the input set. An embedding function is convolved with the input elements to produce a dynamically learned embedding in some potentially high-dimensional space.

In an alternative interpretation of the embedding, we posit that effectively performs a type of bin or bucket sort of the set elements by allocating them to bins. Each dimension of the embedding is thus associated with a certain region within the input space . Unlike an actual bucket sort, we do not require the bins to be disjoint. By constraining the output of

with a softmax function, however, one could produce a probability distribution over the bins. This interpretation generalizes the neural bag-of-words model of Richard and Gall 

[16] by allowing the distribution of each visual word to be learned rather than constrained to be Gaussian. In a sense, such a network computes a probabilistic -means with non-linear clusters. Depending on the dimensionality of , this interpretation provides us one way to visualize and examine the embedded feature space by plotting the activation of a bin or the distribution of the visual word in the input space.

The form of the embedding function plays a significant role in the performance of the network. In the following subsections, we show that nonlinear embeddings are generally preferable to linear.

Ii-a Disadvantage of Linear Embeddings

Consider a linear embedding defined by

(5)

where is an weight matrix and

is a bias vector. Assume that the pooling layer consists of an

average operation. The output of the pooling layer and input to the deep portion of the network given is then

(6)

We see that we could have simply pooled the input elements directly. In addition, if and are the weights and bias of the first post-pooling layer, then could be merged into the layer by substituting and with and . In other words, the linear embedding is computationally unnecessary and can be eliminated. A similar conclusion may be reached if sum-pooling is used instead (or any linear operation). Max-pooling is an exception as it introduces a nonlinearity. However, max-pooling with linear embeddings still has potential issues with ambiguity.

Ii-B Nonlinear Embeddings Mitigate Ambiguity

Based on the previous subsection’s result, one may consider simply skipping a learned embedding and working directly with the input points as the plain DAN of Iyyer et al. [11] suggests. In general, though, this course of action may be unwise. In particular, suppose there are two sets , such that

(7)

One could even construct a situation wherein both sets also have the same element-wise maximums by choosing and to have the same convex hull. In such an event, and are indistinguishable under linear embeddings with max-pooling since the maximum (and minimum) of a linear function will always lie on the boundary (i.e. vertices) of a convex set. Regardless of the cause of the ambiguity, the consequence is that instances with potentially significant differences are functionally identical from the network’s perspective. The primary issue, though, is the fact that these ambiguities are not caused by particularly exotic circumstances.

A nonlinear embedding allows the network to learn functions that can differentiate sets that are ambiguous under linear pooling. Note that ambiguity is still possible with a nonlinear embedding. However, since the embedding is learned to satisfy some objective, one can expect these ambiguities to either be benign or to indicate some inherent similarity between the ambiguous instances. For example, consider the sets of black and white points in Figure 2(a) that are ambiguous under sum- and max-pooling. Using a pair of sigmoidal activation functions each defined by

(8)

with inputs , , where , are each small and positive, we can compute nonlinear embeddings that are unambiguous under sum-pooling.

The nonlinear embedding of the entire set is the key point; linear point-wise embeddings followed by max-pooling may be sufficient when equivalent convex hulls are rare. However, we hypothesize that nonlinear embeddings are inherently more powerful and thus more useful since they have greater representational capacity.

Fig. 2: An example of simultaneous sum-, average-, and max

-pooling ambiguity and its partial resolution via a nonlinear embedding. (a) The set of black points and the set of white points shown have the same coordinate-wise sums and maximums. The shading shows the activation of two sigmoidal functions that can be used to construct nonlinear 2D embeddings (b and c) that distinguish the two sets under

sum- and average-pooling. (b) The embedding of the black points. (c) The embedding of the white points. Note that two points share nearly the same embedding.

Iii Experiments

We conduct experiments to evaluate the performance of CDANs

against alternative architectures as well as examine the effects of different pooling operations. Our experiments focus especially on variable-size sets, which do not seem to have many existing results in the literature. All models were implemented and tested using the Keras 

[2]

deep learning framework with the Theano 

[19] backend.

Iii-a Posture Recognition from Point Sets

A motion capture dataset of hand postures provides the primary basis for our evaluation.111The dataset along with further documentation is available at http://www.latech.edu/~jkanno/collaborative.htm. The dataset consists of variable-size point sets representing five hand postures captured from 12 users. The size of each point set ranges from 3 to 12, although it should be noted that only 11 markers were physically present. Each point set shares the same coordinate system, so no rotations or translations should be required to process the data. Regardless, we center each point set to have zero mean in each dimension. The goal is to classify each point set as one of the five postures.

In order to make the problem more challenging, we do a leave-one-user-out evaluation where all but one user contribute to the training and validation sets and the test set is drawn exclusively from the left-out user. Each user is iteratively left out, and the resulting test accuracies are averaged to obtain a reasonable evaluation of the tested classifier’s generalization error. Training, validation, and test sets are disjoint and each consist of 75 uniformly randomly selected instances per class per user without replacement. This process is repeated five times in order to obtain some measure of confidence in the results.

Iii-B Model Specification

We compare a variety of CDAN architectures for this task, including linear embeddings with max-pooling, linear embedding with sum-pooling (i.e. no embedding), and nonlinear embeddings with average-, sum-, and max-pooling. These models are compared against an RNN with gated recurrent units [3] as well as an experimental variant of the CDAN architecture with recurrent connections between the embeddings, which we call a recurrent deep averaging network (RDAN). In an RDAN, we trade permutation invariance and independent embeddings for increased functional capacity. Note that an RDAN is effectively just an RNN that is pooled over the entire time axis. Finally, we implement the permutation-equivariant layers of Guttenberg et al. [8] (defined by (4)) and Ravanbakhsh et al. [15] (defined by (3)) for an external, contemporaneous comparison. One or more permutational layers enable one to obtain dependent nonlinear embeddings that are permutation invariant (after pooling) as opposed to the RNNs. From this point forth, we refer to the permutational layer of Guttenberg et al. as a pairwise layer due to its structure and to distinguish it from the permutation-equivariant layer of Ravanbakhsh et al.. We refer to the respective types of model as pairwise convolutional deep averaging networks and permutational deep averaging networks for brevity. Despite the fact that the nonlinear embeddings of a CDAN are not always technically convolutions, we will sometimes refer to them as convolutional layers when compared against the recurrent, pairwise, and permutational layers of competing architectures.

Given the incredibly diverse array of architectural and training options available in the literature, we tried to make our architectures as uniform as possible in order to enable fair comparison. Since the recurrent architectures depend on the order of the input, points in each set were lexicographically sorted by their -, - and

-coordinates. Gaussian noise with a standard deviation of 20 (millimeters, which is the scale of the input) was applied to the input as a form of regularization for each network. Dropout 

[18] of 10% was applied to the hidden layers of each network, and regularization with a magnitude of 0.001 was applied to the weights of each layer. We did not apply dropout to the input, but we did adopt the simultaneous dropout suggested by Ravanbakhsh et al. [15] for the PDANs

, which consists of dropping a feature simultaneously in all elements of an input set rather than independently. A default embedding size of 11 was chosen for computational expedience as well as to let each embedded dimension hypothetically represent one of the physical markers. Two special (i.e. convolutional, recurrent, etc.) layers with 11 neurons each were used in each architecture. Each tested recurrent network was bidirectional 

[17] with the forward and backward RNN outputs concatenated at each timestep (i.e. 22 dimensional output). To clarify, the final timesteps in each direction were concatenated in the case of the plain RNN without pooling. Except in the case of linear embeddings, maxout activations [7]

with 2 pieces were used in each layer except for the network output, which incorporated a softmax activation. Each model used the same post-embedding architecture, which consisted of one 11-neuron layer with a residual connection 

[9]

followed by the 5-neuron (one per class) softmax layer.

CDANs offer significant computational and practical advantages over the other architectures that arise primarily from the fact that the embeddings are independent. Unlike an RNN or RDAN, the embeddings can be computed in parallel rather than sequentially. In addition, only embedding function evaluations are required as opposed to function evaluations for a PCDAN. The fact that the embeddings are independent also enables their re-use in intersecting sets whereas recurrent or permutational architectures must re-evaluate each point. PDANs are of similar complexity, although they lack any advantages derived from independent embeddings. For these reasons we were able to experiment with CDANs and PDANs with embedding sizes that are an order of magnitude higher than the other models (100 to be precise) yet still require less computation.

The RMSProp 

[20] implementation provided by Keras [2]

was used with a learning rate of 0.001 and minibatches of size 64. Training was terminated for a model if the validation loss did not improve after 40 epochs.

Iii-C Results and Discussion

Results are presented in Table I, where we can see that the highest average accuracy was achieved by a CDAN with sum-pooling and 100-dimensional nonlinear embeddings. We also immediately notice a significant difference between types of pooling for permutation invariant architectures. RDANs, on the other hand, appear to be robust to changes in the pooling mode and just as effective if not marginally better than the RNN.

hv Type Embedding Embedding Pooling Accuracy
Size
vhv CDAN None N/A sum
CDAN Linear 11 max
CDAN Linear 100 max
CDAN Nonlinear 11 average
CDAN Nonlinear 11 max
CDAN Nonlinear 11 sum
CDAN Nonlinear 100 average
CDAN Nonlinear 100 max
CDAN Nonlinear 100 sum
vhv RDAN Recurrent 11 average
RDAN Recurrent 11 max
RDAN Recurrent 11 sum
RNN Recurrent 11 N/A
vhv PCDAN Pairwise 11 average
PCDAN Pairwise 11 max
PCDAN Pairwise 11 sum
vhv PDAN Permutational 11 average
PDAN Permutational 11 max
PDAN Permutational 11 sum
PDAN Permutational 100 average
PDAN Permutational 100 max
PDAN Permutational 100 sum
vh
TABLE I: Average accuracies and standard deviations over five leave-one-user-out evaluations.

In general, we note that the highest accuracies achieved for each type are clustered around 90% accuracy. Our results do not provide enough confidence to say that the best-performing models are significantly different (in a statistical sense) than one another, but they do suggest a potential advantage to certain CDANs and disadvantage to PDAN. The PDANs slightly inferior performance may be explained by the fact that their permutation-equivariant layers are slightly more constrained than the competition. Furthermore, we tested only a portion of the possible architectures proposed by the framework of Ravanbakhsh et al. [15]. Regardless of whether the best CDAN does achieve significantly higher accuracy than the competition, the computational advantages of a CDAN over RNNs and PCDANs certainly warrants their utility. In particular, reducing the embedding size to 11 renders a significantly more efficient classifier (with relatively few parameters) with only a marginal drop in accuracy.

We can hypothesize potential reasons for the pattern of results induced by different pooling modes. Note first that the difference between average- and sum-pooling must arise from the fact that the input sets are not of constant size. If the size was constant, then both pooling modes would be the same but for a constant factor. A potential cause for their difference here may thus arise from the fact that average-pooling effectively removes information (the implicitly encoded size of the set) and introduces ambiguity between certain set embeddings. On the other hand, max-pooling’s relatively poor performance may be partially due to the choice of the maxout activation function. We noted in some exploratory trials that its accuracy significantly improved when paired with rectified linear unit (ReLU) activations. Some theoretical basis for sum

-pooling’s apparent advantage may be given by a probabilistic interpretation. Though embeddings were not constrained by a softmax output, we may interpret them as the logarithm of unscaled posterior probabilities as indicated by a neural bag-of-words model 

[16]. The sum of the embeddings then gives the log-likelihood (shifted by some amount) for the parameters associated with each visual word given the point set.

We also show that nonlinear embeddings can yield significant gains over linear or identity (i.e. no) embeddings. Indeed, for this problem the identity embedding yields a network no better than guessing. The linear embedding with max-pooling, on the other hand, is competitive with its counterpart nonlinear embedding. However, whereas the nonlinear embedding could potentially be improved by adding more layers or changing its activation functions, the linear embedding is already exhausting its functional capacity.

Iv Conclusion

We introduced the CDAN, a class of neural networks designed for classifying instances containing unordered, variable-size feature sets. The proposed architecture works by directly incorporating a function into the network’s structure that embeds the features in a high-dimensional space and pooling the subsequent embeddings. As the name implies, an equivalence can be drawn between the convolution operation in CNNs and the application of the embedding function. Experiments show that in terms of accuracy, CDANs are competitive with competing recurrent and permutation-equivariant architectures. CDANs are also computationally efficient compared to alternative architectures, favoring parallel implementations and re-use of prior results since feature embeddings are set-invariant. In addition, the learned feature embeddings are a useful by-product that can potentially solve related problems such as nonlinear clustering.

Future work may include further exploration of CDAN properties and optimal architectures when applied to other problems or datasets. One should note that networks incorporating convolutional, recurrent, or permutation-equivariant layers need not be mutually exclusive. Architectures that perform a convolutional embedding prior to a permutation-equivariant layer (or vice-versa) may be worth exploring and could be capable of achieving results superior to either method when used alone.

References