Neural networks have revolutionized many areas of machine learning including image and natural language processing. However, one of the major challenges for neural networks is that they are still black boxes to the user. It is not quite clear how network internals map from inputs to outputs or how to interpret the features learned by the nodes. This is mainly because the features are not constrained to have specific structure or characteristics. Existing regularizations constrain the learned code to have certain properties. However, they are not designed to specifically aid in interpretation of the latent encoding. For example,regularization induces sparsity in the activations but does not impose specific structure between dimensions.
Here, we introduce a new class of regularizations called Graph Spectral Regularizations that result in activations that are filtered on a predefined graph. We define specific members of this class for applications. First, we introduce a (graph) Laplacian smoothing regularization
which enforces smoothly varying activations while at the same time reconstructing the data. This regularization is useful for learning features with specific topologies. For instance, we show that on a cluster-structured topology where features correspond to hierarchical cluster structure in the data it reflects the abstract grouping of features. We also show it is useful for inducing feature consistency between nodes of capsule networks sabour2017capsules. The graph regularization semantically aligns the features such that they appear in the same order in each capsule. When trained on MNIST digits, we find that each of our 10 capsules consisting of 16 nodes encodes the same transformation (rotation, scale, skew, etc) of a particular digit in the same node.
While the Laplacian smoothing regularizations is useful in the context where the features of the data have a recognizable topology, often we don’t know the explicit structure of the data. Instead, we would like to extract the topology of the data itself. Thus, we design a filter that encourages the graph structure layer to learn data-shape features. We achieve this by using a spatially localized, Gaussian filter to localize the activations for any particular data point. We ensure that only one of a dictionary of localized filters is chosen as the activation via a spectral bottleneck layer preceding the graph-structured layer. We show that spatially-localized filter regularizations are useful for detecting circular and linear topologies of data that are not immediately reflected by the observed features. We also explore a biological system – a single-cell protein expression dataset depicting T cell development in the thymus – that has continuous progression structure. The graph structured layer (with a ring graph) reveals the data to have a Y-shaped topology reflecting the bifurcation into CD4 (regulatory) and CD8 (cytotoxic) T cells, confirming known T cell biology.
Finally, we show that the graph-structured layer, when imposing a 2D grid, creates a “pseudo” image that can be analyzed by convolution layers. We show that such re-encoded images of MNIST digits have localized receptive fields that can be used for classification and visual interpretability. Interestingly, we find that the convolution obviates the need for a spectral bottleneck as the convolution and max pooling themselves may provide that function.
Our contributions are as follows:
A framework for imposing graph structure on latent layers using graph spectral regularizations.
A Laplacian graph smoothing regularization and its applications in learning feature smoothness and consistency.
Spatially localized graph regularizations using a spectral bottleneck based on a dictionary of Gaussian Kernels and its application in recognizing data topology.
Applications of graph spectral regularizations, natural and biological datasets to demonstrate feature interpretability and data topology.
2 Graph-structured layer
Consider a given layer in an artificial neural network, and let be the neurons in this layer. These neurons can essentially be regarded as functions of inputs to the neural network, and map each input to an
dimensional vector. Typically, no particular structure is being directly imposed on the range of beyond general notions of bounded norm (e.g., Euclidean norm or norm as a proxy for sparsity). This is clear with fully connected layers, but even with convolutional ones that introduce some relation between neurons within each layer, there is still no clear topological structure for the representations obtained by the entire layer as a whole. Indeed, convolutional layers typically learn multiple channels (or filters) from input signals, and while neurons within each channel can be organized in spatiotemporal coordinates, there are no imposed relations between the different channels.
Since many applications of neural networks essentially focus on supervised learning tasks (whether predictive or generative), where hidden layer neurons are only used as intermediate computational units, their unstructured nature is not considered an important issue. However, when using neural networks for unsupervised and exploratory tasks, in which the hidden layers are treated as latent data representations, their lack of structure makes the interpretation of the resulting representations challenging if not impossible. To address this challenge, autoencoders and similar unsupervised deep models typically restrict their bottleneck layer to have two or three neurons, thus mapping data points toor , where the entire data can be visualized as a 2D or 3D cloud of points. While this approach is useful for getting a general understanding of structures (e.g., clustering or trends) in the data, it significantly limits the amount of information that can be captured by such latent representations. Furthermore, it underutilizes the ability of a human observer to identify rich patterns in 2D displays. Indeed, human perception in natural settings is not tuned to observe particle clouds, but rather to recognize shapes, identify textures, and assess relative sizes of objects. Such elements are often used by visualization techniques in various fields, such as TreeMap shneiderman1992treemaps, audio spectrograms flanagan1972speech, and even classic box plots and histograms tukey1977exploratory.
Here, we propose a new approach for producing human interpretable patterns in the latent representation obtained by hidden layers of neural networks. To this end, we impose a graph topology on the neurons in , which enables us to control spectral properties (e.g., regularity and locality) of latent representations obtained by them to produce recognizable patterns. Formally, we consider the neurons as vertices of a graph with weighted edges between them defined by an adjacency matrix . Then, for any given input to the network, we consider the activations as a signal over this neuron graph. We propose here three ways to utilize this graph signal structure of a neural network layer. First, by considering graph signal processing notions, we can define spectral regularizations that control the regularity or smoothness of learned signals, as explained in Section 2.1. Second, by utilizing sparse graph wavelet dictionaries, we can define a new type of bottleneck that forces autoencoder middle layers to encode inputs by sparse set of localized dictionary atoms (e.g., as done in compressive sensing and dictionary learning), as explained in Section 2.2. Finally, by utilizing simple graph structures, such as a ring graph or a 2D lattice we can force downstream layers to be convolutional layers, regardless of whether the input of the network was structured or not. Section 3 demonstrates the utility of each of these design choices for several data exploration applications, in both supervised and unsupervised settings.
2.1 Spectral graph regularization
The graph structure defined by the weighted adjacency matrix naturally provides a notion of locality, based on local graph neighborhoods, or geodesic distances on the graph. However, recent works on graph signal processing [e.g.,][and references therein]shuman2013emerging also explored spectral notions provided by graph structures, which extend and generalize traditional signal processing and harmonic analysis notions. These notions are based on the definition of a graph Laplacian as
which provides (either directly or via proper normalization) a discrete version of well-studied manifold Laplace operators from geometric harmonic analysis [e.g.,]belkin2002laplacian,coifman2006diffusion. Then, the eigenvectorsof
(indexed by convention in ascending order of the corresponding eigenvalues) are considered graph Fourier harmonics. These can be shown to converge to discrete Fourier harmonics when considering a ring graph, in which case their associated eigenvalues,whose columns are the graph harmonics shuman2013emerging. When applied to a neuron-activation signal in our setting we get its graph Fourier coefficients as , where is associated with the graph (squared) frequency . Similarly, the inverse Fourier transform is also defined via , as it can be verified that , since the graph Laplacian (for all graphs considered in this work) yields a full orthonormal set of eigenvectors, and thus
is an orthogonal matrix shuman2013emerging, hammond2011wavelets.
The graph Fourier transform allows us to consider the activations for a given input of the network in two domains. First, in their original form, these activations form a signal over the neuron graph, which can essentially be considered as a function of individual neurons, with , , which we refer to as the neuron-domain representation. Alternatively, we can consider this signal in the spectral domain, via its Fourier coefficients, as a function , , of graph-harmonic indices. This allows us to pose a new set of regularizations that are defined in the spectral domain, rather than in the neuron domain, in order to directly enforce spectral properties of the neuron activation signal.
We note that one of the most popular regularization traditionally used in deep learning is theregularization, which essentially adds the squared Euclidean norm of the activations in the neuron domain (i.e., ) as another term in the main optimization target for gradient descent. Such regularization encourages the activations to all have equivalently small values, essentially providing a global notion of smoothness that is then balanced with other loss terms but typically provides stability to the optimization process goodfellow2016dl. Since it can be verified that the graph Fourier transform is energy preserving (i.e., due to the orthonormality of graph harmonics), this regularization can equivalently be considered in the spectral domain by using instead of . However, unlike the neuron domain, in the spectral domain we can associate with each element in a harmonic interpretation given by , and therefore we can also generalize the spectral regularization to be weighted by functions of these squared-frequencies. Namely, given weights , , for some function , we define the spectral regularization as adding the term
where , to the optimization loss of the neural network. Notice that such weighting cannot be directly defined in the neuron domain, as individual neurons are not associated with any a priori interpretation. Instead, to apply the spectral regularization in the neuron domain, we consider the matrix form on the RHS, and by combining it together with the definition of the graph Fourier transform we can write the spectral regularization term in the neuron domain as the quadratic form , where is a matrix that is independent of the network input , and can thus be directly computed in advance from the neuron graph structure, and the predetermined spectral weights in . Finally, while in this work we focus on the form of spectral regularizations, in general other weighted norms (e.g., or more generally , ) can also be used to define spectral regularizations based on based on the harmonic structure induced by the neuron graph.
Laplacian smoothing regularization:
we now focus on given class of spectral regularizations with nonnegative weights, i.e., . In such cases, the chosen weights determine which harmonic bands to penalize, and by how much, in the resulting regularization. Therefore, these spectral regularization enable to guide the latent representation provided by the graph-structured layer towards, or away from, certain harmonic patterns. In particular, this enables us to encourage smooth latent representations (over the neuron graph sense) by using weights that penalize high frequency. A natural choice for such weights is to simply set them by the identity , which results in . The quadratic form of this loss is thus . This regularization has been proposed in various forms in the graph signal processing literature [see, e.g.,] belkin2004regularization,zhou2004regularization,shuman2013emerging. We refer to this regularization as Laplacian smoothing and demonstrate its utility for producing interpretable latent representations in hidden layers of neural networks in Section 3.
2.2 Spectral bottleneck
The spectral regularization defined in Section 2.1 is based on representing and activation signal as a linear combination of graph harmonics, and then formulating a regularization over the corresponding coefficients for the combination, which are given by the graph Fourier transform. This principle can be extended by considering more general notions of dictionary learning olshausen1997sparse. In general, such methods seek to define a dictionary of representative atoms that can be used to effectively represent signals, while capturing (in each atom) certain patterns, such as spatial or spectral locality. Under this terminology, the graph Fourier transform is based on a dictionary consisting of the graph harmonics as atoms that are extremely local in frequency. However, it can be shown that these atoms are often not spatially local over the neuron graph. To extend our approach to also include notions of spatial locality, we propose to also consider dictionaries that are based on bandlimited filters, such as graph wavelets or translated Gaussians hammond2011wavelets,shuman2016vertex. Let be such a dictionary, and let be a matrix whose columns are the atoms in the dictionary. We note that while in general, the atoms in the dictionary need not be orthonormal, we assume they are chosen such that the matrix has a suitable pseudoinverse such that . Then, the best approximation of a neuron signal , for network input , by a linear combination of dictionary atoms is given by , which can be written directly as a linear combination with the dictionary coefficients given by .
The dictionary coefficients computed by provide a dictionary-based extension of the Fourier coefficients computed by the graph Fourier transform. Therefore, the same spectral regularization discussed in the previous section can be be directly generalized to dictionary-based regularization. However, the dictionary coefficients also provide an alternative utilization as a new type of information bottleneck in the neural network, inspired by sparse representations commonly used in compressive sensing qaisar2013compressive. In particular, here we consider a bottleneck that forces the network to project its neuron activation signal on a single dictionary atom, which would be chosen adaptively depending on the input , and then only pass this atom to subsequent layer. To achieve this bottleneck, we split the graph-structured layer into two parts and , before and after the bottleneck (correspondingly), such that
where softmax is defined as a function that maps vectors in to unit norm approximately one-hot vectors bishop2006. Therefore, previous layers in the network feed into the neuron graph signal , which is then passed through the bottleneck to produce a filtered signal based on approximately one atom out of the provided ones in the dictionary, and then the activations form this new signal are passed on to subsequent layers. In Section 3, we show by using a spectral bottleneck that is based on graph-translated Gaussians as dictionary atoms, we can force individual inputs processed by the network to be projected on local regions of the neuron graph. This, in turn, allows us to organize the latent representation in the graph-structured layer into “receptive fields” over the neuron graph, which capture local regions in the input data and uncover trends in it based on the imposed graph structure.
3 Empirical Results
Topological Inference Using Laplacian Smoothing Regularization
First, as a sanity check, we demonstrate graph spectral regularization on data that is generated with a specific topology. Our data has a hierarchical cluster structure, where there are 3 large-scale structures, each comprising two Gaussian subclusters generated in 15 dimensions (See Figure 1). We use a graph-structure layer with 6 nodes with 3 connected node pairs and employ the Laplacian smoothing regularization. After training, we find that each node pair acts as a “supernode” that detects each large scale cluster. Within each supernode, each of the two nodes encodes one of each of the two Gaussian substructures. Thus, this specific graph topology is able to extract the hierarchical topology of the data.
Semantic Feature Organization in Capsule Networks
Next, we demonstrate Laplacian smoothing regularization on a natural dataset. Here, instead of using an autoencoder framework, we use a capsule network consisting of 10 capsules of 16 nodes. In the original capsule network paper, [sabour2017capsules] construct an architecture that is able to represent each digit in a 16 dimensional vector using a reconstruction penalty with a decoder. They notice that some of these 16 dimensions turn out to represent semantically meaningful differences in digit representation such as digit scale, skew, and width. We train the capsule net on the MNIST handwritten digit dataset with the Laplacian smoothing regularization applied between the matching ordinal nodes of each capsule using fully connected graphs. We show in Figure 2 that without the regularization each individual capsule in the network derives its own ordering of features that it learns. However, with the graph regularization we obtain a consistent feature ordering, e.g. node 5 corresponds to line thickness across all digits. Thus, the Laplacian smoothing regularization enforces a more interpretable encoding with “tunable” knobs that can be used to generate data with specific properties, as shown in Figure 2.
Spectral Bottleneck Regularization
Next, we impose a linear graph-structure on a dataset generated to have a one-dimensional progression in 20 ambient dimensions (see Figure 3). Here we see that without any graph structure regularization the encoding by the features is arbitrary. Once the Laplacian smoothing regularization is enforced, subsequent points in the progression have smoothly varying changes. Next, in order to make features correspond to data topology, we introduce a spectral bottleneck using an additional layer preceding the spectral regularization layer. This bottleneck layer, using a softmax, effectively chooses one atom from a dictionary of Gaussian kernel-shaped filters for the activations. We see that with the addition of this regularization, we have features of the layer encoding (and activating for) different parts of the graph.
Topological Analysis of T cell Development
Next, we show that spatially-localized filter regularizations are useful for learning characteristic features of different parts of the data topology. We test this ability on biological mass cytometry data, which is high dimensional, single-cell protein data, measured on differentiating T cells from the Thymus setty2016wishbone. The T cells lie along a bifurcating progression where the cells eventually diverge into two lineages (CD4+ and CD8+). Here, we see that the spectral bottleneck compactly encode the branches in specific nodes and thereby create a receptive field for the data topology. Examination of these nodes reveals the input protein features that characterize the different parts of the progression. From the activation heatmap, we see that one major differences between the blue and green branches is the activation of node 18. We see from the heatmap that correlates nodes to gene activations that cluster 18 is positively correlated CD8 and negatively with CD4, and thus this node is the switch between the two lineages. For the cluster of nodes from 6-12, these nodes are low in the red branch and high in the other two branches. Since these nodes are positively correlated with CD3, TCRb and CD127, this indicates that nodes further along in differentiation (blue and green branches) indicating that the cells have acquired higher levels of canonically mature T cell markers (CD3 and TCRb) as well as the naive T cell marker CD127. Although this analysis was done on a relatively low-dimensional dataset which could be analyzed using other methods, it corroborates that the receptive fields produced by the spectral bottleneck offer meaningful features that can be examined to characterize parts of the data topology and can be applied to more complex, higher-dimensional datasets.
Pseudo-Images and Convolutions for Human Interpretable Encodings
Finally, we show the capability of graph-structured regularizations to create pseudo-images from data. Without graph-structured regularization, activations appear unstructured to the human eye and as a result are hard to interpret (See Figure 5). However, using Laplacian smoothing over a 2D lattice graph we can make this representation more visually distinguishable. Since we now take this embedding as an image, it is possible to use a standard convolutional architecture in subsequent layers in order to further filter the encodings. When we add 3 layers of 3x3 2D convolutions with 2x2 max pooling we see that representations for each digit are compressed into specific areas of the image (Figure 5). Now, by visual inspection of this high dimensional embedding layer, we are able to quickly visually categorize inputs. We show that the layer is segmentable, with receptive fields for each digit, thus making the layer amenable for classification. Further, we note that the classification penalty along with the graph-structured layer by itself induces spatial localization, without a spectral bottleneck layer or localized filter regularization. We speculate that this is a result of the convolutions combined with max pooling inducing spatially localized features.
Here, we have introduced a class of graph spectral regularizations that impose graph structure on the activations of hidden layers and show they allow for more interpretable encodings. These include a Laplacian smoothing regularization that creates locally smooth activation patterns which can reflect structure and progression in the associated data as well as consistency of features as demonstrated on capsule nets. Next, we show that if we constrain the node activations to be more spatially localized on the imposed graph structure, using wavelet-like filters, we enable the hidden layers to learn features associated with different parts of the data topology. For example, we can extract biologically meaningful characterizations of a bifurcating differentiation structure in mass cytometry data measuring T cell differentiation. Finally, we show that graph structured regularizations can create pseudo-images when the underlying graph is a grid, making the data amenable to convolutions and other image-processing techniques such as segmenting. We show that such segmentation gives receptive fields that allow for human interpretable activations of high dimensional hidden layers. Normally, visualization comes at the cost of dimensionality as only layers containing two or three dimensions can be visualized. Finally, we note that graph structured regularizations encode datapoints as signals on the graph and thus graph signal processing may be used in future work to analyze such data.
Appendix A Artificial Data Generation
Hierarchical Cluster Dataset
We simulate three gene modules each with two sub modules for a total of six total clusters. There are 15 genes where five are associated with each larger module. For every datapoint exactly one of the three gene modules is “active“. This is represented by a mean shift of 10 in all genes associated with the module. To distinguish the sub-modules within larger modules, one of the submodules has an additional meanshift of 10 in two of the five genes when active. We then add gaussian noise independently to each feature with mean zero and standard deviation one.
We simulate a linear dataset by sequential feature activation. We generate labels by uniformly sampling numbers between zero and ten,
. Looking feature by feature, the first feature has values approximately equal to the probability distribution function (pdf) of the normal distribution with mean one and standard deviation one. The second feature has values approximating the pdf of, and the third feature . We generate 60,000 data points in this way and add independent gaussian noise with mean zero and standard deviation of 0.001.
Effect of increasing Laplacian smoothing regularization
We analyze both Laplacian smoothing and spatially-localized representations on images of a rotating teapot (See Figure 6) weinberger2004learning. We include a graph spectral layer with a ring topology of 20 nodes. We gradually increase the smoothness coefficient (from 0 to 0.1) and show that, at intermediate levels of the coefficient (between 0.0001 and 0.001), we obtain an activation pattern that is smooth on the imposed ring graph. At 0 regularization there is no smoothness on the graph topology (horizontal smoothness in the activation heatmap) and the activations appear to be randomly ordered. At intermediate smoothing we can observe smooth structures with seemingly ordered activations. At high smoothing the activations all become the same effectively creating one node/dimension in the activations (at 0.01 and higher).
Appendix B Experiment Specifics
We use Leaky relus with a coefficient of 0.2 maas2013rectifier for all layers except for the embedding and output layers unless otherwise specified. We use the ADAM optimizer with default parameters kingma2014adam.
Laplacian Smoothing on an Autoencoder
We use an autoencoder with five fully connected layers. The layers have widths [50,50,20,50,50]. To perform Laplacian smoothing on this autoencoder we add a term to the loss function. Letbe the activation vector on the embedding layer, then we add a penalty term where
is a weighting hyperparameter to the standard mean squared error loss.
Spectral Bottlenecking in an Autoencoder
To build a spectral bottleneck layer let be an matrix whose columns are the atoms in the dictionary. Then we replace the embedding layer with a layer that computes . Effectively, we transform the activation vector into the spectral domain, compute the softmax function on it, and restore the output to activations in the neuron domain. To encourage the network to learn low frequency filters over high frequency we also apply Laplacian smoothing when using a spectral bottleneck.
MNIST Classifier Architecture
The basic classifier that we use consists of two convolution and max pooling layers followed by the dense layer where we apply Laplacian smoothing. We use the cross entropy loss to train the classification network in this case. Note that while we use convolutions before this layer for the MNIST example, in principle, techniques applied here could be applied to non image data by using only dense layers until the Laplacian smoothing layer which constructs an image for each datapoint. Table 1 shows the architecture when no convolutions are used. Table 2 exhibits the architecture when convolution and max pooling layers are used after the Laplacian smoothing layer constructs a 2D image.