Multislice PHATE for tensor embeddings
Understanding why and how certain neural networks outperform others is key to guiding future development of network architectures and optimization methods. To this end, we introduce a novel visualization algorithm that reveals the internal geometry of such networks: Multislice PHATE (M-PHATE), the first method designed explicitly to visualize how a neural network's hidden representations of data evolve throughout the course of training. We demonstrate that our visualization provides intuitive, detailed summaries of the learning dynamics beyond simple global measures (i.e., validation loss and accuracy), without the need to access validation data. Furthermore, M-PHATE better captures both the dynamics and community structure of the hidden units as compared to visualization based on standard dimensionality reduction methods (e.g., ISOMAP, t-SNE). We demonstrate M-PHATE with two vignettes: continual learning and generalization. In the former, the M-PHATE visualizations display the mechanism of "catastrophic forgetting" which is a major challenge for learning in task-switching contexts. In the latter, our visualizations reveal how increased heterogeneity among hidden units correlates with improved generalization performance. An implementation of M-PHATE, along with scripts to reproduce the figures in this paper, is available at https://github.com/scottgigante/M-PHATE.READ FULL TEXT VIEW PDF
A continual learning agent should be able to build on top of existing
Catastrophic forgetting refers to the tendency that a neural network
The power of neural networks lies in their ability to generalize to unse...
Explaining the behaviors of deep neural networks, usually considered as ...
We present a visualization algorithm based on a novel unsupervised Siame...
Catastrophic forgetting is a challenge issue in continual learning when ...
Modern datasets and models are notoriously difficult to explore and anal...
Multislice PHATE for tensor embeddings
Despite their massive increase in popularity in recent years, deep networks are still regarded as opaque and difficult to interpret or analyze. Understanding how and why certain neural networks perform better than others remains an art. The design of neural networks and their training: choice of architectures, regularization, activation functions, and hyperparameters, while informed by theory and prior work, is often driven by intuition and tuned manually(shahriari:bayesian_optimization). The combination of these intuition-driven selections and long training times even on high-performance hardware (e.g., 3 weeks on 8 GPUs for the popular ResNet-200 network for image classification), means that the combinatorial task of testing all possible choices is impossible, and must be guided by more principled evaluations and explorations.
A natural and widely used measure of evaluation for the difference between network architectures and optimizers is the validation loss. In some situations, the validation loss lacks a clearly defined global meaning, i.e., when the loss function itself is learned, and other evaluations are required(salimans:inception_score; lucic:inception_distance). While such scores are useful for ranking models on the basis of performance, they crucially do not explain why one model outperforms another. To provide additional insight, visualization tools have been employed, for example to analyze the “loss landscape” of a network. Specifically, these visualizations depict how architectural choices modify the smoothness of local minima (goodfellow:visualizing_optimization; li:visualizing_loss_landscape) — a quality assumed to be related to generalization abilities.
Local minima smoothness, however, is only one possible correlate of performance. Another internal quality that can be quantified is the hidden representations of inputs provided by the hidden unit activations. The multi-layered hidden representations of data are, in effect, the single most important feature distinguishing neural networks from classical machine learning techniques in generalization(
lecun:deep-learning-review; bengio:curse-highly-variable; bengio:learning-deep-architectures; montufar:mixture-of-products; montufar:linear-regions)
. We can view the changes in representation by stochastic gradient descent as a dynamical system evolving from its random initialization to a converged low-energy state. Observing the progression of this dynamical system gives more insight into the learning process than simply observing it at a single point in time (e.g., after convergence.) In this paper, we contribute a novel method of inspecting a neural network’s learning: we visualize the evolution of the network’s hidden representation during training to isolate key qualities predictive of improved network performance.
Analyzing extremely high-dimensional objects such as deep neural networks requires methods that can reduce these large structures into more manageable representations that are efficient to manipulate and visualize. Dimensionality reduction is a class of machine learning techniques which aim to reduce the number of variables under consideration in high-dimensional data while maintaining the structure of a dataset. There exist a wide array of dimensionality reduction techniques designed specifically for visualization, which aim specifically to capture the structure of a dataset in two or three dimensions for the purposes of human interpretation, e.g., MDS(cox:MDS), t-SNE (vandermaaten:tsne), and Isomap (tenenbaum:isomap). In this paper, we employ PHATE (moon:PHATE), a kernel-based dimensionality reduction method designed specifically for visualization which uses multidimensional scaling (MDS) (cox:MDS) to effectively embed the diffusion geometry (coifman2006diffusion) of a dataset in two or three dimensions.
In order to visualize the evolution of the network’s hidden representation, we take advantage of the longitudinal nature of the data; we have in effect many observations of an evolving dynamical system, which lends itself well to building a graph from the data connecting observations across different points in time. We construct a weighted multislice graph (where a “slice” refers to the network state at a fixed point in time) by creating connections between hidden representations obtained from a single unit across multiple epochs, and from multiple units within the same epoch. A pairwise affinity kernel on this graph reflects the similarity between hidden units and their evolution over time. This kernel is then dimensionality reduced with PHATE and visualized in two dimensions.
The main contributions of this paper are as follows. We present Multislice PHATE (M-PHATE), which combines a novel multislice kernel construction with the PHATE visualization moon:PHATE. Our kernel captures the dynamics of an evolving graph structure, that when when visualized, gives unique intuition about the evolution of a neural network over the course of training and re-training. We compare M-PHATE to other dimensionality reduction techniques, showing that the combined construction of the multislice kernel and the use of PHATE provide significant improvements to visualization. In two vignettes, we demonstrate the use M-PHATE on established training tasks and learning methods in continual learning, and in regularization techniques commonly used to improve generalization performance. These examples draw insight into the reasons certain methods and architectures outperform others, and demonstrate how visualizing the hidden units of a network with M-PHATE provides additional information to a deep learning practitioner over classical metrics such as validation loss and accuracy, all without the need to access validation data.
Diffusion maps (DMs) (coifman2006diffusion) is an important nonlinear dimensionality reduction method that has been used to extract complex relationships between high-dimensional data (He2009; Farbman:2010; Talmon2012; Mishne2013; coifman:diffusion_changing_data; Mishne2016; Banisch2017). PHATE (moon:PHATE)
aims to optimize diffusion maps for data visualization. We briefly review the two approaches.
Given a high-dimensional dataset , DMs operate on a pairwise similarity matrix (e.g., computed via a Gaussian kernel ). and return an embedding of the data in a low-dimensional Euclidean space. To compute this embedding, the rows of are normalized by , where . The resulting matrix
can be interpreted as the transition matrix of a Markov chain over the dataset and powers of the matrix,, represents running the Markov chain forward steps. The matrix
thus has a complete sequence of bi-orthogonal left and right eigenvectors,
, respectively, and a corresponding sequence of eigenvalues. Due to the fast spectrum decay of , we can obtain a low-dimensional representation of the data using only the top eigenvectors. Diffusion maps, defined as , embeds the data points into a Euclidean space where the Euclidean distance approximates the diffusion distance:
is neglected because it is a constant vector.
To enable successful data visualization, a method must reduce the dimensionality to two or three dimensions; diffusion maps, however, reduces only to the intrinsic dimensionality of the data, which may be much higher. Thus, to calculate a 2D or 3D representation of the data, PHATE applies MDS (cox:MDS) to the informational distance between rows and of the diffusion kernel defined as
where is selected automatically as the knee point of the Von Neumann Entropy of the diffusion operator. For further details, see moon:PHATE.
We consider the evolving state of a neural network’s hidden units as a dynamical system which can be represented as a multislice graph on which we construct a pairwise affinity kernel. Such a kernel considers both similarities between hidden units in the same epoch or time-slice (denoted intraslice similarities) and similarities of a hidden unit to itself across different time-slices (denoted interslice similarities). The concept of constructing a graph for data changing over time is motivated by prior work both in harmonic analysis (coifman:diffusion_changing_data; lindenbaum2015multiview; lederman2018learning; marshall2018time; Banisch2017) and network science (mucha:multiscale_community). For example, coifman:diffusion_changing_data suggest an algorithm for jointly analyzing DMs built over data points that are changing over time by aligning the separately constructed DMs, while mucha:multiscale_community suggest an algorithm for community detection in multislice networks by connecting each node in one network slice to itself in other slices, with identical fixed weights for all intraslice connections. In both cases, such techniques are designed to detect changes in intraslice dynamics over time, yet interslice dynamics are not incorporated into the model.
Let be a neural network with a total of hidden units applied to -dimensional input data. Let be the activation of the th hidden unit of , and be the representation of the network after being trained for epochs on training data sampled from a dataset .
A natural feature space for the hidden units of is the activations of the units with respect to the input data. Let be a representative sample of points. (In this paper, we use points not used in training; however, this is not necessary. Further discussion of this is given in Section S2.) Let be the th sample in . We use the hidden unit activations to compute a shared feature space of dimension for the hidden units. We can then calculate similarities between units from all layers. Note that one may instead consider the hidden units’ learned parameters (e.g. weight matrices and bias terms); however, these are not suitable for our purposes as they are not necessarily the same shape between hidden layers, and additionally the parameters may contain information not relevant to the data (for example, in dimensions of containing no relevant information.)
We denote the time trace of the network as a tensor containing the activations at each epoch of each hidden unit with respect to each sample . We note that in practice, the major driver of variation in is the bias term contributing a fixed value to the activation of each hidden unit. Further, we note that the absolute values of the differences in activation of a hidden unit are not strictly meaningful, since any differences in activation can simply be magnified by a larger kernel weight in the following layer. Therefore, to calculate more meaningful similarities, we first -score the activations of each hidden unit at each epoch
The time trace gives us a natural substrate from which to construct a visualization of the network’s evolution. We construct a kernel over utilizing our prior knowledge of the temporal aspect of to capture its dynamics. Let be a kernel matrix between all hidden units at all epochs (the th row or column of refers to -th unit at epoch ).We henceforth refer to the th row of as and the th column of as .
To capture both the evolution of a hidden unit throughout training as well as its community structure with respect to other hidden units, we construct a multislice kernel matrix which reflects both affinities between hidden units and in the same epoch , or intraslice affinities
as well as affinities between a hidden unit and itself at different epochs, or interslice affinities
where is the intraslice bandwidth for unit at epoch , is the fixed intraslice bandwidth, and is the adaptive bandwidth decay parameter.
In order to maintain connectivity while increasing robustness to parameter selection for the intraslice affinities , we use an adaptive-bandwidth Gaussian kernel (termed the alpha-decay kernel (moon:PHATE)), with bandwidth set to be the distance of unit at epoch to its th nearest neighbor across units at that epoch: where denotes the distance from to its th nearest neighbor in . Note that the use of the adaptive bandwidth means that the kernel is not symmetric and will require symmetrization. In order to allow the kernel to represent changing dynamics of units over the course of learning, we use a fixed-bandwidth Gaussian kernel in the interslice affinities , where is the average across all epochs and all units of the distance of unit at epoch to its th nearest neighbor among the set consisting of the same unit at all other epochs
Finally, the multislice kernel matrix contains one row and column for each unit at each epoch, such that the intraslice affinities form a block diagonal matrix and the interslice affinities form off-diagonal blocks composed of diagonal matrices (see Figures S1 and S2 for a diagram):
We symmetrize this kernel as , and row normalize it to obtain , which represents a random walk over all units across all epochs, where propagating from to
is conditional on the transition probabilities between epochsand . PHATE (moon:PHATE) is applied to to visualize the time trace in two or three dimensions.
To demonstrate our visualization, we train a feedforward neural network with 3 layers of 64 hidden units to classify digits in MNIST(lecun:MNIST). The visualization is built on the time trace evaluated on the network over a single round of training that lasted 300 epochs and reached 96% validation accuracy.
We visualize the network using M-PHATE (Fig. 1
) colored by epoch, hidden layer and the digit for which examples of that digit most strongly activate the hidden unit. The embedding is clearly organized longitudinally by epoch, with larger jumps between early epochs and gradually smaller steps as the network converges. Additionally, increased structure emerges in the latter epochs as the network learns meaningful representations of the digits, and groups of neurons activating on the same digits begin to co-localize. Neurons of different layers frequently co-localize, showing that our visualization allows meaningful comparison of hidden units in different hidden layers.
To evaluate the quality of the M-PHATE visualization, we compare to three established visualization methods: diffusion maps, t-SNE and ISOMAP. We also compare our multislice kernel to the standard formalism of these visualization techniques, by computing pairwise distances or affinities between all units at all time points without taking into account the multislice nature of the data.
Figure 2 shows the standard and multislice visualizations for all four dimensionality reduction techniques of the network in Section 4.1. For implementation details, see Section S3. Only the Multislice PHATE visualization reveals any meaningful evolution of the neural network over time. To quantify the quality of the visualization, we compare both interslice and intraslice neighborhoods in the embedding to the equivalent neighborhoods in the original data. Specifically, for a visualization we define the intraslice neighborhood preservation of a point as
and the interslice neighborhood preservation of as
where denotes the nearest neighbors of in . We also calculate the Spearman correlation of the rate of change of each hidden unit with the rate of change of the validation loss to quantify the fidelity of the visualization to the diminishing rate of convergence towards the end of training.
M-PHATE achieves the best neighborhood preservation on all measures except the interslice neighborhood preservation, in which it performs on-par with standard t-SNE. Additionally, the multislice kernel construction outperforms the corresponding standard kernel construction for all methods and all measures, except again in the case of t-SNE for interslice neighborhood preservation. M-PHATE also has the highest correlation with change in loss, making it the most faithful display of network convergence.
An ongoing challenge in artificial intelligence is in making a single model perform well on many tasks independently. The capacity to succeed at dynamically changing tasks is often considered a hallmark of genuine intelligence, and is thus crucial to develop in artificial intelligence(parisi:continual-learning). Continual learning is one attempt at achieving this goal sequentially training a single network on different tasks with the aim of instilling the network with new abilities as data becomes available.
To assess networks designed for continual learning tasks, a set of training baselines have been proposed. hsu:continual-learning-baselines define three types of continual learning scenarios for classification: incremental task learning, in which a separate binary output layer is used for each task; incremental domain learning, in which a single binary output layer performs all tasks; and incremental class learning, in which a single 10-unit output layer is used, with each pair of output units used for just a single task. Further details are given in Section S4.
We implemented a 2-layer MLP with 400 units in each hidden layer to perform incremental, domain and class learning tasks using three described baselines: standard training with Adagrad (duchi:adagrad) and Adam (kingma:adam), and an experience replay training scheme called Naive Rehearsal (hsu:continual-learning-baselines) in which a small set of training examples from each task are retained and replayed to the network during subsequent tasks. Each network was trained for 4 epochs before switching to the next task. Overall, we find that validation performance is fairly consistent with results reported in hsu:continual-learning-baselines, with Naive Rehearsal performing best, followed by Adagrad and Adam. Class learning was the most challenging, followed by domain learning and task learning.
Figure 3 shows M-PHATE visualizations of learning in networks trained in each of three baselines, with network slices taken every 50 batches rather than every epoch for increased resolution. Notably, we observe a stark difference in how structure is preserved over training between networks, which is predictive of task performance. The highest-performing networks all tend to preserve representational structure across changing tasks. On the other hand, networks trained with Adam — the worst performing combinations — tend to have a structural “collapse”, or rapid change in connectivity, as the tasks switch, consistent with the rapid change (and eventual increase) in validation loss.
Further, the frequency of neighborhood changes for hidden units throughout training (appearing as a crossing of unit trajectories in the visualization) corresponds to an increase in validation loss; this is due to a change in function of the hidden units, corrupting the intended use of such units for earlier tasks. We quantify this effect by calculating the Adjusted Rand Index (ARI) on cluster assignments computed on the subset of the visualization corresponding to the hidden units pre- and post-task switch, and find that the average ARI is strongly negatively correlated with the network’s final validation loss averaged over all tasks ().
Looking for such signatures, including rapid changes in hidden unit structure and crossing of unit trajectories, can thus be used to understand the efficiency of continual learning architectures.
Despite being massively overparametrized, neural networks frequently exhibit astounding generalization performance (zhang:rethinking-generalization; allen-zhu:overparametrized-learning). Recent work has showed that, despite having the capacity to memorize, neural networks tend to learn abstract, generalizable features rather than memorizing each example, and that this behaviour is qualitatively different in gradient descent compared to memorization (arpit:memorization).
In order to demonstrate the difference between networks that learn to generalize and networks that learn to memorize, we train a 3-layer MLP with 128 hidden units in each layer to classify MNIST with: no regularization; L/L weight regularization; L/L activity regularization; and dropout. Additionally, we train the same network to classify MNIST with random labels, as well as to classify images with randomly valued pixels, such networks being examples of pure memorization. Each network was trained for 300 epochs, and the discrepancy between train and validation loss reported.
We note that in Figure 4, the networks with the poorest generalization (i.e. those with greatest divergence between train and validation loss), especially Activity L and Activity L
, display less heterogeneity in the visualization. To quantify this, we calculate the sum of the variance for all time slices of each embedding and regress this against thememorization error of each network, defined as the discrepancy between train and test loss after 300 epochs (Table 3), achieving a Spearman correlation of .
To understand this phenomenon, we consider the random labels network. In order to memorize random labels, the neural network must hone in on minute differences between images of the same true class in order to classify them differently. Since most images won’t satisfy such specific criteria most nodes will not respond to any given image, leading to low activation heterogeneity and high similarities between hidden units. The M-PHATE visualization clearly exposes this intuition visually, depicting very little difference between these hidden units. Similar intuition can be drawn from the random pixels network, in which the difference between images is purely random. We hypothesize that applying L or L regularization over the activations has a qualitatively similar effect; reducing the variability in activations and effectively over-emphasizing small differences in the hidden representation. This behavior effectively mimics the effects of memorization.
On the other hand, we consider the dropout network, which displays the greatest heterogeneity. Initial intuition evoked the idea that dropout emulates an ensemble method within a single network; by randomly removing units from the network during training, the network learns to combine the output of many sub-networks, each of which is capable of correctly classifying the input srivastava:dropout. M-PHATE visualization of training with dropout recommends a more mechanistic version of this intuition: dropped-out nodes are protected from receiving the exact same gradient signals and diverge to a more expressive representation. The resulting heterogeneity in the network reduces the reliance on small differences between training examples and heightens the network’s capacity to generalize. This intuition falls in line with other theoretical explorations, such as viewing dropout as a form of Bayesian regularization (gal2016dropout) or stochastic gradient descent (baldi2013understanding) and reinforces our understanding of why dropout induces generalization.
We note that while this experiment uses validation data as input to M-PHATE, we have repeated this experiment in Section S2 and show equivalent results. In doing so, we provide a mechanism to understand the generalization performance of a network without requiring access to validation data.
Here we have introduced a novel approach to examining the process of learning in deep neural networks through a visualization algorithm we call M-PHATE. M-PHATE takes advantage of the dynamic nature of the hidden unit activations over the course of training to provide an interpretable visualization otherwise unattainable with standard visualizations. We demonstrate M-PHATE with two vignettes in continual learning and generalization, drawing conclusions that are not apparent without such a visualization, and providing insight into the performance of networks without necessarily requiring access to validation data. In doing so, we demonstrate the utility of such a visualization to the deep learning practitioner.
In Section 3, we describe a multislice affinity kernel built from an intraslice kernel, which connects hidden units in the same epoch, and an interslice kernel, which connects each hidden unit to itself at different epochs. We further clarify the intuition behind such an affinity kernel in two schematics.
Figure S1 displays a graph of 10 hidden units in a dynamically changing graph structure over the course of four time slices. Each hidden unit’s local neighborhood within its own time slice (its intraslice affinities) changes as the system evolves, with connectivity shown as black lines. Additionally, each hidden unit is connected to itself across different epochs, with strength of these interslice connections (shown as dotted lines) also dependent on similarities (rather than simply a fixed-weight connection).
Figure S2 displays the top left corner of an example of a multislice affinity kernel. The full multislice kernel (, left) is composed on the intraslice kernels placed down the block diagonal (, middle) and the interslice kernels forming the diagonals of each off-diagonal block (, right).
In Section 3, we state that the representative subset is taken from points not used in training. However, there is no reason why this should be the case. To demonstrate that M-PHATE can be used successfully without accessing data external to the training set, we show in Figure S3 a repetition of the generalization experiment, using only training data to build the visualization. Using the same quantification of variance and memorization as in Section 4.4, we obtain an equally strong correlation (Spearman’s , Table S1). Further, we note that the visualizations are qualitatively very similar to those obtained using training data, indicating that M-PHATE can be used to understand the generalization performance of a network without having access to an external validation set.
In Section 4.2, we compare M-PHATE to Diffusion Maps, t-SNE and Isomap in both a standard and multiscale context. Since t-SNE and Isomap require distance matrices, not affinity matrices, we convert the multislice kernel to geodesic distances by computing the shortest-path over the graph with the distance . For standard application of Isomap and t-SNE, we use the default parameters in sklearn [sklearn]. Since diffusion maps can be applied to any symmetric non-negative affinity kernel and does not have a reference implementation, we apply diffusion maps to the adaptive bandwidth kernel built in PHATE.
hsu:continual-learning-baselines describe three schemes of continual learning commonly used in the literature.
Incremental task learning describes the process of learning shared hidden units for separated output layers for each task; the output units for task are therefore protected from gradient signals during the training of task
. This is akin to the standard model of transfer learning, in which all but the final layer of a network are copied for a new task, with a fresh output layer attached for the new task.
Incremental domain learning describes the process of learning an entirely shared network which learns to perform all tasks separately, but with the same units; in this case the output units for task are the same units that are used in task and must learn to correctly classify training examples from separate tasks as though they were the same class.
Incremental class learning describes the process of learning an entirely shared network which learns to perform all tasks at once, with no knowledge of which task is currently being performed. The network contains separate output units for each task, but must select which output units to use, in contrast to incremental task learning in which the task is specified. This is by far the most difficult setting, since in training any one task, the optimal solution is to never predict the output classes of any other task; this strongly encourages catastrophic forgetting.
Figure S4 demonstrates these three architectures on Split MNIST.
The networks in Section 4.3
are trained as follows. Input data is scaled from 0 to 1. All networks consist of a MLP with 2 layers of 400 units with ReLU activation, and a softmax classification output layer. All networks are trained with a batch size of 128, split to batches of 64 new data and 64 rehearsal data in the case of Naive Rehearsal. For the Adam optimizer, we use a learning rate of. For the Adagrad optimizer, we use a learning rate of
Figure 3 shows the visualizations of the continual learning networks for a subset of 100 hidden units from each layer of the MLP with 2 layers of 400 units. Figures S5 and S6 show the full embedding of layers 1 and 2 respectively. In all cases, the visualizations are computed on all hidden units and subsampled for plotting purposes only.
We note the striking difference between layer 1 and layer 2 in all visualizations. In every case, there is less “structural collapse” (see Section 4.4) in layer 2 than in layer 1. Also, the vertical patterning in layer 2 is perfectly associated with time-slice; that is, in each task (composed of 16 time-slices), the majority of change in hidden representations in layer 2 occurs within the first two or three time slices. On the other hand, layer 1 continues to change throughout the task.
The networks in Section 4.4 are trained as follows. Input data is scaled from 0 to 1. All networks consist of a MLP with 3 layers of 128 units with Leaky ReLU activation with , and a softmax classification output layer. All networks are trained with a batch size of 256 with the Adam optimizer and a learning rate of . All regularizations are applied with a weight of . Dropout is applied with . For the scrambled network, we randomly permute the output labels of the training data, leaving the validation data intact. All networks are built and trained in Keras [chollet:keras] using a Tensorflow [abadi:tensorflow] backend.
All multislice graphs are built with , and
. We apply PHATE on the multislice affinity matrix with PHATE parametersand , and use the automatically selected parameter of provided by the PHATE algorithm.
All computation was done on a single 36-core workstation running Arch Linux with a NVIDIA TITAN X graphics card and 1TB of RAM.