shlomihod.github.io
None
view repo
The learned weights of a neural network are often considered devoid of scrutable internal structure. In order to attempt to discern structure in these weights, we introduce a measurable notion of modularity for multi-layer perceptrons (MLPs), and investigate the modular structure of MLPs trained on datasets of small images. Our notion of modularity comes from the graph clustering literature: a "module" is a set of neurons with strong internal connectivity but weak external connectivity. We find that MLPs that undergo training and weight pruning are often significantly more modular than random networks with the same distribution of weights. Interestingly, they are much more modular when trained with dropout. Further analysis shows that this modularity seems to arise mostly for networks trained on learnable datasets. We also present exploratory analyses of the importance of different modules for performance and how modules depend on each other. Understanding the modular structure of neural networks, when such structure exists, will hopefully render their inner workings more interpretable to engineers.
READ FULL TEXT VIEW PDFNone
Modularity is a common property of biological and engineered systems (Clune et al., 2013; Baldwin and Clark, 2000; Booch et al., 2007). Reasons given for modularity include adaptability, the ability to handle different situations with common sub-problems, and the importance of minimizing total connection (Sterling and Laughlin, 2015). It also seems desirable from a perspective of transparency: modular systems allow those analyzing the system to inspect the function of individual modules, and combine their understanding of individual modules into an understanding of the entire system.
Neural networks are composed of distinct layers that are modular in the sense that when defining a network, different layers are individually configurable. However, they are not modular from the network science perspective which requires modules to be highly internally connected (Wagner et al., 2007) (since neurons inside one layer are not directly connected to each other), nor are they modular by the dictionary definition^{2}^{2}2Wiktionary (2020) defines the word ‘modular’ as used in this paper as “[c]onsisting of separate modules; especially where each module performs or fulfills some specified function and could be replaced by a similar module for the same function, independently of the other modules.” due to their lack of specific legible functionality^{3}^{3}3That is, beyond the little that can be glimpsed from the kernels of convolutional layers..
In this work, we develop a measurable notion of the modularity of neural networks: roughly, a network is modular to the extent that it can be partitioned into sets of neurons where each set is highly internally connected, but between-set connectivity is low. This definition refers only to the learned weights of the network, not the data distribution, nor the distribution of outputs or activations of the model. More specifically, we use a graph clustering algorithm to decompose trained networks into clusters, and consider each cluster a ‘module’. We then conduct an empirical investigation into the modularity structure of small MLPs trained on small image datasets. This investigation shows that networks trained with a final phase of weight-based pruning (Han et al., 2015) are somewhat modular, and are often more modular than approximately 99.7% of random networks with the same sparsity and distribution of weights. We also find that networks trained with dropout are much more modular, and that we observe modularity when networks increase their accuracy over training, but much less when they train without increasing their accuracy, as happens in the early stages of training on random data. Finally, we perform a preliminary investigation into the importance and dependency structure of the different modules.
In section 2, we define our notion of modularity and our measure thereof. We then describe in section 3 the degree of modularity of networks trained in different fashions, and discuss experiments done to evaluate the importance of and relationships between different modules in section 4. After that, section 5 gives an introduction to the areas of research related to this paper, and section 6 contains a summary of our findings and a list of future directions.
In this section, we will give a graph-theoretic definition of network modularity, and then give an overview of the algorithm we use to measure it. The whole section will draw heavily from von Luxburg (2007).
We will represent our neural network by a weighted undirected graph . To do this, we identify each neuron, including the inputs and outputs, with an integer between and , where is the total number of neurons, and take the set of neurons to be the set of vertices of . Two neurons will have an undirected edge between them if they are in adjacent layers, and the weight of the edge will be equal to the absolute value of the weight of the connection between the two neurons. We will represent the set of weights by the adjacency matrix defined by the edge weight between neurons and . If there is no edge between and , then . As such, encodes all the weight matrices of the neural network, but not the biases.
The degree of a neuron is defined by . The degree matrix is a diagonal matrix, where the diagonal elements are the degrees: . We also define the volume of a set of neurons by , and the weight between two disjoint sets of neurons as . If is a set of neurons, then we denote its complement as .
A partition of the network will be a collection of subsets that are disjoint, if , and whose union forms the whole vertex set, . Our ‘goodness measure’ of a partition is the normalized cut metric (Shi and Malik, 2000), defined as such: , and which we will call ‘n-cut’ in text^{4}^{4}4Note that this is a factor of 2 different from the standard definition.. This metric will be low if neurons in the same partition element share high-weight edges, and those in different partition elements share low-weight edges, as long as the sums of degrees of neurons in each partition element is roughly balanced.
Finally, the graph Laplacian^{5}^{5}5For the connection to the second derivative operator on , see Czaja (2015) and von Luxburg (2007). is defined as , and the normalized Laplacian as . is a positive semi-definite matrix with
real-valued non-negative eigenvalues
(von Luxburg, 2007). The eigenvectors and eigenvalues of
are the generalized eigenvectors and eigenvalues of the generalized eigenvalue problem .To estimate the ‘clusterability’ of a graph, we use a spectral clustering algorithm to compute a partition, which we will call a clustering, and evaluate the n-cut of that clustering. Roughly speaking, the spectral clustering algorithm we use
(Shi and Malik, 2000) solves a relaxation of the problem of finding a clustering that minimizes the n-cut (von Luxburg, 2007). It does this by taking the eigenvectors of with the least eigenvalues, and using them to embed each vertex into —since each eigenvector of is -dimensional, having one real value for every vertex of the graph—then using -means clustering on the points in . It is detailed in algorithm 1, which is adapted from von Luxburg (2007). We use the scikit-learn implementation (Pedregosa et al., 2011) using the LOBPCG eigenvalue solver with AMG preconditioning (Knyazev, 2001; Borzì and Borzì, 2006).We will define the n-cut of a network as the n-cut of the clustering that algorithm 1 returns. As such, since n-cut is low when the network is clusterable or modular^{6}^{6}6We will use the words ‘clusterable’ and ‘modular’ more or less interchangeably., we will describe a decrease in n-cut as an increase in modularity or clusterability, and vice versa. Despite our informal use of ‘modularity’ here, note that modularity also has a technical definition in the network science literature (Newman, 2006), which we are not using.
In this section, we report the results of experiments designed to determine the degree of clusterability of small trained neural networks. In general, for each experiment, we train an MLP with 4 hidden layers, each of width 256, for 20 epochs of Adam
(Kingma and Ba, 2014) with batch size 128. We then run weight pruning on a polynomial decay schedule (Zhu and Gupta, 2017) up to 90% sparsity for an additional 20 epochs. Pruning is used since the pressure to minimize connections plausibly causes sparsity in biological systems (Clune et al., 2013). Training is done using Tensorflow’s implementation of the Keras API
(Abadi et al., 2015; Chollet and others, 2015). For details on hyperparameters, see appendix
A.1.Once the network is trained, we convert it into a graph using the method described in section 2.1. We then run algorithm 1 with four clusters (i.e. )^{7}^{7}7We choose four to avoid the partitioning being overly simple while minimizing the computational budget., and evaluate the n-cut of the resulting clustering. Next, we draw 320 random networks by taking the trained network and randomly shuffling each weight matrix. We convert these networks to graphs, cluster them, and find their n-cuts. We then compare the n-cut of the actual network to the sampled n-cuts, estimating the one-sided -value (North et al., 2002). This determines whether the actual network is more clusterable than one would predict purely based on the sparsity and distribution of weights.
We use three main datasets: MNIST (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky, 2009). MNIST and Fashion-MNIST are used as-is, while CIFAR-10 is down-sampled to 2828 pixels and grayscaled to fit in the same format as the other two. MNIST and Fashion-MNIST have 60,000 training examples and 10,000 test examples, while CIFAR-10 has 50,000 training examples and 10,000 test examples. On CIFAR-10, we prune for 40 epochs rather than 20.
We train 10 times on each dataset. Our results are shown in table 1. As can be seen, we train to approximate test accuracies of 98% on MNIST, 90% on Fashion-MNIST, and 40% on CIFAR-10. The pruned networks are usually significantly more clusterable than random networks, often having lower n-cuts than all random 320 samples. The frequency of this depends on the dataset: every network trained on Fashion-MNIST had a lower n-cut than all 320 random permutations, while only two thirds of networks trained on MNIST did, and only 1 out of 10 networks trained on CIFAR-10 did. In fact, 5 of the 10 networks trained on CIFAR-10 were less clusterable than all randomly sampled permutations of themselves. More complete data is available in appendix A.2.
Dataset | Prop. sig. | Mean train acc. | Mean test acc. | N-cut mean | N-cut std. | Dist. n-cut mean | Dist. n-cut std. |
---|---|---|---|---|---|---|---|
MNIST | 0.7 | 1.00 | 0.984 | 2.000 | 0.035 | 2.042 | 0.017 |
Fashion | 1.0 | 0.983 | 0.893 | 1.880 | 0.030 | 1.992 | 0.018 |
CIFAR-10 | 0.1 | 0.650 | 0.415 | 2.06 | 0.14 | 2.001 | 0.017 |
level (ignoring Bonferroni corrections)—i.e. had a lower n-cut than all sampled networks. “N-cut mean” means the sample mean of the n-cuts of the 10 trained networks, and “N-cut std.” means the sample standard deviation. Distribution mean and standard deviation are over all shuffles of all models, not the shuffles of individual models which are used to test significance.
Results for the n-cuts and test accuracies of all trained networks both pre- and post-pruning (as well as when trained with dropout, see subsection 3.1) are shown in figure 2.
To aid in the interpretation of these n-cut values, 200 randomly-initialized networks were clustered, and the distribution of their n-cuts are shown in figure 3. As is shown, the support lies entirely between 2.3 and 2.45^{8}^{8}8Note that n-cut is invariant to an overall scaling of the weights of a graph..
Dropout is a method of training networks that reduces overfitting without adding a regularization term to the loss, by randomly setting neuronal activations to zero during training (Srivastava et al., 2014). One might expect this to decrease clusterability, since it encourages neurons to place weight on a large number of previous neurons, rather than relying on a few that might get dropped out. However, we see the reverse.
We run exactly the same experiments as described earlier, but applying dropout to all layers other than the output with a rate of 0.5. As shown in table 2, we consistently get statistically significant n-cuts, and the n-cut values’ means and standard deviations are lower than those in table 1^{9}^{9}9The Z-statistics of the n-cut distributions with and without dropout training are for MNIST and Fashion-MNIST and for CIFAR-10. Since the underlying distributions may well not be normal, these numbers should be treated with caution.. As figure 2 shows, there is essentially no overlap between the distributions of n-cuts of networks trained with and without dropout, holding fixed the training dataset and whether or not pruning has occurred.
Dataset | Prop. sig. | Mean train acc. | Mean test acc. | N-cut mean | N-cut std. | Dist. n-cut mean | Dist. n-cut std. |
---|---|---|---|---|---|---|---|
MNIST | 1.0 | 0.967 | 0.979 | 1.840 | 0.015 | 2.039 | 0.019 |
Fashion | 1.0 | 0.863 | 0.869 | 1.726 | 0.022 | 2.013 | 0.017 |
CIFAR-10 | 0.9 | 0.427 | 0.422 | 1.840 | 0.089 | 1.997 | 0.014 |
There are two potential dataset-agnostic explanations for why clusterability would increase during training. The first is that Adam naturally increases clusterability as a byproduct. The second is that in order to accurately classify inputs, networks adopt relatively clusterable structure. In order to test these two explanations, we run three experiments on a dataset of random 28
28 grayscale images with random labels between 0 and 9.In the unlearnable random dataset experiment, we train on 60,000 random images with usual hyperparameters, 10 runs with dropout and 10 runs without. We report the clusterability of the network both pre- and post-pruning. Since the network is unable to memorize the labels, this tests whether Adam can produce clusterability without accuracy gain. For reference, we compare the n-cuts of the unpruned networks against the distribution of randomly initialized networks, to check whether Adam training without pruning increases clusterability. We also compare the n-cuts of the pruned networks against the distribution of n-cuts of shuffles of those networks, to check if Adam increases clusterability in the presence of pruning more than would be predicted purely based on the increase in sparsity.
In the kilo-epoch random dataset experiment, we modify the unlearnable random dataset experiment to remove the pruning and train for 1000 epochs instead of 20, so as to check if clusterability simply takes longer to emerge from training when the dataset is random. Note that even in this case, the network is unable to classify the training set better than random.
In the memorization experiment, we modify the random dataset and training method to be more easily learnable. To do this, we reduce the number of training examples to 3,000, train without pruning for 100 epochs and then with pruning for 100 more epochs, and refrain from shuffling the dataset between epochs. As a result, the network is able to memorize the labels only when dropout is not applied, letting us observe whether Adam, pruning, and learning can increase clusterability on an arbitrary dataset without dropout training.
As is shown in table 3, the unlearnable random dataset experiment shows no increase in clusterability before pruning relative to the initial distribution shown in figure 3, suggesting that it is not a result of Adam alone. We do see an increase in clusterability after pruning, but the eventual clusterability is no more than would be expected of a random network with the same distribution of weights.
Dropout | Mean train acc. | Unp. n-cut mean | Unp. n-cut std. | N-cut mean | N-cut std. | Dist. n-cut mean | Dist. n-cut std. |
---|---|---|---|---|---|---|---|
0.102 | 2.338 | 0.030 | 2.093 | 0.023 | 2.081 | 0.014 | |
0.102 | 2.323 | 0.020 | 2.053 | 0.015 | 2.061 | 0.015 |
The results from the kilo-epoch random dataset experiment are shown in table 4. From the means and standard deviations, it appears that Adam caused no increase in clusterability relative to the distribution shown in figure 3 even after a long period of training, while pruning only caused the increase in clusterability via sparsifying. However, 1 of 10 runs without dropout had significantly low n-cut after pruning at one-sided
(without Bonferroni correction), as did 3 of 10 runs with dropout, suggesting that some outliers were indeed abnormally clusterable. As such, the results of this experiment should be treated as somewhat ambiguous.
Dropout | Mean train acc. | Unp. n-cut mean | Unp. n-cut std. | N-cut mean | N-cut std. | Dist. n-cut mean | Dist. n-cut std. |
---|---|---|---|---|---|---|---|
0.102 | 2.342 | 0.022 | 2.082 | 0.017 | 2.082 | 0.014 | |
0.102 | 2.329 | 0.020 | 2.043 | 0.028 | 2.061 | 0.017 |
The results of the memorization experiment, shown in table 5, are different for the networks trained with and without dropout. Networks trained with dropout mostly did not memorize the dataset, and by and large seem to have n-cuts in line with the random distribution, although 3 of them had statistically significant n-cuts. Those trained without dropout did memorize the dataset, and were all statistically significantly clusterable. In fact, their degree of clusterability is similar to that of those trained on the Fashion-MNIST dataset without dropout, showing the same mean n-cut. Before the onset of pruning, their n-cuts are not particularly lower than the distribution of those of randomly initialized MLPs shown in figure 3.
Dropout | Mean train acc. | Unp. n-cut mean | Unp. n-cut std. | N-cut mean | N-cut std. | Dist. n-cut mean | Dist. n-cut std. |
---|---|---|---|---|---|---|---|
1.00 | 2.464 | 0.014 | 1.880 | 0.017 | 2.008 | 0.014 | |
0.113 | 2.333 | 0.018 | 2.038 | 0.033 | 2.055 | 0.023 |
By and large, the results suggest that networks are more clusterable when they learn to predict the training set, and that merely running the Adam algorithm without an improvement in training accuracy does not produce as much clusterability.
Since Adam training alone does not appear to increase clusterability, one might suppose that the pruning process is key: that perhaps, the increase in clusterability relative to random networks is due to the pruning producing a clusterable topology, and that the values of the weights given the topology are unimportant. To test this, we compare each trained network to a new distribution: instead of randomly shuffling all elements of each weight matrix, we only shuffle the non-zero elements, thereby preserving the topology of the network. Figure 4 shows the n-cuts of some representative networks compared to the distribution of n-cuts of all shuffled networks, and also the distribution of n-cuts of the topology-preserving shuffles. We see three things: firstly, that by and large the topology really does improve clusterability, suggesting that the pruning process is removing the right weights to promote clusterability; secondly, that with dropout, the distribution of topology-preserving shuffles has much lower n-cuts than the distribution of all shuffles; and thirdly, that our networks are more clusterable than would be expected even given their topology.
One hypothesis that the authors initially considered was that modularity was a result of different regions of the network processing different types of inputs, or aspects of each input. To test this hypothesis, we develop mixture datasets composed of two original datasets. These datasets are either of the separate type, where one original dataset will have only classes 0 through 4 included and the other will have classes 5 through 9 included; or the overlapping type, where both original datasets will contribute examples of all classes. The datasets that we mix are MNIST, CIFAR-10, and LINES, which consists of 2828 images of white vertical lines on a black background, labeled with the number of vertical lines.
If modularity were a result of different regions of the network specializing in processing different types of information, we would expect that networks trained on mixture datasets would have n-cuts lower than those trained on either constituent dataset. In fact, our results are ambiguous: for some datasets, the n-cuts of networks trained on the mixture datasets are lower than those trained on the constituent datasets, but for others, the n-cuts of networks trained on the mixture datasets are in-between those trained on the constituent datasets. As such, no particular conclusion can be drawn. Full results are presented and discussed in appendix A.3.
We have seen that trained networks tend to be more clusterable than chance would predict. However, this fact on its own does not tell us much about the function or structure of the modules. In this section, we describe the results of lesion experiments that shed some light on this question.
In the field of neuroscience, one way to determine the function of a region of the brain is to study the behavior of patients whose brain has been accidentally damaged in that region (Adolphs, 2016). Similarly, we have the ability to study the importance of modules by setting the activations of their neurons to 0, and seeing how accurately the network can classify inputs. In the next two subsections, we describe two types of ‘lesion experiment’ we ran and their results.
In this subsection and subsection 4.2, the atomic units of lesioning are the intersections of each module (i.e. cluster) with each hidden layer, which we will call “sub-modules”. This is for two reasons: firstly, because whole modules are large enough that it would be difficult to analyze them, since lesioning any such large set would cause significant damage; and secondly, in order to control for the difference in the types of representations learned between earlier and later layers.
Here, we investigate the importance of single sub-modules. To do so, for each sub-module, we set all weights incoming to the constituent neurons to 0, while leaving the rest of the network untouched. We then determine the damaged network’s test-set accuracy, and in particular how much lower it is than the accuracy of the whole network. In order to assess how meaningful the sub-module is, we use three criteria: firstly, the drop in accuracy should be non-trivial (we somewhat arbitrarily draw the bar at 1 percentage point); secondly, the drop should not simply be due to the number of damaged neurons; and thirdly, the sub-module should be at least 5% of the neurons of the layer. To evaluate the second criterion, we then 100 times randomly pick that many neurons from the layer to lesion, and see the distribution of drops in accuracy. We use this distribution to compute a one-sided -value for the importance of each sub-module, and say that the second criterion is passed if : that is, if the drop in accuracy due to lesioning the sub-module is greater than all 100 sampled drops.
In figure 5, we show data on the importance of sub-modules of an MLP trained on Fashion-MNIST with dropout that has been clustered into 10 modules^{10}^{10}10The number 10 was chosen to increase the granularity of analysis.. It shows that many sub-modules are too small to be counted as important, and that many are statistically significantly impactful but not practically significant. However, some sub-modules clearly are practically important for the functioning of the network.
Now that we know which sub-modules are important, it would be ideal to understand how the important sub-modules depend on each other. To do this, we conduct experiments where we lesion two different important sub-modules, which we’ll call and , in different layers. First of all, we measure the loss in accuracy when both are lesioned, which we’ll call . We then compare to the loss in accuracy if we take a random subset of neurons of size from the same layer as , and check if is larger than 50 random samples of . This tests if the damage from lesioning is statistically significant given how many neurons are contained in , and given that we are already lesioning . We also calculate , which is the additional damage from lesioning given that has been lesioned. If is statistically significantly different to the distribution of , and if is larger than 1 percentage point, we say that sub-module is important conditioned on sub-module . Similarly, we test if is important conditioned on by comparing to the distribution of , and by determining the size of . Figure 6 plots the values and importances of different pairs of clusters. Data for significance of all sub-modules, not merely the important ones, are presented in appendix A.4.
By examining the importances of sub-modules conditioned on each other, we can attempt to construct a dependency graph of sub-modules by determining which sub-modules send information to which others. Consider a pair of sub-modules where is in an earlier layer than , and where both are individually important.
If is not important conditioned on , and is not important conditioned on , we reason that all of the information from is sent to (since otherwise lesioning would damage accuracy even conditioned on being lesioned), and that the only information that receives is sent via (since otherwise lesioning would damage accuracy even conditioned on being lesioned).
If is not important conditioned on but is important conditioned on , then we reason that sends all of its information to , and also recieves information from other sub-modules.
If is not important conditioned on but is important conditioned on , we reason that receives all of its information from , which sends information to other sub-modules.
We can conclude nothing if both and are important conditioned on the other.
These assumptions let us conclude that for any two important sub-modules and , if one is unimportant conditioned on the other, then information flows from one to another. This, combined with data shown in figure 6, let us draw some edges in a dependency graph of sub-modules, which is shown in figure 7. Note that sub-modules of module 2 seem to send information to each other, which is what we would expect if modules were internally connected.
Previous research has explored modularity in neural networks. Watanabe et al. (2018a) demonstrate a different way of grouping neurons into modules, the interpretation of which is discussed by Watanabe et al. (2018b) and Watanabe (2019). Davis et al. (2019) cluster small networks based on statistical properties, rather than on the weights of the network. More broadly, Clune et al. (2013) discuss modularity in biological networks, and give an overview of the biological literature on modularity.
Our work relates more broadly to the study of graph clustering and network science. von Luxburg (2007) gives an overview of spectral clustering, the technique we use, as well as a large amount of work on the topic. Since it was published, Lee et al. (2014) derive an analogue of the Cheeger inequalities (Alon and Milman, 1985; Dodziuk, 1984) for multi-way partitioning, providing a further justification for the spectral clustering algorithm. Barabási and others (2016) give an introduction to the field of network science, which has produced studies such as those by Girvan and Newman (2002) and Newman and Girvan (2004) that study graph clustering under the name ‘community detection’.
A number of papers have investigated some aspect of the structure of neural networks. For instance, Frankle and Carbin (2019) discover that neural networks contain efficiently-trainable subnetworks, inspiring several follow-up papers (Zhou et al., 2019; Frankle et al., 2019), and Ramanujan et al. (2019) show that randomly-initialized neural networks contain subnetworks that achieve high performance on image classification problems. In general, the field of interpretability (sometimes known as transparency or explainable AI) investigates techniques to understand the functioning of neural networks. Books by Molnar (2019) and Samek et al. (2019) offer an overview of this research. Perhaps most similar to our own work within the field of interpretability are Olah et al. (2017), Olah et al. (2018), and Carter et al. (2019), which visualize the features learned in neural networks without relying on detailed information about the data distribution.
Molchanov et al. (2017) discuss how variational dropout can sparsify neural networks. Although this is related to our dropout results, it is important to note that their work relies on learned per-neuron dropout rates causing sparsity, while our work uses standard dropout, and finds an unusually high n-cut controlling for sparsity.
This paper has shown that MLPs trained with weight pruning are often more modular than would be expected given only their overall connectivity and distribution of weights. Further, it has shown that dropout significantly accentuates the modularity, and that the modularity is closely associated with learning. It has also exhibited an exploratory analysis of the module structure of the networks.
Several questions remain: Do these results extend to larger networks trained on larger datasets? What is the appropriate notion of modularity for convolutional neural networks? Why are networks so clusterable, and why does dropout have the effect that it does? Are clusters akin to functional regions of the human brain, and how can we tell? If modularity is desirable, can it be directly regularized for? We hope that follow-up research will shed light on these puzzles.
The authors would like to thank Open Philanthropy for their support of this research, and the researchers at UC Berkeley’s Center for Human-Compatible AI for their advice. Daniel Filan would like to thank Paul Christiano, Rohin Shah, Matthew ‘Vaniver’ Graves, and Buck Shlegeris for valuable discussions that helped shape this research, as well as Andrei Knyazev for his work in helping debug scikit-learn’s implementation of spectral clustering. Shlomi Hod would like to thank Dmitrii Krasheninnikov for fruitful conversations throughout the summer of 2019.
Explainable ai: interpreting, explaining and visualizing deep learning
.Lecture Notes in Artificial Intelligence
, Springer Nature. Cited by: §5.During training, we use the Adam algorithm with the standard Keras hyperparameters: learning rate , , , no amsgrad. For pruning, our initial sparsity is 0.5, our final sparsity is 0.9, the pruning frequency is 10 steps, and we use a cubic pruning schedule (see Zhu and Gupta (2017)). Initial and final sparsities were chosen due to their use in the TensorFlow Model Optimization Tutorial^{11}^{11}11URL: https://web.archive.org/web/20190817115045/https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras. Pruning typically only starts after 20 epochs, and runs for 20 epochs.
Tables A.1, A.2, A.3, A.4, A.5, A.6, A.7, A.8, A.9, A.10, A.11, and A.12 show detailed information about clusterability of each network trained on MNIST, Fashion-MNIST, and CIFAR-10, with and without dropout, both before pruning was applied and at the end of training. They include statistics for how each network compares to the distribution of shuffles of that network’s weights, as well as the distribution of shuffles of the non-zero weights that preserve the network topology. They also include the Cohen’s statistic of the normalized difference between the mean of the standard shuffle distribution and the mean of the non-zero shuffle distribution for each trained network. If and are the respective sample means, and are the respective sample standard deviations, and and are the respective sample sizes, then Cohen’s is defined as
where
Note that different samples were drawn for the generation of this table than were used to calculate the statistics in the main text, so there may be minor discrepancies. The run ID of networks is the same before and after pruning, so each row in a table of networks after pruning refers to the pruned version of the network whose information is displayed in the corresponding row of the corresponding table of networks before pruning.
In figure A.1, we graphically show the Cohen’s statistics documented in tables A.1, A.2, A.3, A.4, A.5, A.6, A.7, A.8, A.9, A.10, A.11, and A.12. Since networks should have very few weights that are exactly zero before pruning, we expect the statistics to be approximately zero for unpruned networks, and indeed this is what we see. Note the negative average statistic for networks trained on MNIST without dropout, indicating that the actual trained topologies are less clusterable on average than random shuffles of the networks.
The distributions of n-cuts for shuffles of networks trained on MNIST, as well as the distributions of n-cuts for shuffles of the non-zero elements, as well as the n-cut of the actual network, is shown in a series of violin plots. Figure A.2 shows the distributions for networks trained without dropout before the onset of pruning, figure A.3 shows the distributions for networks trained with dropout before the onset of pruning, figure A.4 shows the distributions for networks trained without dropout at the end of training, and figure A.5 shows the distributions for networks trained with dropout at the end of training. Note that since networks have almost no weights that are exactly 0 before pruning, the blue and orange distributions in figures A.2 and A.3 are nearly identical. Note also that the distributions for shuffles of unpruned networks trained without dropout tend to be bimodal, and the distributions for shuffles of unpruned networks trained with dropout tend to have long right tails. We do not have an explanation for either of these phenomena.
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.423 | 0.065 | 2.456 | 0.038 | 0.065 | 2.459 | 0.043 | -0.087 |
1 | 2.419 | 0.003 | 2.474 | 0.050 | 0.003 | 2.456 | 0.020 | 0.468 |
2 | 2.445 | 0.346 | 2.465 | 0.045 | 0.346 | 2.479 | 0.056 | -0.280 |
3 | 2.453 | 0.439 | 2.474 | 0.069 | 0.502 | 2.481 | 0.066 | -0.104 |
4 | 2.432 | 0.190 | 2.467 | 0.058 | 0.190 | 2.452 | 0.038 | 0.297 |
5 | 2.443 | 0.159 | 2.483 | 0.056 | 0.190 | 2.477 | 0.051 | 0.115 |
6 | 2.453 | 0.595 | 2.465 | 0.058 | 0.657 | 2.469 | 0.064 | -0.067 |
7 | 2.433 | 0.252 | 2.459 | 0.067 | 0.190 | 2.465 | 0.066 | -0.090 |
8 | 2.468 | 0.751 | 2.479 | 0.055 | 0.657 | 2.483 | 0.059 | -0.089 |
9 | 2.447 | 0.502 | 2.450 | 0.033 | 0.097 | 2.461 | 0.025 | -0.344 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.219 | 0.003 | 2.348 | 0.019 | 0.003 | 2.356 | 0.032 | -0.310 |
1 | 2.247 | 0.003 | 2.350 | 0.020 | 0.003 | 2.351 | 0.022 | -0.047 |
2 | 2.221 | 0.003 | 2.345 | 0.016 | 0.003 | 2.348 | 0.020 | -0.142 |
3 | 2.232 | 0.003 | 2.345 | 0.017 | 0.003 | 2.344 | 0.002 | 0.145 |
4 | 2.214 | 0.003 | 2.346 | 0.018 | 0.003 | 2.346 | 0.016 | -0.035 |
5 | 2.224 | 0.003 | 2.345 | 0.014 | 0.003 | 2.353 | 0.029 | -0.334 |
6 | 2.216 | 0.003 | 2.342 | 0.003 | 0.003 | 2.349 | 0.024 | -0.414 |
7 | 2.233 | 0.003 | 2.346 | 0.013 | 0.003 | 2.343 | 0.005 | 0.365 |
8 | 2.217 | 0.003 | 2.348 | 0.021 | 0.003 | 2.351 | 0.026 | -0.145 |
9 | 2.319 | 0.003 | 2.349 | 0.019 | 0.003 | 2.349 | 0.018 | 0.027 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.073 | 0.938 | 2.050 | 0.012 | 0.097 | 2.094 | 0.020 | -2.678 |
1 | 1.969 | 0.003 | 2.040 | 0.018 | 0.003 | 2.024 | 0.012 | 1.039 |
2 | 2.001 | 0.003 | 2.041 | 0.017 | 0.003 | 2.084 | 0.031 | -1.726 |
3 | 1.960 | 0.003 | 2.041 | 0.021 | 0.128 | 2.056 | 0.087 | -0.243 |
4 | 1.977 | 0.003 | 2.044 | 0.016 | 0.003 | 2.051 | 0.017 | -0.417 |
5 | 2.016 | 0.065 | 2.040 | 0.016 | 0.003 | 2.088 | 0.018 | -2.812 |
6 | 1.983 | 0.003 | 2.041 | 0.016 | 0.003 | 2.057 | 0.019 | -0.853 |
7 | 1.982 | 0.003 | 2.043 | 0.021 | 0.003 | 2.083 | 0.032 | -1.473 |
8 | 2.037 | 0.346 | 2.040 | 0.016 | 0.065 | 2.076 | 0.026 | -1.659 |
9 | 2.005 | 0.003 | 2.037 | 0.017 | 0.003 | 2.069 | 0.023 | -1.547 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 1.845 | 0.003 | 2.035 | 0.013 | 0.003 | 1.908 | 0.004 | 13.015 |
1 | 1.846 | 0.003 | 2.044 | 0.013 | 0.003 | 1.904 | 0.009 | 12.461 |
2 | 1.822 | 0.003 | 2.033 | 0.012 | 0.003 | 1.897 | 0.009 | 12.687 |
3 | 1.845 | 0.003 | 2.040 | 0.014 | 0.003 | 1.893 | 0.008 | 13.135 |
4 | 1.818 | 0.003 | 2.032 | 0.012 | 0.003 | 1.914 | 0.008 | 11.401 |
5 | 1.827 | 0.003 | 2.038 | 0.023 | 0.003 | 1.914 | 0.023 | 5.387 |
6 | 1.843 | 0.003 | 2.034 | 0.012 | 0.003 | 1.909 | 0.050 | 3.417 |
7 | 1.864 | 0.003 | 2.038 | 0.020 | 0.034 | 1.895 | 0.081 | 2.417 |
8 | 1.839 | 0.003 | 2.040 | 0.018 | 0.003 | 1.916 | 0.011 | 8.101 |
9 | 1.856 | 0.003 | 2.037 | 0.012 | 0.003 | 1.897 | 0.010 | 12.485 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.185 | 0.003 | 2.335 | 0.027 | 0.003 | 2.330 | 0.021 | 0.242 |
1 | 2.167 | 0.003 | 2.324 | 0.028 | 0.003 | 2.320 | 0.024 | 0.163 |
2 | 2.171 | 0.003 | 2.321 | 0.020 | 0.003 | 2.324 | 0.047 | -0.094 |
3 | 2.178 | 0.003 | 2.321 | 0.020 | 0.003 | 2.318 | 0.020 | 0.171 |
4 | 2.182 | 0.003 | 2.276 | 0.026 | 0.003 | 2.281 | 0.026 | -0.199 |
5 | 2.183 | 0.003 | 2.322 | 0.020 | 0.003 | 2.319 | 0.024 | 0.130 |
6 | 2.179 | 0.003 | 2.325 | 0.022 | 0.003 | 2.338 | 0.042 | -0.379 |
7 | 2.189 | 0.003 | 2.333 | 0.016 | 0.003 | 2.334 | 0.022 | -0.012 |
8 | 2.180 | 0.003 | 2.315 | 0.024 | 0.003 | 2.312 | 0.026 | 0.134 |
9 | 2.172 | 0.003 | 2.333 | 0.020 | 0.003 | 2.333 | 0.020 | -0.027 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.140 | 0.003 | 2.313 | 0.026 | 0.003 | 2.304 | 0.024 | 0.329 |
1 | 2.155 | 0.003 | 2.356 | 0.045 | 0.003 | 2.357 | 0.041 | -0.014 |
2 | 2.143 | 0.003 | 2.308 | 0.031 | 0.003 | 2.316 | 0.026 | -0.280 |
3 | 2.155 | 0.003 | 2.350 | 0.038 | 0.003 | 2.359 | 0.053 | -0.196 |
4 | 2.147 | 0.003 | 2.335 | 0.055 | 0.003 | 2.325 | 0.026 | 0.215 |
5 | 2.147 | 0.003 | 2.329 | 0.020 | 0.003 | 2.326 | 0.030 | 0.113 |
6 | 2.136 | 0.003 | 2.292 | 0.033 | 0.003 | 2.291 | 0.028 | 0.032 |
7 | 2.148 | 0.003 | 2.322 | 0.033 | 0.003 | 2.323 | 0.045 | -0.029 |
8 | 2.147 | 0.003 | 2.303 | 0.032 | 0.003 | 2.309 | 0.025 | -0.207 |
9 | 2.132 | 0.003 | 2.318 | 0.029 | 0.003 | 2.312 | 0.030 | 0.202 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 1.876 | 0.003 | 2.005 | 0.013 | 0.034 | 1.994 | 0.103 | 0.149 |
1 | 1.871 | 0.003 | 1.992 | 0.014 | 0.065 | 1.942 | 0.072 | 0.958 |
2 | 1.839 | 0.003 | 1.986 | 0.018 | 0.034 | 1.870 | 0.021 | 5.928 |
3 | 1.852 | 0.003 | 1.992 | 0.021 | 0.034 | 2.011 | 0.139 | -0.184 |
4 | 1.909 | 0.003 | 1.987 | 0.016 | 0.034 | 1.963 | 0.094 | 0.344 |
5 | 1.868 | 0.003 | 1.989 | 0.018 | 0.128 | 1.892 | 0.033 | 3.664 |
6 | 1.936 | 0.003 | 1.986 | 0.017 | 0.159 | 2.016 | 0.116 | -0.363 |
7 | 1.912 | 0.003 | 2.003 | 0.021 | 0.065 | 1.931 | 0.015 | 3.910 |
8 | 1.870 | 0.003 | 1.990 | 0.013 | 0.221 | 1.877 | 0.073 | 2.137 |
9 | 1.865 | 0.003 | 1.997 | 0.019 | 0.252 | 1.930 | 0.151 | 0.619 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 1.747 | 0.003 | 2.018 | 0.016 | 0.003 | 1.822 | 0.008 | 15.732 |
1 | 1.709 | 0.003 | 2.015 | 0.015 | 0.003 | 1.779 | 0.007 | 20.336 |
2 | 1.722 | 0.003 | 2.017 | 0.018 | 0.034 | 1.808 | 0.023 | 10.057 |
3 | 1.744 | 0.003 | 2.014 | 0.016 | 0.097 | 1.806 | 0.044 | 6.330 |
4 | 1.730 | 0.003 | 2.015 | 0.018 | 0.003 | 1.783 | 0.008 | 16.432 |
5 | 1.682 | 0.003 | 2.023 | 0.017 | 0.003 | 1.762 | 0.005 | 20.306 |
6 | 1.713 | 0.003 | 2.014 | 0.019 | 0.003 | 1.791 | 0.016 | 12.580 |
7 | 1.751 | 0.003 | 2.016 | 0.014 | 0.003 | 1.803 | 0.006 | 20.116 |
8 | 1.746 | 0.003 | 2.008 | 0.018 | 0.034 | 1.800 | 0.027 | 9.150 |
9 | 1.721 | 0.003 | 2.010 | 0.015 | 0.003 | 1.784 | 0.009 | 17.827 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.405 | 1.000 | 2.247 | 0.013 | 1.000 | 2.249 | 0.018 | -0.118 |
1 | 2.433 | 1.000 | 2.249 | 0.014 | 1.000 | 2.251 | 0.015 | -0.077 |
2 | 2.419 | 1.000 | 2.247 | 0.014 | 1.000 | 2.245 | 0.013 | 0.122 |
3 | 2.438 | 1.000 | 2.250 | 0.015 | 1.000 | 2.252 | 0.018 | -0.097 |
4 | 2.429 | 1.000 | 2.238 | 0.013 | 1.000 | 2.242 | 0.016 | -0.244 |
5 | 2.451 | 1.000 | 2.255 | 0.025 | 1.000 | 2.254 | 0.023 | 0.054 |
6 | 2.453 | 1.000 | 2.247 | 0.016 | 1.000 | 2.250 | 0.016 | -0.224 |
7 | 2.409 | 1.000 | 2.247 | 0.017 | 1.000 | 2.246 | 0.015 | 0.072 |
8 | 2.449 | 1.000 | 2.244 | 0.011 | 1.000 | 2.246 | 0.015 | -0.171 |
9 | 2.453 | 1.000 | 2.258 | 0.014 | 1.000 | 2.254 | 0.014 | 0.289 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.149 | 0.003 | 2.233 | 0.002 | 0.003 | 2.233 | 0.002 | 0.007 |
1 | 2.132 | 0.003 | 2.234 | 0.002 | 0.003 | 2.234 | 0.001 | -0.033 |
2 | 2.137 | 0.003 | 2.234 | 0.003 | 0.003 | 2.235 | 0.002 | -0.463 |
3 | 2.198 | 0.003 | 2.238 | 0.002 | 0.003 | 2.239 | 0.002 | -0.242 |
4 | 2.138 | 0.003 | 2.230 | 0.002 | 0.003 | 2.230 | 0.004 | 0.0787 |
5 | 2.145 | 0.003 | 2.234 | 0.002 | 0.003 | 2.235 | 0.002 | -0.265 |
6 | 2.158 | 0.003 | 2.234 | 0.005 | 0.003 | 2.236 | 0.002 | -0.407 |
7 | 2.137 | 0.003 | 2.233 | 0.002 | 0.003 | 2.233 | 0.002 | 0.172 |
8 | 2.148 | 0.003 | 2.235 | 0.002 | 0.003 | 2.235 | 0.002 | -0.002 |
9 | 2.151 | 0.003 | 2.235 | 0.003 | 0.003 | 2.235 | 0.003 | 0.128 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.049 | 1.000 | 2.006 | 0.016 | 0.751 | 2.043 | 0.055 | -0.901 |
1 | 2.296 | 1.000 | 1.999 | 0.016 | 0.688 | 2.354 | 0.327 | -1.533 |
2 | 2.037 | 1.000 | 2.009 | 0.011 | 0.907 | 2.019 | 0.012 | -0.828 |
3 | 1.993 | 0.377 | 1.998 | 0.015 | 0.688 | 1.986 | 0.011 | 0.945 |
4 | 2.014 | 0.844 | 1.997 | 0.014 | 0.907 | 1.997 | 0.027 | 0.039 |
5 | 2.015 | 0.657 | 2.010 | 0.016 | 0.252 | 2.022 | 0.010 | -0.893 |
6 | 2.301 | 1.000 | 1.995 | 0.017 | 1.000 | 1.844 | 0.238 | 0.891 |
7 | 1.833 | 0.003 | 1.997 | 0.013 | 0.502 | 1.898 | 0.404 | 0.347 |
8 | 2.054 | 1.000 | 2.012 | 0.017 | 1.000 | 2.021 | 0.013 | -0.585 |
9 | 2.008 | 0.439 | 2.008 | 0.014 | 0.439 | 2.012 | 0.008 | -0.371 |
Run | N-cut | -value | Dist. mean | Dist. std. | -value (non-zero) | Non-zero dist. mean | Non-zero dist. std. | Cohen’s |
---|---|---|---|---|---|---|---|---|
0 | 2.040 | 1.000 | 1.990 | 0.009 | 0.907 | 1.873 | 0.128 | 1.298 |
1 | 1.827 | 0.003 | 1.985 | 0.014 | 0.252 | 1.849 | 0.038 | 4.791 |
2 | 1.802 | 0.003 | 1.990 | 0.012 | 0.190 | 1.833 | 0.050 | 4.311 |
3 | 1.915 | 0.003 | 2.021 | 0.008 | 0.003 | 1.939 | 0.033 | 3.420 |
4 | 1.733 | 0.003 | 1.991 | 0.014 | 0.003 | 1.767 | 0.015 | 15.043 |
5 | 1.780 | 0.003 | 1.996 | 0.009 | 0.065 | 1.862 | 0.062 | 3.054 |
6 | 1.828 | 0.003 | 2.003 | 0.007 | 0.128 | 1.936 | 0.128 | 0.741 |
7 | 1.853 | 0.003 | 1.989 | 0.011 | 0.844 | 1.841 | 0.090 | 2.313 |
8 | 1.874 | 0.003 | 1.998 | 0.009 | 0.097 | 1.985 | 0.100 | 0.174 |
9 | 1.750 | 0.003 | 1.993 | 0.013 | 0.097 | 1.823 | 0.071 | 3.322 |
Table A.13 shows n-cuts and accuracies for networks trained without dropout, while table A.14 shows the same for networks trained with dropout. The broad pattern is that networks trained on mixtures between MNIST and CIFAR-10 have lower n-cuts than those trained on either individually, but networks trained on mixtures between LINES and another dataset have n-cuts intermediate between those trained on LINES and those trained on the other dataset. The one exception is that networks trained on LINES-CIFAR-10-SEP without dropout have lower n-cuts than those trained on either LINES without dropout and also those trained on CIFAR-10 without dropout. Since LINES is a very artificial dataset, it is possible that the results obtained for mixtures between MNIST and CIFAR-10 are more representative of other natural datasets.
Dataset | N-cut mean | N-cut std. | Mean train acc. | Mean test acc. |
---|---|---|---|---|
MNIST | 2.000 | 0.035 | 1.00 | 0.984 |
CIFAR-10 | 2.06 | 0.14 | 0.650 | 0.415 |
LINES | 1.599 | 0.049 | 1.00 | 1.00 |
LINES-MNIST | 1.760 | 0.039 | 1.00 | 0.991 |
LINES-CIFAR-10 | 1.632 | 0.032 | 0.784 | 0.708 |
MNIST-CIFAR-10 | 1.822 | 0.025 | 0.805 | 0.697 |
LINES-MNIST-SEP | 1.842 | 0.046 | 1.00 | 0.994 |
LINES-CIFAR-10-SEP | 1.536 | 0.033 | 0.941 | 0.817 |
MNIST-CIFAR-10-SEP | 1.587 | 0.041 | 0.964 | 0.810 |
Dataset | N-cut mean | N-cut std. | Mean train acc. | Mean test acc. |
---|---|---|---|---|
MNIST | 1.840 | 0.015 | 0.967 | 0.979 |
CIFAR-10 | 1.840 | 0.089 | 0.427 | 0.422 |
LINES | 1.144 | 0.036 | 0.913 | 0.312 |
LINES-MNIST | 1.272 | 0.056 | 0.875 | 0.635 |
LINES-CIFAR-10 | 1.418 | 0.031 | 0.682 | 0.459 |
MNIST-CIFAR-10 | 1.784 | 0.032 | 0.714 | 0.694 |
LINES-MNIST-SEP | 1.517 | 0.020 | 0.984 | 0.885 |
LINES-CIFAR-10-SEP | 1.375 | 0.038 | 0.828 | 0.817 |
MNIST-CIFAR-10-SEP | 1.517 | 0.029 | 0.853 | 0.826 |
Table A.15 shows data on the importance of sub-modules in the single lesion experiments, and is plotted in figure 5. Table A.16 shows the conditional importance data of important pairs of sub-modules that is plotted in figure 6. The conditional importance data of all pairs of sub-modules is shown in figure A.6.
Layer | Label | Acc. diff. | -value | Proportion | Type | Acc. diff. mean | Acc. diff. std. |
---|---|---|---|---|---|---|---|
1 | 0 | -0.0552 | 0.0099 | 0.250 | important | -0.0034 | 0.0026 |
1 | 1 | -0.0105 | 0.0099 | 0.123 | important | -0.0011 | 0.0015 |
1 | 2 | 0.0002 | 0.64 | 0.015 | small | -0.00007 | 0.00057 |
1 | 3 | -0.1870 | 0.0099 | 0.245 | important | -0.0037 | 0.0031 |
1 | 4 | -0.0005 | 0.38 | 0.049 | small | -0.0002 | 0.0010 |
1 | 5 | -0.1024 | 0.0099 | 0.319 | important | -0.0052 | 0.0030 |
2 | 0 | 0.0001 | 0.52 | 0.020 | small | 0.00001 | 0.00051 |
2 | 1 | 0.0001 | 0.55 | 0.012 | small | 0.00001 | 0.00052 |
2 | 2 | -0.0350 | 0.0099 | 0.461 | important | -0.0042 | 0.0034 |
2 | 3 | 0.0002 | 0.71 | 0.008 | small | -0.000003 | 0.00031 |
2 | 5 | -0.0027 | 0.020 | 0.109 | other | -0.0004 | 0.0010 |
2 | 7 | -0.0033 | 0.0099 | 0.098 | sig-but-not-diff | -0.00030 | 0.00095 |
2 | 8 | -0.0063 | 0.0099 | 0.242 | sig-but-not-diff | -0.0013 | 0.0016 |
2 | 9 | -0.0002 | 0.39 | 0.051 | other | -0.00006 | 0.00074 |
3 | 2 | -0.0433 | 0.0099 | 0.527 | important | 0.0019 | 0.0021 |
3 | 6 | 0.0001 | 0.55 | 0.004 | small | -0.00007 | 0.00027 |
3 | 7 | -0.0085 | 0.0099 | 0.113 | sig-but-not-diff | 0.0011 | 0.0010 |
3 | 8 | -0.0127 | 0.0099 | 0.234 | important | 0.0015 | 0.0011 |
3 | 9 | -0.0016 | 0.0099 | 0.121 | sig-but-not-diff | 0.00096 | 0.00089 |
4 | 2 | -0.0279 | 0.0099 | 0.414 | important | -0.0076 | 0.0035 |
4 | 6 | -0.0006 | 0.69 | 0.082 | other | -0.00104 | 0.00072 |
4 | 7 | -0.0049 | 0.0099 | 0.125 | sig-but-not-diff | -0.0016 | 0.00083 |
4 | 8 | -0.0054 | 0.059 | 0.219 | other | -0.0029 | 0.0014 |
4 | 9 | -0.0136 | 0.0099 | 0.160 | important | -0.0019 | 0.0010 |
1-0 | 2-2 | 0.166 | 0.146 | ||
1-0 | 3-2 | 0.099 | 0.087 | ||
1-0 | 3-8 | 0.033 | -0.009 | ||
1-0 | 4-2 | 0.072 | 0.044 | ||
1-0 | 4-9 | 0.069 | 0.027 | ||
1-1 | 2-2 | 0.007 | 0.031 | ||
1-1 | 3-2 | 0.053 | 0.086 | ||
1-1 | 3-8 | 0.010 | 0.012 | ||
1-1 | 4-2 | 0.010 | 0.027 | ||
1-1 | 4-9 | 0.012 | 0.015 | ||
1-3 | 2-2 | 0.211 | 0.059 | ||
1-3 | 3-2 | 0.178 | 0.034 | ||
1-3 | 3-8 | 0.165 | -0.009 | ||
1-3 | 4-2 | 0.191 | 0.032 | ||
1-3 | 4-9 | 0.170 | -0.003 | ||
1-5 | 2-2 | 0.070 | 0.002 | ||
1-5 | 3-2 | 0.168 | 0.109 | ||
1-5 | 3-8 | 0.103 | 0.013 | ||
1-5 | 4-2 | 0.117 | 0.042 | ||
1-5 | 4-9 | 0.091 | 0.002 | ||
2-2 | 3-2 | 0.045 | 0.054 | ||
2-2 | 3-8 | 0.145 | 0.123 | ||
2-2 | 4-2 | 0.035 | 0.028 | ||
2-2 | 4-9 | 0.032 | 0.011 | ||
3-2 | 4-2 | 0.214 | 0.199 | ||
3-2 | 4-9 | 0.071 | 0.041 | ||
3-8 | 4-2 | 0.050 | 0.065 | ||
3-8 | 4-9 | -0.003 | -0.002 |
Comments
There are no comments yet.