Recent findings suggest that neural networks can be “pruned” by 90% or more to eliminate unnecessary weights while maintaining performance similar to the original network . Similarly, the lottery ticket hypothesis frankle2018lottery proposes that neural networks contain subnetworks, called winning tickets
, that can be trained in isolation to reach the performance of the original. These results suggest that neural networks may rely on these lucky initializations to learn a good solution. Rather than extensively exploring weight-space, networks trained with gradient-based optimizers may converge quickly to local minima that are nearby the initialization, many of which will be poor estimators of the dataset distribution. If some subset of the weights must be in a winning configuration for a neural network to learn a good solution to a problem, then neural networks initialized with random weights must be significantly larger than the minimal network configuration that would solve the problem in order to optimize the chance having a winning initialization. Furthermore, small networks with winning initial configurations may be sensitive to small perturbations.
Similarly, gradient-based optimizers sample the gradient of the loss function with respect to the weights by averaging the gradient at a few elements of the dataset. Thus, a biased training dataset may bias the gradient in a way that can be detrimental to the success of the network. Here we examine how the distribution of the training dataset affects the network’s ability to learn.
In this paper, we explore how effectively small neural networks learn to take as input a configuration for Conway’s Game of Life (Life) berlekamp2018winning, and then output the configuration steps in the future. Since this task can be implemented minimally in a convolutional neural network with layers and trainable parameters, a neural network with identical architecture should, in principle, be able to learn a similar solution. Nonetheless, we find that networks of this architecture rarely find solutions. We show that the number of weights necessary for networks to reliably converge on a solution increases quickly with . Additionally, we show that the probability of convergence is highly sensitive to small perturbations of initial weights. Finally, we explore properties of the training data that significantly increase the probability that a network will converge to a correct solution. While Life is a toy problem, we believe that these studies give insight into more general issues with training neural networks. In particular, we expect that other neural network architectures and problems exhibit similar issues. We expect that networks likely require a large number of parameters to learn any domain, and that small networks likely exhibit similar sensitivities to small perturbations to their weights. Furthermore, optimal training datasets may be highly particular to certain parameters. Thus, with the growing interest in efficient neural networks NIPS2015_5784; hassibi1993second; hinton2015distilling; lecun1990optimal; li2016pruning, this results serve as an important step forward in developing ideal training conditions.
1.1 Conway’s Game of Life
Prior studies have shown interest in applying neural networks to model physical phenomena in applications including weather simulation and fluid dynamics baboo2010efficient; maqsood2004ensemble; mohan2018deep; shrivastava2012application. Similarly, neural networks are trained to learn computational tasks, such as adding and multiplying kaiser2015neural; graves2014neural; joulin2015inferring; trask2018neural. In all of these tasks, neural networks are required to learn hidden-step processes in which the network must learn some update rule that can be generalized to perform multi-step computation.
Conway’s Life is a two-dimensional cellular automaton with a simple local update rule that can produce complex global behavior. In a Life configuration, cells in an grid can be either alive or dead (represented by or respectively). To determine the state of a given cell on the next step, Life considers the grid of neighbors around the cell. Every step, cells with exactly two alive neighbors will maintain their state, cells with exactly three alive neighbors will become alive, and cells with any other number of neighbors will die (Figure 1). We consider a variant of Life in which cells outside of the grid are always considered to be dead. Despite the simplicity of the update rule, Life can produce complex output over time, and thus can serve as an idealized problem for modeling hidden-step behavior.
2 Related Work
Prior research has shown interest in whether neural networks can learn particular tasks. Joulin et al. joulin2015inferring
argue that certain recurrent neural networks cannot learn addition in a way that generalizes to an arbitrary number of bits. Theoretical work has shown that sufficiently overparameterized neural networks converge to global minima9081945; du2018gradient. Further theoretical work has found methods to minimize local minima kawaguchi2019elimination; nguyen2017loss; kawaguchi2016deep. Nye et al. nye2018efficient
show that minimal networks for the parity function and fast Fourier transform do not converge to a solution unless they are initialized close to a solution.
Increasing the depth and number of parameters of neural networks has been shown to increase the speed at which networks converge and their testing performance arora_optimization_2018; park_effect_2019. Similarly, Frankle et al. frankle2018lottery find that increasing parameter count can increase the chance of convergence to a good solution. Similarly, Li et al. li_measuring_2018 and Neyshabur et al. neyshabur_towards_2018 find that training near-minimal networks leads to poor performance. Choromanska et al. choromanska_loss_2015 provide some theoretical insight into why small networks are more likely to find poor local minima.
Weight initialization has been shown to matter in training deep neural networks. Glorot et al. glorot_understanding_2010 find that initial weights should be normalized with respect to the size of each layer. Dauphin et al. dauphin_metainit_2019 find that tuning weight norms prior to training can increase training performance. Similarly, Mishkin et al. mishkin_all_2016 propose a method for finding a good weight initialization for learning. Zhou et al. zhou_deconstructing_2020 find that the sign of initial weights can determine if a particular subnetwork will converge to a good solution.
There is significant research into weight pruning and developing efficient networks lecun1990optimal; hassibi1993second; NIPS2015_5784; li2016pruning; hinton2015distilling; li_measuring_2018, including the lottery ticket hypothesis, which suggests that gradient descent allows lucky subnetworks to quickly converge to a solution frankle2018lottery.
Finally, there is interest in learning hidden step computational processes including algorithms and arithmetic kaiser2015neural; graves2014neural; joulin2015inferring; trask2018neural, fluid dynamics baboo2010efficient; maqsood2004ensemble; mohan2018deep, and weather simulation shrivastava2012application.
Among the above papers, there are studies that have already shown that weight initialization, overparameterization, and training dataset statistics can determine whether or not a neural network can converge to a good solution to a problem. However, in this paper, we evaluate these domains on the Game of Life, a non-trivial but simple toy problem for which an exact minimal solution is known, allowing us to derive insights that relate to this minimal solution, which is not possible in vision or other similar domains.
3 Experiments and Results
We define the Life problem as a function-learning problem. In particular, if is a matrix of s and s, define to be the next step in Life, according to the previously described update rules. Then, we define the Life problem to be the problem of predicting given . Similarly, we define the -step-Life problem as the problem of learning to predict given . Since Life has a local update rule that considers a grid to determine the state of the center cell, we can model Life with an entirely convolutional neural network, i.e., a neural network without any fully connected or pooling layers. A convolutional layer with two filters that feeds into a second convolutional layer with one filter, can solve the -step-Life problem efficiently, i.e., any fewer layers or convolutional filters would yield an architecture which cannot implement -step-Life. Thus, we call it the minimal-step-Life problem by stacking layers. The second convolutional layer feeds into a final convolutional output layer with one filter with a sigmoid activation function. This forces all outputs to approximate either or but does not, on its own, perform meaningful computation, and thus is included for this convenience. With appropriate weights, this constructs a three-layer convolutional neural network that can solve the -step-Life problem with weights. We generalize this architecture to solve the -step-Life problem by stacking copies of this network, as shown in Figure 2 (right).
We have hand-engineered weights for these architectures that implement the underlying rule and thus solves the -step-Life problem with perfect accuracy. We conclude that this minimal neural network architecture can solve the -step-Life problem with a layer convolutional neural network with weights. In principle, a neural network with an identical architecture should be able to learn a similar solution.
3.1 Life Architecture
We construct a class of architectures to measure how effectively networks of varying sizes solve a hidden-step computational problem. In particular, we employ an architecture similar to the one described in the previous section: an entirely convolutional neural network with copies of a convolutional layer with filters that feed into a convolutional layer with filters, and finally, a convolutional layer with a single filter and sigmoid activation to decode the output into a Life configuration. When the architecture has copies of the described layers, we say that it is an -step architecture. In the minimal111A reviewer helpfully pointed out that an even smaller network can be constructed to solve Life, with a single convolution that counts neighbors, outputs a 0 if there are two neighbors, 1 if there are three neighbors, and -1 otherwise, and then is added to the input and fed through a Heaviside activation function. However, our model is minimal given the constraint that we are using a traditional feedforward network with ReLU activation. solution, each repeated convolutional layer has two filters and each repeated convolutional layer has one filter. When a similar architecture has filters and filters in each respective repeated layer, we say that the architecture is -times overcomplete with respect to the minimal architecture. We let describe the -step -times overcomplete architecture.
on top of TensorFlowtensorflow2015-whitepaper and trained using the Adam optimizer () kingma2014adam
with a binary cross-entropy loss function on the output of the model. Each instance is trained with 1 million randomly generated training examples separated into 100 epochs of 10,000 training examples each, with a batch size of 8. Each training and testing example is generated as follows: first, we uniformly draw adensity from , and then generate a cell board such that each cell is alive with probability . It is extremely unlikely that the network will ever see the same training example twice. Thus, separating testing data into a testing set and a validation set is unnecessary, since novel data can be generated on the fly. To improve computational efficiency, all networks with the identical parameters are implemented so that they can be trained in parallel using the same randomly generated dataset.
3.2 The Difficulty of Life
To quantify the effectiveness of a given neural network architecture, we measure the probability that a random initialization of the network converges to a solution after being shown one million training examples. Because can only implement a update rule in each step of computation and is minimal in this sense, for to solve the -step-Life problem, it must learn the underlying rule. Thus, we consider an instance of to be successful when it learns the correct underlying rule, and can therefore predict with perfect accuracy for all initial states . Any instance of that does not have perfect accuracy did not learn the underlying rule and is thus considered unsuccessful. We wish to determine
To accomplish this, we train 64 instances of for and . We omit certain combinations due to computational limitations. In Figure 3 we plot the percentage of instances of that successfully learn the -step-Life problem.
We observe that of the minimal () architectures, only instances for the -step-Life problem () converged on a solution, with a success rate of approximately 4.7%. Instances of architectures to solve the one and two-step-Life problem had a greater than 50% chance of converging to a solution when the architecture was at least and -times overcomplete, respectively. Instances of architectures to solve the -step-Life problem for require an overcompleteness greater than 24, the highest degree of overcompleteness we tested, due to computational constraints. This explosive growth rate suggests that the degree of overcompleteness required for consistent convergence of the -step-Life problem grows quickly with respect to .
Strikingly, for , we do not observe the hypothesized scaling behavior. Rather, we observe that for high overcompleteness, the architectures for outperforms , and performs similarly to . While all three require many more parameters than the minimal architecture to consistently converge, we would expect that requires fewer than , which would require fewer than . We have multiple hypotheses: firstly, we may observe this result due to noise or dataset artifacts; secondly, our parameterization of Life may have consistent behavior for all , which may make the difficulty of learning any steps similar.
We plot the loss of the instances of for and in Figure 4 to illustrate typical rates at which the networks converge to a solution. In addition, we compute the average earliest point of convergence for converged networks of architecture for and (Figure 4). The earliest point of converge is computed by determining the earliest epoch in which the loss of a convergent network falls below 0.01 to indicate the network has learned a solution to the -step-Life problem. We exclude non-converged networks from this metric.
3.3 Weight Perturbations and Learning
To observe the robustness of weight initializations and of learned solutions, we perturb successful weight initializations and solutions of the minimal -step-Life architecture . In particular, we perform two perturbations: the -sign perturbation and the uniform perturbation. The -sign perturbation modifies weights as follows: we select weights randomly from a uniform distribution. We replace each chosen weight with a weight of the same magnitude but of opposite sign. The uniform perturbation modifies weights by adding a value selected uniformly from the range for a given perturbation magnitude . We choose a weight initialization of a network which converges to a solution to the -step-Life problem. We initialize and train instances of the minimal architecture with these weights perturbed by -sign perturbations for and uniform perturbations for . Similarly, we initialize and train instances of the minimal architecture with the described -sign and uniform perturbations of the converged solution of this network. For each perturbation type, we train 128 instances.
We plot the fraction of successful networks for each perturbation type in Figure 5. Notably, a -sign perturbation of the original initial weights of the successful network causes the network to fail to converge approximately 20% of the time, and only 4–6 sign perturbations are required to drop the success rate below 50%. This suggests that for minimal networks, the weight initialization is sensitive to perturbations. This is not unique to sign perturbations. Even a relatively small uniform perturbation of 0.25 magnitude (where weights are changed by 0.125 in expectation) causes the tested networks to fail to learn approximately 36% of the time. Finally, we observe that even a -sign perturbation of an already converged solution causes approximately 90% of models to fail to learn, suggesting that converged solutions are very sensitive to sign perturbations. Furthermore, since typical weights are small in the converged solution (weights have a mean of approximately
and standard deviation of approximately), sign perturbations do not represent large-magnitude changes.
3.4 An Optimal Training Dataset
Many deep learning systems are restricted by the dataset which is available for training. We examine how a class of training datasets affects the success rate of near-minimal networks learning the-step-Life problem. In particular, we construct a class of training datasets: the -density dataset, a -cell dataset in which cells are chosen independently to be alive with probability . Note that the dataset described in Section 3.1 is a generalization of this class of datasets in which is chosen uniformly, which we call the uniform-density dataset. We show examples of these datasets in Figure 6.
We train 128 instances of on the -density datasets for with intervals of . We train on the same datasets, and in addition, on -density datasets for with intervals of . Surprisingly, we find a sharp spike in probability of success for values of between approximately 0.3 and 0.4. When , we observe a 14% success rate for the minimal model, a strikingly high success rate considering that it is approximately double the success rate of the -density dataset and triple the success rate of the -density dataset (Figure 6). The same result, though less exaggerated, appears for , where the performance increases drastically for and (Figure 6). The tiny range in which performance increases significantly suggests that there is likely a critical value such that the -density dataset is in this sense optimal. We hypothesize that the value of coincides with the peak of the graph shown in Figure 6, which plots the probability that a cell is alive after one step of Life given that the initial configuration of Life is drawn from a -density dataset, for a given . This would place . We predict that an optimal dataset must satisfy a condition in which the probability of observing each possible local configuration of Life reaches an equilibrium that allows the computed average gradient of the weights with respect to the loss function and the training examples to direct the weights quickly to a solution. For example, for very small , we expect most cells to be dead after one step of Life given an initial state sampled from the -density dataset. Thus, in this case, the computed average gradient of the weights will tend to drive the network towards a solution which predicts most cells to be dead. The density which maximizes the probability that a cell will be on after one step of Life will maximize the occurrence of cells with exactly three neighbors and alive cells with exactly two neighbors, any instance of which will increase the number of cells that are alive in the next step of Life. We hypothesize that frequent observation of these configurations are critical for a near-minimal network to solve the -step-Life problem.
The lottery ticket hypothesis frankle2018lottery proposes that when training a convolutional neural network, small lucky subnetworks quickly converge on a solution. This suggests that rather than searching extensively through weight-space for an optimal solution, gradient-descent optimization may rely on lucky initializations of weights that happen to position a subnetwork close to a reasonable local minima to which the network converges. This would make convolutional neural networks, especially those which are near-minimal in architecture, extremely sensitive to weight initializations and other parameters that affect the search space of the network, such as the distribution of the dataset.
In order to determine the significance of weight vector initializations in neural networks, we examined how initial weight configurations of small convolutional neural networks that are trained to solve the-step-Life problem contribute to the ability for the network to learn on a correct solution. We find that despite the fact that the -step-Life problem can be implemented minimally in a neural network with architecture, when networks of this architecture are trained to solve the problem with the standard state-of-the-art gradient-based optimizer, Adam kingma2014adam, the networks rarely learn a successful solution. To determine the scaling behavior of the required number of parameters a network requires to consistently learn a solution to -step-Life, we test the probability at which neural networks with varying degrees of overcompleteness learn a solution. Our results suggest that the required degree of overcompleteness of the network is large, a characteristic predicted by the lottery ticket hypothesis.
While Conway’s Game of Life itself is a toy problem and has few direct applications, the results we report here have implications for similar tasks in which a neural network is trained to predict an outcome which requires the network to follow a set of local rules with multiple hidden steps. Examples of such problems include but are not limited to machine-learning based logic or math solvers, weather and fluid dynamics simulations, and logical deduction in language or image processing. In these instances, without enormously overcomplete networks, gradient descent based optimization methods may not suffice to learn solutions to these problems. Furthermore, such a result may generalize to problems that do not explicitly involve local hidden step processes, such as classification of images and audio, and virtually every other application of machine learning. In addition, significant effort has gone into developing faster and smaller networks with similar performance to their larger counterparts. Our result suggests that these smaller networks may necessarily require alternative training methods, or methods to identify optimal weight initializations.
Additionally, we measure the robustness of the initial and converged weights to -sign and uniform perturbations. In accordance with our prediction based on the lottery ticket hypothesis, we find that weights of minimal architectures are highly sensitive to tiny perturbations.
Finally, we explore the role of dataset in learning a solution to -step-Life for near-minimal architectures. We find a sharp increase in success rate for a dataset in which cells are chosen to be alive with probability , which appears to coincide with the parameter that, given an initial Life configuration drawn from the -density dataset, maximizes the probability that any given cell is alive after one step of Life. We hypothesize that the narrow range in which the probability of success increases drastically is a result of the precise conditions that optimize the probability that certain local configurations of Life occur that are critical for learning, such as instances where a cell has exactly two or three neighbors.
We predict that the tiny magnitude of the range of densities that allow for increased learning potential is specific to Conway’s Game of Life and the particular architectures we train. However, other neural networks, especially small networks, may suffer from similar problems. Even datasets which seem intuitively reasonable may contain glaring biases that prevent neural networks from learning the underlying rules. Furthermore, in many instances, the dataset parameters may need to be tuned near perfectly in order to observe increases in performance.
In conclusion, we find that networks of the architecture that are trained to predict the configuration of Life after steps given an arbitrary initial configuration require a degree of overcompleteness that scales quickly with in order to consistently learn the rules of Life. Similarly, we show that weight initializations and converged solutions are extremely sensitive to small perturbations. Finally, we find that these networks are dependent on very strict conditions of the dataset distribution in order to observe a significant increase in success probability. These observations are consistent with the predictions of the lottery ticket hypothesis, and have important consequences in the field.
5 Broader Impact
This paper provides insight into the lottery ticket hypothesis and why neural networks may fail to learn particular tasks. However, the paper can similarly be interpreted to provide prescriptive claims about how to train neural networks. In particular, gradient-based optimization methods require networks to have large degrees of overcompleteness; thus to learn complex problems, neural networks should increase in size. Additionally, neural networks may be highly sensitive to the dataset distribution; thus datasets should be constructed and selected based on parameters that optimize this distribution for learning.
These prescriptive claims, however, may have adverse implications in the world. For example, increasing the number of weights of a network is not free, both financially and in terms of energy consumption and carbon emission. Our result may incentivize greater carbon emission which can contribute negatively to our global environment and increase the rate of climate change. Furthermore, we suggest developing finely tuned datasets that are optimal for learning. This may incentivize organizations to collect inappropriate amounts of invasive data on individuals to facilitate machine learning.
Instead, we hope that this paper will promote research into the limitations of neural networks so that we can better understand the flaws that necessitate overcomplete networks for learning. We hope that our result will drive development into better learning algorithms that do not face the drawbacks of gradient-based learning.
Appendix A Appendix
a.1 Weights for minimal architecture
We describe weights that solve Life for the minimal architecture .
The first layer has two convolutional filters, each with bias, described as follows:
where and describes the weights of the first and second convolutional filters, respectively, and and similarly describes the bias. Each output is fed through a ReLU function.
The second layer has a single filter which combines the output of the two filters from the previous layer.
where the first component corresponds with the first-layer output of the first filter and the second component corresponds with the output of the second filter.Each output is fed through a ReLU function.
We add one more layer, as a convenience for learning:
for an arbitrary large , and in practice, . The output is then fed through a sigmoid function.
. The output is then fed through a sigmoid function.
Thus, the entire architecture, given a Life input x, has that computes the next step, where is defined as: