Can we produce Deep Neural Networks (DNNs) that perform well without applying arithmetic operations to their weights? Conventional algorithms for training DNNs, such as Stochastic Gradient Descent (SGD), aim to find the appropriate numerical values for a set of predetermined parameter vectors. These algorithms apply the changes to at each iteration. Denoting the parameter vectors at the th iteration as , they use the following update rule: . In contrast, we develop a method to train DNNs using the following update rule: for a permutation operation . In essence, we consider weighted connections as a limited resource which we can not create or modify but only reallocate: we train DNNs by “reconnecting” its neurons. This approach not only eliminates the need to use an ad hoc weight regularization mechanism, e.g., weight decay (Krogh & Hertz, 1992), but also allows us to efficiently search for a good neuron connectivity in sparsely connected DNNs.
Related works tackled the question mentioned in the beginning of this section using mainly two types of approaches: architecture-search-based weight deemphasizing and weight pruning. Weight Agnostic Neural Networks (WANNs) (Gaier & Ha, 2019) deemphasize weights by creating DNN architectures with strong inductive biases which succeeded in performing various tasks without any weight training. The second type of approach (Zhou et al., 2019; Ramanujan et al., 2019) inspired by the Lottery Ticket Hypothesis (Frankle & Carbin, 2019) “trains” randomly weighted DNN by using pruning, i.e., it finds supermasks within standard DNN architectures. Essentially, the weight deemphasizing approach learns new DNN architectures by adding more connections, whereas the weight pruning approach does so by removing the connections. In this paper, we propose a new type of approach which reconnects the neuron connections. We show that our method can reach efficiency and performance similar to that of conventional training algorithms.
Before explaining our method in detail, we first define several simple concepts which are the keys to understanding it. We then describe an interesting phenomenon in how weights are distributed in trained DNNs which inspired and motivated the creation of our method.
1.1 Neuron Connections, and Weight Matrices
DNNs perform chains of mathematical transformations from its input to output. At the core of these transformations are feature extractions, where separate features are weighted, combined and transformed into composite features.Artificial neurons, the elementary vector-to-scalar functions in DNNs, are where feature extractions take place. The incoming weighted connections of a neuron can be represented using a one-dimensional vector (flatten if it has higher dimension, e.g., convolutional kernel), called the weight vector. All these neuron connections between two DNN layers can be represented by a weight matrix, where each column is a weight vector.
1.2 Similarity of Weight Profiles
To understand the contents of a weight matrix, we visualize them by drawing one scatter plot for each column and stack the scatter plots of all columns on the same figure. Concretely, Figure 1(a) shows such a visualization of a weight matrix () in VGG16 (Simonyan & Zisserman, 2015)
trained on ImageNet(Deng et al., 2009). It is observed to be roughly zero-meaned, but overall, a jumble of random-looking points.
Nothing is particularly noticeable until we sort this weight matrix column-wise to obtain Figure 1(b). The pattern shown on this figure implies that all 4096 weight vectors had almost identical shape, centers, and ranges, as the scatter plots closely overlap with each other. In forthcoming discussions, we use the term weight profile to describe a sorted weight vector.
Although the shapes may vary, similar patterns are observed not only in every layer of VGG16 but also in many other trained DNN architectures, such as ResNet50 (He et al., 2016) (Figure 2) trained on ImageNet. We call this phenomenon, where the weight profiles follow such a pattern, the similarity of weight profiles. It reveals the beauty and simplicity in how DNNs extract and store information.
Similarity is where two or more objects lack differentiating features and information. Similarity in the shapes of profiles (sorted weight vectors) implies that much information is lost after sorting. In other words, those information was mostly stored in the pre-sorted orders of the weight vectors.
1.3 Main Contribution
One primary question arose that will be addressed in this paper: given the weight profiles, can we reorder the weights to reproduce a trained network? More generally,
Given only statistical properties of randomly weighted connections, can we permute them so as to train a DNN?
The remaining of this paper gives it a positive answer.
We present a new framework of algorithms to train DNNs using permutation, named Permute to Train (P2T). We introduce two implementations of P2T, namely, Stochastic Gradient Permutation (SGPerm) and Lookahead Permutation (LAPerm)
. SGPerm, which utilizes gradient-based information to explicitly calculate the valid permutations, has a high degree of freedom in finding and choosing the permutations. On the other hand, LAPerm, which depends largely on another optimizer to derive the permutation, is computationally efficient and straightforward to tune. Both methods succeeded in training randomly initialized DNNs. In particular, in Section5.2.2
, we show that convolutional neural networks can be trained up to over 90% validation accuracy on CIFAR-10 using P2T.
2 Stochastic Gradient Permutation: Training DNNs Like Solving Picture Puzzles
In this section, we discuss how to train DNNs like solving picture puzzles. A picture puzzle game involves a picture frame and a set of picture fragments (the puzzle pieces) given in random orders. The picture frame has enough slots for holding all the picture fragments. The goal of this game is to permute this set of puzzle pieces within the picture frame to obtain a complete picture. As demonstrated in Figure 3, we first propose a four-step routine to solve such puzzles using human intuitions.
Step 1: Provide recommendation: For each slot on the picture frame, we provide intuitive recommendations for moving puzzle pieces over. In other words, for each slot, we list a set of puzzle piece candidates which we would like to move to this slot.
Step 2: Graph building: We connect these intuitions to form a directed graph which contains all recommended movements as shown in Figure 3(2).
Step 3: Cycle finding: We find appropriate cycles in the directed graph built in Step 2.
Step 4: Permutation: We perform permutation following the cycles.
If we are satisfied with the resulting picture from Step 4, we return it, otherwise we repeat this routine from Step 1 continuing from this resulting picture.
SGPerm, as shown in Algorithm 1, uses the aforementioned simple four-step routine to train DNNs, except that the intuitive recommendations (Step 1) are computed using gradient-based information, and the puzzle pieces are the weighted connections associated with each neuron. Nevertheless, the “puzzle of a DNN” is often high dimensional and has no straightforward answer. In order to efficiently train DNNs, every step of this routine needs to be elaborated.
2.1 Permissibility Graph
Two major challenges in applying the aforementioned four-step routine to training DNNs are: (1) Given the graph in Step 2, possible combinations of the cycles found in Step 3 grows exponentially w.r.t. the size of the graph. For example, the number of cycles can be found in a complete graph of size is which grows faster than . Therefore, it is infeasible to first enumerate all possible cycles then to decide which ones to use for the permutation. (2) It is difficult to know whether a permutation will result in effective learning of the DNN.
In order to make the cycle finding (Step 3) efficient and effective, in this section, we focus on carefully building the graph for the recommended movements (Step 2) such that it is small (less vertices), sparse (less edges), and contains cycles that would trigger effective learning. We control three important ingredients of the graph building (Step 2): the recommendations, aggressiveness, and graph partitioning.
Moreover, we use the term weighted connection interchangeably with weight. Additionally, we use the term parameter to describe a slot that holds a weight, and we allow permutation of weights to happen only among parameters associated with the same neuron.
2.1.1 Permissible movements
For each parameter , we define a set of permissible movements, where each permissible movement implies an assignment of the weight from parameter to parameter . From the sets of permissible movements of all parameters associated with a neuron, we derive the permissibility graph , where the parameters and the permissible movements are represented using vertices and directed edges , respectively. A weight permutation must consist of at least two weight movements, however, not all movements are permissible.
A permissible movement can have exactly two different effects, namely, to increase or decrease the weight that the parameter holds. The desired effect of a movement, to either increase or decrease the weight value for a parameter, is determined by a real number which we refer to as the recommendation of that parameter. The loss of a DNN is reduced by when a parameter’s weight changes toward the opposite direction of by a step of size , where is negatively correlated with . In the implementation of SGPerm discussed in this paper, we use momentum accelerated gradients (Polyak, 1964) as . Other choices such as gradients and weight difference learned by another optimizer (Nichol et al., 2018; Zhang et al., 2019) are also tested to be valid, but were not as effective due to difficult tuning. More importantly, here instead of using the exact direction pointed by the momentum vector, we permit movements as long as it is within the same orthant of this direction. In other words, needs to be satisfied for permissible movements , where parameter owns weight and recommendation and parameter owns weight .
After deciding the effect of permissible movements for parameters in every weight vector , we restrict these movements from incurring too big or too small updates by setting a lower bound and an upper bound for each movement, such that . The farther the distance between and , the more aggressive movements are allowed.
We define the value range of the weight vector to be the absolute difference between its largest and its smallest weight values. To control the aforementioned aggressiveness of permissible movements happening within , we introduce a hyper-parameter aggressiveness , and let . Setting not only prevents movements that incur too big updates from being permitted, but also makes the permissibility graph sparser.
Since only defines the distance between and but not their actual values, we discuss how to determine in Section 2.1.4, so that we can directly derive .
2.1.4 Graph partitioning
Note that the number of vertices in a permissibility graph can be large. For example, the largest weight vector in ResNet101 has approximately 4600 weights. Even though the sparsity of a permissibility graph can be adjusted using the aggressiveness (Section 2.1.3), a large number of vertices can cause performance bottleneck of SGPerm when finding the cycles. To address this issue, we introduce random partitioning. Aside from its most obvious effect of making the permissibility graph smaller, it has another effect: it allows us to implicitly control the value of so that excessively close movements, which will cause too small and thus inefficient updates, are avoided.
First, to control the size of partitioning, we introduce a hyper-parameter partition ratio . At each mini-batch, before we build the permissibility graphs , for each weight vector , we randomly partition into disjoint subsets , , for , each of size approximately equal to . We then build a permissibility subgraph for each disjoint subset of vertices, such that there exist no edges between different subgraphs.
Consequently, since random partitioning permits movements only within each disjoint subset, it increases the average weight difference between reachable vertices in . We thus have , which is approximately on average, depending on the partition ratio and the range . For example, within the partitioned subset marked by the red boxes on Figure 4 (left) and the red dots on Figure 4 (right), we observe that the average weight distance is increased.
Finally, from a long-term perspective of training, repeated random partitioning at each mini-batch helps weights to reach any parameter. Otherwise, weights are likely to be trapped in regions where only weights with similar values are present, e.g., the dark upper region on Figure 4 (left). In essence, random partitioning can be understood as a technique to improve fluidity of weights.
2.2 Finding the Permutations
Given the permissibility subgraphs built following the steps described in Section 2.1, finding the permutation that efficiently reduces the loss is equivalent to finding a proper set of disjoint cycles in these subgraphs.
We use a heuristic approach: we use depth-first search (DFS), starting from the parameter (vertex) with the largest absolute recommendation, following the directed edges in order, with backtracking. During the search, as soon as a cycle is found, we memorize those vertices that are included in this cycle, so that they will not be traversed again. Otherwise if no cycle is detected after exhausted all possible paths from the vertex where we began the search, we only memorize this vertex, since we know it will not be included in any cycle in the future traversal. After we updated the memorization, we start again from the next parameter associated with the largest absolute recommendation that has not been memorized yet. We repeat this process until all vertices are memorized. This DFS-based cycle finding algorithm is made feasible even for weight vectors with many elements, thanks to the high sparsity and small size of the permissibility subgraphs built using the techniques introduced in2.1.
2.2.1 Priority of movements
We are left with the task of deciding the priority of the DFS traversal. For every parameter in the permissibility subgraph , we obtain a set of permissible movements , for distinct parameters . Based on these permissible movements, we define the adjacency list of to be .
Starting from , our DFS-based cycle finding algorithm (Section 2.2) proceeds from the foremost unvisited vertex in the adjacency list of . Since this algorithm stops as soon as a cycle is found, the way in which the elements in the adjacency lists of each parameter are ordered determines the priority of traversal and, thus, the cycles our cycle finding algorithm finds. Among several options, we choose the one with the best empirical performance: we prioritize those weights that are closer, i.e., for , . Other options include prioritizing the farther weights or randomly selected weights. Future work would come up with rules that set the priority of movements based on other criteria.
An example of the cycles found in a permissibility subgraph built from a subset of size 37 is shown in Figure 5.
For weight vectors , it is empirically shown to help convergence if we reduce at each batch iteration, and optionally reduce or keep it constant. Therefore, as shown in Algorithm 1, we apply an exponential decay term and an exponential growth term to the aggressiveness and partition ratio , respectively. Additionally, we use to prevent from being too small that the permissibility graphs become overly sparse, and use to avoid partition size to grow so large that the cycles can not be found efficiently. Finally, the weight profiles are also hyper-parameters which will be discussed in Section 4.
2.4 Computational Complexity
SGPerm introduces a constant computational overhead mainly due to the permissibility graph building (Section 2.1) and cycle finding (Section 2.2). Graph building involves allocating a 2d array (vertex adjacency list) for each partitioned subset. There are roughly vertices in the th partition of the th weight vector, and each vertex associates with an adjacency list of average length . Therefore, building permissibility graph for all weight vectors requires time complexity .
Cycle finding, in the worst case, involves traversing all edges for each vertex. The computational complexity of Depth-first search is . Therefore, based on the analysis about graph building, it can be shown that finding cycle in all the permissibility subgraphs has time complexity .
Note SGPerm is parallelizable as each permissibility subgraph is built independently, and cycles are searched separately in each of these subgraphs. Furthermore, and are usually set to less than 0.1 which improves the speed of this algorithm in practice.
We empirically validate SGPerm in Section 5.1.
3 Lookahead Permutation
With the basic ideas and intuitions explained earlier in Section 1 and 2, we introduce another P2L algorithm, inspired by the Reptile optimizer (Nichol et al., 2018) and the Lookahead optimizer (Zhang et al., 2019), named Lookahead Permutation (LAPerm). A pseudo-code is shown in 2. This method is computationally efficient and easy to tune.
Similar to the Lookahead optimizer, LAPerm uses an inner loop that runs another optimizer for steps at each iteration, whereas there are major differences in how the initialization and synchronization are done.
For the initialization, how to obtain will be discussed in Section 4. Additionally, before training, we create a copy of and sort all the weight vectors in this copy to obtain , which serves as a preparation for the synchronization step during training. We will return to synchronization with greater detail in Section 3.1.
At each iteration, the look ahead and synchronization are done in the follow orders: In the beginning of th iteration, the inner optimizer begins to look ahead, starting from , for random mini-batches and arrive at the destination . We then synchronize w.r.t. to obtain using permutation.
Note in Algorithm 2, no copy is necessary for , since is a different symbol to represent weights in the inner loop.
For every weight vector in , there is a corresponding weight vector in of the same size. Synchronizing w.r.t. is to permute each weight vector in according to its counterpart in such that and have the same ranking for every .
We define the ranking of the weight vector as a vector of distinct consecutive integers from 0 to such that if and only if for all integers . Here denotes the th element of the vector . Since the weight vectors are already sorted in , synchronization is performed by directly indexing using the ranking of for every .
In essence, the vector obtained at the end of the inner loop is used as a reference for the correction of the orders of weights in the weight vectors of . How much the weight values are changed is not of interest, but only their relative rankings. To visualize the permutations performed by LAPerm, we directly compare how the orders of weights in each weight vector between two consecutive iterations are changed and deduce the permutations. An example of such permutations is shown in Figure 6.
3.2 Computational Complexity
LAPerm introduces computational overhead mainly due to the synchronization step, where a sorting operation is required to get the ranking of values. Assuming that we use a linearithmic sorting algorithm for weight vectors of size , and use an inner optimizer with time complexity , for synchronization period , the computational complexity for one LAPerm update ( mini-batches) is .
Assuming appropriately chosen , the value distribution and range of are similar to that of the initial weights . Moreover, in modern DNN architectures, the size of most weight vectors are usually under
, e.g., average weight vector size in VGG16, ResNet50, and MobileNet (Howard et al. 2017) are approximately 4170, 1017, and 1809, respectively. Therefore, the performance of sorting can be further improved by adopting algorithms such as bucket sort, especially when weights are initialized with near-uniform distributions.
We empirically validate LAPerm in Section 5.2.
4 Weights as Hyper-parameters
In this section, we discuss how to initialize weights for both SGPerm and LAPerm. Since P2T algorithms only alter the orders of weights and never their statistical properties, the weights are naturally regularized. During training, we could adjust the weight profiles according to our needs or keep them unchanged. In the former case we obtain a new dimension to controlling the training, and in the later case, the initialization will have a straightforward effect on the whole training process.
4.1 Walking on the Surface of Hyperspheres
SGD and its variants have infinite freedom when exploring the tremendous parameter space of a DNN. However, given a fully connected multilayer DNN with weight vectors , the P2T implementations introduced in this paper can choose from precisely possible permutation configurations (where each weight vector contains only distinct weight values and we temporarily ignore non-weight parameters such as biases). Since permuting never changes its euclidean norm, P2T walks strictly on the surface of a hypersphere of radius .
In general, we would like the initial weights vector to allow its possible permutations to uniformly cover the hypersphere, so that the combinations of such permutations when forming a weight matrix are flexible enough to represent a variety of functions. An observation (Spruill, 2007) pointed out that the coordinates of a point chosen uniform-randomly on the (n)-sphere of radius
have approximately a normal distribution whenis large, which therefore encourages the usage of normal distribution for initial weights for P2T.
4.2 From Hesitant to Confident
Aside from the theoretical motivations, given the center and scale, a random distribution governs the probabilities of occurrence of different weight values. The weighted connections of a neuron then determine how much each individual feature contributes to its activation. In Figure1
(b), the profile of a well trained weight matrix demonstrates properties similar to that of a double exponential distribution where few parameters have comparatively larger weights and exponentially more parameters has weights that are close to zero. We consider such a profile to beconfident, since it largely attributes its activation to very few input features. Conversely, the uniform random distribution is considered to be hesitant, since it contains weight values that are selected equally likely within the range, and thus is forced to take more input features into account when making up its activation. Ideally, we want the weight profiles to be neither too hesitant nor too confident, e.g., following the normal distribution.
Nevertheless, empirical results showed that P2T has high tolerance to weight distributions. For example, P2T can not only permute random weights, but also “trained weights”, e.g., weight profiles of a well trained DNN. Such property can be utilized to further improve and understand the performance of P2T, which will be studied in depth in the future. For now, we use uniform random weights for SGPerm, because of its evenly spaced weight values. It makes it easy to find similar numbers of permissible movements for every parameter in all weight vectors by using a common aggressiveness (Section 2.1.3).
4.3 Center and scale
A properly centered and scaled weight initialization is crucial, as it ensures a stable forward and backward information flow in DNN. Many effective initialization schemes for SGD and its variants are available, such as Kaiming initialization (He et al., 2015). For now, we directly adopt their methods. Future works would invent better initialization methods specifically for P2T, e.g., by imitating the virtuous features of weights of well trained DNNs.
4.4 Rewiring Neuron Connections
We call a DNN sparsely connected if the weight matrices between each two neighboring layers contain mostly zeros. P2T trains DNNs using permutation which modifies only to which neurons in the th layer each neuron in the th connects. Therefore, as demonstrated in Figure 7, when a DNN is sparsely connected, P2T would rewire its neurons so as to adapt to the given learning task. In contrast, traditional training algorithms will only be able to change the weights for existing neuron connections. In Section 5.2.3, we exploit the plasticity in neuron connections which P2T provides, and attempt to learn a good neuron connectivity for the MNIST datasetgiven a randomly pruned (90% of all connections are removed) randomly initialized DNN.
In this section, we empirically evaluate our methods on three image classification tasks using several different DNN architectures. We train DNN using SGPerm and LAPerm to minimize the cross entropy loss between the network output and ground truth labels, given their corresponding images. For each task, the change in validation accuracy during training is shown. We also run Adam (Kingma & Ba, 2015) on the same tasks to understand the performance of P2T in comparison with a state-of-the-art conventional training algorithm.
In this section, we test SGPerm on MNIST and Fashion-MNIST (Xiao et al., 2017). Both datasets consist of 70,000 black and white images of size with ten different categories. We train our models on 60,000 sample images and validate on 10,000 test images. Hyper-parameters are searched over a sparse grid, as the chosen datasets and architectures are relatively straightforward. We report experiment results using the best settings.
5.1.1 Logistic Regression on MNIST
We first use SGPerm to estimate a set of parameters for a logistic model to classify MNIST handwritten digits. We use a network with 784 input units and 10 output units, with no hidden layers. We use Kaiming’s method to initialize the weights and biases with uniform random weights, and let, , , , and . The batch size is set to 64 for 800 batches.
We observe that SGPerm progressed quickly in the first 25 updates and plateaued as the validation accuracy reached approximately 87%, whereas Adam continues to proceed. The quick jump-start and accuracy bottleneck can be partially attributed to the strongly regularized but aggressive behaviors of SGPerm. By monitoring weight differences in the first 50 updates for both SGPerm and Adam, we find that SGPerm creates on average approximately 4 times sparser but 80 times stronger updates. Another potential reason is that the logistic model had much fewer parameters compared with regular DNNs, which result in an insufficient number of permutation configurations and made SGPerm difficult to reach high performance.
Logistic regression has a well-studied convex objective, which makes it suitable for understanding what SGPerm did to the weights. The logic model has 10 neurons (not including the input neurons), where each neuron is mainly responsible for identifying a digit. The heatmap of each weight vector associated with each neuron before and after training is shown in Figure 9. Colors from brighter to darker indicate weight values from 0.0875 to -0.0875. It can be seen that SGPerm moved large positive weights to locations based on the interests of each neuron, in this case the strokes of each digit, whereas large negative weights are moved to the surroundings of these strokes. This behavior is very similar to completing picture puzzles described in the beginning of Section 2.
5.1.2 Fully Connected Multilayer Neural Networks on MNIST and Fashion-MNIST
Fully Connected Multilayer Neural Networks (FC) are essential building blocks for modern DNN architectures, where each neuron in the hidden layer is by default connected with all neurons in the previous layer. We use a network with a hidden layer of size 500 followed by a hidden layer of size 300. The model uses ReLU as nonlinearities and a softmax output layer on top. The same DNN architecture with the same weight initialization is trained separately on MNIST and Fashion-MNIST. In other words,we ask SGPerm to solve two different problems by permuting the weighted connections of the same network.
For hyper-parameters, the weights and biases are initialized using Kaiming uniform, and we let , , , , and for both experiments. The batch size is set to 64 for 800 batches. The results are shown in Figure 10. To regularize the aggressive behavior mentioned in 5.1.1, we tried a times smaller initial . We observe that this time SGPerm, as compared with 5.1.1, slightly outperformed Adam within the first 800 updates. By adding two hidden layers, the DNN model gained a super-exponential growth on its possible neuron connection configurations, and let SGPerm soon fit to the data even though permutation is the only operation allowed.
In this section, we tackle a more difficult image classification task, CIFAR-10, using LAPerm on convolutional neural networks. The CIFAR-10 dataset consists of 60,000
color images, with 10 different classes. We train our models on 50,000 training images and validate on 10,000 test images. In all experiments, we z-score normalize all images, and use real time random data augmentation with rotation up to 15 degree and width and height shifts up to 10% of the original image size. Since we have yet to come up with a P2T method for the batch normalization(Ioffe & Szegedy, 2015) layers, for now, they are trained only using the inner optimizer of LAPerm.
ResNet is an effective variant of convolutional neural network (CNN). It made it possible to significantly increase the depth of DNN while still achieving compelling performance. We validate LAPerm on the ResNet20 model, which is 20 layers deep and has approximately 270,000 trainable parameters. The model weights and biases are initialized with normal random weights following Kaiming’s method. The model is trained separately using LAPerm with Adam as inner optimizer and Adam alone, using batch size of 64 for 200 epochs. The synchronization periodis set to 10 for LAPerm, learning rates for Adam in both experiments start at 0.001 and are divided by 10 at the 80th, 120th, 160th, and 180th epochs. Regularization is normally done following the ResNet original paper.
The result is shown in Figure 11. LAPerm achieved a maximum validation accuracy of 87.41% which is 4.33% lower than that of Adam. The accuracy growth trend of LAPerm showed that in the last 100 epochs it made no progress at all. Training usually benefits from learning rate reduction when the progress slows down or stopped, and gain a boost in accuracy. However, at around the 80th epoch, while Adam benefited greatly from the reduction of learning rate, LAPerm appeared to react numbly. Would this optimization difficulty be partially due to the complicated architecture of ResNet?
5.2.2 VGG-style Neural Network
The frustration experienced in training ResNet in Section 5.2.1 encouraged us to test on shallower DNN. We choose a VGG-style CNN with approximately 300,000 trainable parameters which is similar to ResNet20, but has only 7 layers: [32 CONV ][MaxPool Stride 2] [64 CONV ][MaxPool Stride 2] [128 CONV ][MaxPool Stride 2][FC 10]. The model uses ReLU as nonlinearities and a softmax output layer on top. The model is trained for 1000 epochs using batch size of 64. We compare LAPerm with Adam by running the following two experiments: (1) We run LAPerm using Adam as the inner optimizer from the beginning and set the synchronization period . (2) We run Adam from the beginning. For better performance of Adam, we add regularizations: we use 0.0001 L2 weight decay and apply 20%, 30% and 40% dropout (Krogh & Hertz, 1991)
to the inputs after the first, second, and third max pooling layers, respectively. For both experiments, the learning rate of Adam starts at 0.001 and is divided by 10 at the 500th and 700th epochs.
The result is shown in Figure 12. LAPerm with Adam as inner optimizer achieved maximum validation accuracy of 90.03% and outperformed using Adam alone.
By comparing with the results in Section 5.2.1, we conjecture that, for P2T, the more complicated the DNN architecture (more layers and skip connections (He et al., 2016)), the more subtle weight initialization is required. Because, unlike Adam, P2T can not modify any weight values. Instead of simply adopting the random weight initialization scheme for SGD and its variants (as we did for all experiments in this paper), a well designed initialization method for P2T would potentially help improve its accuracy on DNNs with complicated architecture. Moreover, an insufficient number of inner loop iteration would also cause the permutation to happen at improper timing which interrupts the progress of its inner optimizer. Other potential causes of inefficient training include improper choice of inner optimizer and insufficient tuning for learning rate, etc. The reason for such optimization difficulties will be studied in the future.
5.2.3 Rewiring DNN using LAPerm
Continuing from Section 4.4, we demonstrate how the plasticity in neuron connections which P2T provides would allow us to learn a better neuron connectivity for a sparsely connected DNN, compared with random connectivity.
The FC model in 5.1.2, with 550,000 trainable parameters, was over-parameterized for the MNIST dataset. We first initialize this model with normal random values following the Kaiming’s method, and to reduce the size of this network, we randomly select and set 90% of these weights to zero. This step can be understood as obtaining a randomly connected subnetwork. We train this subnetwork using the same weight initialization for the following two experiments on the MNIST dataset:
Experiment 1: Phase 1: The model is trained using LAPerm with Adam as inner optimizer for the first 1500 mini-batches. Phase 2: While fixing the zero weights, we use Adam, which inherits all inner states from the inner optimizer of LAPerm used in Phase 1, to continue training this model. We run this experiment for a total of 48 epochs including the first 1500 mini-batches using LAPerm in Phase 1.
Experiment 2: The model, while fixing the zero weights, is trained using Adam for 48 epochs.
For both experiments we use batch size of 64 for 48 epochs. The learning rate of Adam is set to 0.001, and is divided by 10 at the 28 and 38th epoch. The synchronization period for LAPerm is set to 30.
The validation accuracy for both experiments are shown in Figure 15 (a) and (b), where (b) shows this validation accuracy only for the first 2700 mini-batches. For Experiment 1, we used different colors to differentiate Phase 1 and Phase 2, which is labeled as “LAPerm” and “Adam (LAPerm reconnected)” respectively on the figure.
We first observe the heatmap of weight matrix between the input layer and the first hidden layer of our model in Figure 13 and Figure 14. In both figures, image (a) represents the random weights (90% pruned) before the experiments, (b) represents the resulted weights from Experiment 1, and (c) represents the resulted weights from Experiment 2. The heatmap of the weight matrix between two layers can be understood as the connection pattern of the neurons between these layers, since we can learn where the neurons are connected by simply observing whether their connections have zero weight values. Comparing Figure 13 (b) with (c), we observe in (b) the connection pattern is modified, whereas (c) and (a) only showed changes in weights but never their locations. The above observations suggest that LAPerm rewired the neurons.
Indeed, in Phase 1 of Experiment 1, our goal was to rewire the neuron connections of this model using LAPerm so that its connectivity is suitable for learning the given task (MNIST, in this case). This phase can be understood as finding a proper neuron connectivity and training the model at the same time. In the Phase 2 of Experiment 1, we aim at using Adam to fine tune this subnetwork learned by LAPerm in Phase 1. In 15, we observe that the experiment results are in accordance with our goals, as the LAPerm reconnected subnetwork in Experiment 1 clearly outperforms the random subnetwork used in Experiment 2 after the 25th epoch. While both experiments are trained for exactly same number of batches, using LAPerm in Experiment 1 learns a better neuron connectivity compared with in Experiment 2 and thus reached a better accuracy at the end of training.
Note that the validation accuracy trend of Phase 2 is shown in Figure 15 (b), we observe that LAPerm is plateaued at approximately 60% validation accuracy, which progressed much slower than using Adam alone. This is because after removing 90% of weighted connections, only approximately 50,000 connections are available, which reduces the possible permutation configuration of weighted connections in this network by approximately times. On the other hand, this significant reduction in the freedom of permutation forced LAPerm to search for a better neuron connectivity.
6 Conclusion and Future Work
In this paper, we presented Permute to Train (P2T), a family of algorithms to train DNNs by permuting neuron connections, which showed very different virtues compared with traditional training algorithms. We demonstrated using P2T to obtain comparable performance to state-of-the-art optimizers on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. Additionally, we briefly discussed the innate regularization effect of P2T and its ability to rewire a DNN.
The P2T implementations introduced in this paper still have many rooms for improvements, and this work has not explored the full range of its applications. As demonstrated in Section 5.1.2
, P2T can permute the same set of random weights to learn for two completely different machine learning tasks. An application immediately derived from such property would be a simplified implementation of physical deep neural networks. Using P2T, we could replace sophisticated memristive-nanodevices-based neuron connection implementation(Snider, 2007) with simple fixed weight devices. Since P2T does not require to perform any arithmetic operations to the weights, we can add a permutation circuit before each layer to permute the inputs, which is equivalent to permuting the weighted connections. Moreover, in this work, we have only explored the possibility for neuron reconnection (permutation) between two neighboring DNN layers, because that is where the similarity of weight profiles is observed. However, future work would attempt to reconnect neurons across different layers.
Moreover, since gradient-descent moves like a man walking down a hill, one of the greatest nemesis of this type of algorithm is the suboptimal local minima. P2T, on the other hand, involves many random jumps and restarts (Section 2.2 and 3.1), while still able to properly train DNNs. The uncertainties in the behaviors of P2T break the continuity in its training process, and thus shows potential for avoiding getting trapped in local minima. Finally, we believe that a deeper study of P2T in both the theoretical and empirical aspects will lead to a better understanding of the capacity and trainability of DNNs.
The authors are grateful for the comments and feedback provided by Vorapong Suppakitpaisarn and Farley Oliveira from Graduate School of Information Science and Technology, The University of Tokyo, and Khoa Tran from Department of Mathematics, University of California San Diego.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Frankle & Carbin (2019) Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Gaier & Ha (2019) Gaier, A. and Ha, D. Weight agnostic neural networks. In Advances in Neural Information Processing Systems 32, pp. 5364–5378. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8777-weight-agnostic-neural-networks.pdf.
He et al. (2015)
He, K., Zhang, X., Ren, S., and Sun, J.
Delving deep into rectifiers: Surpassing human-level performance on
Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pp. 1026–1034, USA, 2015. IEEE Computer Society. ISBN 9781467383912. doi: 10.1109/ICCV.2015.123. URL https://doi.org/10.1109/ICCV.2015.123.
He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. JMLR.org, 2015.
- Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Krogh & Hertz (1991) Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Proceedings of the 4th International Conference on Neural Information Processing Systems, NIPS’91, pp. 950–957, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. ISBN 1558602224.
- Krogh & Hertz (1992) Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Moody, J. E., Hanson, S. J., and Lippmann, R. P. (eds.), Advances in Neural Information Processing Systems 4, pp. 950–957. Morgan-Kaufmann, 1992. URL http://papers.nips.cc/paper/563-a-simple-weight-decay-can-improve-generalization.pdf.
- Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018. URL http://arxiv.org/abs/1803.02999.
- Polyak (1964) Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4:1–17, 12 1964. doi: 10.1016/0041-5553(64)90137-5.
- Ramanujan et al. (2019) Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., and Rastegari, M. What’s hidden in a randomly weighted neural network? arXiv preprint arXiv:1911.13299, 2019.
- Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556.
- Snider (2007) Snider, G. Self-organized computation with unreliable, memristive nanodevices. NANOTECHNOLOGY Nanotechnology, 18:365202–13, 09 2007. doi: 10.1088/0957-4484/18/36/365202.
- Spruill (2007) Spruill, M. Asymptotic distribution of coordinates on high dimensional spheres. Electron. Commun. Probab., 12:234–247, 2007. doi: 10.1214/ECP.v12-1294. URL https://doi.org/10.1214/ECP.v12-1294.
- Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. URL http://arxiv.org/abs/1708.07747.
- Zhang et al. (2019) Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems 32, pp. 9597–9608. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9155-lookahead-optimizer-k-steps-forward-1-step-back.pdf.
- Zhou et al. (2019) Zhou, H., Lan, J., Liu, R., and Yosinski, J. Deconstructing lottery tickets: Zeros, signs, and the supermask. In Advances in Neural Information Processing Systems 32, pp. 3597–3607. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8618-deconstructing-lottery-tickets-zeros-signs-and-the-supermask.pdf.