I Introduction
In recent days, deep learning has been remarkably advanced and successfully applied in numerous fields
[12]. A key mechanism behind the success of deep neural networks is that they are capable of extracting useful information progressively through their layered structures. It is an increasing trend that more and more complex deep neural network structures are developed in order to solve challenging realworld problems, e.g., [6]. Training of deep neural networks is based on backpropagation in most cases, which basically works in a sequential and synchronous manner. During the feedforward pass, the input data is processed through the hidden layers to produce the network output; during the feedback pass, the error gradient is propagated back through the layers to update each layer’s weight parameters. Therefore, training of each layer has dependency on all the other layers, which causes the issue of locking [8]. This is undesirable in some cases, e.g., a system consisting of several interacting models, a model distributed across multiple computing nodes, etc.There have been attempts to remove the locking constraint. In [1], the method of auxiliary coordinates (MAC) is proposed. It replaces the original loss minimization problem with an equalityconstrained optimization problem by introducing an auxiliary variable for each data and each hidden unit. Then, solving the problem is formulated as iteratively solving several subproblems independently. A similar approach using the alternating direction method of multipliers (ADMM) is proposed in [15]
. It also employs an equalityconstrained optimization but with different auxiliary variables, so that resulting subproblems have closed form solutions. However, these methods are not scalable to deep learning architectures such as convolutional neural networks (CNNs).
The method proposed in [8], called decoupled neural interface (DNI), directly synthesizes estimated error gradients, called synthetic gradients, using an additional small neural network for training a layer’s weight parameters. As long as the synthetic gradients are close to the actual backpropagated gradients, each layer does not need to wait until the error at the output layer is backpropagated through the preceding layers, which allows independent training of each layer. However, this method suffers from performance degradation when compared to regular backpropagation [3]. The idea of having additional modules supporting the layers of the main model is also adopted in [3], where the additional modules are trained to approximate the main model’s outputs instead of error gradients. Due to this, however, the method does not resolve the issue of update locking, and in fact, the work does not intend to design a nonsequential learning algorithm.
In this paper, we propose a novel approach for nonsequential learning, called local critic training. The key idea is that additional modules besides the main neural network model are employed, which we call local critics, in order to indirectly deliver error gradients to the main model for training without backpropagation. In other words, a local critic located at a certain layer group is trained in such a way that the derivative of its output serves as the error gradient for training of the corresponding layers’ weight parameters. Thus, the error gradient does not need to be backpropagated, and the feedforward operations and gradientdescent learning can be performed independently. Through extensive experiments, we examine the influences of the network structure, update frequency, and total number of local critics, which provide not only insight into operation characteristics but also guidelines for performance optimization of the proposed method.
In addition to the capability of implementing training without locking, the proposed approach can be exploited for additional important applications. First, we show that applying the proposed method automatically performs structural optimization of neural networks for a given problem, which has been a challenging issue in the machine learning field. Second, a progressive inference algorithm using the network trained with the proposed method is presented, which can adaptively reduce the computational complexity during the inference process (i.e., test phase) depending on the given data. Third, the network trained by the proposed method naturally enables ensemble inference that can improve the classification performance.
Ii Proposed Approach
Iia Local Critic Training
The basic idea of the proposed approach is to introduce additional local networks, which we call local critics, besides the main network model, so that they eventually provide estimates of the output of the main network. Each local critic network can serve a group of layers of the main model by being attached to the last layer of the group. The proposed architecture is illustrated in Figure 1, where is the th layer group (containing one or more layers), is the output of , and is the final output of the main model having layer groups:
(1) 
is the local critic network for , which is expected to approximate based on , i.e.,
(2) 
Then, this can be used to approximate the loss function of the main network,
, which is used to train , by(3) 
for , i.e.,
(4) 
where is the training target and is the loss function such as crossentropy or meansquared error. Then, the error gradient for training is obtained by differentiating with respect to , i.e.,
(5) 
which can be used to train the weight parameters of , denoted by , via a gradientdescent rule:
(6) 
where is a learning rate. Note that the final layer group does not require a local critic network and can be trained using the regular backproagation because the final output of the main network is directly available. Therefore, the update of does not need to wait until its output propagates till the end of the main network and the error gradient is backpropagated; it can be performed when the operations from (2) to (5) are done. For , we usually use a simple model so that the operations through are simpler than those through till .
While the dependency of on () during training is resolved in this way, there still exists the dependency of on (), because training requires its ideal target, i.e., , which is available from only after the feedforward pass is complete. In order to resolve this problem, we use an indirect, cascaded approach, where is trained so that its training loss targets the training loss for ^{1}^{1}1We found that this is more effective than directly forcing to approximate using .:
(7) 
In other words, training of can be performed once the loss for is available.
Figure 1 compares the proposed architecture with the existing DNI approach that also employs local networks besides the main network to resolve the issue of locking [8]. In DNI, the local network directly estimates the error gradient, i.e.,
(8) 
so that each layer group of the main model can be updated without waiting for the forward and backward propagations in the subsequent layers. And, to update , the error gradient for estimated by is backpropagated through and is used as the (estimated) target for . Therefore, all the necessary computations in the forward and backward passes can be locally confined. The performance of the two methods will be compared in Section III.
IiB Structural optimization
In many cases, determining an appropriate structure of neural networks for a given problem is not straightforward. This is usually done through trialanderror, which is extremely timeconsuming. There have been studies to automate the structural optimization process [2, 4, 10, 13], but this issue still remains very challenging.
In deep learning, the problem of structural optimization is even more critical. Largesized networks may easily show overfitting. Even if large networks may produce high accuracy, they take significantly large amounts of memory and computation, which is undesirable especially for resourceconstrained cases such as embedded and mobile systems. Therefore, it is highly desirable to find an optimal network structure that is sufficiently small while the performance is kept reasonably good.
During local critic training, each local critic network is trained to estimate the output of the main network eventually. Therefore, once the training of the proposed architecture finishes, we obtain different networks that are supposed to have similar inputoutput mappings but have different structures and possibly different accuracy, i.e., multiple submodels and one main model (see Figure 1(b)). Here, a submodel is composed of the layers on the path from the input to a certain hidden layer and its local critic network. Among the submodels, we can choose one as a structureoptimized network by considering the tradeoff relationship between the complexity and performance.
It is worth mentioning that our structural optimization approach can be performed instantly after training of the model, whereas many existing methods for structural optimization require iterative search processes, e.g., [16].
IiC Progressive inference
We propose another simple but effective way to utilize the submodels obtained by the proposed approach for computational efficiency, which we call progressive inference
. Although small submodels (e.g., submodel 1) tend to show low accuracy, they would still perform well for some data. For such data, we do not need to perform the full feedforward pass but can take the classification decision by the submodels. Thus, the basic idea of the progressive inference is to finish inference (i.e., classification) with a small submodel if its confidence on the classification result is high enough, instead of completing the full feedforward pass with the main model, which can reduce the computational complexity. Here, the softmax outputs for all classes are compared and the maximum probability is used as the confidence level. If it is higher than a threshold, we take the decision by the submodel; otherwise, the feedforward pass continues. The proposed progressive inference method is summarized in Algorithm 1
^{2}^{2}2Our method shares some similarity with the anytime prediction scheme [11, 7] that produces outputs according to the given computational budget. However, ours does not require particular network structures (such as multiscale dense network [7] or FractalNet [11]) but works with generic CNNs..IiD Ensemble inference
In recent deep learning systems, it is popular to use ensemble approaches to improve performance in comparison to single models, where multiple networks are combined for producing final results, e.g., [5, 14]. The submodels and main model obtained by applying the proposed local critic training approach can be used for ensemble inference. Figure 3(a)
depicts how the submodels and the main model can work together to form an ensemble classifier. We take the simplest way to combine them, i.e., summation of the networks’ outputs.
Iii Experiments
We conduct extensive experiments to examine the performance of the proposed method in various aspects. We use the CIFAR10 and CIFAR100 datasets [9]
with data augmentation. We employ a VGGlike CNN architecture with batch normalization and ReLU activation functions, which is shown in Figure
1(a). Note that this structure is the same to that used in [3]. It has three local critic networks, thus four layer groups that can be trained independently are formed (i.e., =4). The local critic networks are also CNNs, and their structures are kept as simple as possible in order to minimize the computational complexity for computing the estimated error gradient given by (5).We use the stochastic gradient descent with a momentum of 0.9 for the main network and the Adam optimization with a fixed learning rate of
for the local networks. The L2 regularization is used with for the main network. For the loss functions in (3) and (7), the crossentropy and the L1 loss are used, respectively, which is determined empirically. The batch size is set to 128, and the maximum training iteration is set to 80,000. The learning rate for the main network is initialized to 0.1 and dropped by an order of magnitude after 40,000 and 60,000 iterations. The Xavier method is used for initialization of the network parameters. All experiments are performed using TensorFlow. We conduct all the experiments five times with different random seeds and report the average accuracy.
Iiia Performance evaluation
Dataset  BP  DNI (3)  Critic (3)  LC (1)  LC (3)  LC (5) 

CIFAR10  93.93 0.20  64.86 0.42  91.92 0.30  92.06 0.20  92.39 0.09  91.38 0.20 
CIFAR100  75.14 0.18  36.53 0.64  69.07 0.25  73.61 0.31  69.91 0.50  63.53 0.24 
, and proposed local critic training (LC). The numbers of local networks used are shown in the parentheses. The standard deviation values are also shown.
Figure 3 shows how the loss values of the main network and each local critic network, i.e., in (3), evolve with respect to the training iteration. The graphs show that the local critic networks successfully learn to approximate the main network’s loss with high accuracy during the whole training process. The local critic network farthest from the output side (i.e., ) shows larger loss values than the others, which is due to the inaccuracy accumulated through the cascaded approximation.
The classification performance of the proposed local critic training approach is evaluated in Table I. For comparison, the performance of the regular backpropagation, DNI [8], and critic training [3] is also evaluated. Although the critic training method is not for removing update locking, we include its result because it shares some similarity with our approach, i.e., additional modules to estimate the main network’s output. In all three methods, each additional module is composed of a convolutional layer and an output layer. In the case of the proposed method, we test different numbers of local critic networks. Figure 1(a) shows the structure with three local critic networks. When only one local network is used, it is located at the place of LC2 in Figure 1(a). When five local networks are used, they are placed after every two layers of the main network.
When compared to the result of backpropagation, the proposed approach successfully decouples training of the layer groups at a small expense of accuracy decrease (note that the performance of the proposed method can be made closer to that of backpropagation using different structures, as will be shown in Tables II and Figure 3(b)). The degradation of the accuracy of our method is larger for CIFAR100, which implies that the influence of gradient estimation is larger for more complex problems. When more local critic networks are used, the accuracy tends to decrease more due to higher reliance on predicted gradients rather than true gradients, while more layer groups can be trained independently. Thus, there exists a tradeoff between the accuracy and unlocking effect. The DNI method shows poor performance as in [3]. The proposed method shows similar (or even slightly better) performance to the critic training method, which shows the efficacy of the cascaded learning scheme of the local networks in our method.
IiiB Structures of local critic networks
Dataset  [1,1,1] (default)  [3,3,3]  [5,5,5]  [3,2,1]  [1,2,3]  [5,4,3]  [3,4,5] 

CIFAR10  92.39 0.09  92.36 0.22  91.72 0.19  92.07 0.21  92.20 0.12  92.10 0.16  91.90 0.16 
CIFAR100  69.91 0.50  70.02 0.29  70.34 0.16  70.06 0.64  69.81 0.33  70.87 0.40  69.93 0.56 
We examine the influence of the structures of the local critic networks in our method. Two aspects are considered, one about the influence of the overall complexity of the local networks and the other about the relative complexities of the local networks for good performance. For this, we change the number of convolutional layers in each local critic network, while keeping the other structural parameters unchanged.
The results for various structure combinations of the three local critic networks are shown in Table II. As the number of convolutional layers increase for all local networks (the first three cases in the table), the accuracy for CIFAR100 slightly increases from 69.91% (with one convolutional layer) to 70.02% (three convolutional layers) and 70.34% (five convolutional layers), whereas for CIFAR10 the accuracy slightly decreases when five convolutional layers are used. A more complex local network can learn better the target inputoutput relationship, which leads to the performance improvement for CIFAR100. For CIFAR10, on the other hand, the network structure with five convolutional layers seems too complex compared to the data to learn, which causes the performance drop.
Next, the numbers of layers of the local networks are adjusted differently in order to investigate which local networks should be more complex for good performance. The results are shown in the last four columns of Table II. Overall, it is more desirable to use more complex structures for the local networks closer to the input side of the main model. For instance, LC1 and LC3 are supposed to learn the relationship from to and that from to , respectively. More layers are involved from to in the main network, so the mapping that LC1 should learn would be more complicated, requiring a network structure with sufficient modeling capability.
IiiC Periodic update of local critic networks
Dataset  1/1  1/2  1/3  1/4  1/5 

CIFAR10  92.39 0.09  91.91 0.19  91.78 0.18  91.57 0.12  91.35 0.17 
CIFAR100  69.91 0.50  67.99 0.49  67.760.19  66.74 0.41  66.39 0.39 
A way to increase the efficiency of the proposed approach is to update the local critic networks not at every iteration but only periodically. This may degrade the accuracy but has two benefits. First, the amount of computation required to update the local networks can be reduced. Second, the burden of the communication between the layer groups also can be reduced. These benefits will be more significant when the local networks have larger sizes.
For the default structure shown in Figure 1(a), we compare different update frequency in Table III. It is noticed that the accuracy only slightly decreases as the frequency decreases. When the update frequency is a half of that for the main network (i.e., 1/2), the accuracy drops by 0.48% and 1.92% for the two datasets, respectively. Then, the decrease of the accuracy is only 0.56% for CIFAR10 and 1.60% for CIFAR100 when the update frequency decreases from 1/2 to 1/5.
IiiD Structural optimization
Table IV compares the performance of the submodels, and Table V shows the complexities of the submodels in terms of the amount of computation for a feedforward pass and the number of weight parameters. A larger network (e.g., submodel 3) shows better performance than a smaller network (e.g., submodel 1), which is reasonable due to the difference in learning capability with respect to the model size. The largest submodel (submodel 3) shows similar accuracy to the main model (92.29% vs. 92.39% for CIFAR10 and 67.54% vs. 69.91% for CIFAR100), while the complexity is significantly reduced. For CIFAR10, the computational complexity in terms of the number of floatingpoint operations (FLOPs) and the memory complexity are reduced to only about 30% (15.72 to 4.52 million FLOPs, and 7.87 to 2.26 million parameters), as shown in Table V. If an absolute accuracy reduction of 1.86% (from 92.39% to 90.53%) is allowed by taking submodel 2, the reduction of complexity is even more remarkable, up to about one ninth.
In addition, the table also shows the accuracy of the networks that have the same structures with the submodels but are trained using regular backpropagation. Surprisingly, such networks do not easily reach accuracy comparable to that of the submodels obtained by local critic training, particularly for smaller networks (e.g., 74.46% vs. 85.24% with submodel 1 for CIFAR10). We think that joint training of the submodels in local critic training helps them to find better solutions than those reached by independent regular backpropagation.
Therefore, these results demonstrate that a structurally optimized network can be obtained at a cost of a small loss in accuracy by local critic training, which may not be attainable by trialanderror with backpropagation.
Dataset  BP sub 1  LC sub 1  BP sub 2  LC sub 2  BP sub 3  LC sub 3 

CIFAR10  74.46 0.91  85.24 0.49  88.03 0.87  90.53 0.15  92.05 0.24  92.29 0.09 
CIFAR100  47.58 1.10  55.39 0.57  61.79 0.92  63.62 0.31  67.81 0.22  67.54 0.70 
model  FLOP  of parameters 

Submodel 1  2.85M  1.42M 
Submodel 2  1.76M  0.88M 
Submodel 3  4.52M  2.26M 
Main model  15.72M  7.87M 
FLOP  Accuracy (%)  

Progressive inference (0.9)  2.90M  91.18 0.10 
Progressive inference (0.95)  3.05M  91.75 0.16 
Main model  15.72M  92.39 0.09 
IiiE Progressive inference
We apply the progressive inference algorithm shown in Algorithm 1 to the trained default network for CIFAR10 with the threshold set to 0.9 or 0.95. The results are shown in Table VI. The feedforward pass ends at different submodels for different test data, and the average FLOPs over all test data are shown. When the threshold is 0.9, with only a slight loss of accuracy (92.39% to 91.18%), the computational complexity is reduced significantly, which is only 18.45% of that of the main model. When the threshold increases to 0.95, the accuracy loss becomes smaller (only 0.64%), while the complexity reduction remains almost the same (19.40% of the main model’s complexity).
IiiF Ensemble inference
The results of ensemble inference using the submodels and main model are shown in Figure 3(b). Using an ensemble of the three submodels, we observe improved classification accuracy (92.68% and 70.86% for the two datasets, respectively) in comparison to the main model. The performance is further enhanced by an ensemble of both the three submodels and the main model (92.79% and 71.86%). The improvement comes from the complementarity among the models, particularly between the models sharing a smaller number of layers. For instance, we found that submodel 3 and the main model tend to show coincident classification results for a large portion of test data, so their complementarity is not significant; on the other hand, more data are classified differently by submodel 1 and the main model, where we mainly observe performance improvement. Instead of the simple summation, there could be a better method to combine the models, which is left for future work.

Iv Conclusion
In this paper, we proposed the local critic training approach for removing the interlayer locking constraint in training of deep neural networks. In addition, we proposed three applications of the local critic training method: structural optimization of neural networks, progressive inference, and ensemble classification. It was demonstrated that the proposed method can successfully train CNNs with local critic networks having extremely simple structures. The performance of the method was also evaluated in various aspects, including effects of structures and update intervals of local critic networks and influences of the sizes of layer groups. Finally, it was shown that structural optimization, progressive inference, and ensemble classification can be performed directly using the models trained with the proposed approach without additional procedures.
References

[1]
M. A. CarreiraPerpinan and W. Wang.
Distributed optimization of deeply nested systems.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, pages 10–19, Reykjavik, Iceland, 2014.  [2] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. AdaNet: Adaptive structural learning of artificial neural networks. In International Conference on Machine Learning (ICML), pages 874–883, Sydney, Australia, 2017.
 [3] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 4278–4287, Long Beach, CA, 2017.

[4]
J. Feng and T. Darrell.
Learning the structure of deep convolutional networks.
In
International Conference on Computer Vision (ICCV)
, pages 2749–2757, Santiago, Chile, 2015. 
[5]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 770–778, Las Vegas, NV, 2016.  [6] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), pages 630–645, Amsterdam, The Netherlands, 2016.
 [7] G. Huang, D. Chen, T. Li, F. Wu, L. Maaten, and K.Q. Weinberger. Multiscale dense networks for resource efficient image classification. In International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018.
 [8] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. In International Conference on Machine Learning (ICML), pages 1627–1635, Sydney, Australia, 2017.
 [9] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 [10] T.Y. Kwok and D.Y. Yeung. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks, 8(3):630–645, 1997.
 [11] G. Larsson, M. Maire, and G. Shakhnarovich. FractalNet: Ultradeep neural networks without residuals. In International Conference on Learning Representations (ICLR), Toulon, France, 2017.
 [12] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436–444, 2015.
 [13] R. Reed. Pruning algorithms a survey. IEEE Transactions on Neural Networks, 4(5):730–747, 1993.
 [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, Boston, MA, 2015.
 [15] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. Training neural networks without gradients: A scalable ADMM approach. In International Conference on Machine Learning (ICML), pages 2722–2731, New York, NY, 2016.

[16]
B. Zoph and Q.V. Le.
Neural architecture search with reinforcement learning.
In International Conference on Learning Representations (ICLR), Toulon, France, 2017.
Comments
There are no comments yet.