1 Introduction
The deep neural network has achieved great success in computer vision and machine learning. Mathematically, training a
layer neural network can be formulated as:where denotes the training sample and denotes the labelled set, is the regularization in the th layer, and is the th layer’s activation. The absence of yields the fully connected networks. The main workhorse of the deep neutral network training task is the SGD [1] and its variants. The core part of SGD for training neural networks is to compute the gradient with respect to
, i.e., the backpropagation
[2]. For simplicity of presentation, we assume is amapping. According to the chain rule, it follows:
If , the gradient decays fast or even vanishes when is large. This phenomenon, which has been mentioned in [3], hurts the speed and performance of SGD in the deep layer case. On the other hand, the convergence of stochastic training method is based on the Lipschitz continuous assumption on the gradient, which fails to hold for various applications. To overcome these drawbacks, papers [4, 5, 6, 7] propose gradient free methods by the Alternating Direction Methods of Multipliers (ADMM) or Alternating Minimization. The core idea of this method is the decomposition of the training task into a sequence of substeps which are just related to onelayer activations. Due to that each substep can always find its global minimizer, the gradient free method can achieve notable speedups [4, 5, 6]. Another advantage of the gradient free is parallelism due to that ADMM and AM have natural conveniences to be implemented concurrently. Besides the acceleration and parallelism, ADMM and ADM also enjoy the advantage of mild theoretical guarantees to converge compared with SGD: The theory of SGD is heavily dependent on the global smoothness of with respect to , which usually fails to hold. This paper studies the alternating minimization based method, namely Deep Learning Alternating Minimization (DLAM) and aims to develop an improved version of DLAM.The training task can be reformulated as a nonlinear constrained optimization problem. DLAM is actually the AM method applied to its penalty. In [8], the authors propose a new framework based on using convex functions to appropriate a nondecreasing activation. The AM method is the stateofart solver for the convex optimization problem. In paper [9]
, the authors extend the DLAM to the online learning with coactivation memory technique. By representing the activation function as a proximal operator form,
[10] proposed a new penalty framework, which can also be minimized by AM. With rewriting equivalent biconvex constraints to activation functions, [11]proposes AM for Fenchel lifted networks. With a higher dimensional space to lift the ReLU function, paper
[6] develops a smooth multiconvex formulation with AM. In [12], the authors develop the AM methodology for the nested neural network.The contribution of this paper can be concluded as follows: 1) A novel algorithm. We develop a new algorithm for deep neural networks training. In each iteration, the initial technique is employed to predict an auxiliary point with current and last iterate. For acceleration, we use a small penalty parameter in the beginning iterations and then increase it to a larger one. The DLAM can be regarded as a special case of this scheme. 2) Sound results. Various proximal operaters are widely used in statistical learning comunity [13], which bring various applications problems potentially could be solved by the proposed algorithm in this work. We present the numerical experiments to demonstrate the performance of our algorithm. The convergence is verified, and comparisons with latest classical solvers are presented.
2 Problem Formulation and Algorithms
Notation: Given an integer , . Let and be positive, means . We denote that and . Similar notations are defined for and .
represents the loss function and
is the L2 norm. For a mapand vectors
, we denote where is a proxmial parameter. If , then ; If is set to be the ReLU function [14], is one of these three items minimizing [15]. For loss function with regularization term, we denote an operator as If is set to be , then ; If is set to be hinge loss , then is one of these three items . More proximal operators are widely used in statistical learning community [13].Penalty formulation: We reformulate the layer training task as linearly constrained optimization problem by introducing and :
(1)  
s.t. 
If , we then need to minimize a function with linear constraints, which can be efficiently solved by ADMM. When , Problem (1) is a nonlinear constrained problem, which is difficult to solve directly. Thus, people consider its penalty problem. Given a very large penalty parameter , we aim to solve a reformulated problem:
(2)  
An extreme case is setting , in which (2) is identical to (1). Actually, even for linearly constrained nonconvex minimization, penalty method is also a good choice due to that it enjoys much more mild convergent assumption and easilyset parameters than the nonconvex ADMM [16]. In the numerical experiments, we update from iteration to iteration, and is set to be at the end of each iteration.
It should be noted that the formulation in this work is different from Zeng’s work [15], there is no acitvation before the last layer, and further analysis is based on this formulation.
Inertial methods: The DLAM is actually the alternating minimization applied to (2). In the nonconvex community, the inertial technique [17] (also called as heavyball or momentum) is widely used and proved to be algorithmically efficient [18, 19, 20, 21]. Besides acceleration and good practical performance for nonconvex problems, the advantage of inertial technique is illustrated by weaker conditions avoiding saddle points [22]. The procedure of inertial method is quite simple, it uses linear combination of current and last point for next iteration. For example, the gradient descent minimizing a smooth function employs the inertial term () as .
Algorithm: We employ the alternating minimization method.We use the inertial technique for , and ,
We can see if , the algorithm above then degenerates the DLAM. We first use the linear combinations to predict two new points (inertial step). In the substeps of updating , and , we use a proximalpoint way, i.e., adding the regular terms , and in the minimizations to get the sufficient descent. Let is the width of the layer, and we have .
As we mentioned at the beginning of this section, is the penalty parameter, which shall be large to yield the sufficient closed approximate problem. A natural problem is that large leads to small change of and , which indicates the algorithm slowed down in this case. To overcome this drawback, we set a small as the initialization and then increase it in the iterations as , where . Such techniques have been used in the image processing and nonconvex linearly constrained problems [23, 24]. With increasing penalty parameter strategy, our algorithm can be displayed in Algorithm 1.
3 Numerical Experiments
In this section, we present the numerical results of our algorithm. We follow the experimental setup introduced by [7]. Specifically, we consider the DNN training model (1
) with ReLU activation, the squared loss, and the network architecture being an MLPs with hidden layers, on the two datasets, MNIST
[25] and Fashion MNIST [26]. The specific settings were summarized as follows:
For the MNIST data set, we implemented a 784(15002)10 MLPs (i.e., the input dimension d0 = 28 28 = 784, the output dimension d3 = 10, and the numbers of hidden units are all 1500), and set with different values to testify the proposed algorithm iPDLAM. The sizes of training and test samples are 60000 and 10000, respectively.

The parameters of the DLAM is adopted as the default parameters given in [15]. The learning rate of SGD and its variants (.
i.e. RMSprop, Adam, Adadelta, AMSGrad, Adamax.) is 0.001 (a very conservative learning rate to see if SGD can train the DNNs). More greedy learning rates such as 0.01 and 0.05 have also been used, and similar results of training are observed.

For each experiment, we used the same minibatch sizes (512) and initializations for all algorithms. Specifically, all the weights
are randomly initialized from a Gaussian distribution with a standard deviation of 0.01 and the bias vectors are initialized as vectors of all 0.1, while the auxiliary variables
and state variables are initialized by a single forward pass.
The proposed algorithm is implemented based on PyTorch. And the experiments are conducted on a desktop computer with Intel
^{®} Core^{™} CPU i78700k @ 4.70 GHz , 16 GB memory, and running Ubuntu 18.04 Server OS. The CPU contains 12 physical cores. We use PyTorch 1.5.0 to implement the proposed algorithm.3.1 The closed solutions of substeps in iPDLAM
Given the activation to be ReLU function, the loss function is norm, is 0, and the bias parameter added to the network, the closed solution of substeps in iPDLAM is presented in this subsection. The parameters are discussed in the following subsections. Besides the proximal operators on for , the closed solutions for and for are:
in which, is the input data.
In the following, we present numerical experiments to verify the theoretical analysis, the iPDLAM’s superem performance over baselines under different and . And further more, we discuss the variables update order’s impect on the final performance.
3.2 Comparisons with DLAM
Firstly, we present the objective function values, training accuracy and test accuracy when for the two datasets; is set as . The maximum iteration is set as . The results of test 1 are given in Figure 1. We can see that iPDLAM outperforms DLAM: iPDLAM can achieve smaller objective function values and larger training and test accuracy. It is worth mentioning that the curves of the function values are not always descent. This fact does not contradict our theory due to that we prove is decreasing after sufficient iteration (Lemma 1) rather than the function values.
From Table 1, we can see that the inertial term contributes when . When , iPDLAM performs poorly. This phenomenon coincides with our theoretical findings.
MNIST  Fashion MNIST  

FV  TrA  TeA  FV  TrA  TeA  
0.1  0.14  95.4%  88.2%  0.14  94.6%  93.2% 
0.2  0.12  95.7%  88.9%  0.12  95.0%  93.4% 
0.3  0.13  95.8%  89.2%  0.13  95.5%  93.7% 
0.4  0.14  96.2%  89.5%  0.14  95.7%  93.9% 
0.5  0.14  96.4%  89.3%  0.14  96.2%  94.1% 
0.6  0.10  96.6%  89.2%  0.10  96.6%  94.2% 
0.7  0.03  96.8%  89.6%  0.09  96.3%  94.2% 
0.8  0.28  92.0%  85.8%  0.29  91.6%  91.8% 
0.9  0.61  95.1%  87.6%  0.64  94.6%  93.0% 
1.0  5510  89.0%  78.1%  5600  89.1%  86.2% 
1.1  nan  nan  nan  nan  10.4%  10.4% 
1.2  nan  nan  nan  nan  10.4%  10.4% 
3.3 Robust performance on different values of
In the second test, we use , the parameter is set as . The dataset is the MNIST. The results of the second test are given in Figure 2. In the five cases, the training and test accuracy versus the iterations actually perform very similar for the five cases. The results show that the algorithm is insensitive to .
3.4 Results against classical Deep Learning optimizers
We compare iPDLAM with SGD, AdaGrad[27], Adadelta[28], Adam and Adamax[29]
. The training and test accuracy versus the iteration (epoch) for different algorithms are reported in Figure
3. Although our algorithm cannot beat classical algorithms on training accuracy, it performs better than most of them on test accuracy. Our algorithm can learn a quite good parameter in very small iterations (less than 10). Moreover, the proposed iPDLAM is alawys better than classical optimizers at first 40 iterations, both in It is also the fastest one to reach stable, even when SGD fails to converge (the blue one on the bottom in Figure 3).Based on the experiment results, it is recomended to adopt the iPDLAM to train the neural networks at the first stage as a warmup strategy, which is expected to greatly reduce the total training time cost, and speed up the training process.
3.5 The update order of variables
We compare the alternating minimization methods and use the following cyclic orders to update the variables:

Reverse:

Increase:

Nested Reverse (NR):
The numerical results are reported in Figure 4. The variables update order has little impact on the optimization process, which suggests iPDLAMs could be further applied to parallel and distributed settings. Further experiments results are provided in the supplymentary material, and the implementation of this work is public available.
4 Conclusion
In this paper, we propose an improved alternating minimization, named iPDLAM, for the neural network training. The development of the algorithm is based on the inertial techniques applied to the penalty formulation of the training task. Different from the stochastic training methods, our algorithm enjoys solid convergence guarantee, and the numerical results show that the proposed algorithm takes smaller iterations to reach the same training and test accuracy compared with various classical training algorithms.
References
 [1] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
 [2] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning representations by backpropagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
 [3] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., “Gradient flow in recurrent nets: the difficulty of learning longterm dependencies,” 2001.
 [4] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein, “Training neural networks without gradients: A scalable admm approach,” in International conference on machine learning, pp. 2722–2731, 2016.

[5]
Z. Zhang, Y. Chen, and V. Saligrama, “Efficient training of very deep neural
networks for supervised hashing,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1487–1495, 2016.  [6] Z. Zhang and M. Brand, “Convergent block coordinate descent for training tikhonov regularized deep neural networks,” in Advances in Neural Information Processing Systems, pp. 1721–1730, 2017.
 [7] J. Wang, F. Yu, X. Chen, and L. Zhao, “Admm for efficient deep learning with global convergence,” arXiv preprint arXiv:1905.13611, 2019.
 [8] A. Askari, G. Negiar, R. Sambharya, and L. E. Ghaoui, “Lifted neural networks,” arXiv preprint arXiv:1805.01532, 2018.
 [9] A. Choromanska, E. Tandon, S. Kumaravel, R. Luss, I. Rish, B. Kingsbury, R. Tejwani, and D. Bouneffouf, “Beyond backprop: Alternating minimization with coactivation memory,” stat, vol. 1050, p. 24, 2018.

[10]
J. Li, C. Fang, and Z. Lin, “Lifted proximal operator machines,” in
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 33, pp. 4181–4188, 2019.  [11] F. Gu, A. Askari, and L. E. Ghaoui, “Fenchel lifted networks: A lagrange relaxation of neural network training,” arXiv preprint arXiv:1811.08039, 2018.
 [12] M. CarreiraPerpinan and W. Wang, “Distributed optimization of deeply nested systems,” in Artificial Intelligence and Statistics, pp. 10–19, 2014.
 [13] P. L. Combettes and J.C. Pesquet, “Proximal splitting methods in signal processing,” in Fixedpoint algorithms for inverse problems in science and engineering, pp. 185–212, Springer, 2011.

[14]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.  [15] J. Zeng, T. T.K. Lau, S. Lin, and Y. Yao, “Global convergence of block coordinate descent in deep learning,” vol. 97 of Proceedings of Machine Learning Research, (Long Beach, California, USA), pp. 7313–7323, PMLR, 09–15 Jun 2019.
 [16] T. Sun, D. Li, H. Jiang, and Z. Quan, “Iteratively reweighted penalty alternating minimization methods with continuation for image deblurring,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3757–3761, IEEE, 2019.
 [17] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
 [18] P. Ochs, Y. Chen, T. Brox, and T. Pock, “ipiano: Inertial proximal algorithm for nonconvex optimization,” SIAM Journal on Imaging Sciences, vol. 7, no. 2, pp. 1388–1419, 2014.
 [19] T. Pock and S. Sabach, “Inertial proximal alternating linearized minimization (ipalm) for nonconvex and nonsmooth problems,” SIAM Journal on Imaging Sciences, vol. 9, no. 4, pp. 1756–1787, 2016.
 [20] N. Loizou and P. Richtárik, “Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods,” arXiv preprint arXiv:1712.09677, 2017.
 [21] N. Loizou and P. Richtárik, “Accelerated gossip via stochastic heavy ball method,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 927–934, IEEE, 2018.
 [22] T. Sun, D. Li, Z. Quan, H. Jiang, S. Li, and Y. Dou, “Heavyball algorithms always escape saddle points,” IJCAI, 2019.
 [23] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternating minimization algorithm for total variation image reconstruction,” SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 248–272, 2008.
 [24] T. Sun, L. Qiao, and D. Li, “Bregman reweighted alternating minimization and its application to image deblurring,” Information Sciences, 2019.
 [25] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [26] H. Xiao, K. Rasul, and R. Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.

[27]
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.  [28] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
 [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2014.