Backpropagation (backprop) Rumelhart et al. (1986)
has been the workhorse of neural net learning for several decades, and its practical effectiveness is demonstrated by recent successes of deep learning in a wide range of applications. Backprop (chain rule differentiation) is used to compute gradients in state-of-art learning algorithms such as stochastic gradient descent (SGD)Robbins & Monro (1985) and its variations Duchi et al. (2011); Tieleman & Hinton (2012); Zeiler (2012); Kingma & Ba (2014).
However, backprop has several drawbacks as well, including the commonly known vanishing gradient issue, resulting from recursive application of the chain rule through multiple layers of deep and/or recurrent networks Bengio et al. (1994); Riedmiller & Braun (1993); Hochreiter & Schmidhuber (1997); Pascanu et al. (2013); Goodfellow et al. (2016)
. Although several approaches were proposed to address this issue, including Long Short-Term MemoryHochreiter & Schmidhuber (1997), RPROP Riedmiller & Braun (1993)Nair & Hinton (2010), the fundamental problem with computing gradients of a deeply nested objective function remains. Moreover, backpropagation cannot handle non-differentiable nonlinearities and does not allow parallel weight updates across the layers. Finally, backpropagation is often criticized for being biologically implausible, since its error propagation mechanism does not influence neural activity, unlike known biological feedback mechanisms. For more details on limitations of backpropagation and SGD, both numerical and conceptual, in terms of their biological (im)plausibility, see Le et al. (2011); Carreira-Perpinan & Wang (2014); Taylor et al. (2016) and Lee et al. (2015); Bartunov et al. (2018), respectively.
The issues discussed above continue to motivate research on alternative algorithms for neural net learning. Several approaches were proposed recently, introducing auxiliary variables
associated with hidden unit activations in order to decompose the highly coupled problem of optimizing a nested loss function into multiple, loosely coupled, simpler subproblems. Recently, several auxiliary-variable approaches were proposed, including the alternating direction method of multipliers (ADMM)Taylor et al. (2016); Zhang et al. (2016) and alternating-minimization, or block coordinate descent (BCD) methods Carreira-Perpinan & Wang (2014); Zhang & Brand (2017); Zhang & Kleijn (2017); Askari et al. (2018); Zeng et al. (2018); Lau et al. (2018). A similar formulation, using Lagrange multipliers, was proposed earlier in LeCun (1986); Yann (1987); LeCun et al. (1988), where a constrained formulation involving activations required the output of the previous layer to be equal to the input of the next layer. This gave rise to the approach known as target propagation and its recent extensions Lee et al. (2015); Bartunov et al. (2018); however, target propagation is somewhat different method than the above BCD and ADMM methods, training instead layer-wise inverses of the forward mappings.
While the BCD and ADMM methods discussed above assume an offline (batch) setting, requiring the full training dataset at each iteration, we will focus instead on the online, incremental learning approach, performing online alternating minimization (AM) over the network weights and auxiliary activation variables, yielding scalability to arbitrarily large dataset, and applicability on incremental/online learning settings.
Herein, we introduce the general online alternating minimization framework, provide theoretical convergence analysis, and propose two specific algorithms, AM-Adam which uses Adam locally (no gradient chains) for weight updates in optimization subproblems at each layer, and AM-mem, based on the surrogate function approach similar to the online dictionary learning of Mairal et al. (2009), which relies on accumulation of second-order information (activation covariances and cross-covariances). Both methods demonstrate promising empirical results, clearly outperforming the corresponding off-line alternating minimization, as well as closely related approach of Taylor et al. (2016); we obseve, however, that AM-Adam consistently outperforms AM-mem; often, it also converges faster than Adam and SGD initially, while typically matching the two as learning progresses.
Note that, unlike ADMM-based methods Taylor et al. (2016); Zhang et al. (2016) and some previously proposed BCD methods Zeng et al. (2018), our approach does not require Lagrange multipliers and (in case of AM-Adam) needs no more auxiliary variables than the layer-wise activations, i.e. as memory-efficient as standard SGD which stores activations for gradient computations. We assume arbitrary loss functions and nonlinearities (unlike, for example, formulation of Zhang & Brand (2017) based on ReLU assumption). Moreover, our empirical evaluation extends beyond fully-connected networks, commonly used to evaluate auxiliary-variable methods, and includes CNNs (LeNet-5), RNNs and discrete (binary) networks, on several datasets including multi-million-sample HIGGS dataset.
In summary, our contributions are:
algorithm: a novel online (mini-batch) auxiliary-variable approach for training neural networks without gradient chain rule of backprop; as opposed to prior offline auxiliary-variable algorithms, our online method is scalabile to arbitrarily large datasets and applicable to continual learning.
theory: to the best of our knowledge, we propose the first general theoretical convergence guarantees of alternating minimization in the stochastic setting. We show that the error of AM decays at the sub-linear rate as a function of the iteration .
extensive empirical evaluation on a variety of network architectures and datasets, demonstrating significant advantages of our method vs. offline counterparts, as well as faster initial learning than Adam and SGD, followed by similar asymptotic performance.
our online method inherits common advantages of similar offline auxiliary-variable methods, including (1) no vanishing gradients; (2) handling non-differentiable nonlinearities more easily in local subproblems; (3) possibility for parallelization weight updates across layers, and (4) potentially higher level of biological plausibility
due to explicit neuronal activation propagation, as compared to backprop.
2 Alternating Minimization: Breaking Gradient Chains with Auxiliary Variables
We denote as a dataset of labeled samples, where and
are the sample and its (vector) label at time, respectively (e.g., one-hot -dimensional vector encoding discrete labels with possible values). We assume , and . Given a fully-connected neural network with hidden layers, denotes the link weight matrix associated with the links from layer to layer , where is the numbers of nodes in the layer . denotes the weight matrix connecting the last hidden layer with the output. We denote the set of all weights .
Optimization problem. Training a fully-connected neural network with hidden layers consists of minimizing, with respect to weights , the loss involving a nested function ; this can be re-written as constrained optimization:
In the above formulation, we use as shorthand (not a new variable) denoting the activation vector of hidden units in layer , where
is a nonlinear activation function (e.g, ReLU,, etc) applied to code , a new auxiliary variable
that must be equal to a linear transformation of the previous-layer activations.
For classification problems, we use the multinomial loss as our objective function:
where is the column of , is the entry of the one-hot vector encoding , and the class likelihood is modeled as .
Offline Alternating Minimization. We start with an offline optimization problem formulation, for a given dataset of samples, which is similar to Carreira-Perpinan & Wang (2014) but uses multinomial instead of quadratic loss, and a different set of auxiliary variables. Namely, we use the following relaxation of the constrained formulation in eq. 1:
This problem can be solved by alternating minimization (AM), or block-coordinate descent (BCD), over the weights and the codes . Each AM iteration involves optimizing for fixed , followed by fixing and optimizing . The parameter acts as a regularization weight. As in Carreira-Perpinan & Wang (2014), we use an adaptive scheme for gradually increasing with the number of iterations111 Note that sparsity ( regularization) on both and could be easily added to the objective in eq. 3, and would not change the computational complexity of the algorithms detailed below (we can use proximal instead of gradient methods); however, we tried it in our experiment and have not yet see much benefit; for clarity, we removed it from all formulations..
Online Alternating Minimization. The offline alternating minimization outlined above is not scalable to extremely large datasets (even data-parallel methods, such as Taylor et al. (2016), are inherently limited by the number of cores available). Furthermore, offline AM is not suitable for online learning, where the input samples arrive incrementally, one at a time or in small mini-batches, from a potentially infinite data stream. This scenario is also typical in practical settings such as continual (or lifelong) learning Ring (1994); Thrun (1995, 1998), which can be viewed as online learning from nonstationary inputs, e.g. a (potentially infinite) sequence of different tasks/datasets.
To overcome those limitations, we propose a general online AM algorithmic scheme and present two specific algorithms which differ in optimization approaches used for updating ; both algorithms are later evaluated and compared empirically.
schedule, hyperparameters controlling the number of iterations in optimization subroutines, and several others; we will make our code available online. As an input, the method takes an initial(e.g., random), initial penalty weight , learning rate for the predictive layer, , and a Boolean variable , indicating which optimization method to use for updates; if , a memory-based approach (discussed below) is selected, and initial memory matrices , (described below) will be provided (typically, both are initialized to all zeros, unless we want to retain the memory of some prior learning experience, e.g. in a continual learning scenario). The algorithm processes samples one at a time (but can easily be generalized to mini-batches); current sample is encoded in its representations at each layer (encodeInput procedure, Algorithm 2), and output prediction is made based on such encodings. The prediction error is computed, and the backward code updates follow as shown in the updateCodes procedure, where the code vector at layer is optimized with respect to the only two parts of the global objective that the code variables participate in. Once the codes are updated, the weights can be optimized in parallel across the layers (in updateWeights procedure, Algorithm 3) since fixing codes breaks the weight optimization problem into layer-wise independent subproblems. We next discuss each step in detail.
Activation propagation: forward and backward passes. In an online setting, we only have access to the current sample at time , and thus can only compute the corresponding codes using the weights computed so far. Namely, given input , we compute the last-layer activations in a forward pass, propagating activations from input to the last layer, and make a prediction about , incurring the loss . We now propagate this error back to all activations. This is achieved by solving a sequence of optimization problems:
Weights Update Step. Different online (stochastic) optimization methods can be applied to update the weights at each layer, using a surrogate objective function, defined more generally than in Mairal et al. (2009), as follows: , where is defined in eq. 3 and denotes codes for all samples from time to time , computed at previous iterations. When , we simplify the notation to , and when , the surrogate is the same as the true objective on the current-time codes . The surrogate objective decomposes into independent terms, , which allows for parallel weight optimization across all layers, as follows:
For layers , we have
In general, computing a surrogate function with would require storing all samples and codes in that time interval. Thus, for the update, we always use (current sample), and optimize via stochastic gradient descent (SGD) (step 2 in memoryW, Algorithm 3). However, in case of quadratic loss (intermediate layers), we have more options. One is to use SGD again, or its adaptive-rate version such as Adam. This option is selected when is passed to updateWeights function in Algorithm 3. We call that method AM-Adam.
Alternatively, we can use the memory-efficient surrogate-function computation as in Mairal et al. (2009), where , i.e. the surrogate function accumulates the memory of all previous samples and codes, as described below; we hypothesize that such an approach, here called AM-mem, can be useful in continual learning as a potential mechanism to alleviate the catastrophic forgetting issue.
Co-Activation Memory. We now summarize the memory-based approach. Denoting activation in layer as , and following Mairal et al. (2009), we can rewrite the above objective in Eq. 7 using the following:
where and are the “memory” matrices (i.e. co-activation memories), compactly representing the accumulated strength of co-activations in each layer (matrices , i.e. covariances) and across consecutive layers (matrices , or cross-covariances). At each iteration , once the new input sample is encoded, the matrices are updated (updateMemory function, Algorithm 3) as and .
It is important to note that, using memory matrices, we are effectively optimizing the weights at iteration with respect to all previous samples and their previous linear activations at all layers, without the need for an explicit storage of these examples. Clearly, AM-SGD is even more memory-efficient since it does not require any memory matrices. Finally, to optimize the quadratic surrogate in Eq. 8, we follow Mairal et al. (2009) and use block-coordinate descent, iterating over the columns of the corresponding weight matrices; however, rather than always iterating till convergence, we used the number of such iterations as an additional hyperparameter.
3 Theoretical analysis
We next provide a theoretical convergence analysis for general alternating minimization (AM) schemes. Under certain assumptions that we will discuss, the proposed AM algorithm(s) falls into the category of approaches that comply with these guarantees, though the theory itself is more general and novel. To the best of our knowledge, we provide the first theoretical convergence guarantees of AM in the stochastic setting.
Setting. Let in general denote the function to be optimized using AM, where in the step of the algorithm, we optimize with respect to and keep other arguments fixed. Let denote total number of arguments. For the theoretical analysis, we consider a smooth approximation to as done in the literature Schmidt et al. (2007); Lange et al. (2014).
Let denote the global optimum of computed on the entire data population. For the sake of the theoretical analysis we assume that the algorithm knows the lower-bound on the radii of convergence for .222This assumption is potentially easy to eliminate with a more careful choice of the step size in the first iterations. Let denote the gradient of computed for a single data sample or code . In the next section, we refer to as the gradient of with respect to computed for the entire data population, i.e. an infinite number of samples (“oracle gradient”). We assume in the step, the AM algorithm performs the update:
where denotes the projection onto the Euclidean ball of some given radius centered at the initial iterate . Thus, given any initial vector in the ball of radius centered at , we are guaranteed that all iterates remain within an -ball of . This is true for all .
This scheme is much more difficult to prove theoretically and leads to the worst-case theoretical guarantees with respect to the original setting from
Algorithm 1, i.e. we expect the convergence rate for the original setting to be no worse than the one dictated by the obtained guarantees. This is because we allow only a single stochastic update (i.e. computed on a single data point) with respect to an appropriate argument (when keeping other arguments fixed) in each step of AM, whereas in Algorithm 1 and related schemes in the literature, one may increase the size of the data mini-batch in each AM step (semi-stochastic setting). The convergence rate in the latter case is typically more advantageous Nesterov (2014). Finally, note that the analysis does not consider running the optimizer more than once before changing the argument of an update, e.g., when obtaining sparse code for a given data point and fixed dictionary. We expect this to have a minor influence on the convergence rate as our analysis specifically considers a local convergence regime, where we expect that running the optimizer once produces good enough parameter approximations. Moreover, note that by preventing each AM step to be run multiple times, we analyze more noisy version of parameter updates.
Statistical guarantees for AM algorithms. The theoretical analysis we provide here is an extension to the AM setting of recent work on statistical guarantees for the EM algorithm Balakrishnan et al. (2017).
We first discuss necessary assumptions that we make. Let and denote . Let denote non-empty compact convex sets such that for any . The following three assumptions are made on and objective function .
Assumption 3.1 (Strong concavity).
The function is strongly concave for all pairs in the neighborhood of . That is
where is the strong concavity modulus.
Assumption 3.2 (Smoothness).
The function is -smooth for all pairs . That is
where is the smoothness constant.
Next, we introduce the gradient stability (GS) condition that holds for any from to .
Assumption 3.3 (Gradient stability (GS)).
We assume satisfies GS () condition, where , over Euclidean balls of the form
MNIST (fully-connected nets, 2 layers): online methods, first epoch; 50 mini-batches, 200 samples each.
Next, we introduce the population gradient AM operator, ), where , defined as
where is the step size. We also define the following bound on the expected value of the gradient of our objective function (a common assumption made in stochastic gradient descent convergence theorems as well). Define where
The following theorem then gives a recursion on the expected error obtained at each iteration of Algorithm 1.
Given the stochastic AM gradient iterates of Algorithm 1 with decaying step size , for any and the error at iteration satisfies recursion
The recursion in Theorem 3.1 is expanded in the Supplementary Material to prove the final convergence theorem for Algorithm 1 which states the following:
Given the stochastic AM gradient iterates of Algorithm 1 with decaying step size and assuming that , the error at iteration satisfies
We compared on several datasets (MNIST, CIFAR10, HIGGS) our online alternating minimization algorithms, AM-mem and AM-Adam (using mini-batches instead of single samples at each time point), against backrop-based online methods, SGD and Adam Kingma & Ba (2014), as well as against the offline auxiliary-variable ADMM method of Taylor et al. (2016), using the code provided by the authors333We choose Taylor’s ADMM among several auxiliary methods proposed recently, since it was the only one capable of handling very large datasets due to massive data parallelization; also, some other methods were not designed for classification task, e.g. Carreira-Perpinan & Wang (2014) trained autoencoders,
trained autoencoders,Zhang et al. (2016) learned hashing. , and against the two offline versions of our methods, AM-Adam-off and AM-mem-off
, which simply treated the training dataset as a single minibatch, i.e. one AM iteration was equivalent to one epoch over the training dataset. All our algorithms were implemented in PyTorchPaszke et al. (2017); we also used PyTorch implementation of SGD and Adam. Hyperparameters used for each method were optimized by grid search on a validation subset of training data. Most results were averaged over at least 5 different weight initializations.
Note that most of the prior auxiliary-variable methods were evaluated only on fully-connected networks (Carreira-Perpinan & Wang, 2014; Taylor et al., 2016; Zhang et al., 2016; Zhang & Brand, 2017; Zeng et al., 2018; Askari et al., 2018), while we also experimented with RNNs and CNNs, as well as with discrete (nondifferentiable) networks.
Fully-connected nets: MNIST, CIFAR10, HIGGS. We experimented with fully-connected networks on the standard MNIST LeCun (1998) dataset, consisting of gray-scale images of hand-drawn digits, with 50K samples, and a test set of 10K samples. We evaluated two different 2-hidden-layer architectures, with equal hidden layer sizes of 100 and 500, and ReLU activations. Figure 2 zoomed-in on the performance of the online methods, AM-Adam, AM-mem, SGD and Adam, over 50 minibatches of size 200 each. We observe that, on both architectures, our AM-Adam clearly dominates both SGD and Adam, while AM-mem is comparable with them on the larger architecture, and falls between SGD and Adam on the smaller one. Next, Figure 2 continues to 50 epochs, now including the offline methods (which require at least 1 epoch over the full dataset, by definition). Our AM-Adam matches SGD and Adam, reaching 0.98 accuracy. Our second method, AM-mem only yields 0.91 and 0.96 on the 100-node and 500-node networks, respectively. All offline methods are significantly outperformed by the online ones; e.g., Taylor’s ADMM learns very slowly until about 10 epochs, being greatly outperformed even by our offline versions, but later catches up with offline AM-mem on 100-node network; it is still inferior to all other methods on 500-node architecture.
Figures 4 and 4 show similar results for the same experiment setting, on the CIFAR10 dataset (5000 training and 10000 test samples). Again, our AM-Adam outperforms both SGD and Adam on the first 50 minibatches (same size 200 as before), and even on 50 epochs for the 1-100 architecture, reaching 0.53 vs 0.49 accuracy of SGD and Adam, but falls a bit behind on the larger 1-500 architecture with 0.51 vs0.53 and 0.56, respectively. Our second algorithm, AM-mem, is clearly dominated by all the three methods above. Also, we ran the two offline AM versions, which were again greatly outperformed by the online methods. In the remaining experiments, we will focus on our best-performing method, online AM-Adam.
HIGGS, fully-connected, 1-300 ReLU network. In Figure 7, we compare our online AM-Adam approach against SGD, Adam and again, vs. offline ADMM method of Taylor, on a very large HIGGS dataset, containing 10,500,000 training samples (28 features each) and 500,000 test samples. Each datapoint is labeled as either a signal process producing a Higgs boson or a background process which does not. We use the same architecture (a single-hidden layer network with ReLU activations and 300 hidden nodes) as in Taylor et al. (2016), and the same training/test data sets. For all online methods, we use minibatch of size 200, so one epoch over the 10.5M dataset corresponds to 52,500 iterations.
While Taylor’s method was reported to achieve 0.64 accuracy on the whole dataset (using data parallelization on 7200 cores to handle the whole dataset as a batch) Taylor et al. (2016), the online methods achieve the same accuracy much faster (less than 1000 iterations/200K samples for our AM-Adam, and less than 2000 iterations for SGD and Adam; within only 20,000 iterations (less than a half of training samples), AM-Adam, SGD and Adam 0.70, 0.69 and 0.71, respectively, and continue to improve slowly, reaching after one epoch, 0.71, 0,71 and 0.72, respectively. (Our AM-mem version quickly reached 0.6 together with AM-Adam, but then slowed down, reaching only 0.61 on the 1st epoch).
In summary, our online AM-Adam on HIGGS greatly outperforms Taylor’s offline ADMM (0.70 vs 0.64) on less than half of the 1st epoch, quickly reaching 0.64 benchmark on a tiny fraction (less than 0.01%) of the 10.5M dataset; moreover, our algorithm learns faster than SGD and on par/slightly faster than Adam initially (inset in Figure 7).
RNN on MNIST. Next, we evaluate our method on Sequential MNIST Le et al. (2015), where each image is vectorized and fed to the RNN as a sequence of pixels. We use the standard Elman RNN architecture with activations among hidden states and ReLU applied to the output sequence before making a prediction (we use larger minibatches of 1024 samples to reduce training time). AM-Adam was adapted to work on such RNN architecture (see Appendix for details). Figure 7 shows the results using hidden units (see Appendix for ), averaged over N weight initializations, for 10 epochs, with a zoom-in on the first epoch inset. Again, our Am-Adam learns faster than Adam (about half of the 1st epoch) and much faster than SGD up to epoch 6, and matches the latter afterwards.
CNN (LeNet-5), MNIST. Next, we experimented with CNNs, using LeNet-5 LeCun et al. (1998) on MNIST (Figure 7). Similarly to RNN result, our AM-Adam greatly outperforms SGD, while falling slightly behind Adam but catching up with it at 10 epochs, where our method and Adam reach 0.987 and 0.989 accuracy, respectively.
Binary nets (nondifferentiable activations), MNIST.
Finally, to investigate the ability of our method to handle non-differentiable networks,
we considered an architecture originally investigated in Lee et al. (2015) to evaluate another type of auxiliary-variable approach, called Difference Target Propagation (DTP).
The model is a 2-hidden layer fully-connected network (784-500-500-10), whose first hidden layer is endowed with the non-differentiable transfer function (while the second hidden layer uses ).
Target propagation family of approaches was motivated by the goal of finding more biologically plausible mechanisms for credit assignment in brain’s neural networks as compared to standard backprop, which, among multiple other biologically-implausible aspects, does not model the neuronal activation propagation explicitly, and does not handle non-differentiable binary activations (spikes) Lee et al. (2015); Bartunov et al. (2018). In Lee et al. (2015), DTP was applied to the above discrete network, and compared to backprop-based straight-through estimator (STE)
straight-through estimator (STE), which simply ignores the derivative of the step function (which is 0 or infinite) in the back-propagation phase. While it took about 200 epochs for DTP to reach 0.2 error, matching the STE performance (Figure 3 in Lee et al. (2015)), our AM-Adam with binary activations reached same error within less than 20 epochs (Figure 8).
We proposed a novel online alternating-minimization approach for neural network training; it builds upon previously proposed offline methods that break the nested objective into easier-to-solve local subproblems via inserting auxiliary variables corresponding to activations in each layer. Such methods avoid gradient chain computation and thus vanishing gradients, allow weight update parallelization, handle non-differentiable nonlinearities, and are a step closer than backprop to a biologically plausible learning mechanism.
However, unlike prior art, our approach is online (mini-batch), and thus can handle arbitrarily large datasets and continual learning settings. We proposed two variants, AM-mem and AM-Adam, and found that AM-Adam works better. Also, AM-Adam greatly outperforms offline methods on several datasets and architectures; when compared to state-of-art backprop methods, AM-Adam learns faster initially, consistently outperforming SGD and Adam, and then matches, or nearly matches (and sometimes improves) their performance over multiple epochs. It also converges faster than another related method, difference target propagation, on a discrete (non-differentiable) network. Finally, to the best of our knowledge, we are the first to provide theoretical guarantees for a wide class of online alternating minimization approaches including ours.
- Askari et al. (2018) Askari, A., Negiar, G., Sambharya, R., and El Ghaoui, L. Lifted neural networks. arXiv:1805.01532 [cs.LG], 2018.
- Balakrishnan et al. (2017) Balakrishnan, S., Wainwright, M. J., and Yu, B. Statistical guarantees for the em algorithm: From population to sample-based analysis. Ann. Statist., 45(1):77–120, 02 2017. doi: 10.1214/16-AOS1435. URL https://doi.org/10.1214/16-AOS1435.
- Bartunov et al. (2018) Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G. E., and Lillicrap, T. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In Advances in Neural Information Processing Systems, pp. 9390–9400, 2018.
- Bengio et al. (1994) Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
- Carreira-Perpinan & Wang (2014) Carreira-Perpinan, M. and Wang, W. Distributed optimization of deeply nested systems. In Artificial Intelligence and Statistics, pp. 10–19, 2014.
Duchi et al. (2011)
Duchi, J., Hazan, E., and Singer, Y.
Adaptive subgradient methods for online learning and stochastic
Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT press Cambridge, 2016.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lange et al. (2014) Lange, M., Zühlke, D., Holz, O., and Villmann, T. Applications of lp-norms and their smooth approximations for gradient based learning vector quantization. In ESANN, 2014.
- Lau et al. (2018) Lau, T. T.-K., Zeng, J., Wu, B., and Yao, Y. A proximal block coordinate descent algorithm for deep neural network training. arXiv preprint arXiv:1803.09082, 2018.
- Le et al. (2011) Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. Y. On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 265–272. Omnipress, 2011.
- Le et al. (2015) Le, Q. V., Jaitly, N., and Hinton, G. E. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
- LeCun (1986) LeCun, Y. Learning process in an asymmetric threshold network. In Disordered systems and biological organization, pp. 233–240. Springer, 1986.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
- LeCun et al. (1988) LeCun, Y., Touresky, D., Hinton, G., and Sejnowski, T. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, pp. 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lee et al. (2015) Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 498–515. Springer, 2015.
- Mairal et al. (2009) Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, 2009.
Nair & Hinton (2010)
Nair, V. and Hinton, G. E.
Rectified linear units improve Restricted Boltzmann Machines.In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- Nesterov (2014) Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. ISBN 1461346916, 9781461346913.
Pascanu et al. (2013)
Pascanu, R., Mikolov, T., and Bengio, Y.
On the difficulty of training recurrent neural networks.In International Conference on Machine Learning, pp. 1310–1318, 2013.
- Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
- Riedmiller & Braun (1993) Riedmiller, M. and Braun, H. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pp. 586–591. IEEE, 1993.
- Ring (1994) Ring, M. B. Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin Austin, Texas 78712, 1994.
- Robbins & Monro (1985) Robbins, H. and Monro, S. A stochastic approximation method. In Herbert Robbins Selected Papers, pp. 102–109. Springer, 1985.
- Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
- Schmidt et al. (2007) Schmidt, M., Fung, G., and Rosales, R. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In Kok, J. N., Koronacki, J., Mantaras, R. L. d., Matwin, S., Mladenič, D., and Skowron, A. (eds.), ECML, 2007.
- Taylor et al. (2016) Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A., and Goldstein, T. Training neural networks without gradients: A scalable admm approach. In International conference on machine learning, pp. 2722–2731, 2016.
- Thrun (1995) Thrun, S. A lifelong learning perspective for mobile robot control. In Intelligent Robots and Systems, pp. 201–214. Elsevier, 1995.
- Thrun (1998) Thrun, S. Lifelong learning algorithms. In Learning to learn, pp. 181–209. Springer, 1998.
Tieleman & Hinton (2012)
Tieleman, T. and Hinton, G.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Yann (1987) Yann, L. Modèles connexionnistes de l’apprentissage. PhD thesis, PhD thesis, These de Doctorat, Universite Paris 6, 1987.
- Zeiler (2012) Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Zeng et al. (2018) Zeng, J., Lau, T. T.-K., Lin, S., and Yao, Y. Global convergence in deep learning with variable splitting via the kurdyka-lojasiewicz property. arXiv preprint arXiv:1803.00225, 2018.
- Zhang & Kleijn (2017) Zhang, G. and Kleijn, W. B. Training deep neural networks via optimization over graphs. arXiv:1702.03380 [cs.LG], 2017.
- Zhang & Brand (2017) Zhang, Z. and Brand, M. Convergent block coordinate descent for training Tikhonov regularized deep neural networks. In Advances in Neural Information Processing Systems, pp. 1719–1728, 2017.
- Zhang et al. (2016) Zhang, Z., Chen, Y., and Saligrama, V. Efficient training of very deep neural networks for supervised hashing. In , pp. 1487–1495, 2016.
Appendix A Proofs
The next theorem also holds for any from to . Let and .
For some radius and a triplet such that , suppose that the function is -strongly concave (Assumption 3.1) and -smooth (Assumption 3.2), and that the GS () condition of Assumption 3.3 holds. Then the population gradient AM operator with step such that is contractive over a ball , i.e.
where , and .
a.1 Proof of Theorem a.1
a.2 Proof of Theorem 3.1
Let , where , where is the gradient computed with respect to a single data sample, is the update vector prior to the projection onto a ball . Let and . Thus
Let . Then we have that . We combine it with Equation A.2 and obtain: