1 Introduction
The continued increase in recent years in data availability and processing power has enabled the development and practical applicability of ever more powerful models in statistical machine learning, for example to recognize faces or speech, or to translate natural language
(Bishop, 2006). However, physical limitations in serial computation suggest that scalable processing will require algorithms that can be massively parallelized, so they can profit from the thousands of inexpensive processors available in cloud computing. We focus on hierarchical processing architectures such as deep neural nets (fig. 1), which were originally inspired by biological systems such as the visual and auditory cortex in the mammalian brain (Riesenhuber and Poggio, 1999; Serre et al., 2007; Gold and Morgan, 2000), and which have been proven very successful at learning sophisticated tasks, such as recognizing faces or speech, when trained on data. A typical neural net defines a hierarchical, feedforward, parametric mapping from inputs to outputs. The parameters (weights) are learned given a dataset by numerically minimizing an objective function. The outputs of the hidden units at each layer are obtained by transforming the previous layer’s outputs by a linear operation with the layer’s weights followed by a nonlinear elementwise mapping (e.g. sigmoid). Deep, nonlinear neural nets are universal approximators, that is, they can approximate any target mapping (from a wide class) to arbitrary accuracy given enough units (Bishop, 2006), and can have more representation power than shallow nets (Bengio and LeCun, 2007). The hidden units may encode hierarchical, distributed features that are useful to deal with complex sensory data. For example, when trained on images, deep nets can learn lowlevel features such as edges and Tjunctions and highlevel features such as parts decompositions. Other examples of hierarchical processing systems are cascades for object recognition and scene understanding in computer vision
(Serre et al., 2007; Ranzato et al., 2007a) or for phoneme classification in speech processing (Gold and Morgan, 2000; Saon and Chien, 2012), wrapper approaches to classification or regression (e.g. based on dimensionality reduction; Wang and CarreiraPerpiñán, 2012), or kinematic chains in robotics (Craig, 2004). These and other architectures share a fundamental design principle: mathematically, they construct a deeply nested mapping from inputs to outputs.The ideal performance of a nested system arises when all the parameters at all layers are jointly trained to minimize an objective function for the desired task, such as classification error (indeed, there is evidence that plasticity and learning probably occurs at all stages of the ventral stream of primate visual cortex;
Riesenhuber and Poggio, 1999; Serre et al., 2007). However, this is challenging because nesting (i.e., function composition) produces inherently nonconvex functions. Joint training is usually done with the backpropagation algorithm
(Rumelhart et al., 1986a; Werbos, 1974), which recursively computes the gradient with respect to each parameter using the chain rule. One can then simply update the parameters with a small step in the negative gradient direction as in gradient descent and stochastic gradient descent (SGD), or feed the gradient to a nonlinear optimization method that will compute a better search direction, possibly using secondorder information
(Orr and Müller, 1998). This process is repeated until a convergence criterion is satisfied. Backprop in any of these variants suffers from the problem of vanishing gradients (Rögnvaldsson, 1994; Erhan et al., 2009), where the gradients for lower layers are much smaller than those for higher layers, which leads to tiny steps, slowly zigzagging down a curved valley, and a very slow convergence. This problem worsens with the depth of the net and led researchers in the 1990s to give up in practice with nets beyond around two hidden layers (with special architectures such as convolutional nets (LeCun et al., 1998) being an exception) until recently, when improved initialization strategies (Hinton and Salakhutdinov, 2006; Bengio et al., 2007) and much faster computers—but not really any improvement in the optimization algorithms themselves—have renewed interest in deep architectures. Besides, backprop does not parallelize over layers (and, with nonconvex problems, is hard to parallelize over minibatches if using SGD), is only applicable if the mappings are differentiable with respect to the parameters, and needs careful tuning of learning rates. In summary, after decades of research in neural net optimization, simple backpropbased algorithms such as stochastic gradient descent remain the stateoftheart, particularly when combined with good initialization strategies (Orr and Müller, 1998; Hinton and Salakhutdinov, 2006). In addition, selecting the best architecture, for example the number of units in each layer of a deep net, or the number of filterbanks in a speech frontend processing, requires a combinatorial search. In practice, this is approximated with a manual trialanderror procedure that is very costly in effort and expertise required, and leads to suboptimal solutions (where often the parameters of each layer are set irrespective of the rest of the cascade).We describe a general optimization strategy for deeply nested functions that we call method of auxiliary coordinates (MAC)
, which partly alleviates the vanishing gradients problem, has embarrassing parallelization, and can reuse existing algorithms (possibly not gradientbased) that optimize single layers or individual units. Section
2 describes MAC, section 3 describes related work, section 4 gives experimental results that illustrate the different advantages of MAC, and the appendix gives formal theorem statements and proofs.2 The method of auxiliary coordinates (MAC)
2.1 The nested objective function
For definiteness, we describe the approach for a deep net such as that of fig. 1. Later sections will show other settings. Consider a regression problem of mapping inputs to outputs (both highdimensional) with a deep net given a dataset of pairs . A typical objective function to learn a deep net with hidden layers has the form (to simplify notation, we ignore bias parameters):
(1) 
where each layer function has the form , i.e., a linear mapping followed by a squashing nonlinearity ( applies a scalar function, such as the sigmoid
, elementwise to a vector argument, with output in
). Our method applies to loss functions other than squared error (e.g. crossentropy for classification), with fully or sparsely connected layers each with a different number of hidden units, with weights shared across layers, and with regularization terms on the weights
. The basic issue is the deep nesting of the mapping . The traditional way to minimize (1) is by computing the gradient over all weights of the net using backpropagation and feeding it to a nonlinear optimization method.2.2 The method of auxiliary coordinates (MAC)
We introduce one auxiliary variable per data point and per hidden unit and define the following equalityconstrained optimization problem:
(2) 
Each can be seen as the coordinates of on an intermediate feature space, or as the hidden unit activations for . Intuitively, by eliminating we see this is equivalent to the nested problem (1); we can prove under very general assumptions that both problems have exactly the same minimizers (see appendix A.2). Problem (2) seems more complicated (more variables and constraints), but each of its terms (objective and constraints) involve only a small subset of parameters and no nested functions. Below we show this reduces the illconditioning caused by the nesting, and partially decouples many variables, affording an efficient and distributed optimization.
2.3 MAC with quadraticpenalty (QP) optimization
The problem (2) may be solved with a number of constrained optimization approaches. To illustrate the advantages of MAC in the simplest way, we use the quadraticpenalty (QP) method (Nocedal and Wright, 2006). We optimize the following function over for fixed and drive :
(3) 
This defines a continuous path which, under some mild assumptions (see proof in appendix B), converges to a minimum of the constrained problem (2), and thus to a minimum of the original problem (1). In practice, we follow this path loosely.
The QP objective function can be seen as breaking the functional dependences in the nested mapping and unfolding it over layers. Every squared term involves only a shallow mapping; all variables are equally scaled, which improves the conditioning of the problem; and the derivatives required are simpler: we require no backpropagated gradients over , and sometimes no gradients over at all.
We now apply alternating optimization of the QP objective over and :
 step

Minimizing over for fixed results in a separate minimization over the weights of each hidden unit—each a singlelayer, singleunit problem that can be solved with existing algorithms. Specifically, for the unit , for (where we define ) and (assuming there are units in layer ), we have a nonlinear, leastsquares regression of the form , where is the weight vector (th row of ) that feeds into the th output unit of layer , and the corresponding scalar target for point .
 step

Minimizing over for fixed separates over the coordinates for each data point (omitting the subindex and weights):
(4) and can be solved using the derivatives w.r.t. of the singlelayer functions .
Thus, the step results in many independent, singlelayer singleunit problems that can be solved with existing algorithms, without extra programming cost. The step is new, however it always has the same form (4) of a “generalized” proximal operator (Rockafellar, 1976; Combettes and Pesquet, 2011). MAC reduces a complex, highlycoupled problem—training a deep net—to a sequence of simple, uncoupled problems (the step) which are coordinated through the auxiliary variables (the step). For a large net with a large dataset, this affords an enormous potential for parallel, distributed computation. And, because each  or step operates over very large, decoupled blocks of variables, the decrease in the QP objective function is large in each iteration, unlike the tiny decreases achieved in the nested function. These large steps are effectively shortcuts through space, instead of tiny steps along a curved valley in space.
Rather than an algorithm, the method of auxiliary coordinates is a mathematical device to design optimization algorithms suited for any specific nested architecture, that are provably convergent, highly parallelizable and reuse existing algorithms for nonnested (or shallow) architectures. The key idea is the judicious elimination of subexpressions in a nested function via equality constraints. The architecture need not be strictly feedforward (e.g. recurrent nets). The designer need not introduce auxiliary coordinates at every layer: there is a spectrum between no auxiliary coordinates (full nesting), through hybrids that use some auxiliary coordinates and some semideep nets, to every single hidden unit having an auxiliary coordinate. An auxiliary coordinate may replace any subexpression of the nested function (e.g. the input to a hidden unit, rather than its output, leading to a quadratic step). Other methods for constrained optimization may be used (e.g. the augmented Lagrangian rather than the quadraticpenalty method). Depending on the characteristics of the problem, the  and
steps may be solved with any of a number of nonlinear optimization methods, from gradient descent to Newton’s method, and using standard techniques such as warm starts, caching factorizations, inexact steps, stochastic updates using data minibatches, etc. In this respect, MAC is similar to other “metaalgorithms” such as expectationmaximization (EM) algorithms
(Dempster et al., 1977) and alternatingdirection method of multipliers (Boyd et al., 2011), which have become ubiquitous in statistics, machine learning, optimization and other areas.Fig. 2
illustrates MAC learning for a sigmoidal deep autoencoder architecture, introducing auxiliary coordinates for each hidden unit at each layer (see section
4.2 for details). Classical backpropbased techniques such as stochastic gradient descent and conjugate gradients need many iterations to decrease the error, but each MAC/QP iteration achieves a large decrease, particularly at the beginning, so that it can reach a pretty good network pretty fast. While MAC/QP’s serial performance is already remarkable, its parallel implementation achieves a linear speedup on the number of processors (fig. 6).Stopping criterion
Exactly optimizing for each follows the minima path strictly but is unnecessary, and one usually performs an inexact, faster optimization. Unlike in a general QP problem, in our case we have an accurate way to know when we should exit the optimization for a given . Since our real goal is to minimize the nested error , we exit when its value increases or decreases less than a set tolerance in relative terms. Further, as is common in neural net training, we use the validation error (i.e., measured on a validation set). This means we follow the path not strictly but only inasmuch as we approach a nested minimum, and our approach can be seen as a sophisticated way of taking a descent step in but derived from . Using this stopping criterion maintains our theoretical convergence guarantees, because the path still ends in a minimum of and we drive .
The postprocessing step
Once we have finished optimizing the MAC formulation with the QP method, we can apply a fast postprocessing step that both reduces the objective function, achieves feasibility and eliminates the auxiliary coordinates. We simply satisfy the constraints by setting , , , and keep all the weights the same except for the last layer, where we set by fitting to the dataset . One can prove the resulting weights reduce or leave unchanged the value of .
2.4 Jointly learning all the parameters in heterogeneous architectures
Another important advantage of MAC is that it is easily applicable to heterogeneous architectures, where each layer may perform a particular type of processing for which a specialized training algorithm exists, possibly not based on derivatives over the weights (so that backprop is not applicable or not convenient). For example, a quantization layer of an object recognition cascade, or the nonlinear layer of a radial basis function (RBF) network, often use a
means training to estimate the weights. Simply reusing this existing training algorithm as the step for that layer allows MAC to learn jointly the parameters of the entire network with minimal programming effort, something that is not easy or not possible with other methods.Fig. 3 illustrates MAC learning for an autoencoder architecture where both the encoder and the decoder are RBF networks, introducing auxiliary coordinates only at the coding layer (see section 4.3 for details). In the step, the basis functions of each RBF net are trained with means, and the weights in the remaining layers are trained by leastsquares. As before, MAC/QP achieves a large error decrease in a few iterations.
2.5 Model selection
A final advantage of MAC is that it enables an efficient search not just over the parameter values of a given architecture, but (to some extent) over the architectures themselves. Traditional model selection usually involves obtaining optimal parameters (by running an already costly numerical optimization) for each possible architecture, and then evaluating each architecture based on a criterion such as crossvalidation or a Bayesian Information Criterion (BIC), and picking the best (Hastie et al., 2009). This discretecontinuous optimization involves training an exponential number of models, so in practice one settles with a suboptimal search (e.g. fixing by hand part of the architecture based on an expert’s judgment, or selecting parts separately and then combining them). With MAC, model selection may be achieved “on the fly” by having the step do model selection separately for each layer, and then letting the step coordinate the layers in the usual way. Specifically, consider a model selection criterion of the form , where is the nested objective function (1) and is additive over the layers of the net:
(5) 
This is satisfied by many criteria, such as BIC, AIC or minimum description length (Hastie et al., 2009), in which is essentially proportional to the number of free parameters. While optimizing directly involves testing deep nets if we have choices for each layer, with MAC the step separates over layers, and requires testing only singlelayer nets at each iteration. While these model selection tests are still costly, they may be run in parallel, and they need not be run at each iteration. That is, we may alternate between running multiple iterations that optimize for a given architecture, and running a modelselection iteration, and we still guarantee a monotonic decrease of . In practice, we observe that a nearoptimal model is often found in early iterations. Thus, the ability of MAC to decouple optimizations reduces a search of an exponential number of complex problems to an iterated search of a linear number of simple problems.
Fig. 5 illustrates how to learn the architecture with MAC for the RBF autoencoder (see section 4.4 for details) by trying different values for the number of basis functions in each of the encoder and decoder (a search space of architectures). Because, early during the optimization, MAC/QP settles on an architecture that is quite smaller than the one used in fig. 3, the result is in fact achieved in even less time.
3 Related work
We believe we are the first to propose the MAC formulation in full generality for nested function learning as a provably equivalent, constrained problem that is to be optimized jointly in the space of parameters and auxiliary coordinates using quadraticpenalty, augmented Lagrangian or other methods. However, there exist several lines of work related to it, and MAC/QP can be seen as giving a principled setting that justifies previous heuristic but effective approaches, and opening the door for new, principled ways of training deep nets and other nested systems.
Updating the activations of hidden units separately from the weights of a neural net has been done in the past, from early work in neural nets (Grossman et al., 1988; Saad and Marom, 1990; Krogh et al., 1990; Rohwer, 1990) to recent work in learning sparse features (Olshausen and Field, 1996, 1997; Ranzato et al., 2007b; Kavukcuoglu et al., 2008) and dimensionality reduction (CarreiraPerpiñán and Lu, 2008, 2010, 2011; Wang and CarreiraPerpiñán, 2012). Interest in using the activations of neural nets as independent variables goes back to the early days of neural nets, where learning good internal representations was as important as learning good weights (Grossman et al., 1988; Saad and Marom, 1990; Krogh et al., 1990; Rohwer, 1990). In fact, backpropagation was presented as a method to construct good internal representations that represent important features of the task domain (Rumelhart et al., 1986b). This necessarily requires dealing explicitly with the hidden activations. Thus, while several papers proposed objective functions of both the weights and the activations, these were not intended to solve the nested problem or to achieve distributed optimization, but to help learn good representations. These algorithms typically did not converge at all, or did not converge to a solution of the nested problem, and were developed for a singlehiddenlayer net and tested in very small problems. More recent variations have similar problems (Ma et al., 1997; Castillo et al., 2006; Erdogmus et al., 2005). Nearly all this early work has focused on the case of a single hidden layer, which is easy enough to train by standard methods, so that no great advantage is obtained, and it does not reveal the parallel processing aspects of the problem, which become truly important in the deep net case. When extracting features and using overcomplete dictionaries, sparsity is often encouraged, which sometimes requires an explicit penalty over the features, but this has only been considered for a single layer (the one that extracts the features) and again does not minimize the nested problem (Olshausen and Field, 1996, 1997; Ranzato et al., 2007b; Kavukcuoglu et al., 2008). Some work for a single hidden layer net mentions the possibility of recovering backpropagation in a limit (Krogh et al., 1990; Kavukcuoglu et al., 2008), but this is not used to construct an algorithm that converges to a nested problem optimum. Recent works in deep net learning, such as pretraining (Hinton and Salakhutdinov, 2006) or greedy layerwise training (Bengio et al., 2007; Ngiam et al., 2011), do a single pass over the net from the input to the output layer, fixing the weights of each layer sequentially, but without optimizing a joint objective of all weights. While these heuristics can be used to achieve good initial weights, they do not converge to a minimum of the nested problem.
Auxiliary variables have been used before in statistics and machine learning, from early work in factor and homogeneity analysis (Gifi, 1990), to learn dimensionality reduction mappings given a dataset of highdimensional points . Here, one takes the latent coordinates of each data point as parameters to be estimated together with the reconstruction mapping that maps latent points to data space and minimize a leastsquares error function , often by alternating over and . Various nonlinear versions of this approach exist where is a spline (LeBlanc and Tibshirani, 1994), singlelayer neural net (Tan and Mavrovouniotis, 1995), radial basis function net (Smola et al., 2001), kernel regression (Meinicke et al., 2005) or Gaussian process (Lawrence, 2005). However, particularly with nonparametric functions, the error can be driven to zero by separating infinitely apart the , and so these methods need adhoc terms on to prevent this. The dimensionality reduction by unsupervised regression approach of CarreiraPerpiñán and Lu (2008, 2010, 2011) (generalized to supervised dimensionality reduction in Wang and CarreiraPerpiñán, 2012) solves this by optimizing instead jointly over , and the projection mapping (both RBF networks). This can be seen as a truncated version of our quadraticpenalty approach, where is kept constant, and limited to a singlehiddenlayer net. Therefore, the resulting estimate for the nested mapping is biased, as it does not minimize the nested error.
In summary, these works were typically concerned with singlehiddenlayer architectures, and did not solve the nested problem (1
). Instead, their goal was to define a different problem (having a different solution): one where the designer has explicit control over the net’s internal representations or features (e.g. to encourage sparsity or some other desired property). In MAC, the auxiliary coordinates are purely a mathematical construct to solve a welldefined, general nested optimization problem, with embarrassing parallelism suitable for distributed computation, and is not necessarily related to learning good hidden representations. Also, none of these works realize the possibility of using heterogeneous architectures with layerspecific algorithms, or of learning the architecture itself by minimizing a model selection criterion that separates in the
step.Finally, the MAC formulation is similar in spirit to the alternating direction method of multipliers (ADMM) (Boyd et al., 2011; Combettes and Pesquet, 2011) in that variables (the auxiliary coordinates) are introduced that decouple terms. However, ADMM splits an existing variable that appears in multiple terms of the objective function (which then decouple) rather than a functional nesting, for example becomes s.t. , or is split into nonnegative and nonpositive parts. In contrast, MAC introduces new variables to break the nesting. ADMM is known to be very simple, effective and parallelizable, and to be able to achieve a pretty good estimate pretty fast, thanks to the decoupling introduced and the ability to use existing optimizers for the subproblems that arise. MAC also has these characteristics with problems involving function nesting.
4 Experiments
Section 4.1 describes how we implemented the  and steps and sections 4.2, 4.3 and 4.4 show how MAC can learn a homogeneous architecture (deep sigmoidal autoencoder), a heterogeneous architecture (RBF autoencoder) and the architecture itself, respectively. In all cases, we show the speedup achieved with a parallel implementation of MAC as well.
4.1 Optimization of the MACconstrained problem using a quadratic penalty
We apply alternating optimization of the QP objective (3) over and :
 step

Minimizing over for fixed results in a separate nonlinear, leastsquares regression of the form for (where we define ) and , where is the weight vector (th row of ) that feeds into the th output unit of layer (assuming there are such units), and the corresponding scalar target for point . We solve each of these problems with a GaussNewton approach (Nocedal and Wright, 2006), which essentially approximates the Hessian by linearizing the function , solves a linear system of (the number of units feeding into ) to get a search direction, does a line search (we use backtracking with initial step size ), and iterates. In practice 1–2 iterations converge with high tolerance.
 step

Minimizing over for fixed separates over each for . The problem is also a nonlinear leastsquares fit, formally very similar to those of the step, because and enter the objective function in a nearly symmetric way through , but with additional quadratic terms . We optimize it again with the GaussNewton method, which usually spends 1–2 iterations.
This optimization of the MACconstrained problem, based on a quadraticpenalty method with GaussNewton steps, produces reasonable results and is simple to implement, but it is not intended to be particularly efficient. A more efficient optimization can be achieved by (1) using other methods for constrained optimization, such as the augmented Lagrangian method instead of the quadratic penalty method; and (2) by using more efficient  or steps, by combining standard techniques (inexact steps with warm starts, caching factorizations, stochastic updates using data minibatches, etc.) with unconstrained optimization methods such as LBFGS, conjugate gradients, gradient descent, alternating optimization or others. Exploring this is a topic of future research.
Parallel implementation of MAC/QP
Our parallel implementation of MAC/QP is extremely simple at present, yet it achieves large speedups (about faster if using 12 processors), which are nearly linear as a function of the number of processors for all experiments, as shown in fig. 6. Given that our code is in Matlab, we used the Matlab Parallel Processing Toolbox. The programming effort is insignificant: all we do is replace the “for” loop over weight vectors (in the step) or over auxiliary coordinates (in the step) with a “parfor” loop. Matlab then sends each iteration of the loop to a different processor. We ran this in a sharedmemory multiprocessor machine^{1}^{1}1An Aberdeen Stirling 148 computer having 4 physical CPUs (Intel Xeon CPU L7555@ 1.87GHz), each with 8 individual processing cores (thus a total of 32 actual processors), and a total RAM size of 64 GB., using up to 12 processors (a limit imposed by our Matlab license) and obtained the results reported in the paper. While simple, the Matlab Parallel Processing Toolbox is quite inefficient. Larger speedups would be achievable with other parallel computation models such as MPI in C, and using a distributed architecture (so that cache and other overheads are reduced).
4.2 Homogeneous training: deep sigmoidal autoencoder
We use a dataset of handwritten digit images to train a deep autoencoder architecture that maps the input image to a lowdimensional coding layer and then tries to reconstruct the image from it. We used MAC/QP introducing auxiliary coordinates for each hidden unit at each layer. Fig. 2 shows the learning curves.
The USPS dataset (Hull, 1994), a commonly used machine learning benchmark, contains grayscale images of handwritten digits, i.e., 256D vectors with values in . We use images for training and for validation, both randomly selected equally over all digits.
The autoencoder architecture is 256–300–100–20–100–300–256, for a total of over weights, with all hidden layers being logistic sigmoid units and the output layer being linear. The initial weights are uniformly sampled from for layers , respectively, where is the input dimension (fanin) to each hidden layer (Orr and Müller, 1998). When using initial random weights, a large (but easy) decrease in the error can be achieved simply by adjusting the biases of the output layer so the network output matches the mean of the target data, and all algorithms attain this in the first few iterations, giving the impression of a large error decrease followed by a very slow subsequent decrease. To focus the plots on the subsequent, more difficult error decreases, we always apply a single step of gradient descent to the random initial weights, which mostly adjusts the biases as described. The resulting weights are given as initial weights to each optimization method.
MAC/QP runs the step with 1 GaussNewton iteration and the step with up to 3 GaussNewton iterations. We also use a small regularization on in the first few iterations, which we drop once , since we find that this tends to lead to a better local optimum. For each value of , we optimize in an inexact way as described in section 2, exiting when the value of the nested error , evaluated on a validation set, increases or decreases less than a set tolerance in relative terms. We use a tolerance of and increase rather aggressively, to .
We show the learning curves for two classical backpropbased techniques, stochastic gradient descent and conjugate gradients. Stochastic gradient descent (SGD) has several parameters that should be carefully set by the user to ensure that convergence occurs, and that convergence is as fast as possible. We did a grid search for the minibatch size in and learning rate in . We found minibatches of 20 samples and a learning rate of
were best. We randomly permute the training set at the beginning of each epoch. For nonlinear conjugate gradients (CG), we used the PolakRibière version, widely regarded as the best
(Nocedal and Wright, 2006). We use Carl Rasmussen’s implementation minimize.m (available at http://learning.eng.cam.ac.uk/carl/code/minimize/minimize.m), which uses a line search based on cubic interpolation that is more sophisticated than backtracking, and allows steps longer than 1. We found a minibatch of
(i.e., batch mode) worked best. We used restarts every 100 steps.Figure 2(left) plots the mean squared training error for the nested objective function (1) vs run time for the different algorithms after the initialization. The validation error follows closely the training error and is not plotted. Markers are shown every iteration (MAC), every 100 iterations (CG), or every 20 epochs (SGD, one epoch is one pass over the training set). The change points of the quadratic penalty parameter are indicated using filled red circles at the beginning of each iteration when they happened. The learning curve of the parallelized version of MAC (using 12 processors) is also shown in blue. Fig. 2(right) shows some USPS images and their reconstruction at the end of training for each algorithm.
SGD and CG need many iterations to decrease the error, but each MAC/QP iteration achieves a large decrease, particularly at the beginning, so that it can reach a pretty good network pretty fast. While MAC/QP’s serial performance is already remarkable, its parallel implementation achieves a linear speedup on the number of processors (fig. 6).


4.3 Heterogeneous training: radial basis functions autoencoder


We use a dataset of object images to train an autoencoder architecture where both the encoder and decoder are RBF networks but, rather than using a gradientbased optimization for each subnet, we use means to train the basis functions of each RBF in the step (while the weights in the remaining layers are trained by leastsquares). We used MAC/QP introducing auxiliary coordinates only at the coding layer. Backpropbased algorithms are incompatible with means training, so instead we compare with alternating optimization. Fig. 3 shows the learning curves.
The COIL–20 image dataset (Nene et al., 1996), commonly used as a benchmark to test dimensionality reduction algorithms, contains rotation sequences of 20 different objects every 5 degrees (i.e., 72 images per object), each a grayscale image with pixel intensity in . We resize the images to . Thus, the data contain 20 closed, nonlinear 1D manifolds in a –dimensional space. We pick half of the images from objects 1 (duck) and 4 (cat) as validation set, which leaves a training set containing images.
The autoencoder architecture is as follows. The bottleneck layer of lowdimensional codes has only 2 units, so that we can visualize the data. The encoder reduces the dimensionality of the input image to 2D, while the decoder reconstructs the image as best as possible from the 2D representation. Both the encoder and the decoder are radial basis function (RBF) networks, each having a single hidden layer. The first one (encoder) has the form , where the vector has elements (basis functions) of the form , , with , and maps an image to a 2D space . The second one (decoder) has the form , where the vector has elements (basis functions) of the form , , with , and maps a 2D point to a D image. Thus, the complete autoencoder is the concatenation of the two Gaussian RBF networks, it has hidden layers with sizes 1 024–1 368–2–1 368–1 024, and a total of almost 3 million weights. As is usual with RBF networks, we applied a quadratic regularization to the linearlayer weights with a small value (). The nested problem is then to minimize the following objective function, which is a leastsquares error plus a quadratic regularization on the linearlayer weights:
(6) 
In practice, RBF networks are trained in two stages (Bishop, 2006). Consider the encoder, for example. First, one trains the centers using a clustering algorithm applied to the inputs , typically means or (when the number of centers is large) simply by fixing them to be a random subset of the inputs. Second, having determined , one obtains from a linear leastsquares problem, by solving a linear system. The reason why this is preferred to a fully nonlinear optimization over centers and weights
is that it achieves nearoptimal nets with a simple, noniterative procedure. This type of twostage noniterative strategy to obtain nonlinear networks is widely applied beyond RBF networks, for example with support vector machines
(Schölkopf and Smola, 2001), kernel PCA (Schölkopf et al., 1998), slice inverse regression (Li, 1991) and others.We wish to capitalize on this attractive property to train deep autoencoders constructed by concatenating RBF networks. However, backpropbased algorithms are incompatible with this twostage training procedure, since it does not use derivatives to optimize over the centers. This leads us to the two following optimization methods: an alternating optimization approach, and MAC.
We can use an alternating optimization approach where we alternate the following two steps: (1) A step where we fix and train , by applying means to and a linear system to . This step is identical to the step in MAC over . (2) A step where we fix and train , by applying means to and a nonlinear optimization to (we use nonlinear conjugate gradients with 10 steps). This is because no longer appears linearly in the objective function, but is nonlinearly embedded as the argument of the decoder. This step is significantly slower than the step in MAC over .
We define the MACconstrained problem as follows. We introduce auxiliary coordinates only at the coding layer (rather than at all hidden layers). This allows the step to become the desired means plus linear system training for the encoder and decoder separately. It requires no programming effort; we simply call an existing, meansbased RBF training algorithm for each of the encoder and decoder separately. We start with a quadratic penalty parameter and increase it to after 70 iterations.
Since we use as many centers as data points (), the means step simplifies (for both methods) to setting each basis function center to an input point.
In this experiment, instead of using random initial weights, we obtained initial values for the coordinates by running a nonlinear dimensionality reduction method, the elastic embedding (EE) (CarreiraPerpiñán, 2010); this gives significantly better embeddings than spectral methods; (Tenenbaum et al., 2000; Roweis and Saul, 2000). This takes as input a matrix of similarity values between every pair of COIL images , and produces a nonlinear projection in 2D for each . We used Gaussian similarities with a kernel width of and run EE for 200 iterations using as its user parameter. All the optimization algorithms were initialized from these projections.
Fig. 3(left) shows the nested function error (6) (normalized per data point, i.e., divided by ). As before, MAC/QP achieves a large error decrease in a few iterations. Alternating optimization is much slower. Again, a parallel implementation of MAC/QP achieves a large speedup, which is nearly linear on the number of processors (fig. 6). Fig. 3(right) shows some COIL images and their reconstruction at the end of training for each algorithm. Fig. 4 shows the initial projections (from the elastic embedding algorithm) and the final projections (after running MAC/QP). Most of the manifolds have improved, for example opening up loops that were folded.
4.4 Learning the architecture: RBF autoencoder
We repeat the experiment of the RBF autoencoder of the previous section, but now we learn its architecture. We jointly learn the architecture of the encoder and of the decoder by trying different values for the number of basis functions in each (a search space of architectures). We define the following objective function over architectures and their weights:
(7) 
where is the nested error from eq. (6) (including regularization terms), and is the model selection term. We use the wellknown AIC criterion (Hastie et al., 2009). This is defined as
(8) 
(times a constant , which we omit) where SSE is the sum of squared errors achieved in the training set with a model having parameters , is the mean squared error achieved by a lowbias model (typically estimated by training a model with many parameters) and is the number of free parameters in the model. In our case, this means we use defined as
(9) 
where and are the numbers of centers for the encoder and decoder, respectively (first and third hidden layers), and and are the input and output dimension of the encoder, respectively (equivalently, the output and input dimension of the decoder). The total number of free parameters (centers and linear weights) in the autoencoder is thus .
We choose each of the numbers of centers and from a discrete set consisting of the equispaced values in the range to (a total of different architectures). We estimated from the result of the RBF autoencoder of section 4.3, which had a large number of parameters and thus a low bias. As in that section, the centers of each network are constrained to be equal to a subset of the input points (chosen at random). We set , and . We start the MAC/QP optimization from the most complex model, having centers (i.e., the model of the previous section). While every iteration optimizes the MAC/QP objective function (3) over , we run a model selection step only every 10 iterations. This selects separately for each net the best value and potentially changes the size of . Thus, every 11th iteration is a model selection step, which may or may not change the architecture.
Figure 5 shows the total error of eq. (7) (the nested function error plus the model cost). Model selection steps are indicated with green markers (solid if the architecture did change and empty if it did not change), annotated with the resulting value of . Other details are as in fig. 3. The first change of architecture moves to a far smaller model , achieving an enormous decrease in objective. This is explained by the strong penalty that AIC imposes on the number of parameters, favoring simpler models. Then, this is followed by a few minor changes of architecture interleaved with a continuous optimization of its weights. The final architecture has , for a total of million weights. While this architecture incurs a larger training error than that of the previous section, it uses a much simpler model and has a lower value for the overall objective function of eq. (7). Because, early during the optimization, MAC/QP settles on an architecture that is quite smaller than the one used in fig. 3, the result is in fact achieved in even less time. And, again, the parallel implementation is trivial and achieves an approximately linear speedup on the number of processors (fig. 6).
5 Conclusion
MAC drastically facilitates, in runtime and human effort, the practical design and estimation of nonconvex, nested problems by jointly optimizing over all parameters, reusing existing algorithms, searching automatically over architectures and affording massively parallel computation, while provably converging to a solution of the nested problem. It could replace or complement backpropagationbased algorithms in learning nested systems both in the serial and parallel settings. It is particularly timely given that serial computation is reaching a plateau and cloud computing is becoming a commodity, and intelligent data processing is finding its way into mainstream devices (phones, cameras, etc.), thanks to increases in computational power and data availability. An important area of application may be the joint, automatic tuning of all stages of a complex, intelligentprocessing system in datarich disciplines, such as those found in computer vision and speech, in a distributed cloud computing environment. MAC also opens many questions, such as the optimal way to introduce auxiliary coordinates in a given problem, and the choice of specific algorithms to optimize the  and steps.
6 Acknowledgments
Work funded in part by NSF CAREER award IIS–0754089.
Appendix A Theorems and proofs
a.1 Definitions
Consider a regression problem of mapping inputs to outputs (both highdimensional) with a deep net given a dataset of pairs . We define the nested objective function to learn a deep net with hidden layers, like that in fig. 1, as (to simplify notation, we ignore bias parameters and assume each hidden layer has units):
(1) 
where each layer function has the form , i.e., a linear mapping followed by a squashing nonlinearity ( applies a scalar function, such as the sigmoid , elementwise to a vector argument, with output in ).
In the method of auxiliary coordinates (MAC), we introduce one auxiliary variable per data point and per hidden unit (so , with ) and define the following equalityconstrained optimization problem:
(2) 
Sometimes, for notational convenience (in particular in theorem B.3), we will write the constraints for the th point as a single vector constraint (with an obvious definition for ). We will also call the feasible set of the MACconstrained problem, i.e.,
(10) 
a.2 Equivalence of the MAC and nested formulations
First, we give a theorem that holds under very general assumptions. In particular, it does not require the functions to be smooth, it holds for any loss function beyond the leastsquares one, and it holds if the nested problem is itself subject to constraints.
Theorem A.1.
Proof.
Let us prove that any minimizer of the nested problem is associated with a unique minimizer of the MACconstrained problem , and vice versa . Recall the following definitions (Nocedal and Wright, 2006): (i) For an unconstrained minimization problem , is a local minimizer if there exists a nonempty neighborhood of such that . (ii) For a constrained minimization problem s.t. , is a local minimizer if and there exists a nonempty neighborhood of such that .
Define the “forwardpropagation” function as the result of mapping for . This maps each to a unique , and satisfies for , and therefore that for any .
Let be a local minimizer of the nested problem (1). Then, there exists a nonempty neighborhood of such that . Let and call , which is a nonempty neighborhood of in space. Now, for any we have that . Hence is a local minimizer of the MACconstrained problem.
Let be a local minimizer of the MACconstrained problem (2). Then, there exists a nonempty neighborhood of such that . Note that , and this applies in particular to (which, being a solution, is feasible and thus belongs to ). Calling , we have that . Hence is a local minimizer of the nested problem.
Finally, one can see that the proof holds if the nested problem uses a loss function that is not the leastsquares one, and if the nested problem is itself subject to constraints. ∎
Obviously, the theorem holds if we exchange with everywhere (thus exchanging nonstrict with strict minimizers), and if we exchange “min” with “max” (hence the maximizers of the MAC and nested formulations are in a onetoone correspondence as well). Figure 7 illustrates the theorem. Essentially, the nested objective function stretches along the manifold defined by preserving the minimizers and maximizers. The projection on space of the part of that sits on top of that manifold recovers .
a.3 KKT conditions
We now show that the firstorder necessary (KarushKuhnTucker, KKT) conditions of both problems (nested and MACconstrained) have the same stationary points. For simplicity and clarity of exposition, we give a proof for the special case of . The proof for layers follows analogously. We assume the functions and have continuous first derivatives w.r.t. both its input and its weights. indicates the Jacobian of w.r.t. its input. To simplify notation, we sometimes omit the dependence on the weights; for example, we write as , and as .
Theorem A.2.
Proof.
The nested problem for a nested function is:
Then we have the stationary point equation (firstorder necessary conditions for a minimizer):
(11)  
(12) 
which is satisfied by all the minima, maxima and saddle points.
The MACconstrained problem is
with Lagrangian
and KKT conditions
(13)  
(14)  
(15)  
(16) 
Substituting from eq. (15) and from eq. (16):
into eqs. (13)–(14) we recover eqs. (11)–(12), thus a KKT point of the constrained problem is a stationary point of the nested problem. Conversely, given a stationary point of the nested problem, and defining and as in eqs. (15’)–(16’), then satisfies eqs. (13)–(16) and so is a KKT point of the constrained problem. Hence, there is a onetoone correspondence between the stationary points of the nested problem and the KKT points of the MACconstrained problem. ∎
Appendix B Convergence of the quadraticpenalty method for MAC
Let us first give convergence conditions for the general equalityconstrained minimization problem:
(17) 
and the quadraticpenalty (QP) function
(18) 
with penalty parameter . Given a positive increasing sequence , a nonnegative sequence , and a starting point , the QP method finds an approximate minimizer of for , so that the iterate satisfies . Given this algorithm, we have the following theorems:
Theorem B.1 (Nocedal and Wright, 2006, Th. 17.1).
Suppose that and . If each is the exact global minimizer of , then every limit point of the sequence is a global solution of the problem (17).
Theorem B.2 (Nocedal and Wright, 2006, Th. 17.2).
Suppose that and , and that is a limit point of . Then is a stationary point of the function . Besides, if the constraint gradients are linearly independent, then is a KKT point for the problem (17). For such points, we have for any infinite subsequence such that that , where is the multiplier vector that satisfies the KKT conditions for the problem (17).
If now we particularize these general theorems to our case, we can obtain stronger theorems. Theorem B.1 is generally not applicable, because optimization problems involving nested functions are typically not convex and have local minima. Theorem B.2 is applicable to prove convergence in the nonconvex case. We assume the functions in eq. (1) have continuous first derivatives w.r.t. both its input and its weights, so is differentiable w.r.t. and .
Theorem B.3 (Convergence of MAC/QP for nested problems).
Consider the constrained problem (2) and its quadraticpenalty function of (3). Given a positive increasing sequence , a nonnegative sequence , and a starting point , suppose the QP method finds an approximate minimizer of that satisfies for Then, , which is a KKT point for the problem (2), and its Lagrange multiplier vector has elements .
Proof.
It follows by applying theorem B.2 to the constrained problem (2) and by noting that exists and that the constraint gradients are linearly independent. We prove these two statements in turn.
The limit of the sequence exists because the objective function of the MACconstrained problem (hence the QP function ) are lower bounded and have continuous derivatives.
The constraint gradients are l.i. at any point and thus, in particular, at the limit . To see this, let us first compute the constraint gradients. There is one constraint for each point , layer and unit , where we define as the set of auxiliary coordinate indices for layer and . The gradient of this constraint is:
Now, we will show that these gradients are l.i. at any point . It suffices to look at the gradients w.r.t. . Call for short. Constructing a linear combination of them and setting it to zero: