1 Introduction
Deep learning algorithms have yielded impressive performance across a range of tasks, including object and voice recognition [1]. The workhorse underlying deep learning is error backpropagation [2, 3, 4] – a decades old algorithm that yields stateoftheart performance on massive labeled datasets when combined with recent innovations such as rectifiers and dropout [5].
Backprop is gradient descent plus the chain rule. Gradient descent has convergence guarantees in settings that are smooth or convex or both. However,
modern convnets are neither smooth nor convex. Although it is wellknown that convnets are not convex, it is perhaps underemphasized that the spectacular recent results obtained by convnets on benchmarks such as ImageNet
[6]rely on architectures that are not smooth. Starting with AlexNet in 2012, every winner of the ImageNet classification challenge has used rectifier (also known as rectilinear) activation functions
[7, 8, 9, 10].Rectifiers and maxpooling are nonsmooth functions that are used in essentially all modern convnets [11, 12, 13, 14, 15, 16]. In fact, the representational power of rectifier nets derives precisely from their nondifferentiability: the number of nondifferentiable boundaries in the parameter space grows exponentially with depth [17]. It follows that none of the standard convergence guarantees from the optimization literature apply to modern convnets.
In this paper, we provide the first convergence rates for convolutional networks with rectifiers and maxpooling. To do so we introduce a new class of gated games which generalize the convex games studied by Stoltz and Lugosi in [18]. We reformulate learning in convnets as gated games and adapt results on convergence to correlated equilibria from convex to gated games.
1.1 Open Questions in the Foundations of Deep Learning
Theoretical questions about deep learning can be loosely grouped into four categories:

Representational power
The set of functions approximated by neural networks have been extensively studied. Early results show that neural networks with a single hidden layer are universal function approximators [19, 20]. More recently, researchers have focused on the role of depth and rectifiers in function approximation [21, 22, 23].

Generalization guarantees
Standard guarantees from VCtheory apply to neural nets, although these are quite loose [24, 25]. Recent work by Hardt et al shows that the convergence rate of stochastic gradient methods have implications for generalization bounds in both convex and nonconvex settings [26]. Unfortunately, their results rely on a smoothness assumption^{1}^{1}1Function is smooth if , such that . Rectifiers are not smooth for any . that does not hold for rectifiers or maxpooling. Thus, although suggestive, the results do not apply to modern convnets. Feng et al have initiated a promising direction based on ensemble robustness is [27], although robustness cannot be evaluated analytically.
A related problem is to better understand regularization methods such as dropout [28, 29]. Regretbounds for dropout have been found in the setting of prediction with expert advice [30]. However, it is unclear how to extend these results to neural nets.

Local and global minima
The third problem is to understand how far the critical points found by backpropagation are from local minima and the global minimum. The problem is challenging since neural networks are not convex. There has been theoretical work studying conditions under which gradient descent converges to local minima on nonconvex problems [31, 32]. The assumptions required for these results are quite strong, and include smoothness assumptions that do not hold for rectifiers. It has also been observed that saddles slow down training, even when the algorithm does not converge to a saddle point; designing algorithms that avoid saddles is an area of active research [33].
Recent work by Choromanska et al suggests that most local optima in neural nets have error rates that are reasonably close to the global optimum [34, 35]. Searching for good local optima may therefore be of less practical importance than ensuring rapid convergence.

Convergence rates
The last problem, and the focus of this paper, is to understand the convergence of gradientbased methods on neural networks. Speeding up the training time of neural nets is a problem of enormous practical importance. Although there is a large body of empirical work on optimizing neural nets, there are no theoretical guarantees that apply to the methods used to train rectifier convnets since they are neither smooth nor convex.
Recent work has investigated the convergence of proximal methods for nonconvex nonsmooth problems [36, 37]. However computing prox
 operators appears infeasible for neural nets. Interesting results have been derived for variancereduced gradient optimization in the nonconvex setting
[38, 39, 40], although smoothness is still required.1.2 Outline
Training modern convnets, with rectifiers and maxpooling, entails searching over a rich subset of a universal class of function approximators with a loss function that is neither smooth nor convex. There is little hope of obtaining useful convergence results at this level of generality. It is therefore necessary to utilize the structure of rectifier networks.
Our strategy is to decompose neural nets into interacting optimizers that are easier to analyze individually than the net as a whole. In short, the strategy is to import techniques from game theory into deep learning.
1.2.1 Utilizing network structure
We make two observations about neural network structure. The first, section 2, is to reformulate linear networks as convex games where the players are the units in the network. Although the loss is not a convex function of the weights of the network; it is a convex function of the weights of the individual players. The observation connects the dynamics of games under noregret learning to the dynamics of linear networks under backpropagation.
Linear networks are a special case. It is natural to ask whether neural networks with nonlinearities are also convex games. Unfortunately, the answer is no: introducing any nonlinearity breaks convexity.^{2}^{2}2In short, the “nonlinearity” would have to be affine since we need all linear combinations of to be convex, including and . Although the situation seems hopeless it turns out, remarkably, that gametheoretic convergence results can be imported – despite nonconvexity – for precisely the nonlinearities used in modern convnets.
The second observation, section 3.1, is that a rectifier network is a linear network equipped with gates that control whether units are active for a given input. If a unit is inactive during the feedforward sweep, it is also inactive during backpropagation, and therefore does not update its weights. This motivates generalizing convex games to gated games.
1.2.2 Gated games
In a classical game, each player chooses a series of actions and, on each round, incurs a convex loss. The regret of a player is the difference between its cumulative loss and what the cumulative would have been had the player chosen the best action in hindsight [41]. Players can be implemented with socalled noregret algorithms that minimize their loss relative to the best action in hindsight. More precisely, a noregret algorithm has sublinear cumulative regret. The regret per round therefore vanishes asymptotically.
Section 3 introduces gated games where players only incur a convex loss on rounds for which they are active. After extending the definitions of regret and correlated equilibrium to gated games, proposition 3 shows that if players follow nogatedregret strategies, then they converge to a correlated equilibrium. Gated players generalize the sleepy experts introduced by Blum in [42], see also [43].
A useful technical tool is pathsum games. These are games constructed over directed acyclic graphs with weighted edges. Lemma 4 shows that pathsums encode the dynamics of the feedforward and feedback sweeps of rectifier nets. Proposition 5 shows that pathsum games are gated games and proposition 6 extends the result to convolutional networks.
1.2.3 Summary of contribution
The main contributions of the paper are as follows:

Theorem 1: Rectifier convnets converge to a critical point under backpropagation at a rate controlled by the gatedregret of the units in the network.
Corollary 1 specializes the result to gradient descent. To the best of our knowledge, there are no previous convergence rates applicable to neural nets with rectifier nonlinearities and maxpooling. Finding conditions that guarantee convergence to local minima is deferred to future work.
The results derive from a detailed analysis of the internal structure of rectifer nets and their updates under backpropagation. They require no new ideas regarding optimization in general. Our methods provide the first rigorous explanation for how methods designed for convex optimization improve convergence rates on modern convnets. The results do not apply to all neural networks: they hold for precisely the neural networks that perform best empirically [7, 8, 9, 10, 11, 12, 13, 14, 15, 16].
The philosophy underlying the paper is to decompose training neural nets into two distinct tasks: communication and optimization. Communication is handled by backpropagation which sends the correct gradient information to players (units) in the net. Optimization is handled locally by the individual players. Note that although this philosophy is widely applied when designing and implementing neural nets, it has been underutilized in the analysis of neural nets. The role of players in a convnet is encapsulated in the Gated Forecaster setting, Section 4. Our results provide a dictionary that translates the guarantees applicable to any noregret algorithm into a convergence rate for the network as a whole.

Reformulate neural networks as games.
The primary conceptual contribution of the paper is to connect game theory to deep learning. An interesting consequence of the main result is corollary 2 which provides a compact description of the weights learned by a neural network via the signal underlying correlated equilibrium. More generally, neural nets are a basic example of a game with a structured communication protocol (the pathsums) which determines how players interact [44]. It may be fruitful to investigate broader classes of structured games.
It has been suggested that rectifiers perform well because they are nonnegative homogeneous which has implications for regularization [45] and robustness to changes in initialization [23]. Our results provide a complementary explanation. Rectifiers simultaneously (i) introduce a nonlinearity into neural nets providing them with enormous representational power and (ii) act as gates that select subnetworks of an underlying linear neural network, so that convex methods are applicable with guarantees.

Logarithmic regret algorithm.
As a concrete application of the gated forecaster framework, we adapt the Online Newton Step algorithm [46] to neural nets and show it has logarithmicregret, corollary 3. The resulting algorithm approximates Newton’s method locally at the level of individual units of the network – rather than globally for the network as a whole. The local, unitwise implementation reduces computational complexity and sidesteps the tendency of quasinewton methods to approach saddle points.

Conditional computation.
A secondary conceptual contribution is to introduce a framework for conditional computation. Up to this point, we assumed the gate is a fixed property of the game. Concretely, gates correspond to rectifiers and maxpooling in convolutional networks – which are baked into the architecture before the network is exposed to data. It is natural to consider optimizing when players in a gated game are active, section 4.5.
Recent work along these lines has applied reinforcement learning algorithms to find datadependent dropout policies
[47, 48]. Conditional computation is closely related to models of attention [49, 50]. Slightly further afield, longshort term memories (LSTMs) and Gated Recurrent Units (GRUs) use complicated sets of sigmoidgates to control activity within memory cells
[51, 52]. Unfortunately the resulting architectures are difficult to analyze; see [53]for a principled simplification of recurrent neural network architectures motivated by similar considerations to the present paper.
As a first step towards analyzing conditional computation in neural nets, we introduce the Conditional Gate (CoG) setting. CoGs are contextual bandits or contextual partial monitors that optimize when sets of units are active. CoGs are a second class of players that can be introduced into in neural games, and may provide a useful tool when designing deep learning algorithms.
1.2.4 Caveat
Neural nets are typically trained on minibatches sampled i.i.d. from a dataset. In contrast, the analysis below provides guarantees in adversarial settings. Our results are therefore conservative. Extending the analysis to take advantage of stochastic settings is an important open problem. However, it is worth mentioning that neural nets are increasingly applied to data that is not i.i.d. sampled. For example, adversarially trained generative networks have achieved impressive performance [54, 55]. Similarly, there has been spectacular progress applying neural nets to reinforcement learning [56, 57].
Activity within a neural network is not i.i.d. even when the inputs are, a phenomenon known as internal covariate shift [58]
. Two relevant developments are batchnormalization
[58] and optimistic mirror descent [59, 60, 61]. Batch normalization significantly reduces the training time of neural nets by actively reducing internal covariate shift. Optimistic mirror descent takes advantage of the fact that all players in a game are implementing noregret learning to speed up convergence. It is interesting to investigate whether reducing internal covariate shift can be understood gametheoretically, and whether optimistic learning algorithms can be adapted to neural networks.1.3 Related work
A number of papers have brought techniques from convex optimization into the analysis of neural networks. A line of work initiated by Bengio in [62] shows that allowing the learning algorithm to choose the number of hidden units can convert neural network optimization in a convex problem, see also [63]. A convex multilayer architecture is developed in [64, 65]. Although these methods are interesting, they have not achieved the practical success of convnets. In this paper, we analyze convnets as they are rather than proposing a more tractable, but potentially less useful, model.
Game theory was developed to model interactions between humans [66]. However, it may be more directly applicable as a toolbox for analyzing machina economicus – that is, interacting populations of algorithms that are optimizing objective functions [67]. We go one step further, and develop a gametheoretic analysis of the internal structure of backpropagation.
2 Warmup: Linear Networks
The paper combines disparate techniques and notation from game theory, convex optimization and deep learning, and is therefore somewhat dense. To get oriented, we start with linear neural networks. Linear nets provide a simple but nontrivial worked example. They are not convex. Their energy landscapes and dynamics under backpropagation have been extensively studied and turn out to be surprisingly intricate [73, 74].
2.1 Neural networks
Consider a neural network with hidden layers. Let denote the input to the network. For each layer , set and where is a (typically nonlinear) function applied coordinatewise. Let denote the set of weight matrices. A convenient shorthand for the output of the network is .
For simplicity, suppose the output layer consists in a single unit and the output is a scalar (the assumption is dropped in section 3). Let denote a sample of labeled data and let be a loss function that is convex in the first argument. Training the neural network reduces to solving the optimization problem
(1) 
where is the empirical distribution over the data. Training is typically performed using gradient descent
(2) 
Since Eq. (1) is not a convex function of , there are no guarantees that gradient descent will find the global minimum.
2.1.1 Backprop
Let us recall how backprop is extracted from gradient descent. Subscripts now refer to units, not layers. Setting , Eq. (2) can be written more concretely for a single weight as
(3) 
By the chain rule the derivative decomposes as where is the backpropagated error. Backprop computes recursively via
(4) 
2.2 Linear networks
In a linear network, the function is the identity. It follows that the output of the network is
(5) 
For linear networks, the backpropagation computation in Eq. (4) reduces to
(6) 
It is convenient for our purposes to decompose slightly differently, by factorizing it into , the derivative of the loss with respect to the output, and , the sensitivity of the network’s output to unit :
(7) 
We now reformulate the forward and back propagation in linear nets in terms of pathsums [34, 72]:
Definition 1 (pathsums in linear nets).
A path is a directedsequence of edges connecting two units. Let denote the set of paths from unit to unit . The weight of a path is the product of the weights of the edges along the path.

sum(paths from to ):
(8) 
sum(paths from to ):
(9) 
sum(paths from to ):
(10) 
sum(paths avoiding ):
(11)
Proposition 1 (structure of linear nets).
Let .
For a linear network as above,

Feedforward computation of outputs.
The output of unit is . 
Sensitivity of network output.
The sensitivity of the network’s output to unit , denoted , is the sum of the weights of all paths from unit to the output unit:(12) 
Decomposition of network output.
The output of a linear network decomposes, with respect unit , as(13) 
Backpropagated errors.
Let denote the derivative of with respect to the output of the network. The backpropagated error signal received by unit is . 
Error gradients.
Finally,(14)
Note that
is a linear function of the weight vector
of unit , and that neither the pathsums from nor the pathsums avoiding depend on .Proof.
Direct computations. ∎
The output of a linear neural network is a polynomial function of its weights. This can be seen from the pathsum perspective by noting that the output of a linear net is the sum over all paths from the input layer to the output unit.
2.3 Game theory and online learning
In this subsection we reformulate linear neural networks as convex games.
Definition 2 (convex game).
A convex game consists of a set of players, actions and loss vector . Player picks actions from a convex compact set . Player ’s loss is convex in the argument.
The classical games of von Neumann and Morgensten [66] are a special case of convex games where
is the probability simplex over a finite set of actions
available to each agent , and the loss is multilinear.It is well known that, even in the linear case, the loss of a neural network is not a convex function of its weights or the weights of its individual layers. However, the loss of a linear network is a convex function of the weights of the individual units.
Proposition 2 (linear networks are convex games).
The players correspond to the units, where we impose that weight vectors are chosen from compact, convexs set for each unit , where is the indegree of unit .
Let denote the set of all weights in a neural network except those of unit . Define the loss of unit as . Then is a convex function of for all .
Proof.
Note that the loss of every unit is the same and corresponds to the loss of the network as a whole; the notation is introduced to emphasize the relevant parameters. By proposition 1.3, the loss can be written
(15) 
where , and are functions of and , and so constant with respect to . It follows that the loss is the composite of an affine function of with a convex function, and so is convex. ∎
Remark 1 (any neural network is a game).
Any neural network can be reformulated as a game by treating the individual units as players. However, in general the loss will not be a convex function of the players’ actions and so convergence guarantees are not available. The main conceptual contribution of the paper is to show that modern convnets form a class of games which, although not convex, are close enough that convergence results from game theory can be adapted to the setting.
As a concrete example, consider a network equipped with the meansquare error. The loss of unit is
(16) 
Define the residue . Unit ’s loss can be rewritten
(17) 
Thus, unit
performs linear regression on the residue, amplified by a scalar that reflects the network’s sensitivity to
’s output.The goal of each player in a game is to minimize its loss . Unfortunately, this is not realistic, since depends on the actions of other players. If the game is repeated, then an attainable goal is for players to minimize their regret. A player’s cumulative regret is the difference between the loss incurred over a series of plays and the loss that would have been incurred had the player consistently chosen the best play in hindsight:
(18)  
(19) 
An algorithm has noregret if . That is, an algorithm has noregret (asymptotically) if grows sublinearly in . It is important to note that noregret guarantees hold against any sequence of actions by the other players in the game – be they stochastic, adversarial, or something else. A player with noregret plays optimally given the actions of the other players. Examples of noregret algorithms on convex losses include online gradient descent, the exponential weights algorithm, follow the regularized leader, AdaGrad, and online mirror descent.
It was observed by Foster and Vohra [75] that, if players play according to noregret online learning rules, then the average of the sequence of plays converges to a correlated equilibrium [76]. Proposition 3 below shows a more general result: nogatedregret algorithms converge to correlated equilibrium at a rate that depends on the gatedregret.
Let us briefly recall the relevant notion of correlated equilibrium. A distribution is an coarse correlated equilibrium if, for every player , it holds that
(20) 
When we refer to a coarse correlated equilibrium. The term in Eq. (20) quantifies the deviation of from a coarse correlated equilibrium [77]. The notion of correlated equilibrium is weaker than Nash equilibrium. The set of correlated equilibria contains the convex hull of the set of Nash equilibria as a subset.
We thus have two perspectives on linear nets: as networks or as games. To train a network, we use algorithms such as gradient descent implemented via backpropagation. To play a game, the players use noregret algorithms. Sections 3 and 4 show the two perspectives are equivalent in the more general setting of modern convnets. In particular, correlated equilibria games map to critical points of energy landscapes. Our strategy is then to convert results about the convergence of convex games to correlated equilibria into results about the convergence of backpropagation on neural nets.
3 Gated Games and Convolutional Networks
This section presents a detailed analysis of rectifier nets. The key observation is that rectifiers act as gates, which leads directly to gated games. Gates games are not convex. However, they are close enough that results on convergence to correlated equilibria can easily be adapted to the setting.
The main technical work of the section is to introduce notation to handle the interaction between pathsums and gates. Pathsum games are then introduced as a class of gated games capturing the dynamics of rectifier nets, see proposition 5. Finally, we show how to extend the results to convnets.
3.1 Rectifier networks
Historically, neural networks typically used sigmoid or tanh nonlinearities. Alternatives were investigated by Jarrett et al in [11], who found that rectifiers often perform much better than sigmoids in practice. Rectifiers are now the default nonlinearity in convnets [12, 13, 14, 15, 16]. There are many variants on the theme, including noisy rectifers, , and leaky rectifers
(21) 
3.1.1 Rectifiers gate error backpropagation
The rectifier is convex and differentiable except at , with subgradient
(22) 
The subgradient acts as an indicator function, which motivates the choice of notation. Substituting for in the recursive backprop computation, Eq. (4), yields
(23) 
Rectifiers act as gates: the only difference between backprop in a linear network, Eq. (6), and a rectifier network, Eq. (23), is that some units are zeroed out during both the forward and backward sweeps. In the forward sweep, rectifiers zero out units which would have produced negative outputs; on the backward sweep, the rectifier subgradients zero out the exact same units by acting as indicator functions.
Zeroed out (or inactive) units do not contribute to the feedforward sweep and do not receive an error signal during backpropagation. In effect, the rectifiers select a linear subnetwork of active units for use during forward and backpropagation.
3.2 Gated Games
We have seen that linear networks are convex games. Extending the result to rectifier networks requires generalizing convex games to the setting where only a subset of players are active on each round.
Definition 3 (gated games).
Let denote the powerset of . A gated game is a convex game equipped with a gate . Players are active. Inactive players incur no loss. Each active player incurs a convex loss , that depends on its action and the actions of the other active players; i.e. . Inactive players do not incur a loss.
The gated forecaster setting formalizes the perspective of a player in a gated game:
In neural nets, rectifier functions, maxpooling, and dropout all act as forms of gates that control (deterministically or probabilistically) whether units actively respond to an input. Importantly, inactive units do not receive any error under backpropagation, as discussed in section 3.1.
In the gated setting, players only experience regret with respect to their actions when active. We therefore introduce gatedregret:
(24)  
(25) 
where is the number of rounds in which player is active.
Remark 2 (permanently inactive units).
If a player is permanently inactive then, trivially, it has no gatedregret. This suggests there is a loophole in the definition that players can exploit. We make two comments. Firstly, players do not control when they are inactive. Rather, they optimize over the rounds they are exposed to.
Secondly, in practice, some units in rectifier networks do become inactive. The problem is mild: rectifier nets still outperform other architectures. Reducing the number of inactive units was one of the motivations for maxout units [78].
The next step is to extend correlated equilibrium to gated games. The intuition behind correlated equilibrium is that a signal is sent to all players which guides their behavior. However, inactive players do not observe the signal. The signal received by player when active is the conditional distribution
(26) 
The following proposition extends the result that noregret learning leads to coarse correlated equilibria from convex to gated games:
Proposition 3 (no gatedregret correlated equilibrium).
Let be a gated game and suppose that the players follow strategies with gatedregret . Then, the empirical distribution of the actions played is an coarse correlated equilibrium.
The rate at which the gatedregret of the players decay thus controls the rate of convergence to a correlated equilibrium.
Proof.
We adapt a theorem for twoplayer convex games by Hazan and Kale in [79] to our setting. Since player has gatedregret , it follows that
(27) 
The empirical distribution assigns probability to each joint action occurring while player is active. We can therefore rewrite the above inequality as
(28) 
and the result follows. ∎
3.3 PathSum Games
Let be a directed acyclic graph corresponding to a rectifier neural network with units that are not input units. We provide an alternate description of the dynamics of the feedforward and feedback sweep on the neural net in terms of pathsums. The definitions are somewhat technical; the underlying intuition can be found in the discussions of linear and rectifier networks above.
Let denote the indegree of node . Every edge is assigned a weight. In addition, each source node (with no incoming edges) is assigned a weight . The weights assigned to source nodes are used to encode the input to the neural net.
Recall that given path , we write if starts at node and finishes at node . Given a set of nodes , write if all the nodes along are elements of .
Definition 4 (pathsums in rectifier nets).
The weight of a path is the product of the weights of the edges along the path. If a path starts at a source node , then is included in the product.
Given a set of nodes and a node , the pathsum is the sum of the weights of all paths in from source nodes to :
(29) 
By convention, is zero if no such path exists (for example, if ).
The set of active units is defined inductively on , which tracks the length of the longest path from source units to a given unit:
(30) 
Source units are always active, so set . Suppose unit has sourcepathlength and elements in have been identified. Then, is active if it corresponds to

a linear unit or

a rectifier with .
For simplicity we suppress that is a function of weights from the notation. It is also convenient to drop the superscript via the shorthand .
The following proposition connects active pathsums to the feedforward and feedback sweeps in a neural network:
Proposition 4 (structure of rectifier nets).
Let and .
Further, introduce notation for the output layer. Then

Feedforward outputs.
If inputs to the network are encoded in source weights as above then the output of unit in the neural network is . Specifically, if is linear then ; if is a rectifier then . 
Decomposition of network output.
Let denote the sum of active path weights from to the output layer. The output of the network decomposes as(31) where is the sum over active paths from sources to outputs that do not intersect .

Backpropagated errors.
Suppose the network is equipped with error function . Let denote the gradient of . The backpropagated error signal received by unit is . 
Error gradients.
Finally,(32) (33)
Proof.
Direct computation, paralleling proposition 1. ∎
The output of a rectifier network is a piecewise polynomial function of its weights. To see this, observe that the output of a rectifier net is the sum over all active paths from the input layer to the output unit, see also [34].
The next step is to construct a game played by the units of the neural network. It turns out there are two ways of doing so:
Definition 5 (pathsum games).
The set of players is . The zeroth player corresponds to the environment and is always active. The environment plays labeled datapoints and suffers no loss. The remaining players correspond to nonsource units of . Player plays weight vector in compact convex .
The losses in the two games are:

Pathsum prediction game (PSPred).
Player incurs when active and no loss when inactive. 
Pathsum gradient game (PSGrad).
Player incurs when active, where , and no loss when inactive.
PSPred and PSGrad are analogs of prediction with expert advice and the hedge setting. In the hedge setting, players receive linear losses and choose actions from the simplex; in PSGrad, players receive linear losses. The results below hold for both games, although our primary interest is in PSPred. Note that PSGrad has the important property that the loss of player is a linear function of player ’s action when it is active:
(34) 
Finally, observe that the regret when playing PSGrad upper bounds PSPred, since regretbounds for linear losses are the worstcase amongst convex losses.
Remark 3 (minibatch games).
It is possible to construct batch or minibatch games, by allowing the environment to play sequences of moves on each round.
Proposition 5 (pathsum games are gated games).
PSPred and PSGrad are gated games if the error function is convex in its first argument. That is, rectifier nets are gated games.
The gating structure is essential; pathsum games are not convex, even for rectifiers with the meansquared error: composing a rectifier with a quadratic can yield the nonconvex function . Even simpler, the negative of a rectifier is not convex.
Proof.
It is required to show that the losses under PSPred and PSGrad, that is and , are convex functions of when player is active. Clearly each loss is a scalarvalued function.
By proposition 4.2, when player is active the network loss has the form
(35)  
(36) 
The terms , and are all constants with respect to . Thus, the network loss is an affine transformation of (dotproduct followed by multiplication by a constant and adding a constant) composed with a convex function, and so convex.
By proposition 4.4, the gradient loss has the form
(37) 
when player is active – which is linear in since all the other terms are constants with respect to . ∎
Remark 4 (dependence of loss on other players).
We have shown that it is a convex function of player ’s action, when player is active. Note that: (i) the loss of player depends on the actions chosen by other players in the game and (ii) the loss is not a convex function of the jointaction of all the players. It is for these reasons that the gametheoretic analysis is essential.
The proposition does not merely hold for toy cases. The next section extends the result to maxout units, DropOut, DropConnect, and convolutional networks with shared weights and maxpooling. Proposition 5 thus applies to convolutional networks as they are used in practice. Finally, note that proposition 5 does not hold for leaky rectifier units [14] or units that are not piecewise linear, such as sigmoid or .
3.4 Convolutional Networks
We extend proposition 5 from rectifier nets to convnets.
Proposition 6 (convnets are gated games).
Let be a convolutional network with any combination of linear, rectifier, maxout and maxpooling units. Then, is a gated game.
The proof consists in identifying the relevant players and gates for each case (maxout units, maxpooling, weighttying in convolutional layers, dropout and dropconnect) in turn. We sketch the result below.
3.4.1 Maxout units
Maxout units were introduced in [78] to deal with the problem that rectifier units sometimes saturate at zero resulting in them being insufficiently active and to complement dropout. A maxout unit has weight vectors and, given input , outputs
(38) 
Construct a new graph, , which has: one node per input, linear and rectifier unit; and nodes per maxout unit. Players correspond to nodes of and are denoted by greek letters. The extended graph inherits its edge structure from : there is a connection between players in iff the underlying units in are connected. Path weights and pathsums are defined exactly as before, except that we work on instead of . The definition of active units is modified as follows:
The set of active players for maxout units is defined inductively. Let active players with longest sourcepath . Source players are active ().
Player with sourcepathlength is active if

it corresponds to a linear unit; or

a rectifier with ; or

a maxout unit with for all corresponding to the same maxout unit.
3.4.2 Maxpooling
Maxpooling is heavily used in convnets to as a form of dimensionality reduction. A maxpooling unit has no parameters and outputs the maximum of the outputs of the units from which it receives inputs:
(39) 
Gates can be extended to maxpooling by adding the condition that, to be active, the output of unit must be greater than any other unit that feeds (directly) into the same pooling unit.
A unit may thus produce an output and still count as inactive because it is ignored by the maxpooling layer, and so has no effect on the output of the neural network. In particular, units that are ignored by maxpooling do not update their weights under backpropagation.
3.4.3 Convolutional layers
Units in a convolutional layer share weights. Obversely to maxout units, each of which corresponds to many players, weightsharing units correspond to a single composite player.
Suppose that rectifier units share weight vector . Let denote active players in lower layers and define
(40) 
Component in layer is active if . Notice that, since players correspond to many units, two players may be connected by more than one edge. Player is active if any of its components is active, i.e. if . The output of player is the sum of its active components:
(41) 
The loss incurred by player is per Definition 5.
3.4.4 Dropout and Dropconnect
In training with dropout [5], units become inactive with some probability (typically ) during training. In other words, there is a stochastic component to whether or not a player is active. Gated games are easily extended to incorporate dropout by allowing gates to switch off stochastically. That is, the gating function takes values in the set of distributions over the set of units, , from which the active units are sampled.
Dropconnect is a refinement of dropout where connections, instead of units, are dropped out during training [80]. Dropconnect requires extending the notion of gate so that its range is distributions over subsets of edges instead of subsets of units: .
4 Deep Online Convex Optimization
We now explore some implications of the connection between pathsum games and deep learning. Theorem 1 shows that the convergence rates of gated forecasters in a pathsum game (that is, a rectifier net or convnet) controls the convergence rate of the network as a whole. As an immediate corollary, we obtain the first convergence rates for gradient descent on rectifier nets. A second corollary, that is of more conceptual than practical importance, shows that the signal underpinning the correlated equilibrium can be used to describe the representation learned by the neural network. Finally, we present an algorithm with logarithmic regret.
4.1 A localtoglobal convergence guarantee
Our main result is that gatedregret controls the rate of convergence to critical points in rectifier convnets with a loss function that is convex in the output of the net. The result holds assuming weight vectors are restricted to compact convex sets. Weights are not usually hardconstrained when training neural networks although they are frequently regularized. Hinton has recently argued that the weights of rectifier layers quickly stabilise on similar values, suggesting this is not an issue in practice.^{3}^{3}3For example, see http://bit.ly/1KN8e85 starting at 24:00
It is important to note that the theorem applies to convolutional nets as used in practice. Rectifiers have replaced sigmoids as the nonlinearity of choice as they consistently yield better empirical performance. Loss functions are convex in almost all applications: the logistic loss, hinge loss, and meansquared error are all convex functions of output weights.
Theorem 1 (localtoglobal convergence guarantee).
Let be a rectifier convnet trained by using backpropagation to compute gradients and a noregret algorithm (such as gradient descent, Adagrad, mirror descent) to update weights given the gradients.
Let denote the gatedregret of player after rounds. Then, the empirical distribution of the weights arising during training network over rounds converges to a correlated equilibrium. That is,
(42) 
for all players . Consequently, the gated regret of the players controls the rate of convergence of backpropagation to critical points when training rectifier nets.
An important class of games, introduced by Monderer and Shapley in [81], is potential games. A game is a potential game if the loss functions of all the players arise from a single function, referred to as the potential function. Rectifier nets are gated potential games where the potential function is the loss of the network. That is, the loss incurred by each player, when active, is the loss of the network. Potential games are more amenable to analysis and computation than general games. Local minima of the potential function are pure Nash equilibria. Moreover, simple algorithms such as fictitious play and regretmatching converge to Nash equilibria in potential games [82, 83].
Proof.
The output of a neural net is a continuous piecewise polynomial function of its weights, recall remark after prop 4. The potential function of a neural net is the composite of a piecewise polynomial and convex function. It follows that noregret algorithms will either converge to a point where the gradient is zero or to a point where the gradient does not exist. Thus, the network converges to a correlated equilibrium that is a Dirac distribution concentrated on a critical point of the loss function. ∎
The theorem provides the first rigorous justification for applying convex methods to convnets: although they are not convex, individual units perform convex optimizations when active. The theorem provides a generic conversion from regretguarantees for convex methods to convergence rates for rectifier networks. Corollaries 1 and 3 provide algorithms for which and respectively.
4.2 Gradient descent
A special case of theorem 1 is when the noregret algorithm is gradient descent, see algorithm 2. The algorithm differs from standard backpropagation by introducing a projection step
(43) 
that forces the updated weight to lie in the set of actions available to player . If the diameter of
is sufficiently large then the projection step makes no practice. It can be thought of as analogous to gradient clipping, which is sometimes used when training neural nets.
Corollary 1 (convergence for gradient descent).
Suppose a neural network has a loss function that is convex in its output. Suppose that has diameter . Further suppose that that the backpropagated errors received by and the inputs to are bounded by and respectively.
Then unit ’s gatedregret under online gradient descent is bounded by
(44) 
where is the number of rounds where is active.
The learning rate in algorithm 2 decays according to the number of active steps rather than the number of steps . An important insight of the gatedgame formulation is that learning only occurs on active rounds.
Proof.
By Lemma 4, weight updates under error backpropagation coincide with players performing online gradient descent, when active, on either loss in Definition 5. The gradient depends on player ’s input, upper bounded by , and its backpropagated error, upper bounded by . The result follows from a standard analysis of online gradient descent [84]. ∎
The bound is a function of constants, and , that depend on the behavior of other units in the neural net. The dependence arises for any gradient descent based algorithm where the weight updates depend on the backpropagated error. The corollary precisely characterises the dependence.
4.3 Signals and representations
Recall that a correlated equilibrium requires a signal to guide the behavior of the players. In the case of a rectifier net, the relevant signal is the emprical distribution over the joint actions of the players. As a corollary of theorem 1, we show that the signal provides a compact description of the representations learned by deep networks. There is thus a direct connection between correlated equilibria and representation learning.
Given a distribution on the set of joint actions (recall that a joint action in PSPred specifies the input to the network, its label, and every weight vector), define the expected gain of player as
(45) 
where is moves by players other than . Note that in Eq. (45), the moves of all players except are drawn from , which determines which players are active; player ’s move (if active) is treated as a free variable.
Let denote the empirical distribution – or signal in gametheoretic terminology – on joint actions up to round of a neural network trained by error backpropagation, and the empirical signal observered by player . For notational convenience, it is useful to incorporate the learning rate, number of rounds and initial weight vector into the gain, and define
(46) 
where is the number of rounds where unit is active. We then have
Corollary 2 (signals representations).
Construct the empirical gain of unit after rounds from the signal (empirical distribution) per (46).
Then if a rectifier net implements gradient descent with fixed learning rate and unconstrained weights, it holds that

unit ’s weight vector at time is the gradient of the gain:
(47) 
the output of unit on round is the directional derivative of the gain w.r.t. ’s input:
(48)
Corollary 2 succinctly describes the representations learned by a neural network via the gametheoretic notation developed above. The corollary does not eliminate the complexity of deep representations. Rather, it demonstrates their direct dependence on the empirical signal , which is itself an extremely complicated object.
4.4 Logarithmic convergence
As a second application of theorem 1, we adapt the Online Newton Step (ONS) algorithm [46] to neural networks, see NProp in Algorithm 3. Newton’s method is computationally expensive since it involves inverting the Hessian. In particular, Online Newton Step scales as where is the dimension [85]. Moreover, quasinewton method tends to converge on saddle points. A naive implementation of a quasinewton method in neural networks based on the global hessian is therefore problematic: the number of parameters is huge; and saddle points are abundant [86, 33].
The NProp algorithm sidesteps both problems since it is implemented unitwise. The computational cost is reduced, since an approximation to a local Hessian is computed for each unit. Thus, the computational cost scales as quadratically with the largest layer, rather than with the network. Similarly, since NProp is implemented unitwise, the Newtonapproximation is not exposed to curvature of the neural net. Instead, NProp simultaneously leverages the linear structure of active pathsums and the expconcave structure (curvature) of the external loss.
Let be a nonempty compact convex set. A function is expconcave if is a concave function of . Many commonly used loss functions are expconcave, including the meansquared error, , the log loss, , and the logistic loss, , for suitably restricted and .
Recall that
(49) 
Given vectors and , let denote their outerproduct. If and are and dimensional respectively then is a matrix.
Corollary 3 (NProp has logarithmic gatedregret).
Suppose that a neural network has loss function that is expconcave in its output. Suppose that has diameter . Further suppose that the backpropagated errors and inputs to are bounded by and respectively.
Then, unit ’s gatedregret under NProp is bounded by
(50) 
where is the number of rounds that is active and is its indegree.
We first prove the following lemma.
Lemma 1.
Let be an expconcave function. Suppose that is a nonempty compact convex set with , and that and are a matrix and a vector satisfying . Suppose that .
Define as . If then for all it holds that
(51)  
(52) 
Proof.
We are now ready to prove the Theorem.
Proof.
The proof follows the same logic as Theorem 2 in [46] after replacing Lemma 3 there with our Lemma 1. We omit details, except to show how the setting of Lemma 1 connects to neural networks. Let
(55) 
Let the dimensional matrix denote the outer product. By Lemma 4.2,
(56) 
Since
(57) 
we have that
(58) 
and the remainder of the argument is standard. ∎
To the best of our knowledge, NProp is the first logarithmicregret algorithm for neural networks. NProp is computationally more efficient than order methods, since it does not require computing the Hessian, and there are efficient ways to iteratively compute without directly inverting the matrix. Nevertheless, NProp’s memory usage and computational complexity are prohibitive [85]; it is worth investigating whether there are more efficient algorithms that achieve logarithmic regret in this setting, for example based on the Sketched Online Newton algorithm proposed by Luo et al [87] which has linear runtime. Finally, NProp does not take advantage of the fact that some experts are inactive on each round, suggesting a second direction in which it may be improved.
4.5 Conditional computation
Convnets are pathsum games played between gated convex players. The criterion for activating a unit is either a operator (rectifiers, maxout units, and maxpooling) or random (dropout and dropconnect). These have been shown to work well in practice. It is nevertheless natural to question whether they are optimal. This section introduces a framework to tackle the question. Indeed
Analyzing and optimizing the gates requires a new kind of player, Conditional Gate (CoG), that controls when players are active. Conditional Gate experiences regret about not activing the optimal subset of players. More precisely, a CoG activates a subset of players on each round. The CoG’s context is the weights of the players and their inputs. In PSPred, a CoG incurs scalar loss . In PSGrad, a CoG incurs loss vector .
Comments
There are no comments yet.