# Deep Online Convex Optimization with Gated Games

Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.

## Authors

• 32 publications
• ### Deep Online Convex Optimization by Putting Forecaster to Sleep

Methods from convex optimization such as accelerated gradient descent ar...
09/06/2015 ∙ by David Balduzzi, et al. ∙ 0

• ### Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks

Modern convolutional networks, incorporating rectifiers and max-pooling,...
11/07/2016 ∙ by David Balduzzi, et al. ∙ 0

06/17/2017 ∙ by Mahesh Chandra Mukkamala, et al. ∙ 0

• ### Faster Rates for Convex-Concave Games

We consider the use of no-regret algorithms to compute equilibria for pa...
05/17/2018 ∙ by Jacob Abernethy, et al. ∙ 0

• ### Gated Linear Networks

This paper presents a family of backpropagation-free neural architecture...
09/30/2019 ∙ by Joel Veness, et al. ∙ 38

• ### Online Convex Optimization with Binary Constraints

We consider online optimization with binary decision variables and conve...
05/05/2020 ∙ by Antoine Lesage-Landry, et al. ∙ 0

• ### Deep SimNets

We present a deep layered architecture that generalizes convolutional ne...
06/09/2015 ∙ by Nadav Cohen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep learning algorithms have yielded impressive performance across a range of tasks, including object and voice recognition [1]. The workhorse underlying deep learning is error backpropagation [2, 3, 4] – a decades old algorithm that yields state-of-the-art performance on massive labeled datasets when combined with recent innovations such as rectifiers and dropout [5].

Backprop is gradient descent plus the chain rule. Gradient descent has convergence guarantees in settings that are smooth or convex or both. However,

modern convnets are neither smooth nor convex

. Although it is well-known that convnets are not convex, it is perhaps under-emphasized that the spectacular recent results obtained by convnets on benchmarks such as ImageNet

[6]

rely on architectures that are not smooth. Starting with AlexNet in 2012, every winner of the ImageNet classification challenge has used rectifier (also known as rectilinear) activation functions

[7, 8, 9, 10].

Rectifiers and max-pooling are non-smooth functions that are used in essentially all modern convnets [11, 12, 13, 14, 15, 16]. In fact, the representational power of rectifier nets derives precisely from their nondifferentiability: the number of nondifferentiable boundaries in the parameter space grows exponentially with depth [17]. It follows that none of the standard convergence guarantees from the optimization literature apply to modern convnets.

In this paper, we provide the first convergence rates for convolutional networks with rectifiers and max-pooling. To do so we introduce a new class of gated games which generalize the convex games studied by Stoltz and Lugosi in [18]. We reformulate learning in convnets as gated games and adapt results on convergence to correlated equilibria from convex to gated games.

### 1.1 Open Questions in the Foundations of Deep Learning

Theoretical questions about deep learning can be loosely grouped into four categories:

1. Representational power

The set of functions approximated by neural networks have been extensively studied. Early results show that neural networks with a single hidden layer are universal function approximators [19, 20]. More recently, researchers have focused on the role of depth and rectifiers in function approximation [21, 22, 23].

1. Generalization guarantees

Standard guarantees from VC-theory apply to neural nets, although these are quite loose [24, 25]. Recent work by Hardt et al shows that the convergence rate of stochastic gradient methods have implications for generalization bounds in both convex and nonconvex settings [26]. Unfortunately, their results rely on a smoothness assumption111Function is smooth if , such that . Rectifiers are not smooth for any . that does not hold for rectifiers or max-pooling. Thus, although suggestive, the results do not apply to modern convnets. Feng et al have initiated a promising direction based on ensemble robustness is [27], although robustness cannot be evaluated analytically.

A related problem is to better understand regularization methods such as dropout [28, 29]. Regret-bounds for dropout have been found in the setting of prediction with expert advice [30]. However, it is unclear how to extend these results to neural nets.

1. Local and global minima

The third problem is to understand how far the critical points found by backpropagation are from local minima and the global minimum. The problem is challenging since neural networks are not convex. There has been theoretical work studying conditions under which gradient descent converges to local minima on nonconvex problems [31, 32]. The assumptions required for these results are quite strong, and include smoothness assumptions that do not hold for rectifiers. It has also been observed that saddles slow down training, even when the algorithm does not converge to a saddle point; designing algorithms that avoid saddles is an area of active research [33].

Recent work by Choromanska et al suggests that most local optima in neural nets have error rates that are reasonably close to the global optimum [34, 35]. Searching for good local optima may therefore be of less practical importance than ensuring rapid convergence.

1. Convergence rates

The last problem, and the focus of this paper, is to understand the convergence of gradient-based methods on neural networks. Speeding up the training time of neural nets is a problem of enormous practical importance. Although there is a large body of empirical work on optimizing neural nets, there are no theoretical guarantees that apply to the methods used to train rectifier convnets since they are neither smooth nor convex.

Recent work has investigated the convergence of proximal methods for nonconvex nonsmooth problems [36, 37]. However computing prox

- operators appears infeasible for neural nets. Interesting results have been derived for variance-reduced gradient optimization in the nonconvex setting

[38, 39, 40], although smoothness is still required.

### 1.2 Outline

Training modern convnets, with rectifiers and max-pooling, entails searching over a rich subset of a universal class of function approximators with a loss function that is neither smooth nor convex. There is little hope of obtaining useful convergence results at this level of generality. It is therefore necessary to utilize the structure of rectifier networks.

Our strategy is to decompose neural nets into interacting optimizers that are easier to analyze individually than the net as a whole. In short, the strategy is to import techniques from game theory into deep learning.

#### 1.2.1 Utilizing network structure

We make two observations about neural network structure. The first, section 2, is to reformulate linear networks as convex games where the players are the units in the network. Although the loss is not a convex function of the weights of the network; it is a convex function of the weights of the individual players. The observation connects the dynamics of games under no-regret learning to the dynamics of linear networks under backpropagation.

Linear networks are a special case. It is natural to ask whether neural networks with nonlinearities are also convex games. Unfortunately, the answer is no: introducing any nonlinearity breaks convexity.222In short, the “nonlinearity” would have to be affine since we need all linear combinations of to be convex, including and . Although the situation seems hopeless it turns out, remarkably, that game-theoretic convergence results can be imported – despite nonconvexity – for precisely the nonlinearities used in modern convnets.

The second observation, section 3.1, is that a rectifier network is a linear network equipped with gates that control whether units are active for a given input. If a unit is inactive during the feedforward sweep, it is also inactive during backpropagation, and therefore does not update its weights. This motivates generalizing convex games to gated games.

#### 1.2.2 Gated games

In a classical game, each player chooses a series of actions and, on each round, incurs a convex loss. The regret of a player is the difference between its cumulative loss and what the cumulative would have been had the player chosen the best action in hindsight [41]. Players can be implemented with so-called no-regret algorithms that minimize their loss relative to the best action in hindsight. More precisely, a no-regret algorithm has sublinear cumulative regret. The regret per round therefore vanishes asymptotically.

Section 3 introduces gated games where players only incur a convex loss on rounds for which they are active. After extending the definitions of regret and correlated equilibrium to gated games, proposition 3 shows that if players follow no-gated-regret strategies, then they converge to a correlated equilibrium. Gated players generalize the sleepy experts introduced by Blum in [42], see also [43].

A useful technical tool is path-sum games. These are games constructed over directed acyclic graphs with weighted edges. Lemma 4 shows that path-sums encode the dynamics of the feedforward and feedback sweeps of rectifier nets. Proposition 5 shows that path-sum games are gated games and proposition 6 extends the result to convolutional networks.

#### 1.2.3 Summary of contribution

The main contributions of the paper are as follows:

1. Theorem 1: Rectifier convnets converge to a critical point under backpropagation at a rate controlled by the gated-regret of the units in the network.

Corollary 1 specializes the result to gradient descent. To the best of our knowledge, there are no previous convergence rates applicable to neural nets with rectifier nonlinearities and max-pooling. Finding conditions that guarantee convergence to local minima is deferred to future work.

The results derive from a detailed analysis of the internal structure of rectifer nets and their updates under backpropagation. They require no new ideas regarding optimization in general. Our methods provide the first rigorous explanation for how methods designed for convex optimization improve convergence rates on modern convnets. The results do not apply to all neural networks: they hold for precisely the neural networks that perform best empirically [7, 8, 9, 10, 11, 12, 13, 14, 15, 16].

The philosophy underlying the paper is to decompose training neural nets into two distinct tasks: communication and optimization. Communication is handled by backpropagation which sends the correct gradient information to players (units) in the net. Optimization is handled locally by the individual players. Note that although this philosophy is widely applied when designing and implementing neural nets, it has been under-utilized in the analysis of neural nets. The role of players in a convnet is encapsulated in the Gated Forecaster setting, Section 4. Our results provide a dictionary that translates the guarantees applicable to any no-regret algorithm into a convergence rate for the network as a whole.

1. Reformulate neural networks as games.

The primary conceptual contribution of the paper is to connect game theory to deep learning. An interesting consequence of the main result is corollary 2 which provides a compact description of the weights learned by a neural network via the signal underlying correlated equilibrium. More generally, neural nets are a basic example of a game with a structured communication protocol (the path-sums) which determines how players interact [44]. It may be fruitful to investigate broader classes of structured games.

It has been suggested that rectifiers perform well because they are nonnegative homogeneous which has implications for regularization [45] and robustness to changes in initialization [23]. Our results provide a complementary explanation. Rectifiers simultaneously (i) introduce a nonlinearity into neural nets providing them with enormous representational power and (ii) act as gates that select subnetworks of an underlying linear neural network, so that convex methods are applicable with guarantees.

1. Logarithmic regret algorithm.

As a concrete application of the gated forecaster framework, we adapt the Online Newton Step algorithm [46] to neural nets and show it has logarithmic-regret, corollary 3. The resulting algorithm approximates Newton’s method locally at the level of individual units of the network – rather than globally for the network as a whole. The local, unit-wise implementation reduces computational complexity and sidesteps the tendency of quasi-newton methods to approach saddle points.

1. Conditional computation.

A secondary conceptual contribution is to introduce a framework for conditional computation. Up to this point, we assumed the gate is a fixed property of the game. Concretely, gates correspond to rectifiers and max-pooling in convolutional networks – which are baked into the architecture before the network is exposed to data. It is natural to consider optimizing when players in a gated game are active, section 4.5.

Recent work along these lines has applied reinforcement learning algorithms to find data-dependent dropout policies

[47, 48]. Conditional computation is closely related to models of attention [49, 50]

. Slightly further afield, long-short term memories (LSTMs) and Gated Recurrent Units (GRUs) use complicated sets of sigmoid-gates to control activity within memory cells

[51, 52]. Unfortunately the resulting architectures are difficult to analyze; see [53]

for a principled simplification of recurrent neural network architectures motivated by similar considerations to the present paper.

As a first step towards analyzing conditional computation in neural nets, we introduce the Conditional Gate (CoG) setting. CoGs are contextual bandits or contextual partial monitors that optimize when sets of units are active. CoGs are a second class of players that can be introduced into in neural games, and may provide a useful tool when designing deep learning algorithms.

#### 1.2.4 Caveat

Neural nets are typically trained on minibatches sampled i.i.d. from a dataset. In contrast, the analysis below provides guarantees in adversarial settings. Our results are therefore conservative. Extending the analysis to take advantage of stochastic settings is an important open problem. However, it is worth mentioning that neural nets are increasingly applied to data that is not i.i.d. sampled. For example, adversarially trained generative networks have achieved impressive performance [54, 55]. Similarly, there has been spectacular progress applying neural nets to reinforcement learning [56, 57].

Activity within a neural network is not i.i.d. even when the inputs are, a phenomenon known as internal covariate shift [58]

. Two relevant developments are batch-normalization

[58] and optimistic mirror descent [59, 60, 61]. Batch normalization significantly reduces the training time of neural nets by actively reducing internal covariate shift. Optimistic mirror descent takes advantage of the fact that all players in a game are implementing no-regret learning to speed up convergence. It is interesting to investigate whether reducing internal covariate shift can be understood game-theoretically, and whether optimistic learning algorithms can be adapted to neural networks.

### 1.3 Related work

A number of papers have brought techniques from convex optimization into the analysis of neural networks. A line of work initiated by Bengio in [62] shows that allowing the learning algorithm to choose the number of hidden units can convert neural network optimization in a convex problem, see also [63]. A convex multi-layer architecture is developed in [64, 65]. Although these methods are interesting, they have not achieved the practical success of convnets. In this paper, we analyze convnets as they are rather than proposing a more tractable, but potentially less useful, model.

Game theory was developed to model interactions between humans [66]. However, it may be more directly applicable as a toolbox for analyzing machina economicus – that is, interacting populations of algorithms that are optimizing objective functions [67]. We go one step further, and develop a game-theoretic analysis of the internal structure of backpropagation.

The idea of decomposing deep learning algorithms into cooperating modules dates back to at least the work of Bottou [68]. A related line of work modeling biological neural networks from a game-theoretic perspective can be found in [69, 70, 71, 72].

## 2 Warmup: Linear Networks

The paper combines disparate techniques and notation from game theory, convex optimization and deep learning, and is therefore somewhat dense. To get oriented, we start with linear neural networks. Linear nets provide a simple but nontrivial worked example. They are not convex. Their energy landscapes and dynamics under backpropagation have been extensively studied and turn out to be surprisingly intricate [73, 74].

### 2.1 Neural networks

Consider a neural network with hidden layers. Let denote the input to the network. For each layer , set and where is a (typically nonlinear) function applied coordinatewise. Let denote the set of weight matrices. A convenient shorthand for the output of the network is .

For simplicity, suppose the output layer consists in a single unit and the output is a scalar (the assumption is dropped in section 3). Let denote a sample of labeled data and let be a loss function that is convex in the first argument. Training the neural network reduces to solving the optimization problem

 W∗=argminWE(x,y)∼^P[ℓ(fW(x),y)], (1)

where is the empirical distribution over the data. Training is typically performed using gradient descent

 W←W−E(x,y)∼^P[η⋅∇Wℓ(fW(x),y)]. (2)

Since Eq. (1) is not a convex function of , there are no guarantees that gradient descent will find the global minimum.

#### 2.1.1 Backprop

Let us recall how backprop is extracted from gradient descent. Subscripts now refer to units, not layers. Setting , Eq. (2) can be written more concretely for a single weight as

 wij←wij−η⋅∂E∂wij. (3)

By the chain rule the derivative decomposes as where is the backpropagated error. Backprop computes recursively via

 ∂E∂ajδj=∑{k:j→k}δk∂E∂ak⋅∂ak∂ajδj=∑{k:j→k}δk⋅wjkh′jδj (4)

### 2.2 Linear networks

In a linear network, the function is the identity. It follows that the output of the network is

 f(x)=fW(x)=(L∏l=1Wl)⋅x. (5)

For linear networks, the backpropagation computation in Eq. (4) reduces to

 δj=∑{k:j→k}δk⋅wjk (6)

It is convenient for our purposes to decompose slightly differently, by factorizing it into , the derivative of the loss with respect to the output, and , the sensitivity of the network’s output to unit :

 δj=∂E∂aj=∂E∂f⋅∂f∂ajδj (7)

We now reformulate the forward- and back- propagation in linear nets in terms of path-sums [34, 72]:

###### Definition 1 (path-sums in linear nets).

A path is a directed-sequence of edges connecting two units. Let denote the set of paths from unit to unit . The weight of a path is the product of the weights of the edges along the path.

• sum(paths from to ):

 σi⇝j:=∑ρ∈(i⇝j)weight(ρ) (8)
• sum(paths from to ):

 σ∙⇝j:=∑s∈inxs⋅⎛⎝∑ρ∈(s⇝j)weight(ρ)⎞⎠ (9)
• sum(paths from to ):

 σj⇝∙:=∑ρ∈(j⇝∙)weight(ρ) (10)
• sum(paths avoiding ):

 σ−j:=∑s∈inxs⋅⎛⎜⎝∑{ρ|ρ∈(s⇝∙) and j∉ρ}weight(ρ)⎞⎟⎠ (11)
###### Proposition 1 (structure of linear nets).

Let . For a linear network as above,

1. Feedforward computation of outputs.
The output of unit is .

2. Sensitivity of network output.
The sensitivity of the network’s output to unit , denoted , is the sum of the weights of all paths from unit to the output unit:

 ∂fW∂aj=σj⇝∙ (12)
3. Decomposition of network output.
The output of a linear network decomposes, with respect unit , as

 fW(x)=⟨wj,σin(j)⟩⋅σj⇝∙+σ−j=aj⋅σj⇝∙+σ−j (13)
4. Backpropagated errors.
Let denote the derivative of with respect to the output of the network. The backpropagated error signal received by unit is .

Finally,

 ∇wjE=δj⋅σin(j)=β⋅σj⇝∙⋅σin(j) (14)

Note that

is a linear function of the weight vector

of unit , and that neither the path-sums from nor the path-sums avoiding depend on .

###### Proof.

Direct computations. ∎

The output of a linear neural network is a polynomial function of its weights. This can be seen from the path-sum perspective by noting that the output of a linear net is the sum over all paths from the input layer to the output unit.

### 2.3 Game theory and online learning

In this subsection we reformulate linear neural networks as convex games.

###### Definition 2 (convex game).

A convex game consists of a set of players, actions and loss vector . Player picks actions from a convex compact set . Player ’s loss is convex in the argument.

The classical games of von Neumann and Morgensten [66] are a special case of convex games where

is the probability simplex over a finite set of actions

available to each agent , and the loss is multilinear.

It is well known that, even in the linear case, the loss of a neural network is not a convex function of its weights or the weights of its individual layers. However, the loss of a linear network is a convex function of the weights of the individual units.

###### Proposition 2 (linear networks are convex games).

The players correspond to the units, where we impose that weight vectors are chosen from compact, convexs set for each unit , where is the in-degree of unit .

Let denote the set of all weights in a neural network except those of unit . Define the loss of unit as . Then is a convex function of for all .

###### Proof.

Note that the loss of every unit is the same and corresponds to the loss of the network as a whole; the notation is introduced to emphasize the relevant parameters. By proposition 1.3, the loss can be written

 ℓj(wj,w−j,x,y)=ℓ(⟨wj,σin(j)⟩⋅σj⇝∙+σ−j,y), (15)

where , and are functions of and , and so constant with respect to . It follows that the loss is the composite of an affine function of with a convex function, and so is convex. ∎

###### Remark 1 (any neural network is a game).

Any neural network can be reformulated as a game by treating the individual units as players. However, in general the loss will not be a convex function of the players’ actions and so convergence guarantees are not available. The main conceptual contribution of the paper is to show that modern convnets form a class of games which, although not convex, are close enough that convergence results from game theory can be adapted to the setting.

As a concrete example, consider a network equipped with the mean-square error. The loss of unit is

 ℓj(w)=(σj⇝∙⋅⟨wj,σin(j)⟩−(y−σ−j))2. (16)

Define the residue . Unit ’s loss can be rewritten

 ℓj(w)=⎧⎨⎩σ2j⇝∙⋅(⟨wj,σin(j)⟩−y−j)2if σj⇝∙≠00else. (17)

Thus, unit

performs linear regression on the residue, amplified by a scalar that reflects the network’s sensitivity to

’s output.

The goal of each player in a game is to minimize its loss . Unfortunately, this is not realistic, since depends on the actions of other players. If the game is repeated, then an attainable goal is for players to minimize their regret. A player’s cumulative regret is the difference between the loss incurred over a series of plays and the loss that would have been incurred had the player consistently chosen the best play in hindsight:

 Regretj(T)=supwj∈Hj1TT∑t=1( ℓj(wt1,…,wtj,…,wtN) (18) − ℓj(wt1,…,wj,…,wtN)). (19)

An algorithm has no-regret if . That is, an algorithm has no-regret (asymptotically) if grows sublinearly in . It is important to note that no-regret guarantees hold against any sequence of actions by the other players in the game – be they stochastic, adversarial, or something else. A player with no-regret plays optimally given the actions of the other players. Examples of no-regret algorithms on convex losses include online gradient descent, the exponential weights algorithm, follow the regularized leader, AdaGrad, and online mirror descent.

It was observed by Foster and Vohra [75] that, if players play according to no-regret online learning rules, then the average of the sequence of plays converges to a correlated equilibrium [76]. Proposition 3 below shows a more general result: no-gated-regret algorithms converge to correlated equilibrium at a rate that depends on the gated-regret.

Let us briefly recall the relevant notion of correlated equilibrium. A distribution is an -coarse correlated equilibrium if, for every player , it holds that

 (20)

When we refer to a coarse correlated equilibrium. The -term in Eq. (20) quantifies the deviation of from a coarse correlated equilibrium [77]. The notion of correlated equilibrium is weaker than Nash equilibrium. The set of correlated equilibria contains the convex hull of the set of Nash equilibria as a subset.

We thus have two perspectives on linear nets: as networks or as games. To train a network, we use algorithms such as gradient descent implemented via backpropagation. To play a game, the players use no-regret algorithms. Sections 3 and 4 show the two perspectives are equivalent in the more general setting of modern convnets. In particular, correlated equilibria games map to critical points of energy landscapes. Our strategy is then to convert results about the convergence of convex games to correlated equilibria into results about the convergence of backpropagation on neural nets.

## 3 Gated Games and Convolutional Networks

This section presents a detailed analysis of rectifier nets. The key observation is that rectifiers act as gates, which leads directly to gated games. Gates games are not convex. However, they are close enough that results on convergence to correlated equilibria can easily be adapted to the setting.

The main technical work of the section is to introduce notation to handle the interaction between path-sums and gates. Path-sum games are then introduced as a class of gated games capturing the dynamics of rectifier nets, see proposition 5. Finally, we show how to extend the results to convnets.

### 3.1 Rectifier networks

Historically, neural networks typically used sigmoid or tanh nonlinearities. Alternatives were investigated by Jarrett et al in [11], who found that rectifiers often perform much better than sigmoids in practice. Rectifiers are now the default nonlinearity in convnets [12, 13, 14, 15, 16]. There are many variants on the theme, including noisy rectifers, , and leaky rectifers

 ρL(a)={aif a>00.01aelse (21)

introduced in [12] and [14] respectively.

#### 3.1.1 Rectifiers gate error backpropagation

The rectifier is convex and differentiable except at , with subgradient

 1(a)=dρda={1if a≥00else. (22)

The subgradient acts as an indicator function, which motivates the choice of notation. Substituting for in the recursive backprop computation, Eq. (4), yields

 δj=∂ℓ∂aj=∑{k:j→k}δk⋅wjk⋅1j (23)

Rectifiers act as gates: the only difference between backprop in a linear network, Eq. (6), and a rectifier network, Eq. (23), is that some units are zeroed out during both the forward and backward sweeps. In the forward sweep, rectifiers zero out units which would have produced negative outputs; on the backward sweep, the rectifier subgradients zero out the exact same units by acting as indicator functions.

Zeroed out (or inactive) units do not contribute to the feedforward sweep and do not receive an error signal during backpropagation. In effect, the rectifiers select a linear subnetwork of active units for use during forward and backpropagation.

### 3.2 Gated Games

We have seen that linear networks are convex games. Extending the result to rectifier networks requires generalizing convex games to the setting where only a subset of players are active on each round.

###### Definition 3 (gated games).

Let denote the powerset of . A gated game is a convex game equipped with a gate . Players are active. Inactive players incur no loss. Each active player incurs a convex loss , that depends on its action and the actions of the other active players; i.e. . Inactive players do not incur a loss.

The gated forecaster setting formalizes the perspective of a player in a gated game:

In neural nets, rectifier functions, max-pooling, and dropout all act as forms of gates that control (deterministically or probabilistically) whether units actively respond to an input. Importantly, inactive units do not receive any error under backpropagation, as discussed in section 3.1.

In the gated setting, players only experience regret with respect to their actions when active. We therefore introduce gated-regret:

 GRegret(T)=1Ti ⎛⎜⎝∑{t∈[T]:i∈A(wt)}ℓti(wti,wt−i) (24) −supwi∈Hi∑{t∈[T]:i∈A(wt)}ℓt(wi,wt−i)⎞⎟⎠, (25)

where is the number of rounds in which player is active.

###### Remark 2 (permanently inactive units).

If a player is permanently inactive then, trivially, it has no gated-regret. This suggests there is a loophole in the definition that players can exploit. We make two comments. Firstly, players do not control when they are inactive. Rather, they optimize over the rounds they are exposed to.

Secondly, in practice, some units in rectifier networks do become inactive. The problem is mild: rectifier nets still outperform other architectures. Reducing the number of inactive units was one of the motivations for maxout units [78].

The next step is to extend correlated equilibrium to gated games. The intuition behind correlated equilibrium is that a signal is sent to all players which guides their behavior. However, inactive players do not observe the signal. The signal received by player when active is the conditional distribution

 Pj:=P(w|j∈A(w))={P(w)P({w|j∈A(w)})if j∈A(w)0else. (26)

The following proposition extends the result that no-regret learning leads to coarse correlated equilibria from convex to gated games:

###### Proposition 3 (no gated-regret → correlated equilibrium).

Let be a gated game and suppose that the players follow strategies with gated-regret . Then, the empirical distribution of the actions played is an -coarse correlated equilibrium.

The rate at which the gated-regret of the players decay thus controls the rate of convergence to a correlated equilibrium.

###### Proof.

We adapt a theorem for two-player convex games by Hazan and Kale in [79] to our setting. Since player has gated-regret , it follows that

 1Tj∑{t∈[T]:j∈A(wt)}(ℓj(wt)−ℓj(wj,wt−j))≤ϵ∀wj∈Hj. (27)

The empirical distribution assigns probability to each joint action occurring while player is active. We can therefore rewrite the above inequality as

 Ew∼^Pj[ℓj(w)]−Ew∼^Pj[ℓj(wj,w−j)]≤ϵ∀wj∈Hj (28)

and the result follows. ∎

### 3.3 Path-Sum Games

Let be a directed acyclic graph corresponding to a rectifier neural network with units that are not input units. We provide an alternate description of the dynamics of the feedforward and feedback sweep on the neural net in terms of path-sums. The definitions are somewhat technical; the underlying intuition can be found in the discussions of linear and rectifier networks above.

Let denote the indegree of node . Every edge is assigned a weight. In addition, each source node (with no incoming edges) is assigned a weight . The weights assigned to source nodes are used to encode the input to the neural net.

Recall that given path , we write if starts at node and finishes at node . Given a set of nodes , write if all the nodes along are elements of .

###### Definition 4 (path-sums in rectifier nets).

The weight of a path is the product of the weights of the edges along the path. If a path starts at a source node , then is included in the product.

Given a set of nodes and a node , the path-sum is the sum of the weights of all paths in from source nodes to :

 σA∙⇝j:=∑s∈source∑{ρ|ρ⊂A and ρ∈(s⇝j)}weight(ρ). (29)

By convention, is zero if no such path exists (for example, if ).

The set of active units is defined inductively on , which tracks the length of the longest path from source units to a given unit:

 Let Aκ={active units with longest path from % a source ≤κ}. (30)

Source units are always active, so set . Suppose unit has source-path-length and elements in have been identified. Then, is active if it corresponds to

• a linear unit or

• a rectifier with .

For simplicity we suppress that is a function of weights from the notation. It is also convenient to drop the superscript via the shorthand .

The following proposition connects active path-sums to the feedforward and feedback sweeps in a neural network:

###### Proposition 4 (structure of rectifier nets).

Let and . Further, introduce notation for the output layer. Then

1. Feedforward outputs.
If inputs to the network are encoded in source weights as above then the output of unit in the neural network is . Specifically, if is linear then ; if is a rectifier then .

2. Decomposition of network output.
Let denote the sum of active path weights from to the output layer. The output of the network decomposes as

 (31)

where is the sum over active paths from sources to outputs that do not intersect .

3. Backpropagated errors.
Suppose the network is equipped with error function . Let denote the gradient of . The backpropagated error signal received by unit is .

Finally,

 ⟨∇wjℓ(ςout,y),wj⟩ =⟨g,ςout−σA∖{j}out⟩ (32) =⟨g,ςj⇝out⟩⋅⟨wj,ςin(j)⟩=δj⋅ς∙⇝j (33)
###### Proof.

Direct computation, paralleling proposition 1. ∎

The output of a rectifier network is a piecewise polynomial function of its weights. To see this, observe that the output of a rectifier net is the sum over all active paths from the input layer to the output unit, see also [34].

The next step is to construct a game played by the units of the neural network. It turns out there are two ways of doing so:

###### Definition 5 (path-sum games).

The set of players is . The zeroth player corresponds to the environment and is always active. The environment plays labeled datapoints and suffers no loss. The remaining players correspond to non-source units of . Player plays weight vector in compact convex .

The losses in the two games are:

• Path-sum prediction game (PS-Pred).
Player incurs when active and no loss when inactive.

Player incurs when active, where , and no loss when inactive.

PS-Pred and PS-Grad are analogs of prediction with expert advice and the hedge setting. In the hedge setting, players receive linear losses and choose actions from the simplex; in PS-Grad, players receive linear losses. The results below hold for both games, although our primary interest is in PS-Pred. Note that PS-Grad has the important property that the loss of player is a linear function of player ’s action when it is active:

 (34)

Finally, observe that the regret when playing PS-Grad upper bounds PS-Pred, since regret-bounds for linear losses are the worst-case amongst convex losses.

###### Remark 3 (minibatch games).

It is possible to construct batch or minibatch games, by allowing the environment to play sequences of moves on each round.

###### Proposition 5 (path-sum games are gated games).

PS-Pred and PS-Grad are gated games if the error function is convex in its first argument. That is, rectifier nets are gated games.

The gating structure is essential; path-sum games are not convex, even for rectifiers with the mean-squared error: composing a rectifier with a quadratic can yield the nonconvex function . Even simpler, the negative of a rectifier is not convex.

###### Proof.

It is required to show that the losses under PS-Pred and PS-Grad, that is and , are convex functions of when player is active. Clearly each loss is a scalar-valued function.

By proposition 4.2, when player is active the network loss has the form

 ℓj(w) (35) =ℓ(c1⋅⟨wj,ςin(j)⟩+c2,y). (36)

The terms , and are all constants with respect to . Thus, the network loss is an affine transformation of (dot-product followed by multiplication by a constant and adding a constant) composed with a convex function, and so convex.

By proposition 4.4, the gradient loss has the form

 (37)

when player is active – which is linear in since all the other terms are constants with respect to . ∎

###### Remark 4 (dependence of loss on other players).

We have shown that it is a convex function of player ’s action, when player is active. Note that: (i) the loss of player depends on the actions chosen by other players in the game and (ii) the loss is not a convex function of the joint-action of all the players. It is for these reasons that the game-theoretic analysis is essential.

The proposition does not merely hold for toy cases. The next section extends the result to maxout units, DropOut, DropConnect, and convolutional networks with shared weights and max-pooling. Proposition 5 thus applies to convolutional networks as they are used in practice. Finally, note that proposition 5 does not hold for leaky rectifier units [14] or units that are not piecewise linear, such as sigmoid or .

### 3.4 Convolutional Networks

We extend proposition 5 from rectifier nets to convnets.

###### Proposition 6 (convnets are gated games).

Let be a convolutional network with any combination of linear, rectifier, maxout and max-pooling units. Then, is a gated game.

The proof consists in identifying the relevant players and gates for each case (maxout units, max-pooling, weight-tying in convolutional layers, dropout and dropconnect) in turn. We sketch the result below.

#### 3.4.1 Maxout units

Maxout units were introduced in [78] to deal with the problem that rectifier units sometimes saturate at zero resulting in them being insufficiently active and to complement dropout. A maxout unit has weight vectors and, given input , outputs

 maxout:m(fj):=maxc∈[kj]⟨wj,c,fj⟩ (38)

Construct a new graph, , which has: one node per input, linear and rectifier unit; and nodes per maxout unit. Players correspond to nodes of and are denoted by greek letters. The extended graph inherits its edge structure from : there is a connection between players in iff the underlying units in are connected. Path weights and path-sums are defined exactly as before, except that we work on instead of . The definition of active units is modified as follows:

The set of active players for maxout units is defined inductively. Let active players with longest source-path . Source players are active ().

Player with source-path-length is active if

• it corresponds to a linear unit; or

• a rectifier with ; or

• a maxout unit with for all corresponding to the same maxout unit.

#### 3.4.2 Max-pooling

Max-pooling is heavily used in convnets to as a form of dimensionality reduction. A max-pooling unit has no parameters and outputs the maximum of the outputs of the units from which it receives inputs:

 max-pooling:max{i:i→j}σAi (39)

Gates can be extended to max-pooling by adding the condition that, to be active, the output of unit must be greater than any other unit that feeds (directly) into the same pooling unit.

A unit may thus produce an output and still count as inactive because it is ignored by the max-pooling layer, and so has no effect on the output of the neural network. In particular, units that are ignored by max-pooling do not update their weights under backpropagation.

#### 3.4.3 Convolutional layers

Units in a convolutional layer share weights. Obversely to maxout units, each of which corresponds to many players, weight-sharing units correspond to a single composite player.

Suppose that rectifier units share weight vector . Let denote active players in lower layers and define

 Aj:={α∈[L]:⟨wj,σA0} (40)

Component in layer is active if . Notice that, since players correspond to many units, two players may be connected by more than one edge. Player is active if any of its components is active, i.e. if . The output of player is the sum of its active components:

 ςj=⟨wj,∑α∈AjσA

The loss incurred by player is per Definition 5.

#### 3.4.4 Dropout and Dropconnect

In training with dropout [5], units become inactive with some probability (typically ) during training. In other words, there is a stochastic component to whether or not a player is active. Gated games are easily extended to incorporate dropout by allowing gates to switch off stochastically. That is, the gating function takes values in the set of distributions over the set of units, , from which the active units are sampled.

Dropconnect is a refinement of dropout where connections, instead of units, are dropped out during training [80]. Dropconnect requires extending the notion of gate so that its range is distributions over subsets of edges instead of subsets of units: .

## 4 Deep Online Convex Optimization

We now explore some implications of the connection between path-sum games and deep learning. Theorem 1 shows that the convergence rates of gated forecasters in a path-sum game (that is, a rectifier net or convnet) controls the convergence rate of the network as a whole. As an immediate corollary, we obtain the first convergence rates for gradient descent on rectifier nets. A second corollary, that is of more conceptual than practical importance, shows that the signal underpinning the correlated equilibrium can be used to describe the representation learned by the neural network. Finally, we present an algorithm with logarithmic regret.

### 4.1 A local-to-global convergence guarantee

Our main result is that gated-regret controls the rate of convergence to critical points in rectifier convnets with a loss function that is convex in the output of the net. The result holds assuming weight vectors are restricted to compact convex sets. Weights are not usually hard-constrained when training neural networks although they are frequently regularized. Hinton has recently argued that the weights of rectifier layers quickly stabilise on similar values, suggesting this is not an issue in practice.333For example, see http://bit.ly/1KN8e85 starting at 24:00

It is important to note that the theorem applies to convolutional nets as used in practice. Rectifiers have replaced sigmoids as the nonlinearity of choice as they consistently yield better empirical performance. Loss functions are convex in almost all applications: the logistic loss, hinge loss, and mean-squared error are all convex functions of output weights.

###### Theorem 1 (local-to-global convergence guarantee).

Let be a rectifier convnet trained by using backpropagation to compute gradients and a no-regret algorithm (such as gradient descent, Adagrad, mirror descent) to update weights given the gradients.

Let denote the gated-regret of player after rounds. Then, the empirical distribution of the weights arising during training network over rounds converges to a correlated equilibrium. That is,

 Ew∼^PTj[ℓj(w)]≤minwj∈HjEw∼^PTj[ℓj(wj,w−j)]+GRegretj(T) (42)

for all players . Consequently, the gated regret of the players controls the rate of convergence of backpropagation to critical points when training rectifier nets.

An important class of games, introduced by Monderer and Shapley in [81], is potential games. A game is a potential game if the loss functions of all the players arise from a single function, referred to as the potential function. Rectifier nets are gated potential games where the potential function is the loss of the network. That is, the loss incurred by each player, when active, is the loss of the network. Potential games are more amenable to analysis and computation than general games. Local minima of the potential function are pure Nash equilibria. Moreover, simple algorithms such as fictitious play and regret-matching converge to Nash equilibria in potential games [82, 83].

###### Proof.

Propositions 3, 5 and 6 together imply Eq. (42).

The output of a neural net is a continuous piecewise polynomial function of its weights, recall remark after prop 4. The potential function of a neural net is the composite of a piecewise polynomial and convex function. It follows that no-regret algorithms will either converge to a point where the gradient is zero or to a point where the gradient does not exist. Thus, the network converges to a correlated equilibrium that is a Dirac distribution concentrated on a critical point of the loss function. ∎

The theorem provides the first rigorous justification for applying convex methods to convnets: although they are not convex, individual units perform convex optimizations when active. The theorem provides a generic conversion from regret-guarantees for convex methods to convergence rates for rectifier networks. Corollaries 1 and 3 provide algorithms for which and respectively.

A special case of theorem 1 is when the no-regret algorithm is gradient descent, see algorithm 2. The algorithm differs from standard backpropagation by introducing a projection step

 ProjHj(w′):=argminw∈Hj∥w−w′∥22 (43)

that forces the updated weight to lie in the set of actions available to player . If the diameter of

is sufficiently large then the projection step makes no practice. It can be thought of as analogous to gradient clipping, which is sometimes used when training neural nets.

###### Corollary 1 (convergence for gradient descent).

Suppose a neural network has a loss function that is convex in its output. Suppose that has diameter . Further suppose that that the backpropagated errors received by and the inputs to are bounded by and respectively.

Then unit ’s gated-regret under online gradient descent is bounded by

 GRegretBackprop(T)≤32DGB1√Tj (44)

where is the number of rounds where is active.

The learning rate in algorithm 2 decays according to the number of active steps rather than the number of steps . An important insight of the gated-game formulation is that learning only occurs on active rounds.

###### Proof.

By Lemma 4, weight updates under error backpropagation coincide with players performing online gradient descent, when active, on either loss in Definition 5. The gradient depends on player ’s input, upper bounded by , and its backpropagated error, upper bounded by . The result follows from a standard analysis of online gradient descent [84]. ∎

The bound is a function of constants, and , that depend on the behavior of other units in the neural net. The dependence arises for any gradient descent based algorithm where the weight updates depend on the backpropagated error. The corollary precisely characterises the dependence.

### 4.3 Signals and representations

Recall that a correlated equilibrium requires a signal to guide the behavior of the players. In the case of a rectifier net, the relevant signal is the emprical distribution over the joint actions of the players. As a corollary of theorem 1, we show that the signal provides a compact description of the representations learned by deep networks. There is thus a direct connection between correlated equilibria and representation learning.

Given a distribution on the set of joint actions (recall that a joint action in PS-Pred specifies the input to the network, its label, and every weight vector), define the expected gain of player as

 (45)

where is moves by players other than . Note that in Eq. (45), the moves of all players except are drawn from , which determines which players are active; player ’s move (if active) is treated as a free variable.

Let denote the empirical distribution – or signal in game-theoretic terminology – on joint actions up to round of a neural network trained by error backpropagation, and the empirical signal observered by player . For notational convenience, it is useful to incorporate the learning rate, number of rounds and initial weight vector into the gain, and define

 ^Gj(wj):=ηTj⋅Gj(w;^Pj)+⟨wj,w1j⟩, (46)

where is the number of rounds where unit is active. We then have

###### Corollary 2 (signals ↔ representations).

Construct the empirical gain of unit after rounds from the signal (empirical distribution) per (46). Then if a rectifier net implements gradient descent with fixed learning rate and unconstrained weights, it holds that

• unit ’s weight vector at time is the gradient of the gain:

 wT+1j=∇j^Gj (47)
• the output of unit on round is the directional derivative of the gain w.r.t. ’s input:

 ςT+1j={DςT+1in(j)^Gjif positive0else. (48)

Corollary 2 succinctly describes the representations learned by a neural network via the game-theoretic notation developed above. The corollary does not eliminate the complexity of deep representations. Rather, it demonstrates their direct dependence on the empirical signal , which is itself an extremely complicated object.

### 4.4 Logarithmic convergence

As a second application of theorem 1, we adapt the Online Newton Step (ONS) algorithm [46] to neural networks, see NProp in Algorithm 3. Newton’s method is computationally expensive since it involves inverting the Hessian. In particular, Online Newton Step scales as where is the dimension [85]. Moreover, quasi-newton method tends to converge on saddle points. A naive implementation of a quasi-newton method in neural networks based on the global hessian is therefore problematic: the number of parameters is huge; and saddle points are abundant [86, 33].

The NProp algorithm sidesteps both problems since it is implemented unit-wise. The computational cost is reduced, since an approximation to a local Hessian is computed for each unit. Thus, the computational cost scales as quadratically with the largest layer, rather than with the network. Similarly, since NProp is implemented unit-wise, the Newton-approximation is not exposed to curvature of the neural net. Instead, NProp simultaneously leverages the linear structure of active path-sums and the exp-concave structure (curvature) of the external loss.

Let be a non-empty compact convex set. A function is -exp-concave if is a concave function of . Many commonly used loss functions are exp-concave, including the mean-squared error, , the log loss, , and the logistic loss, , for suitably restricted and .

Recall that

 ProjH,A(w):=argminv∈H⟨v−w,A⋅(v−w)⟩. (49)

Given vectors and , let denote their outerproduct. If and are and dimensional respectively then is a -matrix.

###### Corollary 3 (NProp has logarithmic gated-regret).

Suppose that a neural network has loss function that is -exp-concave in its output. Suppose that has diameter . Further suppose that the backpropagated errors and inputs to are bounded by and respectively.

Then, unit ’s gated-regret under NProp is bounded by

 GRegretNProp(T)≤5dj(1α+BDG)logTjTj (50)

where is the number of rounds that is active and is its indegree.

We first prove the following lemma.

###### Lemma 1.

Let be an -exp-concave function. Suppose that is a nonempty compact convex set with , and that and are a -matrix and a -vector satisfying . Suppose that .

Define as . If then for all it holds that

 g(v)≥ g(w)+⟨∇g(w),v−w⟩ (51) +β2⟨w−v,∇g(w)⊗∇g(w)⋅(w−v)⟩. (52)
###### Proof.

It is shown in Lemma 3 of [46] that, since is -exp-concave and ,

 f(x)−12βlog(1−2β⟨∇f(x),y−x⟩)≤f(y). (53)

By the chain rule, , and so Eq. (53) can be rewritten as

 g(w)−12βlog(1−2β⟨∇g(w),v−w⟩)≤g(v). (54)

By construction, and the result follows by the reasoning in [46]. ∎

We are now ready to prove the Theorem.

###### Proof.

The proof follows the same logic as Theorem 2 in [46] after replacing Lemma 3 there with our Lemma 1. We omit details, except to show how the setting of Lemma 1 connects to neural networks. Let

 xt:=ςtin(j),πt:=ςtj⇝outandbt:=σA∖{j}out. (55)

Let the -dimensional matrix denote the outer product. By Lemma 4.2,

 ςtout=At⋅wtj+bt. (56)

Since

 A⊺∇ℓ=xt⋅⟨πt,∇ℓ⟩=δtj⋅xt, (57)

we have that

 ∥A⊺∇ℓ∥≤|δtj|⋅∥xt∥≤BG, (58)

and the remainder of the argument is standard. ∎

To the best of our knowledge, NProp is the first logarithmic-regret algorithm for neural networks. NProp is computationally more efficient than order methods, since it does not require computing the Hessian, and there are efficient ways to iteratively compute without directly inverting the matrix. Nevertheless, NProp’s memory usage and computational complexity are prohibitive [85]; it is worth investigating whether there are more efficient algorithms that achieve logarithmic regret in this setting, for example based on the Sketched Online Newton algorithm proposed by Luo et al [87] which has linear runtime. Finally, NProp does not take advantage of the fact that some experts are inactive on each round, suggesting a second direction in which it may be improved.

### 4.5 Conditional computation

Convnets are path-sum games played between gated convex players. The criterion for activating a unit is either a operator (rectifiers, maxout units, and max-pooling) or random (dropout and dropconnect). These have been shown to work well in practice. It is nevertheless natural to question whether they are optimal. This section introduces a framework to tackle the question. Indeed

Analyzing and optimizing the gates requires a new kind of player, Conditional Gate (CoG), that controls when players are active. Conditional Gate experiences regret about not activing the optimal subset of players. More precisely, a CoG activates a subset of players on each round. The CoG’s context is the weights of the players and their inputs. In PS-Pred, a CoG incurs scalar loss . In PS-Grad, a CoG incurs loss vector .