0 Introduction
Deep learning has achieved remarkable successes in object and voice recognition, machine translation, reinforcement learning and other tasks
[1, 2, 3, 4, 5]. From a practical standpoint the problem of supervised learning is wellunderstood and has largely been solved – at least in the regime where both labeled data and computational power are abundant. The workhorse underlying most deep learning algorithms is error backpropagation
[6, 7, 8, 9], which is simply gradient descent distributed across a neural network via the chain rule.
Gradient descent and its variants are wellunderstood when applied to convex or nearly convex objectives [10, 11, 12, 13]. In particular, they have strong performance guarantees in the stochastic and adversarial settings [14, 15, 16, 17]. The reasons for the success of gradient descent in nonconvex settings are less clear, although recent work has provided evidence that most local minima are good enough [18, 19]; that modern convolutional networks are close enough to convex for many results on rates of convergence apply [20]; and that the rate of convergence of gradientdescent can control generalization performance, even in nonconvex settings [21].
Taking a step back, gradientbased optimization provides a wellestablished set of computational primitives [22]
, with theoretical backing in simple cases and empirical backing in others. Firstorder optimization thus falls in broadly the same category as computing an eigenvector or inverting a matrix: given sufficient data and computational resources, we have algorithms that reliably find good enough solutions for a wide range of problems.
This essay proposes to abstract out the optimization algorithms used for weight updates and focus on how the components of deep learning algorithms interact. Treating optimization as a computational primitive encourages a shift from lowlevel algorithm design to higherlevel mechanism design: we can shift attention to designing architectures that are guaranteed to learn distributed representations suited to specific objectives. The goal is to introduce a language at a level of abstraction where designers can focus on formal specifications (grammars) that specify how plugandplay optimization modules combine into larger learning systems.
0.1 What is a representation?
Let us recall how representation learning is commonly understood. Bengio et al
describe representation learning as “learning transformations of the data that make it easier to extract useful information when building classifiers or other predictors”
[23]. More specifically, “a deep learning algorithm is a particular kind of representation learning procedure that discovers multiple levels of representation, with higherlevel features representing more abstract aspects of the data” [24]. Finally, LeCun et al state that multiple levels of representations are obtained “by composing simple but nonlinear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations” [5].The quotes describe the operation of a successful deep learning algorithm. What is lacking is a characterization of what makes a deep learning algorithm work in the first place. What properties must an algorithm have to learn layered representations? What does it mean for the representation learned by one layer to be useful to another? What, exactly, is a representation?
In practice, almost all deep learning algorithms rely on error backpropagation to “align” the representations learned by different layers of a network. This suggests that the answers to the above questions are tightly bound up in firstorder (that is, gradientbased) optimization methods. It is therefore unsurprisingly that the bulk of the paper is concerned with tracking the flow of firstorder information. The framework is intended to facilitate the design of more general firstorder algorithms than backpropagation.
Semantics. To get started, we need a theory of the meaning or semantics encoded in neural networks. Since there is nothing special about neural networks, the approach taken is inclusive and minimalistic. Definition 1 states that the meaning of any function is how it implicitly categorizes inputs by assigning them to outputs. The next step is to characterize those functions whose semantics encode knowledge, and for this we turn to optimization [25].
Representations from optimizations. Nemirovski and Yudin developed the blackbox computational model to analyze the computational complexity of firstorder optimization methods [26, 27, 28, 29]
. The blackbox model is a more abstract view on optimization than the Turing machine model: it specifies a
communication protocol that tracks how often an algorithm makes queries about the objective. It is useful to refine Nemirovski and Yudin’s terminology by distinguishing between blackboxes, which respond with zerothorder information (the value of a function at the querypoint), and grayboxes^{1}^{1}1Gray for gradient., which respond with zeroth and firstorder information (the gradient or subgradient).With these preliminaries in hand, Definition 4 proposes that a representation is a function that is a local solution to an optimization problem. Since we do not restrict to convex problems, finding global solutions is not feasible. Indeed, recent experience shows that global solutions are often not necessary practice [1, 2, 3, 4, 5]. The local solution has similar semantics to – that is, it represents – the ideal solution. The ideal solution usually cannot be found: due to computational limitations, since the problem is nonconvex, because we only have access to a finite sample from an unknown distribution, etc.
To see how Definition 4 connects with representation learning as commonly understood, it is necessary to take a detour through distributed optimization and game theory.
0.2 Distributed representations
Game theory provides tools for analyzing distributed optimization problems where a set of players aim to minimizes losses that depend not only on their actions, but also the actions of all other players in the game [30, 31]. Game theory has traditionally focused on convex losses since they are more theoretically amenable. Here, the only restriction imposed on losses is that they are differentiable almost everywhere.
Allowing nonconvex losses means that errorbackpropagation can be reformulated as a game. Interestingly, there is enormous freedom in choosing the players. They can correspond to individual units, layers, entire neural networks, and a variety of other, intermediate choices. An advantage of the gametheoretic formulation is thus that it applies at many different scales.
Nonconvex losses and local optima are essential to developing a scalefree formalism. Even when it turns out that particular units or a particular layer of a neural network are solving a convex problem, convexity is destroyed as soon as those units or layers are combined to form larger learning systems. Convexity is not a property that is preserved in general when units are combined into layers or layers into networks. It is therefore convenient to introduce the computational primitive to denote the output of a firstorder optimization procedure, see Definition 4.
A concern about excessive generality. A potential criticism is that the formulation is too broad. Very little can be said about nonconvex optimization in general; introducing games where many players jointly optimize a set of arbitary nonconvex functions only compounds the problem.
Additional structure is required. A successful case study can be found in [20], which presents a detailed gametheoretic analysis of rectifier neural networks. The key to the analysis is that rectifier units are almost convex. The main result is that the rate of convergence of a neural network to a local optimum is controlled by the (waking)regret of the algorithms applied to compute weight updates in the network.
Whereas [20] relied heavily on specific properties of rectifer nonlinearities, this paper considers a widerange of deep learning architectures. Nevertheless, it is possible to carve out an interesting subclass of nonconvex games by identifying the composition of simple functions as an essential feature common to deep learning architectures. Compositionality is formalized via distributed communication protocols and grammars.
Grammars for games. Neural networks are constructed by composing a series of elementary operations. The resulting feedforward computation is captured via as a computation graph [32, 33, 34, 35, 36, 37]. Backpropagation traverses the graph in reverse and recursively computes the gradient with respect to the parameters at each node.
Section 2 maps the feedforward and feedback computations onto the queries and responses that arise in Nemirovski and Yudin’s model of optimization. However, queries and responses are now highly structured. In the query phase, players feed parameters into a computation graph (the Query graph ) that performs the feedforward sweep. In the response phase, oracles reveal firstorder information that is fed into a second computation graph (the Response graph ).
In most cases the Response graph simply implements backpropagation. However, there are examples where it does not. Three are highlighted here, see section 2.3, and especially sections 2.4 and 2.5. Other algorithms where the Response graphs do not simply implement backprop include difference target propagation [38] and feedback alignment [39] (both discussed briefly in section 2.5) and truncated backpropagation through time [40, 41, 42], where a choice is made about where to cut backprop short. Examples where the query and response graph differ are of particular interest, since they point towards more general classes of deep learning algorithms.
A distributed communication protocol is a game with additional structure: the Query and Response graphs, see Definition 7. The graphs capture the compositional structure of the functions learned by a neural network and the compositional structure of the learning procedure respectively. It is important for our purposes that (i) the feedforward and feedback sweeps correspond to two distinct graphs and (ii) the communication protocol is kept distinct from the optimization procedure. That is, the communication protocol specifies how information flows through the networks without specifying how players make use of it. Players can be treated as plugandplay rational agents that are provided with carefully constructed and coordinated firstorder information to optimize as they see fit [43, 44].
Finally, a grammar is a distributed communication protocol equipped with a guarantee that the response graph encodes sufficient information for the players to jointly find a local optimum of an objective function. The paradigmatic example of a grammar is backpropagation. A grammar is a thus a game designed to perform a task. A representation learned by one (p)layer is useful to another if the game is guaranteed to converge on a local solution to an objective – that is, if the players interact though a grammar. It follows that the players build representations that jointly encode knowledge about the task.
Caveats. What follows is provisional. The definitions are a first attempt to capture an interesting, and perhaps useful, perspective on deep learning. The essay contains no new theorems, algorithms or experiments, see [45, 20, 46] for “real work” based on the ideas presented here. The essay is not intended to be comprehensive. Many details are left out and many important aspects are not covered: most notably, probabilistic and Bayesian formulations, and various methods for unsupervised pretraining.
A series of worked examples. In line with its provisional nature, much of the essay is spent applying the framework to worked examples: error backpropagation as a supervised model [8]
; variational autoencoders
[47] and generative adversarial networks [48]for unsupervised learning; the deviatoractorcritic (DAC) model for deep reinforcement learning
[46]; and kickback, a biologically plausible variant of backpropagation [45]. The examples were chosen, in part, to maximize variety and, in part, based on familiarity. The discussions are short; the interested reader is encouraged to consult the original papers to fill in the gaps.The last two examples are particularly interesting since their Response graphs differ substantially from backpropagation. The DAC model constructs a zerothorder blackbox to estimate gradients rather than querying a firstorder graybox. Kickback prunes backprop’s Response graph by replacing most of its grayboxes with blackboxes and approximating the chain rule with (primarily) local computations.
0.3 Related work
Bottou and Gallinari proposed to decompose neural networks into cooperating modules [49, 50]. Decomposing more general algorithms or models into collections of interacting agents dates back to the shrieking demons that comprised Selfridge’s Pandemonium [51] and a long line of related work [52, 53, 54, 55, 56, 57, 58, 59]
. The focus on components of neural networks as players, or rational agents, in their own right developed here derives from work aimed at modeling biological neurons gametheoretically, see
[60, 61, 62, 63, 64].A related approach to semantics based on general value functions can be found in Sutton et al [65], see remark 1
. Computation graphs as applied to backprop are the basis of the Python library Theano
[34, 35, 36] and provide the backbone for automatic/algorithmic differentiation [32, 33].Grammars are a technical term in the theory of formal languages relating to the Chomsky hierarchy [66]. There is no apparent relation between that notion of grammar and the one presented here, aside from both relating to structural rules governing composition. Formal languages and deep learning are sufficiently disparate fields that there is little risk of terminological confusion. Similarly, the notion of semantics introduced here is distinct from semantics in the theory of programming languages.
Although game theory was originally developed to model human interactions [30], it has been pointed out that it may be more directly applicable to interacting populations of algorithms, socalled machina economicus [67, 68, 69, 70, 71, 72]. This paper goes one step further to propose that games played over firstorder communication protocols are a key component of the foundations of deep learning.
A source of inspiration for the essay is Bayesian networks and Markov random fields. Probabilistic graphical models and factor graphs provide simple, powerful ways to encode a multivariate distribution’s independencies into a diagram
[73, 74, 75]. They have greatly facilitated the design and analysis of probabilistic algorithms. However, there is no comparable framework for distributed optimization and deep learning. The essay is intended as a first step in this direction.1 Semantics and Representations
This section defines semantics and representations. In short, the semantics of a function is how it categorizes its inputs; a function is a representation if it is selected to optimize an objective. The connection between the definition of representation below and “representation learning” is clarified in section 2.1.
Possible world semantics was introduced by Lewis to formalize the meaning of sentences in terms of counterfactuals [76]. Let be a proposition about the world. Its truth depends on its content and the state of the world. Rather than allowing the state of the world to vary, it is convenient to introduce the set of all possible worlds.
Let us denote proposition applied in world by . The meaning of is then the mapping which assigns 1 or 0 to each according to whether or not proposition
is true. Equivalently, the meaning of the proposition is the ordered pair consisting of: all worlds, and the subset of worlds where it is true:
(1) 
For example, the meaning of “that is blue” is the subset of possible worlds where I am pointing at a blue object. The concept of blue is rendered explicit in an exhaustive list of possible examples.
A simple extension of possible world semantics from propositions to arbitrary functions is as follows [77]:
Definition 1 (semantics).
Given function , the semantics or meaning of output is the ordered pair of sets
(2) 
Functions implicitly categorize inputs by assigning outputs to them; the meaning of an output is the category.
Whereas propositions are true or false, the output of a function is neither. However, if two functions both optimize a criterion, then one can refer to how accurately one function represents the other. Before we can define representations we therefore need to take a quick detour through optimization:
Definition 2 (optimization problem).
An optimization problem is a pair consisting in parameterspace and objective that is differentiable almost everywhere.
The solution to the global optimization problem is
(3) 
which is either a maximum or minimum according to the nature of the objective.
The solution may not be unique; it also may not exist unless further restrictions are imposed. Such details are ignored here.
Next recall the blackbox optimization framework introduced by Nemirovski and Yudin [26, 27, 28, 29].
Definition 3 (communication protocol).
A communication protocol for optimizing an unknown objective consists in a User (or Player) and an Oracle. On each round, User presents a query . Oracle can respond in one of two ways, depending on the nature of the protocol:

Blackbox (zerothorder) protocol.
Oracle responds with the value .

Graybox (firstorder) protocol.
Oracle responds with either the gradient or with the gradient together with the value.
The protocol specifies how Player and Oracle interact without specifying the algorithm used by Player to decide which points to query. The next section introduces distributed communication protocols as a general framework that includes a variety of deep learning architectures as special cases – again without specifying the precise algorithms used to perform weight updates.
Unlike [26, 28] we do not restrict to convex problems. Finding a global optimum is not always feasible, and in practice often unnecessary.
Definition 4 (representation).
Let be a function space and
(4) 
be a map from parameterspace to functions. Further suppose that objective function is given.
A representation is a local solution to the optimization problem
(5) 
corresponding to a local maximum or minimum according to whether the objective is minimized or maximized.
Intuitively, the objective quantifies the extent to which functions in categorize their inputs similarly. The operation applies a firstorder method to find a function whose semantics resembles the optimal solution where .
In short, representations are functions with useful semantics, where usefulness is quantifed using a specific objective: the lower the loss or higher the reward associated with a function, the more useful it is. The relation between Definition 4 and representations as commonly understood in the deep learning literature is discussed in section 2.1 below.
Remark 1 (value function semantics).
In related work, Sutton et al [65] proposed that semantics – i.e. knowledge about the world – can be encoded in general value functions that provide answers to specific questions about expected rewards. Definition 1 is more general than their approach since it associates a semantics to any function. However, the function must arise from optimizing an objective for its semantics to accurately represent a phenomenon of interest.
1.1 Supervised learning
The main example of a representation arises under supervised learning.
Representation 1 (supervised learning).
Let and be an input space and a set of labels and
be a loss function. Suppose that
is a parametrized family of functions.
Nature which samples labeled pairs i.i.d. from distribution , singly or in batches.

Predictor chooses parameters .

Objective is
(6)
The query and responses phases can be depicted graphically as
The predictor is then a representation of the optimal predictor .
A commonly used mapping from parameters to functions is
(7) 
where a feature map is fixed.
The setup admits a variety of complications in practice. Firstly, it is typically infeasible even to find a local optimum. Instead, a solution that is within some small of the local optimum suffices. Secondly, the distribution
is unknown, so the expectation is replaced by a sum over a finite sample. The quality of the resulting representation has been extensively studied in statistical learning theory
[78]. Finally, it is often convenient to modify the objective, for example by incorporating a regularizer. Thus, a more detailed presentation would conclude that(8) 
yields a representation of the solution to . To keep the discussion and notation simple, we do not consider any of these important details.
It is instructive to unpack the protocol, by observing that the objective is a composite function involving , and :
The notation is borrowed from backpropagation. It is shorthand for the derivative of the objective with respect to parameters .
Nature is not a deterministic blackbox since it is not queried directly: Nature produces pairs stochastically, rather than in response to specific inputs. Our notion of blackbox can be extended to stochastic blackboxes, see e.g. [37]. However, once again we prefer to keep the exposition as simple as possible.
1.2 Unsupervised learning
The second example concerns fitting a probabilistic or generative model to data. A natural approach is to find the distribution under which the observed data is most likely:
Representation 2 (maximum likelihood estimation).
Let be a data space.

Nature samples points from distribution .

Estimator chooses parameters .

Operator acts as a loss. The objective is to mimimize
(9)
The estimate , where , is a representation of the optimal solution, and can also be considered a representation of . The setup extends easily to maximum a posteriori estimation.
As for supervised learning, the protocol can be unpacked by observing that the objective has a compositional structure:
1.3 Reinforcement learning
The third example is taken from reinforcement learning [79]. We will return to reinforcement learning in section 2.4
, so the example is presented in some detail. In reinforcement learning, an agent interacts with its environment, which is often modeled as a Markov decision process consisting of state space
, action space , initial distribution on states, stationary transition distribution and reward function . The agent chooses actions based on a policy: a function from states to actions. The goal is to find the optimal policy.Actorcritic methods break up the problem into two pieces [80]. The critic estimates the expected value of stateaction pairs given the current policy, and the actor attempts to find the optimal policy using the estimates provided by the critic. The critic is typically trained via temporal difference methods [81, 82].
Let denote the distribution on states at time given policy and initial state at and let . Let be the discounted future reward. Define the value of a stateaction pair as
(10) 
Unfortunately, the valuefunction cannot be queried. Instead, temporal difference methods take a bootstrapped approach by minimizing the Bellman error:
(11) 
where is the state subsequent to .
Representation 3 (temporal difference learning).
Critic interacts with blackboxes Actor and Nature.^{2}^{2}2Nature’s outputs depend on Actor’s actions, so the Query graph should technically have an additional arrow from Actor to Nature.

Critic plays parameters .

Operator and estimates the value function and compute the Bellman error. In practice, it turns out to clone the valueestimate periodically and compute a slightly modified Bellman error:
(12) where is the cloned estimate. Cloning improves the stability of TDlearning [4]. A nice conceptual sideeffect of cloning is that TDlearning reduces to gradient descent.
The estimate is a representation of the true value function.
Remark 2 (on temporal difference learning as firstorder method).
Temporal difference learning is not strictly speaking a gradientbased method [82]. The residual gradient method performs gradient descent on the Bellman error, but suffers from double sampling [83]. Projected fixpoint methods minimize the projected Bellman error via gradient descent and have nice convergence properties [84, 85, 86]. An interesting recent proposal is implicit TD learning [87], which is based on implicit gradient descent [88].
Section 2.4 presents the DeviatorActorCritic model which simultaneously learns a valuefunction estimate and a locally optimal policy.
2 Protocols and Grammars
It is often useful to decompose complex problems into simpler subtasks that can handled by specialized modules. Examples include variational autoencoders, generative adversarial networks and actorcritic models. Neural networks are particularly welladapted to modular designs, since units, layers and even entire networks can easily be combined analogously to bricks of lego [49].
However, not all configurations are viable models. A methodology is required to distinguish good designs from bad. This section provides a basic language to describe how bricks are glued together that may be a useful design tool. The idea is to extend the definitions of optimization problems, protocols and representations from section 1 from single to multiplayer optimization problems.
Definition 5 (game).
A distributed optimization problem or game is a set of players, a parameter space
, and loss vector
. Player picks moves from and incurs loss determined by . The goal of each player is to minimize its loss, which depends on the moves of the other players.The classic example is a finite game [30], where player has a menu of actions and chooses a distribution over actions, on each round. Losses are specified for individual actions, and extended linearly to distributions over actions. A natural generalization of finite games is convex games where the parameter spaces are compact convex sets and each loss is a convex function in its argument [89]. It has been shown that players implementing noregret algorithms are guaranteed to converge to a correlated equilibrium in convex games [90, 91, 89].
The notion of game in Definition 5 is too general for our purposes. Additional structure is required.
Definition 6 (computation graph).
A computation graph is a directed acyclic graph with two kinds of nodes:

Inputs are set externally (in practice by Players or Oracles).

Operators produce outputs that are a fixed function of their parents’ outputs.
Computation graphs are a useful tool for calculating derivatives [32, 34, 35, 36, 33]. For simplicity, we restrict to deterministic computation graphs. More general stochastic computation graphs are studied in [37].
A distributed communication protocol extends the communication protocol in Definition 3 to multiplayer games using two computation graphs.
Definition 7 (distributed communication protocol).
A distributed communication protocol is a game where each round has two phases, determined by two computation graphs:

Query phase. Players provide inputs to the Query graph () that Operators transform into outputs.

Response phase. Operators in act as Oracles in the Response graph (): they input subgradients that are transformed and communicated to the Players.
The moves chosen by Players depend only on their prior moves and the information communicated to them by the Response graph.
The protocol specifies how Players and Oracles communicate without specifying the optimization algorithms used by the Players. The addition of a Response graph allows more general computations than simply backpropagating the gradients of the Query phase. The additional flexibility allows the design of new algorithms, see sections 2.4 and 2.5 below. It is also sometimes necessary for computational reasons. For example, backpropagation through time on recurrent networks typically runs over a truncated Response graph [40, 41, 42].
Suppose that we wish to optimize an objective function that depends on all the moves of all the players. Finding a global optimum is clearly not feasible. However, we may be able to construct a protocol such that the players are jointly able to find local optima of the objective. In such cases, we refer to the protocol as a grammar:
Definition 8 (grammar).
A grammar for objective is a distributed communication protocol where the Response graph provides sufficient firstorder information to find a local optimum of .
The guarantee ensures that the representations constructed by Players in a grammar can be combined into a coherent distributed representation. That is, it ensures that the representations constructed by the Players transform data in a way that is useful for optimizing the shared objective .
The Players’ losses need not be explicitly computed. All that is necessary is that the Response phase communicate the gradient information needed for Players to locally minimize their losses – and that doing so yields a local optimum of the objective.
Basic building blocks: function composition () and the chain rule ().
Functions can be inserted into grammars as legolike building blocks via function composition during queries and the chain rule during responses. Let be a function that takes inputs and , provided by a Player and by upstream computations respectively. The output of is communicated downstream in the Query phase:
The chain rule is implemented in the Response phase as follows. Oracle reports the gradient in the Response phase. Operator “” computes the products via matrix multiplication. The projection of the product onto the first and second components^{3}^{3}3Alternatively, to avoid having “” produce two outputs, the entire vector can be reported in both direction with the irrelevant components ignored. are reported to Player and upstream respectively.
Summary of guarantees. A selection of examples are presented below. Guarantees fall under the following broad categories:

Exact gradients.
Under error backpropagation the Response graph implements the chain rule, which guarantees that Players receive the gradients of their loss functions; see section 2.1. 
Surrogate objectives.
The variational autoencoder uses a surrogate objective: the variational lower bound. Maximizing the surrogate is guaranteed to also maximize the true objective, which is computational intractable; see section 2.2.
Remark 3 (fine and coarsegraining).
There is considerable freedom regarding the choice of players. In the examples below, players are typically chosen to be layers or entire neural networks to keep the diagrams simple. It is worth noting that zooming in, such that players correspond to individual units, has proven to be a useful tool when analyzing neural networks [45, 20, 46].
The gametheoretic formulation is thus scalefree and can be coarse or finegrained as required. A mathematical language for tracking the structure of hierarchical systems at different scales is provided by operads, see [92] and the references therein, which are the natural setting to study the composition of operators that receive multiple inputs.
2.1 Error backpropagation
The main example of a grammar is a neural network using error backpropagation to perform supervised learning. Layers in the network can be modeled as players in a game. Setting each (p)layer’s objective as the network’s loss, which it minimizes using gradient ascent, yields backpropagation.
Grammar 1 (backpropagation).
An layer neural network can be reformulated as a game played between players, corresponding to Nature and the Layers of the network. The query graph for a 3layer network is:

Nature plays samples datapoints i.i.d. from and acts as the zeroth player.

Layer plays weight matrices .

Operators compute for each layer, along with loss .
The response graph performs error backpropagation:
The protocol can be extended to convolutional networks by replacing the matrix multiplications performed by each operator,
, with convolutions and adding parameterless maxpooling operators
[93].Guarantee. The loss of every (p)layer is
(13) 
It follows by the chain rule that communicates to player .
Representation learning. We are now in a position to relate the notion of representation in definition 4 with the standard notion of representation learning in neural networks. In the terminology of section 1, each player learns a representation. The representations learned by the different players form a coherent distributed representation because they jointly optimize a single objective function.
Abstractly, the objective can be written as
(14) 
where . The goal is to minimize the composite objective.
If we set then the function fits the definition of representation above. Moreover, the compositional structure of the network implies that is composed of subrepresentations corresponding to the optimizations performed by the different players in the grammar: each function is a local optimum – where is optimized to transform its inputs into a form that is useful to network as a whole.
Detailed analysis of convergence rates. Little can be said in general about the rate of converge of the layers in a neural network since the loss is not convex. However, neural networks can be decomposed further by treating the individual units as players. When the units are linear or rectilinear, it turns out that the network is a circadian game. The circadian structure provides a way to convert results about the convergence of convex optimization methods into results about the global convergence a rectifier network to a local optimum, see [20].
2.2 Variational autoencoders
The next example extends the unsupervised setting described in section 1.2. Suppose that observations are sampled i.i.d. from a twostep stochastic process: a latent value is sampled from , after which is sampled from .
The goal is to (i) find the maximum likelihood estimator for the observed data and (ii) estimate the posterior distribution on conditioned on an observation . A straightforward approach is to maximize the marginal likelihood
(15) 
and then compute the posterior
(16) 
However, the integral in Eq. (15) is typically untractable, so a more roundabout tactic is required. The approach proposed in [47] is to construct two neural networks, a decoder that learns a generative model approximating , and an encoder that learns a recognition model or posterior approximating .
It turns out to be useful to replace the encoder with a deterministic function, , and a noise source, that are compatible. Here, compatible means that sampling is equivalent to sampling and computing .
Grammar 2 (variational autoencoder).
A variational autoencoder is a game played between Encoder, Decoder, Noise and Environment. The query graph is

Environment plays i.i.d. samples from

Noise plays i.i.d. samples from . It also communicates its density function , which is analogous to a gradient – and the reason that Noise is gray rather than blackbox.

Encoder and Decoder play parameters and respectively.

Operator is a neural network that encodes samples into latent variables.

Operator is a neural network that estimates the probability of conditioned on .

The remaining operators compute the (negative) variational lower bound
(17)
The response graph implements backpropagation:
Guarantee. The guarantee has two components:

Maximizing the variational lower bound yields (i) a maximum likelihood estimator and (ii) an estimate of the posterior on the latent variable [47].

The chain rule ensures that the correct gradients are communicated to Encoder and Decoder.
The first guarantee is that the surrogate objective computed by the query graph yields good solutions. The second guarantee is that the response graph communicates the correct gradients.
2.3 GenerativeAdversarial networks
A recent approach to designing generative models is to construct an adversarial game between Forger and Curator [48]. Forger generates samples; Curator aims to discriminate the samples produced by Forger from those produced by Nature. Forger aims to create samples realistic enough to fool Curator.
If Forger plays parameters and Curator plays then the game is described succinctly via
(18) 
where is a neural network that converts noise in samples and classifies samples as fake or not.
Grammar 3 (generative adversarial networks).
Construct a game played between Forger and Curator, with ancillary players Noise and Environment:

Environment samples images i.i.d. from .

Noise samples i.i.d. from .

Forger and Curator play parameters and respectively.

Operator is a neural network that produces fake image .

Operator is a neural network that estimates the probability that an image is fake.

The remaining operators compute a loss that Curator minimizes and Forger maximizes
(19)
Note there are two copies of Operator in the Query graph.
The response graph implements the chain rule, with a tweak that multiplies the gradient communicated to Forger by to ensure that Forger maximizes the loss that Curator is minimizing.
Comments
There are no comments yet.