Semantics, Representations and Grammars for Deep Learning

by   David Balduzzi, et al.

Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.



page 1

page 2

page 3

page 4


A structural operational semantics for interactions with a look at loops

Message Sequence Charts Sequence Diagrams are graphical models that ...

Benchpress: a scalable and platform-independent workflow for benchmarking structure learning algorithms for graphical models

Describing the relationship between the variables in a study domain and ...

Deep Learning of Representations: Looking Forward

Deep learning research aims at discovering learning algorithms that disc...

Operational Semantics with Hierarchical Abstract Syntax Graphs

This is a motivating tutorial introduction to a semantic analysis of pro...

Probabilistic Dependency Graphs

We introduce Probabilistic Dependency Graphs (PDGs), a new class of dire...

Information Flow in Pregroup Models of Natural Language

This paper is about pregroup models of natural languages, and how they r...

A case for deep learning in semantics

Pater's target article builds a persuasive case for establishing stronge...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

0 Introduction

Deep learning has achieved remarkable successes in object and voice recognition, machine translation, reinforcement learning and other tasks

[1, 2, 3, 4, 5]

. From a practical standpoint the problem of supervised learning is well-understood and has largely been solved – at least in the regime where both labeled data and computational power are abundant. The workhorse underlying most deep learning algorithms is error backpropagation

[6, 7, 8, 9]

, which is simply gradient descent distributed across a neural network via the chain rule.

Gradient descent and its variants are well-understood when applied to convex or nearly convex objectives [10, 11, 12, 13]. In particular, they have strong performance guarantees in the stochastic and adversarial settings [14, 15, 16, 17]. The reasons for the success of gradient descent in non-convex settings are less clear, although recent work has provided evidence that most local minima are good enough [18, 19]; that modern convolutional networks are close enough to convex for many results on rates of convergence apply [20]; and that the rate of convergence of gradient-descent can control generalization performance, even in nonconvex settings [21].

Taking a step back, gradient-based optimization provides a well-established set of computational primitives [22]

, with theoretical backing in simple cases and empirical backing in others. First-order optimization thus falls in broadly the same category as computing an eigenvector or inverting a matrix: given sufficient data and computational resources, we have algorithms that reliably find good enough solutions for a wide range of problems.

This essay proposes to abstract out the optimization algorithms used for weight updates and focus on how the components of deep learning algorithms interact. Treating optimization as a computational primitive encourages a shift from low-level algorithm design to higher-level mechanism design: we can shift attention to designing architectures that are guaranteed to learn distributed representations suited to specific objectives. The goal is to introduce a language at a level of abstraction where designers can focus on formal specifications (grammars) that specify how plug-and-play optimization modules combine into larger learning systems.

0.1 What is a representation?

Let us recall how representation learning is commonly understood. Bengio et al

describe representation learning as “learning transformations of the data that make it easier to extract useful information when building classifiers or other predictors”

[23]. More specifically, “a deep learning algorithm is a particular kind of representation learning procedure that discovers multiple levels of representation, with higher-level features representing more abstract aspects of the data” [24]. Finally, LeCun et al state that multiple levels of representations are obtained “by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations” [5].

The quotes describe the operation of a successful deep learning algorithm. What is lacking is a characterization of what makes a deep learning algorithm work in the first place. What properties must an algorithm have to learn layered representations? What does it mean for the representation learned by one layer to be useful to another? What, exactly, is a representation?

In practice, almost all deep learning algorithms rely on error backpropagation to “align” the representations learned by different layers of a network. This suggests that the answers to the above questions are tightly bound up in first-order (that is, gradient-based) optimization methods. It is therefore unsurprisingly that the bulk of the paper is concerned with tracking the flow of first-order information. The framework is intended to facilitate the design of more general first-order algorithms than backpropagation.

Semantics. To get started, we need a theory of the meaning or semantics encoded in neural networks. Since there is nothing special about neural networks, the approach taken is inclusive and minimalistic. Definition 1 states that the meaning of any function is how it implicitly categorizes inputs by assigning them to outputs. The next step is to characterize those functions whose semantics encode knowledge, and for this we turn to optimization [25].

Representations from optimizations. Nemirovski and Yudin developed the black-box computational model to analyze the computational complexity of first-order optimization methods [26, 27, 28, 29]

. The black-box model is a more abstract view on optimization than the Turing machine model: it specifies a

communication protocol that tracks how often an algorithm makes queries about the objective. It is useful to refine Nemirovski and Yudin’s terminology by distinguishing between black-boxes, which respond with zeroth-order information (the value of a function at the query-point), and gray-boxes111Gray for gradient., which respond with zeroth- and first-order information (the gradient or subgradient).

With these preliminaries in hand, Definition 4 proposes that a representation is a function that is a local solution to an optimization problem. Since we do not restrict to convex problems, finding global solutions is not feasible. Indeed, recent experience shows that global solutions are often not necessary practice [1, 2, 3, 4, 5]. The local solution has similar semantics to – that is, it represents – the ideal solution. The ideal solution usually cannot be found: due to computational limitations, since the problem is nonconvex, because we only have access to a finite sample from an unknown distribution, etc.

To see how Definition 4 connects with representation learning as commonly understood, it is necessary to take a detour through distributed optimization and game theory.

0.2 Distributed representations

Game theory provides tools for analyzing distributed optimization problems where a set of players aim to minimizes losses that depend not only on their actions, but also the actions of all other players in the game [30, 31]. Game theory has traditionally focused on convex losses since they are more theoretically amenable. Here, the only restriction imposed on losses is that they are differentiable almost everywhere.

Allowing nonconvex losses means that error-backpropagation can be reformulated as a game. Interestingly, there is enormous freedom in choosing the players. They can correspond to individual units, layers, entire neural networks, and a variety of other, intermediate choices. An advantage of the game-theoretic formulation is thus that it applies at many different scales.

Nonconvex losses and local optima are essential to developing a scale-free formalism. Even when it turns out that particular units or a particular layer of a neural network are solving a convex problem, convexity is destroyed as soon as those units or layers are combined to form larger learning systems. Convexity is not a property that is preserved in general when units are combined into layers or layers into networks. It is therefore convenient to introduce the computational primitive to denote the output of a first-order optimization procedure, see Definition 4.

A concern about excessive generality. A potential criticism is that the formulation is too broad. Very little can be said about nonconvex optimization in general; introducing games where many players jointly optimize a set of arbitary nonconvex functions only compounds the problem.

Additional structure is required. A successful case study can be found in [20], which presents a detailed game-theoretic analysis of rectifier neural networks. The key to the analysis is that rectifier units are almost convex. The main result is that the rate of convergence of a neural network to a local optimum is controlled by the (waking-)regret of the algorithms applied to compute weight updates in the network.

Whereas [20] relied heavily on specific properties of rectifer nonlinearities, this paper considers a wide-range of deep learning architectures. Nevertheless, it is possible to carve out an interesting subclass of nonconvex games by identifying the composition of simple functions as an essential feature common to deep learning architectures. Compositionality is formalized via distributed communication protocols and grammars.

Grammars for games. Neural networks are constructed by composing a series of elementary operations. The resulting feedforward computation is captured via as a computation graph [32, 33, 34, 35, 36, 37]. Backpropagation traverses the graph in reverse and recursively computes the gradient with respect to the parameters at each node.

Section 2 maps the feedforward and feedback computations onto the queries and responses that arise in Nemirovski and Yudin’s model of optimization. However, queries and responses are now highly structured. In the query phase, players feed parameters into a computation graph (the Query graph ) that performs the feedforward sweep. In the response phase, oracles reveal first-order information that is fed into a second computation graph (the Response graph ).

In most cases the Response graph simply implements backpropagation. However, there are examples where it does not. Three are highlighted here, see section 2.3, and especially sections 2.4 and 2.5. Other algorithms where the Response graphs do not simply implement backprop include difference target propagation [38] and feedback alignment [39] (both discussed briefly in section 2.5) and truncated backpropagation through time [40, 41, 42], where a choice is made about where to cut backprop short. Examples where the query and response graph differ are of particular interest, since they point towards more general classes of deep learning algorithms.

A distributed communication protocol is a game with additional structure: the Query and Response graphs, see Definition 7. The graphs capture the compositional structure of the functions learned by a neural network and the compositional structure of the learning procedure respectively. It is important for our purposes that (i) the feedforward and feedback sweeps correspond to two distinct graphs and (ii) the communication protocol is kept distinct from the optimization procedure. That is, the communication protocol specifies how information flows through the networks without specifying how players make use of it. Players can be treated as plug-and-play rational agents that are provided with carefully constructed and coordinated first-order information to optimize as they see fit [43, 44].

Finally, a grammar is a distributed communication protocol equipped with a guarantee that the response graph encodes sufficient information for the players to jointly find a local optimum of an objective function. The paradigmatic example of a grammar is backpropagation. A grammar is a thus a game designed to perform a task. A representation learned by one (p)layer is useful to another if the game is guaranteed to converge on a local solution to an objective – that is, if the players interact though a grammar. It follows that the players build representations that jointly encode knowledge about the task.

Caveats. What follows is provisional. The definitions are a first attempt to capture an interesting, and perhaps useful, perspective on deep learning. The essay contains no new theorems, algorithms or experiments, see [45, 20, 46] for “real work” based on the ideas presented here. The essay is not intended to be comprehensive. Many details are left out and many important aspects are not covered: most notably, probabilistic and Bayesian formulations, and various methods for unsupervised pre-training.

A series of worked examples. In line with its provisional nature, much of the essay is spent applying the framework to worked examples: error backpropagation as a supervised model [8]

; variational autoencoders

[47] and generative adversarial networks [48]

for unsupervised learning; the deviator-actor-critic (DAC) model for deep reinforcement learning

[46]; and kickback, a biologically plausible variant of backpropagation [45]. The examples were chosen, in part, to maximize variety and, in part, based on familiarity. The discussions are short; the interested reader is encouraged to consult the original papers to fill in the gaps.

The last two examples are particularly interesting since their Response graphs differ substantially from backpropagation. The DAC model constructs a zeroth-order black-box to estimate gradients rather than querying a first-order gray-box. Kickback prunes backprop’s Response graph by replacing most of its gray-boxes with black-boxes and approximating the chain rule with (primarily) local computations.

0.3 Related work

Bottou and Gallinari proposed to decompose neural networks into cooperating modules [49, 50]. Decomposing more general algorithms or models into collections of interacting agents dates back to the shrieking demons that comprised Selfridge’s Pandemonium [51] and a long line of related work [52, 53, 54, 55, 56, 57, 58, 59]

. The focus on components of neural networks as players, or rational agents, in their own right developed here derives from work aimed at modeling biological neurons game-theoretically, see

[60, 61, 62, 63, 64].

A related approach to semantics based on general value functions can be found in Sutton et al [65], see remark 1

. Computation graphs as applied to backprop are the basis of the Python library Theano

[34, 35, 36] and provide the backbone for automatic/algorithmic differentiation [32, 33].

Grammars are a technical term in the theory of formal languages relating to the Chomsky hierarchy [66]. There is no apparent relation between that notion of grammar and the one presented here, aside from both relating to structural rules governing composition. Formal languages and deep learning are sufficiently disparate fields that there is little risk of terminological confusion. Similarly, the notion of semantics introduced here is distinct from semantics in the theory of programming languages.

Although game theory was originally developed to model human interactions [30], it has been pointed out that it may be more directly applicable to interacting populations of algorithms, so-called machina economicus [67, 68, 69, 70, 71, 72]. This paper goes one step further to propose that games played over first-order communication protocols are a key component of the foundations of deep learning.

A source of inspiration for the essay is Bayesian networks and Markov random fields. Probabilistic graphical models and factor graphs provide simple, powerful ways to encode a multivariate distribution’s independencies into a diagram

[73, 74, 75]. They have greatly facilitated the design and analysis of probabilistic algorithms. However, there is no comparable framework for distributed optimization and deep learning. The essay is intended as a first step in this direction.

1 Semantics and Representations

This section defines semantics and representations. In short, the semantics of a function is how it categorizes its inputs; a function is a representation if it is selected to optimize an objective. The connection between the definition of representation below and “representation learning” is clarified in section 2.1.

Possible world semantics was introduced by Lewis to formalize the meaning of sentences in terms of counterfactuals [76]. Let be a proposition about the world. Its truth depends on its content and the state of the world. Rather than allowing the state of the world to vary, it is convenient to introduce the set of all possible worlds.

Let us denote proposition applied in world by . The meaning of is then the mapping which assigns 1 or 0 to each according to whether or not proposition

is true. Equivalently, the meaning of the proposition is the ordered pair consisting of: all worlds, and the subset of worlds where it is true:


For example, the meaning of that is blue” is the subset of possible worlds where I am pointing at a blue object. The concept of blue is rendered explicit in an exhaustive list of possible examples.

A simple extension of possible world semantics from propositions to arbitrary functions is as follows [77]:

Definition 1 (semantics).

Given function , the semantics or meaning of output is the ordered pair of sets


Functions implicitly categorize inputs by assigning outputs to them; the meaning of an output is the category.

Whereas propositions are true or false, the output of a function is neither. However, if two functions both optimize a criterion, then one can refer to how accurately one function represents the other. Before we can define representations we therefore need to take a quick detour through optimization:

Definition 2 (optimization problem).

An optimization problem is a pair consisting in parameter-space and objective that is differentiable almost everywhere.

The solution to the global optimization problem is


which is either a maximum or minimum according to the nature of the objective.

The solution may not be unique; it also may not exist unless further restrictions are imposed. Such details are ignored here.

Next recall the black-box optimization framework introduced by Nemirovski and Yudin [26, 27, 28, 29].

Definition 3 (communication protocol).

A communication protocol for optimizing an unknown objective consists in a User (or Player) and an Oracle. On each round, User presents a query . Oracle can respond in one of two ways, depending on the nature of the protocol:

  • Black-box (zeroth-order) protocol.
    Oracle responds with the value .



  • Gray-box (first-order) protocol.
    Oracle responds with either the gradient or with the gradient together with the value.




The protocol specifies how Player and Oracle interact without specifying the algorithm used by Player to decide which points to query. The next section introduces distributed communication protocols as a general framework that includes a variety of deep learning architectures as special cases – again without specifying the precise algorithms used to perform weight updates.

Unlike [26, 28] we do not restrict to convex problems. Finding a global optimum is not always feasible, and in practice often unnecessary.

Definition 4 (representation).

Let be a function space and


be a map from parameter-space to functions. Further suppose that objective function is given.

A representation is a local solution to the optimization problem


corresponding to a local maximum or minimum according to whether the objective is minimized or maximized.

Intuitively, the objective quantifies the extent to which functions in categorize their inputs similarly. The operation applies a first-order method to find a function whose semantics resembles the optimal solution where .

In short, representations are functions with useful semantics, where usefulness is quantifed using a specific objective: the lower the loss or higher the reward associated with a function, the more useful it is. The relation between Definition 4 and representations as commonly understood in the deep learning literature is discussed in section 2.1 below.

Remark 1 (value function semantics).

In related work, Sutton et al [65] proposed that semantics – i.e. knowledge about the world – can be encoded in general value functions that provide answers to specific questions about expected rewards. Definition 1 is more general than their approach since it associates a semantics to any function. However, the function must arise from optimizing an objective for its semantics to accurately represent a phenomenon of interest.

1.1 Supervised learning

The main example of a representation arises under supervised learning.

Representation 1 (supervised learning).

Let and be an input space and a set of labels and

be a loss function. Suppose that

is a parametrized family of functions.

  • Nature which samples labeled pairs i.i.d. from distribution , singly or in batches.

  • Predictor chooses parameters .

  • Objective is


The query and responses phases can be depicted graphically as




The predictor is then a representation of the optimal predictor .

A commonly used mapping from parameters to functions is


where a feature map is fixed.

The setup admits a variety of complications in practice. Firstly, it is typically infeasible even to find a local optimum. Instead, a solution that is within some small of the local optimum suffices. Secondly, the distribution

is unknown, so the expectation is replaced by a sum over a finite sample. The quality of the resulting representation has been extensively studied in statistical learning theory

[78]. Finally, it is often convenient to modify the objective, for example by incorporating a regularizer. Thus, a more detailed presentation would conclude that


yields a representation of the solution to . To keep the discussion and notation simple, we do not consider any of these important details.

It is instructive to unpack the protocol, by observing that the objective is a composite function involving , and :








The notation is borrowed from backpropagation. It is shorthand for the derivative of the objective with respect to parameters .

Nature is not a deterministic black-box since it is not queried directly: Nature produces pairs stochastically, rather than in response to specific inputs. Our notion of black-box can be extended to stochastic black-boxes, see e.g. [37]. However, once again we prefer to keep the exposition as simple as possible.

1.2 Unsupervised learning

The second example concerns fitting a probabilistic or generative model to data. A natural approach is to find the distribution under which the observed data is most likely:

Representation 2 (maximum likelihood estimation).

Let be a data space.

  • Nature samples points from distribution .

  • Estimator chooses parameters .

  • Operator

    computes a probability density on

    that depends on parameter .

  • Operator acts as a loss. The objective is to mimimize





The estimate , where , is a representation of the optimal solution, and can also be considered a representation of . The setup extends easily to maximum a posteriori estimation.

As for supervised learning, the protocol can be unpacked by observing that the objective has a compositional structure:








1.3 Reinforcement learning

The third example is taken from reinforcement learning [79]. We will return to reinforcement learning in section 2.4

, so the example is presented in some detail. In reinforcement learning, an agent interacts with its environment, which is often modeled as a Markov decision process consisting of state space

, action space , initial distribution on states, stationary transition distribution and reward function . The agent chooses actions based on a policy: a function from states to actions. The goal is to find the optimal policy.

Actor-critic methods break up the problem into two pieces [80]. The critic estimates the expected value of state-action pairs given the current policy, and the actor attempts to find the optimal policy using the estimates provided by the critic. The critic is typically trained via temporal difference methods [81, 82].

Let denote the distribution on states at time given policy and initial state at and let . Let be the discounted future reward. Define the value of a state-action pair as


Unfortunately, the value-function cannot be queried. Instead, temporal difference methods take a bootstrapped approach by minimizing the Bellman error:


where is the state subsequent to .

Representation 3 (temporal difference learning).

Critic interacts with black-boxes Actor and Nature.222Nature’s outputs depend on Actor’s actions, so the Query graph should technically have an additional arrow from Actor to Nature.

  • Critic plays parameters .

  • Operator and estimates the value function and compute the Bellman error. In practice, it turns out to clone the value-estimate periodically and compute a slightly modified Bellman error:


    where is the cloned estimate. Cloning improves the stability of TD-learning [4]. A nice conceptual side-effect of cloning is that TD-learning reduces to gradient descent.









The estimate is a representation of the true value function.

Remark 2 (on temporal difference learning as first-order method).

Temporal difference learning is not strictly speaking a gradient-based method [82]. The residual gradient method performs gradient descent on the Bellman error, but suffers from double sampling [83]. Projected fixpoint methods minimize the projected Bellman error via gradient descent and have nice convergence properties [84, 85, 86]. An interesting recent proposal is implicit TD learning [87], which is based on implicit gradient descent [88].

Section 2.4 presents the Deviator-Actor-Critic model which simultaneously learns a value-function estimate and a locally optimal policy.

2 Protocols and Grammars

It is often useful to decompose complex problems into simpler subtasks that can handled by specialized modules. Examples include variational autoencoders, generative adversarial networks and actor-critic models. Neural networks are particularly well-adapted to modular designs, since units, layers and even entire networks can easily be combined analogously to bricks of lego [49].

However, not all configurations are viable models. A methodology is required to distinguish good designs from bad. This section provides a basic language to describe how bricks are glued together that may be a useful design tool. The idea is to extend the definitions of optimization problems, protocols and representations from section 1 from single to multi-player optimization problems.

Definition 5 (game).

A distributed optimization problem or game is a set of players, a parameter space

, and loss vector

. Player picks moves from and incurs loss determined by . The goal of each player is to minimize its loss, which depends on the moves of the other players.

The classic example is a finite game [30], where player has a menu of -actions and chooses a distribution over actions, on each round. Losses are specified for individual actions, and extended linearly to distributions over actions. A natural generalization of finite games is convex games where the parameter spaces are compact convex sets and each loss is a convex function in its -argument [89]. It has been shown that players implementing no-regret algorithms are guaranteed to converge to a correlated equilibrium in convex games [90, 91, 89].

The notion of game in Definition 5 is too general for our purposes. Additional structure is required.

Definition 6 (computation graph).

A computation graph is a directed acyclic graph with two kinds of nodes:

  • Inputs are set externally (in practice by Players or Oracles).

  • Operators produce outputs that are a fixed function of their parents’ outputs.

Computation graphs are a useful tool for calculating derivatives [32, 34, 35, 36, 33]. For simplicity, we restrict to deterministic computation graphs. More general stochastic computation graphs are studied in [37].

A distributed communication protocol extends the communication protocol in Definition 3 to multiplayer games using two computation graphs.

Definition 7 (distributed communication protocol).

A distributed communication protocol is a game where each round has two phases, determined by two computation graphs:

  • Query phase. Players provide inputs to the Query graph () that Operators transform into outputs.

  • Response phase. Operators in act as Oracles in the Response graph (): they input subgradients that are transformed and communicated to the Players.

The moves chosen by Players depend only on their prior moves and the information communicated to them by the Response graph.

The protocol specifies how Players and Oracles communicate without specifying the optimization algorithms used by the Players. The addition of a Response graph allows more general computations than simply backpropagating the gradients of the Query phase. The additional flexibility allows the design of new algorithms, see sections 2.4 and 2.5 below. It is also sometimes necessary for computational reasons. For example, backpropagation through time on recurrent networks typically runs over a truncated Response graph [40, 41, 42].

Suppose that we wish to optimize an objective function that depends on all the moves of all the players. Finding a global optimum is clearly not feasible. However, we may be able to construct a protocol such that the players are jointly able to find local optima of the objective. In such cases, we refer to the protocol as a grammar:

Definition 8 (grammar).

A grammar for objective is a distributed communication protocol where the Response graph provides sufficient first-order information to find a local optimum of .

The guarantee ensures that the representations constructed by Players in a grammar can be combined into a coherent distributed representation. That is, it ensures that the representations constructed by the Players transform data in a way that is useful for optimizing the shared objective .

The Players’ losses need not be explicitly computed. All that is necessary is that the Response phase communicate the gradient information needed for Players to locally minimize their losses – and that doing so yields a local optimum of the objective.

Basic building blocks: function composition () and the chain rule (). Functions can be inserted into grammars as lego-like building blocks via function composition during queries and the chain rule during responses. Let be a function that takes inputs and , provided by a Player and by upstream computations respectively. The output of is communicated downstream in the Query phase:






The chain rule is implemented in the Response phase as follows. Oracle reports the gradient in the Response phase. Operator “” computes the products via matrix multiplication. The projection of the product onto the first and second components333Alternatively, to avoid having “” produce two outputs, the entire vector can be reported in both direction with the irrelevant components ignored. are reported to Player and upstream respectively.

Summary of guarantees. A selection of examples are presented below. Guarantees fall under the following broad categories:

  1. Exact gradients.
    Under error backpropagation the Response graph implements the chain rule, which guarantees that Players receive the gradients of their loss functions; see section 2.1.

  2. Surrogate objectives.
    The variational autoencoder uses a surrogate objective: the variational lower bound. Maximizing the surrogate is guaranteed to also maximize the true objective, which is computational intractable; see section 2.2.

  3. Learned objectives.
    In the case of generative adversarial network and the DAC-model, some of the players learn a loss that is guaranteed to align with the true objective, which is unknown; see sections 2.3 and 2.4.

  4. Estimated gradient.
    In the DAC-model and kickback, gradient estimates are substituted for the true gradient; see sections 2.4 and 2.5. Guarantees are provided on the estimates.

Remark 3 (fine- and coarse-graining).

There is considerable freedom regarding the choice of players. In the examples below, players are typically chosen to be layers or entire neural networks to keep the diagrams simple. It is worth noting that zooming in, such that players correspond to individual units, has proven to be a useful tool when analyzing neural networks [45, 20, 46].

The game-theoretic formulation is thus scale-free and can be coarse- or fine-grained as required. A mathematical language for tracking the structure of hierarchical systems at different scales is provided by operads, see [92] and the references therein, which are the natural setting to study the composition of operators that receive multiple inputs.

2.1 Error backpropagation

The main example of a grammar is a neural network using error backpropagation to perform supervised learning. Layers in the network can be modeled as players in a game. Setting each (p)layer’s objective as the network’s loss, which it minimizes using gradient ascent, yields backpropagation.

Grammar 1 (backpropagation).

An -layer neural network can be reformulated as a game played between players, corresponding to Nature and the Layers of the network. The query graph for a 3-layer network is:







  • Nature plays samples datapoints i.i.d. from and acts as the zeroth player.

  • Layer plays weight matrices .

  • Operators compute for each layer, along with loss .

The response graph performs error backpropagation:









The protocol can be extended to convolutional networks by replacing the matrix multiplications performed by each operator,

, with convolutions and adding parameterless max-pooling operators


Guarantee. The loss of every (p)layer is


It follows by the chain rule that communicates to player .

Representation learning. We are now in a position to relate the notion of representation in definition 4 with the standard notion of representation learning in neural networks. In the terminology of section 1, each player learns a representation. The representations learned by the different players form a coherent distributed representation because they jointly optimize a single objective function.

Abstractly, the objective can be written as


where . The goal is to minimize the composite objective.

If we set then the function fits the definition of representation above. Moreover, the compositional structure of the network implies that is composed of subrepresentations corresponding to the optimizations performed by the different players in the grammar: each function is a local optimum – where is optimized to transform its inputs into a form that is useful to network as a whole.

Detailed analysis of convergence rates. Little can be said in general about the rate of converge of the layers in a neural network since the loss is not convex. However, neural networks can be decomposed further by treating the individual units as players. When the units are linear or rectilinear, it turns out that the network is a circadian game. The circadian structure provides a way to convert results about the convergence of convex optimization methods into results about the global convergence a rectifier network to a local optimum, see [20].

2.2 Variational autoencoders

The next example extends the unsupervised setting described in section 1.2. Suppose that observations are sampled i.i.d. from a two-step stochastic process: a latent value is sampled from , after which is sampled from .

The goal is to (i) find the maximum likelihood estimator for the observed data and (ii) estimate the posterior distribution on conditioned on an observation . A straightforward approach is to maximize the marginal likelihood


and then compute the posterior


However, the integral in Eq. (15) is typically untractable, so a more roundabout tactic is required. The approach proposed in [47] is to construct two neural networks, a decoder that learns a generative model approximating , and an encoder that learns a recognition model or posterior approximating .

It turns out to be useful to replace the encoder with a deterministic function, , and a noise source, that are compatible. Here, compatible means that sampling is equivalent to sampling and computing .

Grammar 2 (variational autoencoder).

A variational autoencoder is a game played between Encoder, Decoder, Noise and Environment. The query graph is






  • Environment plays i.i.d. samples from

  • Noise plays i.i.d. samples from . It also communicates its density function , which is analogous to a gradient – and the reason that Noise is gray rather than black-box.

  • Encoder and Decoder play parameters and respectively.

  • Operator is a neural network that encodes samples into latent variables.

  • Operator is a neural network that estimates the probability of conditioned on .

  • The remaining operators compute the (negative) variational lower bound


The response graph implements backpropagation:









Guarantee. The guarantee has two components:

  1. Maximizing the variational lower bound yields (i) a maximum likelihood estimator and (ii) an estimate of the posterior on the latent variable [47].

  2. The chain rule ensures that the correct gradients are communicated to Encoder and Decoder.

The first guarantee is that the surrogate objective computed by the query graph yields good solutions. The second guarantee is that the response graph communicates the correct gradients.

2.3 Generative-Adversarial networks

A recent approach to designing generative models is to construct an adversarial game between Forger and Curator [48]. Forger generates samples; Curator aims to discriminate the samples produced by Forger from those produced by Nature. Forger aims to create samples realistic enough to fool Curator.

If Forger plays parameters and Curator plays then the game is described succinctly via


where is a neural network that converts noise in samples and classifies samples as fake or not.

Grammar 3 (generative adversarial networks).

Construct a game played between Forger and Curator, with ancillary players Noise and Environment:

  • Environment samples images i.i.d. from .

  • Noise samples i.i.d. from .

  • Forger and Curator play parameters and respectively.

  • Operator is a neural network that produces fake image .

  • Operator is a neural network that estimates the probability that an image is fake.

  • The remaining operators compute a loss that Curator minimizes and Forger maximizes







Note there are two copies of Operator in the Query graph. The response graph implements the chain rule, with a tweak that multiplies the gradient communicated to Forger by to ensure that Forger maximizes the loss that Curator is minimizing.