# A Separation Principle for Control in the Age of Deep Learning

We review the problem of defining and inferring a "state" for a control system based on complex, high-dimensional, highly uncertain measurement streams such as videos. Such a state, or representation, should contain all and only the information needed for control, and discount nuisance variability in the data. It should also have finite complexity, ideally modulated depending on available resources. This representation is what we want to store in memory in lieu of the data, as it "separates" the control task from the measurement process. For the trivial case with no dynamics, a representation can be inferred by minimizing the Information Bottleneck Lagrangian in a function class realized by deep neural networks. The resulting representation has much higher dimension than the data, already in the millions, but it is smaller in the sense of information content, retaining only what is needed for the task. This process also yields representations that are invariant to nuisance factors and having maximally independent components. We extend these ideas to the dynamic case, where the representation is the posterior density of the task variable given the measurements up to the current time, which is in general much simpler than the prediction density maintained by the classical Bayesian filter. Again this can be finitely-parametrized using a deep neural network, and already some applications are beginning to emerge. No explicit assumption of Markovianity is needed; instead, complexity trades off approximation of an optimal representation, including the degree of Markovianity.

## Authors

• 24 publications
• 78 publications
• ### Deep neural network approximation for high-dimensional parabolic Hamilton-Jacobi-Bellman equations

The approximation of solutions to second order Hamilton–Jacobi–Bellman (...
03/09/2021 ∙ by Philipp Grohs, et al. ∙ 0

• ### Bayesian Dark Knowledge

We consider the problem of Bayesian parameter estimation for deep neural...
06/14/2015 ∙ by Anoop Korattikara, et al. ∙ 0

• ### Deep Learning Approximation for Stochastic Control Problems

Many real world stochastic control problems suffer from the "curse of di...
11/02/2016 ∙ by Jiequn Han, et al. ∙ 0

• ### Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks

Experiments in particle physics produce enormous quantities of data that...
01/28/2016 ∙ by Evan Racah, et al. ∙ 0

• ### Invariant-equivariant representation learning for multi-class data

Representations learnt through deep neural networks tend to be highly in...
02/08/2019 ∙ by Ilya Feige, et al. ∙ 0

• ### Adaptive Quantum State Tomography with Neural Networks

Quantum State Tomography is the task of determining an unknown quantum s...
12/17/2018 ∙ by Yihui Quek, et al. ∙ 0

• ### Visual Representations: Defining Properties and Deep Approximations

Visual representations are defined in terms of minimal sufficient statis...
11/27/2014 ∙ by Stefano Soatto, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Say you have a time series of data and wish to store a function of it having constant complexity, which we call a representation, that is useful for a prediction or control task. “Useful” means that the representation retains all the information the data contain about the task.111

For such a representation to exist, we must assume that the data satisfies a Markov model, an assumption on which we will come back later.

For example, if is the output of a linear time-invariant system with “true” finite-dimensional state , driven by white zero-mean Gaussian noise, then a sufficient representation of the data for the task , for instance, prediction , is the mean and covariance of the posterior density kalman60 , or equivalently the posterior itself (more on this equivalence later). This representation summarizes all past history of the data for the purpose of the task (in this case, predicting its future222If the model is not known, the representation includes a constant component that belongs to an equivalence class of realizations arun1990balanced , and there is an elegant geometry that is exploited in subspace system identification lindquist1979stochastic . Even if there is no “true” finite-dimensional state, under the Markovian assumption with Gaussian inputs of known dimension, one can infer a finite-dimensional predictor along with the state and model parameters akaike1974new .). In other words, given the representation, past data is independent of future data. Such independence makes it possible separate

inference of the state given the measurements, from control design given the estimated state

lewis2012optimal .

Such a separation principle has served the practitioner well over the years, but has left us with few tools for when the underlying assumptions are not satisfied: What if noise is not additive or Gaussian? What if the “true state” is high- or even infinite-dimensional? What if the data is also high-dimensional and almost all of it irrelevant for the control task?

Unfortunately, these are conditions the practitioner of Robotics and Autonomous Systems faces routinely: The task may include navigation in an unknown environment populated by objects whose shape is described by (infinite-dimensional) surfaces and reflectance functions; the data may include images with millions of channels (pixels), predicting most of which is irrelevant to the navigation task; nuisance factors may include occlusion, changes of illumination and pose, which are far from white, zero-mean Gaussian “noise.” Does a separation principle exist for this kind of scenario? Is it still possible to infer a bounded-complexity function of the data, that can be stored in memory in lieu of the data, with no information loss?

To be sure, there have been many attempts to answer these questions. In sufficient dimensionality reduction chiaromonte2002sufficient ; shyr2010sufficient one aims to identify small-dimensional statistics (e.g., projections) that summarize the data. Similarly, the classical use of invariants in image analysis is to remove redundancy from the data by mapping it to the quotient space, which for image can be a thin set sundaramoorthiPVS09 . These approaches have had limited impact in Robotics and Autonomous Systems. Sufficient reductions are either to restrictive (linear projections) or too hard to compute and difficult to use. But what if we go the opposite way of dimensionality reduction? What if we instead increase the dimensionality beyond that of the data, already in the millions? What if, instead of computing statistics (deterministic functions of the data), we allow representations to be arbitrary stochastic functions?

At least for simpler tasks, such as classification krizhevsky2012imagenet or control in a finite setting, deep neural networks (DNNs) with hundreds of millions of parameters have shown remarkable empirical success. Can we leverage this success to infer representations of time series specifically for filtering and control tasks? Is there a theoretical framework that explains why large networks would work well for control?

Using a large network seems ill-advised at first: The bias-variance dilemma

cover2012elements states that as we increase the complexity of a model inferred from finitely sampled data, its ability to capture the underlying distribution degrades, a phenomenon referred to as overfitting

that seems at odd with DNN’s empirical success

zhang2016understanding . However, if the network complexity is measured by information content, rather than dimension, a well-trained DNN for classification faithfully obeys the bias-variance tradeoff achille2017emergence , and relaxing representations to be stochastic functions has the double advantage of simplifying the computation of information quantities and analyzing the properties of the resulting representations.

In this paper, we study representations for Robotics and Autonomous Systems, using tools from statistics and information theory, and deep neural networks as the class of functions implementing them (realizations). We first review the simple case of a model with trivial dynamics, to introduce the machinery of deep neural networks, and then extend it to the dynamic case.

### 1.1 Outline of the Paper

In Section 2, we introduce the defining properties of representations and formalize the notions of sufficiency, minimality, invariance and separation. Since representations that are minimal and sufficient do not exist in general in finite dimensions jeffreys1960extension , we start from the posterior, which is minimal-sufficient bahadur1954sufficiency but infinite-dimensional, and frame the problem of learning representations as an approximation problem where complexity is modulated explicitly in the Information Bottleneck Lagrangian (IBL). This is a cost functional to be minimized with respect to a class of functions in a sufficiently rich set.

Deep neural networks (DNNs) are universal approximants in the limit where the number of parameters goes to infinity, like many other function classes. However, they enjoy a peculiar coupling between the model parameters and the properties of the learned representation that make them better than most when we want the representation to be invariant to nuisance variability, and having components that are maximally independent (“disentanglement”). In Section 4 we give a succinct introduction to deep neural networks (DNNs), to the extent needed to follow the rest of the paper.

In Section 3 we present a series of core results that explain in what sense deep networks are approximations of optimal representations for static systems achille2016information . Specifically, through the use of the IB Lagrangian, we formalize the trade-off between the complexity of the data representation, and the error we commit when we use this representation to solve the task in lieu of the original data.

However, at first sight the IB Lagrangian does not address important properties of a representation, invariance and disentanglement, that we should then deal with separately. Nonetheless, we show that, given sufficiency, invariance is equivalent to minimality. We also show that the IBL is equivalent to the cross-entropy loss typically used for classification tasks in deep learning except for an added regularizer, thus creating and important link between the (information-theoretic) optimal representations and the Deep Learning practice. Furthermore, it has been shown that some heuristic methods in used in optimizing deep networks (stochastic gradient descent, Dropout and its variants) approximate this regularizer

achille2017emergence . We then show that, somewhat counter-intuitively, stacking layers of neural networks increases minimality of the representation, and therefore invariance. This is tied to the architecture of deep networks and partly explains their success.

Architecture design is also critical in coupling the optimization process where the IBL is minimized with respect to the parameters of the network (weights), with the desirable properties of the resulting representations (activations), which we outline in Section 2. Specifically, in Section 5 we describe an inequality that links the activations of a deep network (a representation of the test datum) and its weights (a representation of the training set). This duality also sheds light on the generalization properties of DNNs.

In Section 6 we extend the model to dynamical systems. By explicitly introducing a task variable, which is in general separate from the data, we open the possibility for drastically more efficient representations than those sufficient for future data prediction, while still learning end-to-end with a simple filter.

In Section 8 we discuss some properties and limitations of the representation proposed. Specifically we discuss the limitations of the Markovianity assumptions in classical models, and how the proposed model partially overcomes it by trading it off against complexity costs.

### 1.2 Related work

Deep Learning is impacting many areas of engineering and science, including time series forecasting koren2010collaborative , showing promise especially when some of the most common hypotheses underlying conventional methods, such as Markovianity, are not satisfied wu2017recurrent

. Several have studied extensions of classical Bayesian filtering, including using neural networks, but the resulting approaches had drawbacks: (a) the complexity of the update rule, which in the classical Bayesian setting requires computing the posterior of the data given the hidden state, is problematic for high-dimensional data types such as images; (b) the only task considered is prediction of the data, which is usually overkill when the actual task is, say, control: One does not need to model the complex reflectance properties of the world to decide whether to steer a vehicle to the right. Finally, (c) existing methods do not allow an explicit trade-off between the complexity of the hidden state and the quality of the prediction error. Notice, however, that Variational Bayesian methods can be used as a partial solution to the first problem

krishnan2015deep ; raiko2009variational , as long as the function class covers the underlying data distribution, which remains an open problem for natural images.

Among other methods, the Deep Kalman filter

krishnan2015deep

assumes the existence of a Gaussian latent state and non-linear transformations that explain the observations and use a variational auto-encoder (VAE) to infer such a model. This choice is restrictive as the only task allowed is the reconstruction of the measurements. Also the method focuses on batch system identification whereas we are interested in a causal, on-line scheme.

More directly related to our approach, langford2009learning suggest that, rather than finding a (generally complex) hidden state, we could focus on finding a statistic of the past data that separates past data from future predictions. Such a statistic can be learned and updated without using Bayes’ rule, thereby avoiding the complex computations of the data posterior . However, their analysis is restricted to linear models, and the task to the prediction of future data.

Constructing deterministic functions of the data (statistics) that separate the past from the future requires (embedding) dimensions (the mean and covariance maintained by a Kalman filter jazwinski2007stochastic , where is the dimension of the state-space); however, as we will show, a stochastic representation can make do with dimensions, at the cost of maintaining samples from the distribution, as in a particle filter. Sigma-point filters wan2000unscented are deterministic sample-based representations that falls in between these two cases. In this case as well, the task is prediction of the measurements. We allow the task to be more general, including the case whereby one does not care to be able to reproduce every channel of the measurements (e.g. the color of every pixel in the image) but instead cares only for a small projection or quotient of the data with respect to the action of nuisances. Our model also allows even more flexibility relative to the strong assumption of Markovianity implicit in the classical filtering equations. The inferred state can be thought as a separator but only for the measurements, as opposed to a more general task, which makes the problem less tractable than in our case, as the statistics that matter for control typically have far smaller complexity than the data. Also the proofs provided only apply to minimal realizations. In our model we trade-off Markovianity and complexity, which is not contemplated in the classical filtering equations.

The trade-off we seek between complexity of the representation and sufficiency for future prediction is also closely related to the Minimum-Information LQG Control of fox2016minimum , which explicitly accounts for the agent having a representation of bounded complexity. fox2016minimum only addresses the linear-quadratic Gaussian case, but gives a complete account of that. Similarly tiomkin2017control ; rubin2012trading deal with capacity costs, but in continuous time. The general principles are laid out in fox2016principled .

The theory we describe here emphasize that, in order to obtain efficient representations of the data, we should focus on a specific task, such as control, rather than predicting high-dimensional future data. dosovitskiy2016learning

assume that there is a low-dimensional vector of “measurements” separate from the actual measured data, which can be easily obtained, and on which the control loss depends linearly. In our parlance, predicting the future is a task sufficient for control, and therefore allows us to learn a sufficient representation for control. In particular, optimal control reduces to minimum-prediction error, and one can simply train a network to predict future measurements conditioned on the current policy a given control action taken at present. Using this technique for control shows state of the art performance on video games.

Finally, we study the quantity of information that the observed data contain about the parameters of the system plays. This quantity is considered in houthooft2016variational , which uses a variational approximation similar to ours to measure the information content in the data using a neural network, and then learns a control policy for exploration that maximizes this information quantity. In this case the model is a constant parameter, akin to an assumption of time-invariance and equivalent to Markovianity.

### 1.3 Preliminaries

We denote the history of a process from time to by , where we omit the subscript when . Thus, denotes the measured data up to time , while denote the quantity of interest (task) at that time, which could be the value of the measurements at a future time, . We will consider trivial dynamics at first, with each independent and identically distributed (i.i.d.) from an unknown density .

we denote the expectation of with respect to the measure by , Shannon’s entropy by , conditional entropy by , (conditional) mutual information , Kullback-Liebler’s (KL) divergence , cross-entropy , and total correlation by , defined as

 TC(x)=\KLp(x)∏ip(xi),

where are the marginal distributions of the components of ; is zero if and only if the components of are independent, in which case we say that is disentangled. We make use of the following identity:

 I(x;y)=Ey∼p(y)\KLp(x|y)p(x).

We say that

form a Markov chain, indicated with

, if . The Data Processing Inequality (DPI) for a Markov chain ensures that .

We define a nuisance to be any random variable that affects the observed data , but is not related to the task, , or equivalently . Similarly, we say that the representation is invariant to the nuisance if , or . When is not strictly invariant but minimizes among all sufficient representations, we say it is maximally insensitive to . It can be shown achille2017emergence

that the data can always be written as a function of the task and of all nuisances affecting it. Specifically, given a joint distribution

where

is a discrete random variable, we can always find a random variable

independent of such that , for some deterministic function .

Given random variables , , and with joint density , we say that is sufficient of for the task if we have the Markov chain , i.e., if . We say instead that the posterior of , is sufficient of for if . Notice that the posterior of a sufficient representation is in turn always sufficient, since . The converse does not hold in general, but holds in the important case in which is a deterministic function of . In this work we will focus on the general case of sufficient posteriors and abuse the notation to refer to both the random variable and the posterior as being sufficient.333An equivalent characterization using conditional expectations is to say that is sufficient of for if for any measurable function , and similarly the posterior is sufficient if for any .

## 2 Desiderata for representations

We call a representation of the data any stochastic function of . Ideally, we would like to be (a) sufficient for the task , that is, all the information that contains about the task should also be contained in , or . In order to not squander resources, the representation should also be (b) minimal, that is is smallest among all sufficient representations . Note that we are defining “small” in terms of information content, not dimension, of . Moreover, we would like it to be (c) invariant to a nuisance , or, if that is not possible, at least maximally insensitive to it, i.e., is minimized. Note that we require invariants to be uninformative, not constant, with respect to nuisance variability. We impose no requirement on identifiability and harbor no hope of uniqueness of representations. However, to facilitate their use, we do wish for the components of to be (d) maximally disentangled, that is, we want to be minimized.

The first two properties are satisfied by any minimal sufficient representation, which can be found by solving

 minp(x|y) I(y;x) s.t. H(z|x)=H(z|y)

or minimizing the corresponding Information Bottleneck Lagrangian (IBL) tishby2000information :

 L=H(z|x)cross-entropy+ βI(x;y).regularizer (1)

The IBL trades off sufficiency and minimality, regulated by and can be optimized efficiently when the is parametrized by a neural network achille2016information ; alemi2016deep . However, we are also interested in the other properties, invariance and disentanglement, that are not explicit in the IBL and are the focus of the next section.

## 3 Learning invariant and disentangled representations

We now present a key result connecting minimality of a representation and invariance to nuisances achille2017emergence .

Let be a nuisance affecting the data . Then, for any representation of we have:

 I(x;n)invariance≤I(x;y)% minimality−I(y;z)constant,

where the RHS is minimized when is minimal. Moreover, there always exists a particular nuisance such that equality holds up to a (generally small) residual , that is

 I(x;n)=I(x;y)−I(y;z)−ϵ,

where . In particular ,444Notice that , and usually , so we can generally ignore the extra term. and whenever is a deterministic function of . Under these conditions, a sufficient statistic is invariant (maximally insensitive) to nuisances if and only if it is minimal.

This result implies that, rather than manually imposing invariance to nuisances in the representation, which is usually difficult, we can construct invariants by simply reducing the amount of information that contains about , while retaining sufficiency. As we discussed, this can be done minimizing the IB Lagrangian using a neural network achille2016information .

While we will analyze deep networks in Section 4, this result, together with the Data Processing Inequality, already suggests an advantage in stacking multiple intermediate representations into “layers.” In fact, suppose suppose that we have a Markov chain of representations

 y→x1→x2,

such that there is an information bottleneck between and , that is, . Then, if is still sufficient, then it is necessarily more minimal, and therefore more invariant to nuisances, than . Notice that bottlenecks are easy to create, either by reducing the dimension so that , or by introducing noise between the and . This is indeed common practice in designing and training deep networks, which concatenate multiple layers

 y→x1→x2→…→xL,

so that, whenever layer is sufficient of for (which is imposed by the training loss), then is more insensitive to nuisances than all the preceding layers.

This also relates to the notion of Actionable Information soatto2013actionable , which is : In the special case when is deterministic, a representation that minimizes the IB Lagrangian also maximizes Actionable Information achille2017emergence .

Finally, it has been shown in achille2016information that, when assuming a factorized prior in the activations, the IBL also bounds Total Correlation, so minimizing it yields a representation that trades off sufficiency with complexity, invariance, and disentanglement.

## 4 Learning with Deep Neural Networks

In this section we sketch the very basics of deep learning, first by describing the class of functions realized by deep neural networks, and then the choice of functionals and optimization schemes used for determining their parameters. In the following section we show how this process, despite being agnostic of desirable properties of the representations outlined in the previous sections, manage to achieve just that by exploiting a peculiar information duality between the weights and the activations of the network.

### 4.1 Function class of DNNs

A Deep neural network (DNN) is a parametrized class of non-linear functions obtained by composing multiple layers: Each layer implements a linear transformation of its input, which is the output of the previous layer, followed by a (generally element-wise) non-linearity. Specifically, let be the input data, and let be a matrix, where . Then, we define the “activations” (output) of the -th layer as , where is a nonlinear function. A common choice for the non-linearity is

, also called a Rectified Linear Unit (ReLU). The output

of a network with layers is the function

 F(y;w)=ϕK(WKϕK−1(WK−1…ϕ1(W1y)…)),

where is the set of parameters, or weights, of the network. Each can be considered as a representation of the original input , and its component are generally called features (or feature maps, or activations, or responses). By the data processing inequality, contain no more information than ; however, as we will see, in a well trained network we expect

to contain all (and only) the information necessary for the task. Since the output of the network is often a (conditional) probability distribution (

e.g., the probability of a label given the image ), the last non-linearity is usually a Softmax non-linearity, , which ensures that the output of the network is positive and sums to one.

When the input has some particular structure, such as an image, the linear transformation can be chosen to exploit this structure. For example, when is an image, it is common to choose to be a set of convolutions. Networks using convolutional maps, known as convolutional neural networks (CNNs), have the notable property that their features are invariant to translations lecun1990handwritten , and have considerably fewer parameters (the number of parameters depends only on the size of the filters, which are generally small, rather than on the size of the image). Aside from reducing the size of the parameter space, the use of convolutions has a drastic, and not yet fully understood, effect in achieving desirable properties of the networks when operating on imaging data soatto2016visual .

### 4.2 Loss function and optimization

The output of a network is usually interpreted as a probability distribution over the inference target (e.g., the label of an image, the position of an object). Per soatto2016visual , if that approached the true posterior, it would be a minimal sufficient representation.

When is a discrete random variable, this identification can be done directly by letting the output of the network be a probability vector (or an un-normalized likelihood function). When is continuous we can chose a family of parametrized distributions, and let the network output the parameters (e.g.

, mean and variance for a normal distribution). In both cases, we will think of a deep network as a map

where, absent any system dynamics, the parameters are constant and usually determined by maximizing the log-likelihood of the observed data, which leads to the cross-entropy loss

 L(w)=Hp,q(z|y,w)=1tt∑i=1−logq(zi|yi,w).

Notice that the cross-entropy loss can be decomposed as

 Hp,q(z|y,w)=Hp(z|y)+\KLp(z|y)q(z|y,w).

Since all terms are positive, and only the KL divergence depends on , we can conclude that is minimized if and only if on all observed samples, giving an alternative justification for the use of this loss. Moreover, in the special case where is a normal distribution with fixed variance, the cross-entropy reduces to the usual loss for regression.

Minimization of the loss , and hence determining the weights , is usually done using stochastic gradient descent (SGD): We start by randomly initializing the parameters glorot2010understanding . Then, at each step , a random subset (“mini-batch”) of size , with , is sampled from the observed data and we compute the gradient relative to the mini-batch

 gk=1b∇wHp,q(yik+bik,zik+bik)=1bb∑j=0∇wlogq(zik+j|yik+j,w)

Since , we can see

as an unbiased (but high-variance, or “noisy”) estimate of the real gradient of the original loss function with respect to

. This can be computed efficiently since it requires computing the gradients on only samples rather than the whole collection of observed data, which can number in the millions. The weights are now updated using , where is called the learning rate. It is known nesterov2013introductory that when the loss function is strongly convex, the gradients are Lipschitz, and the learning rate decreases as , then SGD converges to the global optimum of the loss with convergence rate .

There are two main challenges one faces in carrying out this optimization: (i) the loss function is highly non-convex, therefore SGD can get stuck in a sub-optimal local minimum, and (ii) even if a global minimum is found, the parameters could be overfitting the data, meaning that while minimizes the loss on the observed data, the loss evaluated on unseen (future) data could be much larger.

The first problem (i) is partly addressed by SGD itself: Because of the noise added in the computation of the gradient by SGD, the optimization typically settles on extrema that are close to the global minimum in value. Variants of SGD include using Nesterov’s momentum nesterov2013introductory

, which generally yields faster training and improved performance of the network. Other algorithms, like RMSProp and Adam

, use the gradient history to reduce the variance in the estimate of the gradient, which is also adapted to the local geometry of the loss function. While in some tasks, such as stochastic optimal control (“Reinforcement Learning”)

mnih2015human , these algorithms show drastically improved performances as expected, on image classification and similar tasks the simpler SGD with momentum can still outperform them, suggesting that the noise added by SGD plays an important, positive, role in the optimization. There is at present a considerable amount of activity, but a dearth of results, in characterizing the topological and geometric properties of the loss function and designing algorithms that can exploit it to converge to minima that yield good generalization performance, as we discuss in Sect. 5.1. Generalization, or lack thereof (“overfitting”) is the second problem (ii) which we discuss in more detail next.

## 5 Duality and generalization

One of the main problems in optimizing a DNN is that the cross-entropy loss in notoriously prone to overfitting: The loss is small for (past) training data (thus optimization is successful), but large on (future) test data, indicating that the training process has converged to a function that is far from being an optimal representation.

We can gain insight about the possible causes of this phenomenon by looking at the following decomposition of the cross-entropy achille2017emergence :

 Hp,q(zt|yt,w)=Hp(zt|yt,θ)+I(θ;zt|yt,w)+\KLq(zt|yt,w)p(zt|yt,w)−I(zt;w|yt,θ), (2)

where . The first term of the right-hand side of (2) relates to the intrinsic error and depends only on ; the second term measures how much of the information past data contain about the parameter is captured by the weights; the third term relates to the efficiency of the model and the class of functions with respect to which the loss is optimized. The last, and only negative, term relates to how much information about the labels is memorized in the weights, regardless of the underlying data distribution. Absent any intervention, the left-hand side (LHS) of (2) can be minimized by just maximizing the last term, i.e.,

by memorizing the dataset, which amounts to overfitting and yields poor generalization. Traditional machine learning practice suggests that this problem can be avoided by reducing the complexity of the model, or by regularizing its parameters. However, it has been shown

zhang2016understanding that common architectures and regularization methods always allow the network to memorize a dataset.

Memorization can however be prevented by adding the last term back to the loss function, leading to a regularized loss , where the negative term on the RHS is canceled. However, computing, or even approximating, the value of is at least as difficult as fitting the model itself.

To overcome this problem, consider , the collection of all past data that we are using to infer the model parameters . Notice that to successfully learn the distribution , we only need to memorize in the information about the latent parameters , that is we need , which is bounded above by a constant. On the other hand, to overfit, the term needs to grow linearly with the number of training samples . We can exploit this fact to prevent overfitting by adding a Lagrange multiplier to make the amount of information constant with respect to , leading to the regularized loss function

 L(p(w|D))=Hp,q(z|y,w)+βI(w;D), (3)

which is, remarkably, the same IB Lagrangian in (1), but now interpreted as a function of rather than . Under appropriate assumptions on the form of the posterior , the term can be computed in closed form, and we can optimize eq. 3 efficiently kingma2015variational ; achille2017emergence .

Thus, as we have seen, the IB Lagrangian emerges as a natural criterion both for inferring a representation of the test datum that is sufficient and invariant (with no explicit notion of overfitting), and for inferring a representation of the training dataset (past data) that avoids overfitting (with no explicit notion of invariance). A natural question, which we will address in Section 5, is if, and how, these two representations, and their corresponding IB Lagrangians, are related to each other.

An alternative approach to the generalization problem is to use a Bayesian framework, and to find a posterior distribution over the weights that maximizes the marginal log-likelihood of the data, subject to a given prior . Rather than optimizing directly, which would be computationally expensive, one can maximize the Variational Lower Bound (VLBO)

 logp(zt|yt)≥Hp,q(zt|yt,w)+\KLq(w|D)p(w).

The IB Lagrangian eq. 3 can be seen as a generalization of Bayesian learning, where we have increased flexibility in selecting the regularizer by the added multiplier .

### 5.1 Information, generalization, and flat minima

Thus far we have suggested that adding the explicit information regularizer prevents the network from memorizing the dataset and thus avoids overfitting, which is also confirmed empirically in achille2017emergence . However, common networks are not commonly trained with this information regularizer, thus seemingly undermining the theory. However, even when not explicitly controlled,

is implicitly regularized by the use of SGD. In particular, empirical evidence suggests that SGD biases the optimization toward “flat minima”: local minima whose Hessian has mostly small eigenvalues

dinh2017sharp . These minima can be interpreted exactly as having low information , as suggested early on by hochreiter1997flat . As a consequence of previous claims, flat minima can then be seen as having better generalization properties.

More precisely, let be a local minimum of the cross-entropy loss , and let be the Hessian at that point. Then, under suitable assumptions on the form of the posterior, for the optimal choice of the posterior parameters we have achille2017emergence :

 I(w;D)≤12K[log\norm^w22+log\normH∗−Klog(K2β/2)],

where and denotes the nuclear norm of the matrix. Therefore the information in the weights is upper-bounded by the nuclear norm (and hence the “flatness”) of the Hessian. Notice that a converse inequality, that is, low information implies flatness, needs not hold, so sharp minima can in principle generalize well, as proved by dinh2017sharp .

In the next section we show that the quantity of information on the weights is connected not only to the geometry of the loss function, but also to the minimality (invariance) and disentanglement of the activations. In particular, this shows that weight regularization, whether implicit (SGD) or explicit (IB Lagrangian), biases the optimization towards good representations.

### 5.2 Duality of the representations

The core link between information in the weights, and hence flatness of the local minima, minimality of the representation, and disentanglement can be described by the following proposition from achille2017emergence :

Let be a single layer of a network. Under opportune hypotheses on the form of , we can find a strictly increasing function s.t. we have the uniform bound

 g(α)≤I(y;x)+TC(x)dim(x)≤g(α)+c,

where , and is related to by . In particular, is tightly bounded by and increases strictly with it.

Using the Markov property of the layers, we can now easily extend this bound to multiple layers. Let for be weight matrices, and let , where and is any nonlinearity. Under the same assumptions as the previous result, one can prove

 I(xL;y)≤mink

where .

Together with the results of Section 3, this implies that regularized networks containing low information in the weights, automatically learn a representation of the input which is both more invariant to nuisances and more disentangled. Moreover, by Section 5.1, SGD is biased toward such representations.

This result is important because it establishes connections between the weights, which are a representation of past data, given and used to optimize a loss function that knows nothing about sufficiency, minimality, invariance and disentanglement, and representation of future data, that emerge to have precisely those properties. Such connections are peculiar to the class of functions implemented by deep neural networks and do not apply to any generic function class.

Finally, we have all the elements to extend the notion of representation, and the optimization involved in inferring it (which encompasses system identification and filtering) to a dynamic setting.

## 6 Representing time series

In this section we consider the case where the data are not drawn i.i.d. from a distribution with constant underlying parameters. Instead, we assume that the representation can evolve over time according to a probability law that does not. More details can be found in achille2017deep .

### 6.1 Hidden state dynamic model

Many standard models for filtering and control assume the existence of a hidden state which evolves following a Markov process through some state transition probability , where we made the dependency on the control action explicit. The observations are sampled from the hidden state with some distribution , as described by the following graphical model:

The fundamental assumption of this model is that there is a random variable of bounded complexity, the state , that separates new observations from all past ones . The advantage of having such a variable is apparent in the classical filtering equations:

 p(xt+1|yt+1,ut) ∝p(yt+1|xt+1)∫p(xt+1|xt,ut)p(xt|yt,ut−1)dxt (4) p(yt+1|yt,ut) =∫p(yt+1|xt+1)p(xt+1|yt,ut)dxt+1 (5)

Here, all the information about the past data is contained in the (relatively small) posterior . In this sense, the posterior is sufficient for the state update, i.e., for computing , and for the prediction of the data, i.e., computing .

In a hidden Markov model or Kalman filter, the transitions are assumed to be linear, and the state and observations either Gaussian or discrete. In these cases, the posterior can updated easily and there are efficient algorithms to infer the model parameters of the system. However, for many real problems the integrals in the filtering equation are not tractable since the transition operator is often non-linear. In this case the complexity of updating the posterior may grow exponentially

bar-shalomF87 . Furthermore, the data generating distribution is difficult to compute, or even to approximate. Finally, while we can always artificially ignore long-term dependencies and consider the system Markovian by augmenting the state , the resulting state may be too complex to handle.

While there is no obvious solution to these problems in general, it is often the case that we are not interested in predicting the data, but just the control action which can be quite different and far smaller than the data. In the next section we see that this can guide the design of efficient filters.

### 6.2 Separating representation

Rather than explicitly looking for a Markovian state that can generate the observed data , i.e., inferring representations for prediction of the data, we focus on finding a representation (proxy state) to predict a task variable , for instance a control input, which is generally far lower-dimensional than the data, and that allows causal and recursive posterior update using only the latest measurements. In this sense, this section is about inferring representations for control.

Motivated by eq. 4, we define the variable through its posterior distribution ,555We use to distinguish the (unknown) data distribution from our model distribution. and we require that it satisfies the following:

1. prediction: the posterior of is sufficient of and for , that is, for each , we have

 p(zt+k|yt,ut+k−1)=∫q(zt+k|xt,ut+kt)q(xt|yt,ut−1)dxt
2. update: the posterior of is sufficient of and for

 q(xt+1|yt+1,ut)=∫q(xt+1|xt,yt+1,ut)q(xt|yt,ut−1)dxt.

Note that, like the classical filtering equations, this density propagation is exact. However, unlike the filtering equations, we can directly learn the transition probability rather than use Bayes’ rule, and therefore there is no need to compute the posterior , which is generally intractable for high-dimensional and complex data such as natural images. The separator, in this case, is not the random variable , but the posterior density , which is in general infinite-dimensional. This model allows us to explicitly modulate complexity of the representation with the fidelity of the separation.

[Kalman filter] The method we propose better reduce, in the linear Gaussian case, to the Kalman filter. Indeed, it does so in two different ways. First, let be the state of a linear time-invariant Gaussian state-space model, and let the task be one-step prediction, . Now, let be a random variable such that , where is the posterior computed by the Kalman filter, and let . Then, trivially, , so the posterior computed by the Kalman filter is sufficient for predicting future measurements. Moreover, by letting , we see that is also sufficient for the update. Therefore the posterior computed by the Kalman filter satisfies both the prediction (1) and update model (2). Notice, however, that this is not the only option. Instead, let be the mean and covariance of the innovation computed by the Kalman filter. Then is a deterministic sufficient statistic (function of the past ). Notice that, in this case, the dimension of the representation is larger and the update equation is given by the more complex Riccati equation. So, by adopting a deterministic representation, we have had to increase its computational complexity.

While in eq. 4 we need to use the prediction probability to update the posterior, which is not tractable when the data is high-dimensional, using conditions (1) and (2) we have the simple iterative update rules:

 q(zt|yt,ut) =∫q(zt|xt)q(xt|yt,ut)dxt, q(xt+1|xt,yt+1,ut+1) =∫q(xt+1|xt,yt,ut)q(xt|yt,ut)dxt.

Unlike eq. 4 these update equations only involve distributions over and , which are assumed to have lower effective dimension than the data , or at least have a simpler distribution (i.e., discrete, Gaussian).

Moreover, notice that if we restrict to be degenerate (i.e., a Dirac delta), so is a deterministic function of the past history of the measurements, which this framework allows, then the integrals are trivial and all updates can be computed exactly. On the other hand, allowing a more complex form for could drastically simplify the computation of both and , so there is a trade-off between the cost of computing the integrals for and the complexity of the prediction and update rules, as seen in the case of the Kalman filter. More specifically, when the model is linear and the driving input white, zero-mean Gaussian and i.i.d., the posterior is Gaussian. Thus one can consider both the posterior itself, or the parameters that represent it (mean and covariance matrix), as the separator, with the latter being a deterministic representation.

While the complexity of a posterior sufficient for the task is generally much smaller than of what would be required to predict , a representation that satisfies all the required properties may still have an high dimension or high complexity. What we are after is an explicit way to trade off complexity with quality of the representation, represented by its “degree of sufficiency and Markovianity.” As we have already seen in the static case, this trade-off can be expressed by the IB Lagrangian, which is now

 L=1TT∑t=1n∑k=0Hp,q(zt+k|yt,ut+k−1)+βI(xt;yt,ut).

Where is the cross entropy between the real data distribution and our model distribution previously defined.

[-step prediction loss] Given and , define as above. Then the cross-entropy loss

 L=1TT∑t=1n∑k=0Hp,q(zt+k|yt,ut+k−1)

is minimized if and only if the posterior of separates from the past data , meaning that for almost all .

###### Proof.

To simplify the notation, we only consider the case (that corresponds to smoothing), the general case being identical. Recall that , so

 L =1TT∑t=1Hp,q(zt|yt) =1TT∑t=1Hp(zt|yt)+1TT∑t=1Eyt\KLp(zt|yt)q(zt|yt) ≥1TT∑t=1Hp(zt|yt).

Since the degenerate representation trivially reaches the lower bound, for any representation minimizing the loss function we must have for all and a.e. . In particular, for a.e.  we have . ∎

Notice that it is not the random variable that separates from , i.e. , as it was in the static case. Instead, it is its (posterior) distribution that acts as the separator. However, if the latter is finitely-parametrized, where is a parametrized family of probability distributions and is a function, then , i.e., the parameters of the distributions can be interpreted as a finite-dimensional representation that separates the past data from the task.

Suppose that there exist a separating variable of finite complexity. Then, in the limit , the IB Lagrangian recovers a separating representation.

###### Proof.

Since has finite complexity, as . Therefore, in the limit the minimum of the Lagrangian is exactly

 L′=1TT∑t=1n∑k=0Hp,q(zt+k|yt,ut+k−1).

Therefore any other minimizer of the IB Lagrangian must also minimize , and, by the previous proposition, it must be a separating distribution. ∎

## 7 A separation principle for control

In the previous section we have seen that, given a task (a random variable to predict) it is possible to infer a representation that trades off complexity with sufficiency and Markovianity. We now specialize this program for a control task, so that a controller operating on the representation behaves as if it had access to the entire past history of the data, analogously with the separation principle in linear-quadratic Gaussian (LQG) optimal control. Unlike LQG, however, in general there is no finite-dimensional sufficient statistic, and therefore, following the program above, we seek for a representation that trades off of complexity with fidelity.

To this end, assume that our control task consists of minimizing a control loss such that

 R=T∑t=1rt(yt,ut),

where is a possibly stochastic function of the true (global) state of the system and the actions. Notice that even if the system is not Markovian, we can always assume such a global state exists, in the worse case . Notice that LQG and other standard control losses can be written in this form. To simplify, we consider a finite horizon .

We claim that if the posterior of is a sufficient representation of the data for the task , then there exists an optimal control policy which is a function of the posterior alone.

Let be such that the posterior of is sufficient of for , meaning that

 p(rt+k|yt,ut+k)=∫q(rt+k|xt,ut+kt)q(xt|yt,ut)dxt.

Then, there exist an optimal control policy that minimizes the expected risk and depends on the past data only through .

###### Proof.

Adopting standard reinforcement learning notation, let be the expected value of when following the policy for the last steps given the observation history and action history until now. Define the optimal Q-function .

Recall that, given , the optimal policy is given by . Therefore, to prove that the optimal policy depends only on , it suffices to prove that , i.e. that we can compute the optimal Q-function given alone instead of the whole history . This follows trivially from the fact that

 Q∗>t(yt,ut,u) =minuTt:ut+1=uT−t∑kE[rt+k|yt,ut+k] =minuTt+1:ut=u∫\setT−t∑k=0rt+kq(rt+k|xt,ut+kt)dq(xt|yt,ut).

Notice that this proposition does not gives an explicit way of learning a policy (since a naïve application would require a brute force optimization over all possible actions). Rather, the usefulness of the theorem is in that it proves that any representation sufficient to predict the rewards , a fairly general condition, is also sufficient for control. In particular, if is a function of some (low-dimensional) measurement , as dosovitskiy2016learning suggests, a representation trained to predict those measurements will also be sufficient for and hence for control. This fact is implicitly exploited in dosovitskiy2016learning to learn a state of the art control policy for complex task and high-dimensional data.

## 8 Discussion

We have framed the problem of system identification for the purpose of control as that of inferring not deterministic statistics of sufficiently-exciting time series, but rather of an approximation of the posterior of the control loss given past measurements. While this is in general infinite-dimensional, universally-approximating function classes, such as neural networks, can be employed in the inference. This yields some nice properties relative to classical Bayesian filtering: First, we do not need to apply Bayes’ rule, and therefore there is no partition function to compute, to the benefit of computational complexity. Second, we do not need to make a strict assumption of Markovianity. Instead, we can explicitly trade off complexity with fidelity of the approximation of the posterior. Such a posterior is the separator that plays the analogous role of the state of a Gaussian linear model in classical linear identification.

The good news is that the representations learned by generic stochastic gradient descent, while being agnostic of desirable properties of the resulting representation, end up enforcing them through implicit regularization, as we show for the static case.

Now for a few caveats. First, the representations we aim to infer are optimal when they are as good as the data for the chosen task. This does not mean they are good: If the data is uninformative (or non-sufficiently exciting), there is no guarantee that can be made on the quality of the representation, other than it is sufficient, meaning as good as the data (it can be no more, per the Data Processing Inequality). A completely independent problem is how to get as exciting data as possible, which is the problem of Active Learning or Experiment Design, that can be framed as an optimal control problem, which we do not address here.

Second, we are not suggesting that the model we propose is tractable in its most general form, or that training a neural network to minimize the IBL proposed is easy. However, we show that minimizing a simple cross entropy for a particular task (the control loss) leads to a representation which is sufficient for control. One should notice that this approach has strong links with dosovitskiy2016learning , but also with reinforcement learning. Indeed, both can be seen as ways of making the algorithm tractable by directly approximating the expected loss for a given action.

More importantly, this class of tools opens a number of potentially exciting research avenue, both applied – making use of the power of these representations, and implementing efficient algorithms to infer them – and theoretical, as little is known about the properties of these representation and their approximation bounds. This approach promises to re-open a field that has been shackled between the linear case, which is nice and elegant and for which a plethora of results are known, but that is very limited applicability, and the general case, where there is little to say, and little that works in practice.