1.1 The Setting
The field of machine learning has seen enormous advances over the last decade. These advances have been driven by two key elements: (i) the introduction of flexible architectures which have the expressive power needed to efficiently represent the input-output maps encountered in practice; (ii) the development of smart optimization tools which train the free parameters in these input-output maps to match data. The text  overviews the start-of-the-art.
While there is little work in the field of derivative-free, paralellizable methods for machine learning tasks, such advancements are greatly needed. Variants on the Robbins-Monro algorithm , such as stochastic gradient descent (SGD), have become state-of-the-art for practitioners in machine learning  and an attendant theory [14, 3, 70, 48, 37] is emerging. However the approach faces many challenges and limitations [20, 60, 78]. New directions are needed to overcome them, especially for parallelization, as attempts to parallelize SGD have seen limited success .
A step in the direction of a derivative-free, parallelizable algorithm for the training of neural networks was attempted in  by use of the the method of auxiliary coordinates (MAC). Another approach using the alternating direction method of multipliers (ADMM) and a Bregman iteration is attempted in 
. Both methods seem successful but are only demonstrated on supervised learning tasks with shallow, dense neural networks that have relatively few parameters. In reinforcement learning, genetic algorithm have seen some success (see and references therein), but it is not clear how to deploy them outside of that domain.
To simultaneously address the issues of parallelizable and derivative-free optimization, we demonstrate in this paper the potential for using ensemble Kalman methods to undertake machine learning tasks. Optimizing neural networks via Kalman filtering has been attempted before (see  and references therein), but most have been through the use of Extended or Unscented Kalman Filters. Such methods are plagued by inescapable computational and memory constraints and hence their application has been restricted to small parameter models. A contemporaneous paper by Haber et al  has introduced a variant on the ensemble Kalman filter, and applied it to the training of neural networks; our paper works with a more standard implementations of ensemble Kalman methods for filtering and inversion [44, 35] and demonstrates potential for these methods within a wide range of machine learning tasks.
1.2 Our Contribution
The goal of this work is two-fold:
First we show that many of the common tasks considered in machine learning can be formulated in the unified framework of Bayesian inverse problems. The advantage of this point of view is that it allows for the transfer of theory and algorithms developed for inverse problems to the field of machine learning, in a manner accessible to the inverse problems community. To this end we give a precise, mathematical description of the most common approximation architecture in machine learning, the neural network (and its variants); we use the language of dynamical systems, and avoid references to the neurobiological language and notation more common-place in the applied machine learning literature.
Secondly, adopting the inverse problem point of view, we show that variants of ensemble Kalman methods (EKI, EnKF) can be just as effective at solving most machine learning tasks as the plethora of gradient-based methods that are widespread in the field. We borrow some ideas from machine learning and optimization to modify these ensemble methods, to enhance their performance.
Our belief is that by formulating machine learning tasks as inverse problems, and by demonstrating the potential for methodologies to be transferred from the field of inverse problems to machine learning, we will open up new ways of thinking about machine learning which may ultimately lead to deeper understanding of the optimization tasks at the heart of the field, and to improved methodology for addressing those tasks. To substantiate the second assertion we give examples of the competitive application of ensemble methods to supervised, semi-supervised, and online learning problems with deep dense, convolutional, and recurrent neural networks. To the best of our knowledge, this is the first paper to successfully apply ensemble Kalman methods to such a range of relatively large scale machine learning tasks. Whilst we do not attempt parallelization, ensemble methods are easily parallelizable and we give references to relevant literature. Our work leaves many open questions and future research directions for the inverse problems community.
1.3 Notation and Overview
We adopt the notation for the real axis, the subset of non-negative reals, and for the set of natural numbers. For any set , we use to denote its -fold Cartesian product for any . For any function , we use to denote its image. For any subset of a linear space , we let denote the dimension of the smallest subspace containing . For any Hilbert space , we adopt the notation and to be its associated norm and inner-product respectively. Furthermore for any symmetric, positive-definite operator , we use the notation and . For any two topological spaces , we let denote the set of continuous functions from to . We define
the set of
Section 2 delineates the learning problem, starting from the classical, optimization-based framework, and shows how it can be formulated as a Bayesian inverse problem. Section 3 gives a brief overview of modern neural network architectures as dynamical systems. Section 4 outlines the state-of-the-art algorithms for fitting neural network models, as well as the EKI method and our proposed modifications of it. Section 5 presents our numerical experiments, comparing and contrasting EKI methods with the state-of-the-art. Section 6 gives some concluding remarks and possible future directions for this line of work.
2 Problem Formulation
Subsection 2.1 overviews the standard formulation of machine learning problems with subsections 2.1.1, 2.1.2, and 2.1.3 presenting supervised, semi-supervised, and online learning respectively. Subsection 2.2 sets forth the Bayesian inverse problem interpretation of these tasks and gives examples for each of the previously presented problems.
2.1 Classical Framework
The problem of learning is usually formulated as minimizing an expected cost over some space of mappings relating the data [21, 81, 53]. More precisely, let , be separable Hilbert spaces and let be a probability measure on the product space . Let be a positive-definite function and define to be the set of mappings on which the composition is -measurable for all in . Then we seek to minimize the functional
across all mappings in . This minimization may not be well defined as there could be infimizing sequences not converging in . Thus further constraints (regularization) are needed to obtain an unambiguous optimization problem. These are generally introduced by working with parametric forms of . Additional, explicit regularization is also often added to parameterized versions of (1).
Usually is called the loss or cost function and acts as a metric-like function on ; however it is useful in applications to relax the strict properties of a metric, and we, in particular, do not require to be symmetric or subadditive. With this interpretation of as a cost, we are seeking a mapping with lowest cost, on average with respect to . There are numerous choices for used in applications ; some of the most common include the squared-error loss used for regression tasks, and the cross-entropy loss used for classification tasks. In both these cases we often have , and, for classification, we may restrict the class of mappings to those taking values in .
Most of our focus will be on parametric learning where we approximate by a parametric family of models where is the parameter and is a separable Hilbert space. This allows us to work with a computable class of functions and perform the minimization directly over . Much of the early work in machine learning focuses on model classes which make the associated minimization problem convex [9, 31, 53], but the recent empirical success of neural networks has driven research away from this direction [45, 21]. In Section 3, we give a brief overview of the model space of neural networks.
While the formulation presented in (1) is very general, it is not directly transferable to practical applications as, typically, we have no direct access to . How we choose to address this issue depends on the information known to us, usually in the form of a data set, and defines the type of learning. Typically information about is accessible only through our sample data. The next three subsections describe particular structures of such sample data sets which arise in applications, and the minimization tasks which are constructed from them to determine the parameter .
2.1.1 Supervised Learning
Suppose that we have a dataset assumed to be i.i.d. samples from . We can thus replace the integral (1) with its Monte Carlo approximation, and add a regularization term, to obtain the following minimization problem:
Here is a regularizer on the parameters designed to prevent overfitting or address possible ill-posedness. We use the notation , and analogously , for concatenation of the data in the input and output spaces respectively.
A common choice of regularizer is where is a tunable parameter. This choice is often called weight decay in the machine learning literature. Other choices, such as sparsity promoting norms, are also employed; carefully selected choices of the norm can induce desired behavior in the parameters [10, 82]. We note also that Monte Carlo approximation is itself a form of regularization of the minimization task (1).
2.1.2 Semi-Supervised Learning
Suppose now that we only observe a small portion of the data in the image space; specifically we assume that we have access to data , where , and where with . Clearly this can be turned into supervised learning by ignoring all data indexed by , but we would like to take advantage of all the information known to us. Often the data in is known as unlabeled data, and the data in as labeled data; in particular the labeled data is often in the form of categories. We use the terms labeled and unlabeled in general, regardless or whether the data in is categorical; however some of our illustrative discussion below will focus on the binary classification problem. The objective is to assign a label to every . This problem is known as semi-supervised learning.
One approach to the problem is to seek to minimize
where the regularizer may use the unlabeled data in , but the loss term involves only labeled data in .
There are a variety of ways in which one can construct the regularizer including graph-based and low-density separation methods [7, 6]. In this work, we will study a nonparametric graph approach where we think of as indexing the nodes on a graph. To illustrate ideas we consider the case of binary outputs, take and restrict attention to mappings which take values in ; we sometimes abuse notation and simply take , so that is no longer a Hilbert space. We assume that comprises real-valued functions on the nodes of the graph, equivalently vectors in . We specify that for all
, and take, for example, the probit or logistic loss function[62, 7]. Once we have found an optimal parameter value for , application of to will return a labeling over all nodes in . In order to use all the unlabeled data we introduce edge weights which measure affinities between nodes of a graph with vertices , by means of a weight function on . We then compute the graph Laplacian and use it to define a regularizer in the form
Here is the identity operator, and with are tunable parameters. Further details of this method are in the following section. Applications of semi-supervised learning can include any situation where data in the image space is hard to come by, for example because it requires expert human labeling; a specific example is medical imaging .
2.1.3 Online Learning
Our third and final class of learning problems concerns situations where samples of data are presented to us sequentially and we aim to refine our choice of parameters at each step. We thus have the supervised learning problem (2) and we aim to solve it sequentially as each pair of data points
is delivered. To facilitate cheap algorithms we impose a Markovian structure in which we are allowed to use only the current data sample, as well as our previous estimate of the parameters, when searching for the new estimate. We look for a sequencesuch that as where, in the perfect scenario, will be a minimizer of the limiting learning problem (1). To make the problem Markovian, we may formulate it as the following minimization task
where is again a regularizer that could enforce a closeness condition between consecutive parameter estimates, such as
Furthermore this regularization need not be this explicit, but could rather be included in the method chosen to solve (6). For example if we use an iterative method for the minimization, we could simply start the iteration at .
This formulation of supervised learning is known as online learning. It can be viewed as reducing computational cost as a cheaper, sequential way of estimating a solution to (1); or it may be necessitated by the sequential manner in which data is acquired.
2.2 Inverse Problems
The preceding discussion demonstrates that, while the goal of learning is to find a mapping which generalizes across the whole distribution of possible data, in practice, we are severely restricted by only having access to a finite data set. Namely formulations (2), (4), (6) can be stated for any input-output pair data set with no reference to by simply assuming that there exists some function in our model class that will relate the two. In fact, since is positive-definite, its dependence also washes out when ones takes a function approximation point of view. To make this precise, consider the inverse problem of finding such that
here is a concatenation and is a
-valued random variable distributed according to a measurethat models possible noise in the data, or model error. In order to facilitate a Bayesian formulation of this inverse problem we let
denote a prior probability measure on the parameters. Then supposing
we see that (2) corresponds to the standard MAP
estimator arising from a Bayesian formulation of (8). The semi-supervised learning problem (4) can also be viewed as a MAP estimator by restricting (8) to and using to build . This is the perspective we take in this work and we illustrate with an example for each type of problem.
Suppose that and are Euclidean spaces and let and be Gaussian with positive-definite covariances where is block-diagonal with identical blocks . Computing the MAP estimator of (8), we obtain that and .
Suppose that and with the data . We will take the model class to be a single function depending only on the index of each data point and defined by . As mentioned, we think of as the nodes on a graph and construct the edge set where is a symmetric function. This allows construction of the associated graph Laplacian . We shift it and remove its null space and consider the symmetric, positive-definite operator from which we can define the Gaussian measure . For details on why this construction defines a reasonable prior we refer to . Letting , we restrict (8) to the inverse problem
With the given definitions, letting , the associated MAP estimator has the form of (4), namely
The infimum for this functional is not achieved , but the ensemble based methods we employ to solve the problem implicitly apply a further regularization which circumvents this issue.
Lastly we turn to the online learning problem (6). We assume that there is some unobserved, fixed in time parameter of our model that will perfectly match the observed data up to a noise term. Our goal is to estimate this parameter sequentially. Namely, we consider the stochastic dynamical system,
where the sequence are -valued i.i.d. random variables that are also independent from the data. This is an instance of the classic filtering problem considered in data assimilation . We may view this as solving an inverse problem at each fixed time with increasingly strong prior information as time unrolls. With the appropriate assumptions on the prior and the noise model, we may again view (6) as the MAP estimators of each fixed inverse problem. Thus we may consider all problems presented here in the general framework of (8).
3 Approximation Architectures
In this section, we outline the approximation architectures that we will use to solve the three machine learning tasks outlined in the preceding section. For supervised and online learning these amount to specifying the dependence of on ; for semi-supervised learning this corresponds to determining a basis in which to seek the parameter . We do not give further details for the semi-supervised case as our numerics fit in the context of Example 2.2, but we refer the reader to  for a detailed discussion.
details feed-forward neural networks with subsections3.1.1 and 3.1.2 showing the parameterizations of dense and convolutional networks respectively. Subsection 3.2 presents basic recurrent neural networks.
3.1 Feed-Forward Neural Networks
Feed-forward neural networks are a parametric model class defined as discrete time, nonautonomous, semi-dynamical systems of an unusual type. Each map in the composition takes a specific parametrization and can change the dimension of its input while the whole system is computed only up to a fixed time horizon. To make this precise, we will assume, and define a neural network with hidden layers as the composition
where and are nonlinear maps, referred to as layers, depending on parameters respectively, is an affine map with parameters , and is a concatenation. The map is fixed and thought of as a projection or thresholding done to move the output to the appropriate subset of data space. The choice of is dependent on the problem at hand. If we are considering a regression task and then can simply be taken as the identity. On the other hand, if we are considering a classification task and , the set of probability vectors in , then is often taken to be the softmax function defined as
From this perspective, the neural network approximates a categorical distribution of the input data and the softmax arises naturally as the canonical response function of the categorical distribution (when viewed as belonging to the exponential family of distributions) [51, 65]. If we have some specific bounds for the output data, for example then can be a point-wise hyperbolic tangent.
What makes this dynamic unusual is the fact that each map can change the dimension of its input unlike a standard dynamical system which operates on a fixed metric space. However, note that the sequence of dimension changes is simply a modeling choice that we may alter. Thus let and consider the solution map generated by the nonautonomous difference equation
where each map is such that ; then is given that . We may then define a neural network as
where is a projection operator and is again an affine map. While this definition is mathematically satisfying and potentially useful for theoretical analysis as there is a vast literature on nonautonomus semi-dynamical systems , in practice, it is more useful to think of each map as changing the dimension of its input. This is because it allows us to work with parameterizations that explicitly enforce the constraint on the dimension of the image. We take this point of view for the rest of this section to illustrate the practical uses of neural networks.
3.1.1 Dense Networks
A key feature of neural networks is the specific parametrization of each map . In the most basic case, each is an affine map followed by a point-wise nonlinearity, in particular,
where , are the parameters i.e. and is non-constant, bounded, and monotonically increasing; we extend to a function on by defining it point-wise as for any vector . This layer type is referred to as dense, or fully-connected, because each entry in is a parameter with no global sparsity assumptions and hence we can end up with a dense matrix. A neural network with only this type of layer is called dense or fully-connected (DNN).
The nonlinearity , called the activation function. Note that ReLU is unbounded and hence does not satisfy the assumptions for the classical universal approximation theorem 
, but it has shown tremendous numerical success when the associated inverse problem is solved via backpropagation (method of adjoints).
3.1.2 Convolutional Networks
Instead of seeking the full representation of a linear operator at each time step, we may consider looking only for the parameters associated to a pre-specified type of operator. Namely we consider
where can be fully specified by the parameter . The most commonly considered operator is the one arising from a discrete convolution . We consider the input as a function on the integers with period then we may define as the circulant matrix arising as the kernel of the discrete circular convolution with convolution operator . Exact construction of the operator is a modeling choice as one can pick exactly which blocks of to compute the convolution over. Usually, even with maximally overlapping blocks, the operation is dimension reducing, but can be made dimension preserving, or even expanding, by appending zero entries to . This is called padding. For brevity, we omit exact descriptions of such details and refer the reader to . The parameter is known as the stencil. Neural networks following this construction are called convolutional (CNN).
In practice, a CNN computes a linear combination of many convolutions at each time step, namely
for where with each entry known as a channel and if no convolutions were computed at the previous iteration. Finally we define . The number of channels at each time step, the integer , is a design choice which, along with the choice for the size of the stencils , the dimension of the input, and the design of determine the dimension of the image space for the map .
When employing convolutions, it is standard practice to sometimes place maps which compute certain statistics from the convolution. These operations are commonly referred to as pooling 
. Perhaps the most common such operation is known as max-pooling. To illustrate supposeare the channels computed as the output of a convolution (dropping the dependence for notational convenience). In this context, it is helpful to change perspective slightly and view each as a matrix whose concatenation gives the vector . Each of these matrices is a two-dimensional grid whose value at each point represents a linear combination of convolutions each computed at that spatial location. We define a maximum-based, block-subsampling operation
where the tuple is called the pooling kernel and the tuple is called the stride, each a design choice for the operation. It is common practice to take . We then define the full layer as . There are other standard choices for pooling operations including average pooling, -pooling, fractional max-pooling, and adaptive max-pooling where each of the respective names are suggestive of the operation being performed; details may be found in [79, 23, 24, 27]. Note that pooling operations are dimension reducing and are usually thought of as a way of extracting the most important information from a convolution. When one chooses the kernel such that , the per channel output of the pooling is a scalar and the operation is called global pooling.
|Convolutional Neural Network|
A four layer convolutional neural network for classifying images in the MNIST data set. The middle column shows a description typical of the machine learning literature. The other two columns connect this jargon to the notation presented here. No padding is added and the convolutions are computed over maximally overlapping blocks (stride of one). The nonlinearityis the ReLU and is the same for every layer.
Designs of feed-forward neural networks usually employ both convolutional (with and without pooling) and dense layers. While the success of convolutional networks has mostly come from image classification or object detection tasks , they can be useful for any data with spatial correlations [8, 21]. To connect the complex notation presented in this section with the standard in machine learning literature, we will give an example of a deep convolutional neural network. We consider the task of classifying images of hand-written digits given in the MNIST dataset . These are grayscale images of which there are and 10 overall classes hence we consider and the space of probability vectors over . Figure 1 show a typical construction of a deep convolutional neural network for this task. The word deep is generally reserved for models with . Once the model has been fit, Figure 2 shows the output of each map on an example image. Starting with the digitized digit , the model computes its important features, through a sequence of operations involving convolutional layers, culminating in the second to last plot, the output of the affine map . This plot shows model determining that the most likely digit is , but also giving substantial probability weight on the digit . This makes sense, as the digits and can look quite similar, especially when hand-written. Once the softmax is taken (because it exponentiates), the probability of the image being a is essentially washed out, as shown in the last plot. This is a short-coming of the softmax as it may not accurately retain the confidence of the model’s prediction. We stipulate that this may be a reason for the emergence of highly-confident adversarial examples [77, 22], but do not pursue suitable modifications in this work.
3.2 Recurrent Neural Networks
Recurrent neural networks are models for time-series data defined as discrete time, nonautonomous, semi-dynamical systems that are parametrized by feed-forward neural networks. To make this precise, we first define a layer of two-inputs simply as the sum of two affine maps followed by a point-wise nonlinearity, namely for define by
where , and ; the parameters are then given by the concatenation . The dimension is a design choice that we can pick on a per-problem basis. Now define the map by composing along the first component
where is a concatenation. Now suppose is an observed time series and define the dynamic
up to time . We can think of this as a nonautonomous, semi-dynamical system on with parameter . Let be the solution map generated by this difference equation. We can finally define a recurrent neural network by
where are affine maps, is a thresholding (such as softmax) as previously discussed, and a concatenation of the parameters as well as the parameters for all of the affine maps. Usually ones takes , but randomly generated initial conditions are also used in practice.
The construction presented here is the most basic recurrent neural network. Many others architectures such as Long Short-Term Memory (LSTM), recursive, and bi-recurrent networks have been proposed in the literature[69, 30, 29, 21], but they are all slight modifications to the above dynamic. These architectures can be used as sequence to sequence maps, or, if we only consider the output at the last time that is , as predicting or classifying the sequence . We refer the reader to  for an overview of the applications of recurrent neural networks.
Subsection 4.1 describes the choice of loss function. Subsection 4.2 outlines the state-of-the-art derivative based optimization, with subsection 4.2.1 presenting the algorithms and subsection 4.2.2 presenting tricks for better convergence. Subsection 4.3 defines the EKI method, with subsequent subsections presenting our various modifications.
4.1 Loss Function
Before delving into the specifics of optimization methods used, we discuss the general choice of loss function . While the machine learning literature contains a wide variety of loss functions that are designed for specific problems, there are two which are most commonly used and considered first when tackling any regression and classification problems respectively, and on which we focus our work in this paper. For regression tasks, the squared-error loss
is standard and is well known to the inverse problems community; it arises from an additive Gaussian noise model. When the task at hand is classification, the standard choice of loss is the cross-entropy
with the computed point-wise and where we consider . This loss is well-defined on the space . It is consistent with the the projection map of the neural network model being the softmax as . A simple Lagrange multiplier argument shows that is indeed infimized over by sequence and hence the loss is consistent with what we want our model output to be. 111Note that the infimum is not, in general, attained in as defined, because perfectly labeled data may take the form which is in the closure of but not in itself. From a modeling perspective, the choice of softmax as the output layer has some drawbacks as it only allows us to asymptotically match the data. However it is a good choice if the cross-entropy loss is used to solve the problem; indeed, in practice, the softmax along with the cross-entropy loss has seen the best numerical results when compared to other choices of thresholding/loss pairs .
The interpretation of the cross-entropy loss is to think of our model as approximating a categorical distribution over the input data and, to get this approximation, we want to minimize its Shannon cross-entropy with respect to the data. Note, however, that there is no additive noise model for which this loss appears in the associated MAP estimator simply because cannot be written purely as a function of the residual .
4.2 Gradient Based Optimization
4.2.1 The Iterative Technique
The current state of the art for solving optimization problems of the form (2), (4), (6) is based around the use of stochastic gradient descent (SGD) [63, 39, 66]. We will describe these methods starting from a continuous time viewpoint, for pedagogical clarity. In particular, we think of the unknown parameter as the large time limit of a smooth function of time . Let then gradient descent imposes the dynamic
which moves the parameter in the steepest descent direction with respect to the regularized loss function, and hence will converge to a local minimum for Lebesgue almost all initial data, leading to bounded trajectories [48, 71].
For the practical implementations of this approach in machine learning, a number of adaptations are made. First the ODE is discretized in time, typically by a forward Euler scheme; the time-step is referred to as the learning rate. The time-step is often, but not always, chosen to be a decreasing function of the iteration step [14, 63]. Secondly, at each step of the iteration, only a subset of the data is used to approximate the full gradient. In the supervised case, for example,
where is a random subset of cardinality usually with . A new is drawn at each step of the Euler scheme without replacement until the full dataset has been exhausted. One such cycle through all of the data is called an epoch
. The number of epochs it takes to train a model varies significantly based on the model and data at hand but is usually within the range 10 to 500. This idea, calledmini-batching, leads to the terminology stochastic gradient descent (SGD). Recent work has suggested that adding this type of noise helps preferentially guide the gradient descent towards places in parameter space which generalize better than standard descent methods [12, 13].
A third variant on basic gradient descent is the use of momentum-augmented methods utilized to accelerate convergence 
. The continuous time dynamic associated with the Nesterov momentum method is
We note, however, that, while still calling it Nesterov momentum, this is not the dynamic machine learning practitioners discretize. In fact, what has come to be called Nesterov momentum in the machine learning literature is nothing more than a strange discretization of a rescaled version the standard gradient flow (10). To see this, we note that via the approximation one may discretize (11), as is done in , to obtain
with and where the sequence gives the step sizes. However, what machine learning practitioners use is the algorithm
for some fixed . This may be written as
If we rearrange and divide by , we can obtain
which is easily seen as a discretization of a rescaled version of the gradient flow (10), namely
However, there is a sense in which this discretization introduces momentum, but only to order whereas classically the momentum term would be on the order . To see this, we can again rewrite the discretization as
Rearranging this and dividing through by we can obtain
which may be seen as a discretization of the equation
where is a continuous time version of the sequence ; in particular, if we see that whilst momentum is approximately present, it is only a small effect.
From these variants on continuous time gradient descent have come a plethora of adaptive first-order optimization methods that attempt to solve the learning problem. Some of the more popular include Adam, RMSProp, and Adagrad[40, 15]. There is no consensus on which methods performs best although some recent work has argued in favor of SGD and momentum SGD .
Lastly, the online learning problem (6) is also commonly solved via a gradient descent method dubbed online gradient descent (OGD). The dynamic is
which can be extended to the momentum case in the obvious way. It is common that only a single step of the Euler scheme is computed. The process of letting all these ODE(s) evolve in time is called training.
4.2.2 Initialization and Normalization
Two major challenges for the iterative methods presented here are: 1) finding a good starting point for the dynamic, and 2) constraining the distribution of the outputs of each map in the neural network. The first is usually called initialization, while the second is termed normalization; the two, as we will see, are related.
Historically, initialization was first dealt with using a technique called layer-wise pretraining . In this approach the parameters are initialized randomly. Then the parameters of all but first layer are held fixed and SGD is used to find the parameters of the first layer. Then all but the parameters of the second layer are held fixed and SGD is used to find the parameters of the second layer. Repeating this for all layers yields an estimate for all the parameters, and this is then used as an initialization for SGD in a final step called fine-tuning. Development of new activation functions, namely the ReLU, has allowed simple random initialization (from a carefully designed prior measure) to work just as well, making layer-wise pretraining essentially obsolete. There are many proposed strategies in the literature for how one should design this prior [20, 52]
. The main idea behind all of them is to somehow normalize the output mean and variance of each map. One constructs the product probability measure
where each is usually a centered, isotropic probability measure with covariance scaling . Each such measure corresponds to the distribution of the parameters of each respective layer with attached to the parameters of the map . A common strategy called Xavier initialization  proposes that the inverse covariance (precision) is determined by the average of the input and output dimensions of each layer:
When the layer is convolutional, and are instead taken to be the number of input and output channels respectively. Usually each is then taken to be a centered Gaussian or uniform probability measure. Once this prior is constructed one initializes SGD by a single draw.
As we have seen, initialization strategies aim to normalize the output distribution of each layer. However, once SGD starts and the parameters change, this normalization is no longer in place. This issue has been called the internal covariate shift. To address it, normalizing parameters are introduced after the output of each layer. The most common strategy for finding these parameters is called batch-normalization , which, as the name implies, relies on computing a mini-batch of the data. To illustrate the idea, suppose are the outputs of the map at inputs . We compute the mean and variance
and normalize these outputs so that the inputs to the map are
where is used for numerical stability while are new parameters to be estimated, and are termed the scale and shift respectively; they are found by means of the SGD optimization process. It is not necessary to introduce the new parameters but is common in practice and, with them, the operation is called affine batch-normalization. When an output has multiple channels, separate normalization is done per channel. During training a running mean of each is kept and the resulting values are used for the final model. A clear drawback to batch normalization is that it relies on batches of the data to be computed and hence cannot be used in the online setting. Many similar strategies have been proposed [80, 2] with no clear consensus on which works best. Recently a new activation function called SeLU  has been claimed to perform the required normalization automatically.
4.3 Ensemble Kalman Inversion
The Ensemble Kalman Filter (EnKF) is a method for estimating the state of a stochastic dynamical system from noisy observations . Over the last decade the method has been systematically developed as an iterative method for solving general inverse problems; in this context, it is sometimes referred to as Ensemble Kalman Inversion (EKI) . Viewed as a sequential Monte Carlo method , it works on an ensemble of parameter estimates (particles) transforming them from the prior into the posterior. Recent work has established, however, that unless the forward operator is linear and the additive noise is Gaussian , the correct posterior is not obtained . Nevertheless there is ample numerical evidence that shows EKI works very well as a derivative-free optimization method for nonlinear least-squares problems [38, 5]. In this paper, we view it purely through the lens of optimization and propose several modifications to the method that follow from adopting this perspective within the context of machine learning problems.
Consider the general inverse problem
where represent noise, and let be a prior measure on the parameter . Note that the supervised, semi-supervised, and online learning problems (8), (9) can be put into this general framework by adjusting the number of data points in the concatenations , and letting be absorbed into the definition of . Let be an ensemble of parameter estimates which we will allow to evolve in time through interaction with one another and with the data; this ensemble may be initialized by drawing independent samples from , for example. The evolution of is described by the EKI dynamic 
and is the empirical cross-covariance operator
Viewing the difference of from its mean, appearing in the left entry of the inner-product, as a projected approximate derivative of , it is possible to understand (12) as an approximate gradient descent.
Rigorous analysis of the long-term properties of this dynamic for a finite are poorly understood except in the case where is linear . In the linear case, we obtain that as where minimizes the functional
in the subspace , and where is the mean of the initial ensemble . This follows from the fact that, in the linear case, we may re-write (12) as
where is an empirical covariance operator
Hence each particle performs a gradient descent with respect to and projects into the subspace .
To understand the nonlinear setting we use linearization. Note from (12) that