# Convolutional Neural Nets: Foundations, Computations, and New Applications

We review mathematical foundations of convolutional neural nets (CNNs) with the goals of: i) highlighting connections with techniques from statistics, signal processing, linear algebra, differential equations, and optimization, ii) demystifying underlying computations, and iii) identifying new types of applications. CNNs are powerful machine learning models that highlight features from grid data to make predictions (regression and classification). The grid data object can be represented as vectors (in 1D), matrices (in 2D), or tensors (in 3D or higher dimensions) and can incorporate multiple channels (thus providing high flexibility in the input data representation). For example, an image can be represented as a 2D grid data object that contains red, green, and blue (RBG) channels (each channel is a 2D matrix). Similarly, a video can be represented as a 3D grid data object (two spatial dimensions plus time) with RGB channels (each channel is a 3D tensor). CNNs highlight features from the grid data by performing convolution operations with different types of operators. The operators highlight different types of features (e.g., patterns, gradients, geometrical features) and are learned by using optimization techniques. In other words, CNNs seek to identify optimal operators that best map the input data to the output data. A common misconception is that CNNs are only capable of processing image or video data but their application scope is much wider; specifically, datasets encountered in diverse applications can be expressed as grid data. Here, we show how to apply CNNs to new types of applications such as optimal control, flow cytometry, multivariate process monitoring, and molecular simulations.

## Authors

• 2 publications
• 7 publications
05/20/2018

### Low-Cost Parameterizations of Deep Convolutional Neural Networks

Convolutional Neural Networks (CNNs) filter the input data using a serie...
10/29/2019

### LeanConvNets: Low-cost Yet Effective Convolutional Neural Networks

Convolutional Neural Networks (CNNs) have become indispensable for solvi...
10/31/2018

### Some New Layer Architectures for Graph CNN

While convolutional neural networks (CNNs) have recently made great stri...
02/19/2018

### Communication-Optimal Convolutional Neural Nets

Efficiently executing convolutional neural nets (CNNs) is important in m...
08/17/2021

### Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs

In this paper, we challenge the common assumption that collapsing the sp...
09/07/2019

### GMLS-Nets: A framework for learning from unstructured data

Data fields sampled on irregularly spaced points arise in many applicati...
05/29/2020

### GMLS-Nets: Scientific Machine Learning Methods for Unstructured Data

Data fields sampled on irregularly spaced points arise in many applicati...

## Code Repositories

### data-science-awesome-reference

Daftar referensi tautan-tautan berguna untuk mempelajari tentang Data Science, Machine Learning, dan lainnya. Reference list of useful links to learn about Data Science, Machine Learning and more.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Convolutional neural nets (CNNs) are powerful machine learning models that highlight (extract) features from data to make predictions (regression and classification). The input data object processed by CNNs has a grid-like topology (e.g., an image is a matrix and video is a tensor); features of the input object are extracted using a mathematical operation known as convolution. Convolutions are applied to the input data with different operators (also known as filters) that seek to extract different types of features (e.g., patterns, gradients, geometrical features). The goal of the CNN is to learn optimal operators (and associated features) that best map the input data to the output data. For instance, in recognizing an image (the input), the CNN seeks to learn the patterns of the image that best explains a label given to an image (the output).

The earliest version of a CNN was proposed in 1980 by Kunihiko Fukushima [8]

and was used for pattern recognition. In the late 1980s, the LeNet model proposed by LeCun et al. introduced the concept of

backward propagation, which streamlined learning computations using optimization techniques [19]

. Although the LeNet model had a simple architecture, it was capable of recognizing hand-written digits with high accuracy. In 1998, Rowley et al. proposed a CNN model capable of performing face recognition tasks (this work revolutionized object classification and detection)

[29]. The complexity of CNN models (and their predictive power) has dramatically expanded with the advent of parallel computing architectures such as graphics processing units [25]. Modern CNN models for image recognition include SuperVision [18], GoogLeNet [35], VGG [32], and ResNet [10]

. New models are currently being developed to perform diverse computer vision tasks such as object detection

[27], semantic segmentation [20], action recognition [31], and 3D analysis [15]. Nowadays, CNNs are routinely used in applications such as in the face-unlock feature of smartphones [24].

CNNs tend to outperform other machine learning models (e.g., support vector machine and decision tree) but their behavior is difficult to explain. For instance, it is not always straightforward to determine the features that the operators are seeking to extract. As such, researchers have devoted significant effort into understanding the mathematical properties of CNNs. For instance, Cohen et al. established an equivalence between CNNs and hierarchical tensor factorizations

[5]

. Bruna et al. analyzed feature extraction capabilities

[2] and showed that these satisfy translation invariance and deformation stability (important concepts in determining geometric features from data).

While CNNs were originally developed to perform computer vision tasks, the grid data representation used by CNNs is flexible and can be used to process datasets arising in many different applications. For instance, in the field of chemistry, Hirohara et al. proposed a matrix representations of SMILES strings (which encodes molecular topology) by using a technique known as one-hot encoding

[13]. The authors used this representation to train a CNN that could predict the toxicity of chemicals; it was shown that the CNN outperformed traditional models based on fingerprints (an alternative molecular representation). Via analysis of the learned filters, the authors also determined chemical structures (features) that drive toxicity. In the realm of biology, Xie et al. applied CNNs to count and detect cells from micrographs [39]. In area of material science, Smith et al. have used CNNs to extract features from optical micrographs of liquid crystals to design chemical sensors [33].

While there has been extensive research on CNNs and the number of applications in science and engineering is rapidly growing, there are limited reports available in the literature that outline the mathematical foundations and operations behind CNNs. As such, important connections between CNNs and other fields such as statistics, signal processing, linear algebra, differential equations, and optimization remain under-appreciated. We believe that this disconnect limits the ability of researchers to propose extensions and identify new applications. For example, a common misconception is that CNNs are only applicable to computer vision tasks; however, CNNs can operate on general grid data objects (vectors, matrices, and tensors). Moreover, operators learned by CNNs can be potentially explained when analyzed from the perspective of calculus and statistics. For instance, certain types of operators seek to extract gradients (derivatives) of a field or seek to extract correlation structures. Establishing connections with other mathematical fields is important in gaining interpretability of CNN models. Understanding the operations that take place in a CNN is also important in order to understand their inherent limitations, in proposing potential modeling and algorithmic extensions, and in facilitating the incorporation of CNNs in computational workflows of interest to engineers (e.g., process monitoring, control, and optimization).

In this work, we review the mathematical foundations of CNNs; specifically, we provide concise derivations of input data representations and of convolution operations. We explain the origins of certain types of operators and the data transformations that they induce to highlight features that are hidden in the data. Moreover, we provide concise derivations for forward and backward propagations that arise in CNN training procedures. These derivations provide insight into how information flows in the CNN and help understand computations involved in the learning process (which seeks to solve an optimization problem that minimizes a loss function). We also explain how derivatives of the loss function can be used to understand key features that the CNN searches for to make predictions. We illustrate the concepts by applying CNNs to new types of applications such as optimal control (1D) , flow cytometry (2D), multivariate process monitoring (2D), and molecular simulations (3D). Specifically, we focus our attention on how to convert raw data into a suitable grid-like representation that the CNN can process.

## 2 Convolution Operations

Convolution is a mathematical operation that involves an input function and an operator function (these functions are also known as signals). A related operation that is often used in CNNs is cross-correlation. Although technically different, the essence of these operations is similar (we will see that they are mirror representations). One can think of a convolution as an operation that seeks to transform the input function in order to highlight (or extract) features. The input and operator functions can live on arbitrary dimensions (1D, 2D, 3D, and higher). The features highlighted by an operator are defined by its design; for instance, we will see that one can design operators that extract specific patterns, derivatives (gradients), or frequencies. These features encode different aspects of the data (e.g., correlations and geometrical patterns) and are often hidden. Input and operator functions can be continuous or discrete; discrete functions facilitate computational implementation but continuous functions can help understand and derive mathematical properties. For example, families of discrete operators can be generated by using a single continuous function (a kernel function). We begin our discussion with 1D convolution operations and we later extend the analysis to the 2D case; extensions to higher dimensions are straightforward once the basic concepts are outlined. We highlight, however, that convolutions in high dimensions are computationally expensive (and sometimes intractable).

### 2.1 1D Convolution Operations

The convolution of scalar continuous functions and is denoted as and their cross-correlation is denoted as . These operations are given by:

 ψ(x) =(u∗v)(x)=∫∞−∞u(x′)⋅v(x−x′)dx′,x∈(−∞,∞) (1a) ϕ(x) =(u⋆v)(x)=∫∞−∞u(x′)⋅v(x+x′)dx′,x∈(−∞,∞). (1b)

We will refer to function as the convolution operator and to as the input signal. The output of the convolution operation is a scalar continuous function that we refer to as the convolved signal. The output of the cross-correlation operation is also a continuous function that we refer to as the cross-correlated signal.

The convolution and cross-correlation are applied to the signal by spanning the domain ; we can see that convolution is applied by looking backward, while cross-correlation is applied by looking forward. One can show that cross-correlation is equivalent to convolution that uses a rotated (flipped) operator by 180 degrees; in other words, if we define the rotated operator as then . We thus have that convolution and cross-correlation are mirror operations; consequently, convolution and cross-correlation are often both referred to as convolution

. In what follows we use the term convolution and cross-correlation interchangeably; we highlight one operation over the other when appropriate. Modern CNN packages such as PyTorch

[26] use cross-correlation in their implementation (this is more intuitive and easier to implement).

One can think of convolution and cross-correlation as weighting operations (with weighting function defined by the operator). The weighting function can be designed in different ways to perform different operations to the signal such as averaging, smoothing, differentiation, and pattern recognition. In Figure 1 we illustrate the application of a Gaussian operator to a noisy rectangular signal . We note that the convolved and cross-correlated signals are identical (because the operator is symmetric and thus

). We also note that the output signals are smooth versions of the input signal; this is because a Gaussian operator has the effect of extracting frequencies from the signal (a fact that can be established from Fourier analysis). Specifically, we recall that the Fourier transform of the convolved signal

satisfies:

 F{ψ}=F{u∗v}=F{u}⋅F{v}. (2)

where is the Fourier transform of scalar function . Here, the product has the effect of preserving the frequencies in the signal that are also present in the operator . In other words, convolution acts as a filter of frequencies; as such, one often refers to operators as filters. By comparing the output signals of convolution and cross-correlation obtained with the triangular operator, we can confirm that one is the mirror version of the other. From Figure 1, we also see that the output signals are smooth variants of the input signal; this is because the triangular operator also extracts frequencies from the input signal (but the extraction is not as clean as that obtained with the Gaussian operator). This highlights the fact that different operators have different frequency content (known as the frequency spectrum).

Convolution and cross-correlation are implemented computationally by using discrete representations; these signals are represented by column vectors and with entries denoted as , , (note that ). Here, we note that the input and operator are defined over a 1D grid. The discrete convolution results in vector with entries given by:

 ψ[x]=N∑x′=−Nu[x′]⋅v[x+x′],x∈{−N,N}. (3)

Here, we use to denote the integer sequence . We note that computing at a given location requires operations and thus computing the entire output signal (spanning ) requires operations; consequently, convolution becomes expensive when the signal and operator are long vectors; as such, the operator is often a vector of low dimension (compared to the dimension of the signal ). Specifically, consider the operator with entries (thus ) and assume (typically ). The convolution is given by:

 ψ[x]=Nu∑x′=−Nuu[x′]⋅v[x+x′],x∈{−N,N}. (4)

This reduces the number of operations needed from to .

The convolved signal is obtained by using a moving window starting at the boundary and by moving forward as until reaching . We note that the convolution is not properly defined close to the boundaries (because the window lies outside the domain of the signal). This situation can be remedied by starting the convolution at an inner entry of such that the full window is within the signal (this gives a proper convolution). However, this approach has the effect of returning a signal that is smaller than the original signal . This issue can be overcome by artificially padding the signal by adding zero entries (i.e., adding ghost entries to the signal at the boundaries). This is an improper convolution but returns a signal that has the same dimension as (this is often convenient for analysis and computation). Figure 2

illustrates the difference between convolutions with and without padding. We also highlight that it is possible to advance the moving window by multiple entries as

(here, is known as the stride). This has the effect of reducing the number of operations (by skipping some entries in the signal) but returns a signal that is smaller than the original one.

By defining signals and with entries and , one can also express the valid convolution operation in an asymmetric form as:

 (5)

In CNNs, one often applies multiple operators to a single input signal or one analyzes input signals that have multiple channels. This gives rise to the concepts of single-input single-output (SISO), single-input multi-output (SIMO), and multi-input multi-output (MIMO) convolutions.

In SISO convolution, we take a 1-channel input vector and output a vector (a 1-channel signal), as in (5). In SIMO convolution, one uses a single input and a collection of operators with . Here, every element is an -dimensional vector and we represent the collection as the object . SIMO convolution yields a collection of convolved signals with and with entries given by:

 Ψ(j)[x]=m∑x′=1U(j)[x′]⋅v[x+x′−1],x∈{1,nv−nu+1}. (6)

In MIMO convolution, we consider a multi-channel input given by the collection ( is the number of channels); the collection is represented as the object . We also consider a collection of operators with and ; in other words, we have operators per input channel and we represent the collection as the object . MIMO convolution yields the collection of convolved signals with and with entries given by:

 Ψ(j)[x] =p∑i=1U(i,j)∗V(i) (7) =p∑i=1m∑x′=1U(i,j)[x′]⋅V(i)[x+x′−1],x∈{1,nv−nu+1}.

We see that, in MIMO convolution, we add the contribution of all channels to obtain the output object , which contains channels given by vectors of dimension . Channel combination loses information but saves computer memory.

### 2.2 2D Convolution Operations

Convolution and cross-correlation operations in 2D are analogous to those in 1D; the convolution and cross-correlation of a continuous operator and an input signal are denoted as and and are given by:

 ψ(x1,x2) =∫∞−∞∫∞−∞u(x′1,x′2)⋅v(x1−x′1,x2−x′2)dx′1dx′2,x1,x2∈(−∞,∞) (8a) ϕ(x1,x2) =∫∞−∞∫∞−∞u(x′1,x′2)⋅v(x1+x′1,x2+x′2)dx′1dx′2,x1,x2∈(−∞,∞) (8b)

As in the 1D case, the terms convolution and cross-correlation are used interchangeably; here, we will use the cross-correlation form (typically used in CNNs).

In the discrete case, the input signal and the convolution operator are matrices and with entries and (thus and ). For simplicity, we assume that these are square matrices and we note that the input and operator are defined over a 2D grid. The convolution of the input and operator matrices results in a matrix with entries:

 Ψ[x1,x2]=NU∑x′1=−NUNU∑x′2=−NUU[x′1,x′2]⋅V[x1+x′1,x2+x′2],x1,x2∈{−NV,NV}. (9)

In Figure 3 we illustrate 2D convolutions with and without padding. The 2D convolution (in valid and asymmetric form) can be expressed as:

 Ψ[x1,x2] =nU∑x′1=1nU∑x′2=1U[x′1,x′2]⋅V[x1+x′1−1,x2+x′2−1], (10)

where , and .

The SISO convolution of an input and an operator is given by (10) and outputs a matrix . In SIMO convolution, we are given a collection of operators with . A convolved matrix is obtained by applying the -th operator to the input :

 Ψ(j)[x1,x2]=nU∑x′1=1nU∑x′2=1U(j)[x′1,x′2]V[x1+x′1−1,x2+x′2−1] (11)

for . The collection of convolved matrices is represented as object .

In MIMO convolution, we are given a -channel input collection with and represented as . We convolve this input with the operator object , which is a collection . This results in an object given by the collection for and with entries given by:

 Ψ(j)[x1,x2] =p∑i=1U(i,j)∗V(i) (12) =p∑i=1nU∑x′1=1nU∑x′2=1U(i,j)[x′1,x′2]V(i)[x1+x′1−1,x2+x′2−1],

for , and . For convenience, the input object is often represented as a 3D tensor , the convolution operator is represented as the 4D tensor , and the output signal is represented as the 3D tensor . We note that, if and (1-channel inputs and channels), these tensors become matrices. Tensors are high-dimensional quantities that require significant computer memory to store and significant power to process.

MIMO convolutions in 2D are typically used to process RGB images (3-channel input), as shown in Figure 4. Here, the RGB image is the object and each input channel is a matrix; each of these matrices is convolved with an operator . Here we assume (one operator per channel). The collection of convolved matrices are combined to obtain a single matrix . If we consider a collection of operators with , and , the output of MIMO convolution returns the collection of matrices with , which is assembled in the tensor . Convolution with multiple operators allows for the extraction of different types of features from different channels.

We highlight that the use of channels is not restricted to images; specifically, channels can be used to input data of different variables in a grid (e.g., temperature, concentration, density). As such, channels provide a flexible framework to express multivariate data.

We also highlight that a grayscale image is a 1-channel input matrix (the RGB channels are combined in a single channel). In a grayscale image, every pixel (an entry in the associated matrix) has a certain light intensity value; whiter pixels have higher intensity and darker pixels have a lower intensity (or the other way around). The resolution of the image is dictated by the number of pixels and thus dictates the size of the matrix; the size of the matrix dictates the amount of memory needed for storage and computations needed for convolution. It is also important to emphasize that any matrix can be visualized as a grayscale image (and any grayscale image has an underlying matrix). This duality is important because visualizing large matrices (e.g., to identify patterns) is difficult if one directly inspects the numerical values; as such, a convenient approach to analyze patterns in large matrices consists of visualizing them as images.

Finally, we highlight that convolution operators can be applied to any grid data object in higher dimensions (e.g., 3D and 4D) in a similar manner. Convolutions in 3D can be used to process video data (each time frame is an image). However, 3D data objects can also be used to represent data distributed over 3D Euclidean spaces (e.g., density or flow in a 3D domain). However, the complexity of convolution operations in 3D (and higher dimensions) is substantial. In the discussion that follows we focus our attention to 2D convolutions; in Section 6 we illustrate the use of 3D convolutions in a practical application.

## 3 Convolution Operators

Convolution operators are the key functional units that a CNN uses to extract features from input data objects. Some commonly used operators and their transformation effect are shown in Figure 4. When the input is convolved with a Sobel operator, the output highlights the edges (gradients of intensity). The reason for this is that the Sobel operator is a differential operator. To explain this concept, consider a discrete 1D signal ; its derivative at entry can be computed using the finite difference:

 v′[x]=v[x+1]−v[x−1]2. (13)

This indicates that we can compute the derivative signal by applying a convolution of the form with operator . Here, the operator can be scaled as as this does not alter the nature of the operation performed (just changes the scale of the entries of ).

The Sobel operator shown in Figure 4 is a matrix of the form:

 U=⎡⎢⎣121000−1−2−1⎤⎥⎦=⎡⎢⎣10−1⎤⎥⎦[121], (14)

where the vector approximates the first derivative in the vertical direction, and is a binomial operator that smooths the input matrix.

Another operator commonly used to detect edges in images is the Laplacian operator; this operator is an approximation of the continuous Laplacian operator . The convolution of a matrix using a Laplacian operator is:

 U∗V =V[x1−1,x2]+V[x1+1,x2]+V[x1,x2−1]+V[x1,x2+1]−4⋅V[x1,x2]. (15)

This reveals that the convolution is an approximation of that uses a 2D finite difference scheme; this scheme has an operator of the form:

 U=⎡⎢⎣0101−41010⎤⎥⎦. (16)

The transformation effect of the Laplacian operator is shown in Figure 4

. In the partial differential equations (PDE) literature, the non-zero structure of a finite-difference operator is known as the

stencil. As expected, a wide range of finite difference approximations (and corresponding operators) can be envisioned. Importantly, since the Laplacian operator computes the second derivative of the input, this is suitable to detect locations of minimum or maximum intensity in an image (e.g., peaks). This allows us to understand the geometry of matrices (e.g., geometry of images or 2D fields). Moreover, this also reveals connections between PDEs and convolution operations.

The Gaussian operator is commonly used to smooth out images. A Gaussian operator is defined by the density function:

 U(x1,x2)=12πσ2e−x21+x222σ2, (17)

where

is the standard deviation. The standard deviation determines the spread of of the operator and is used to control the frequencies removed from a signal. We recall that the density of a Gaussian is maximum at the center point and decays exponentially (and symmetrically) when moving away from the center. Discrete representations of the Gaussian operator are obtained by manipulating

and truncating the density in a window. For instance, the Gaussian operator as shown in Figure 4 is:

 U=116⎡⎢⎣121242121⎤⎥⎦. (18)

Here, we can see that the operator is square and symmetric, has a maximum value at the center point, and the values decay rapidly as one moves away from the center.

Convolution operators can also be designed to perform pattern recognition; for instance, consider the situation in which you want to highlight areas in a matrix that have a pattern (feature) of interest. Here, the structure of the operator dictates the 0-1 pattern sought. For instance, in Figure 5 we present an input matrix with 0-1 entries and we perform a convolution with an operator with a pre-defined 0-1 structure. The convolution highlights the areas in the input matrix in which the presence of the sought pattern is strongest (or weakest). As one can imagine, a huge number of operators could be designed to extract different patterns (including smoothing and differentiation operators); moreover, in many cases it is not obvious which patterns or features might be present in the image. This indicates that one requires a systematic approach to automatically determine which features of an input signal (and associated operators) are most relevant.

## 4 CNN Architectures

CNNs are hierarchical (layered) models that perform a sequence of convolution, activation, pooling, flattening operations to extract features from input data object. The output of this sequence of transformation operations is a vector that summarizes the feature information of the input; this feature vector is fed to a fully-connected neural net that makes final predictions. The goal of the CNN is to determine the features of a set of input data objects (input data samples) that best predict the corresponding output samples. Specifically, the input of the CNN are a set of sample objects with sample index , and the output of the CNN are the corresponding predicted labels . The goal of a CNN is to determine convolution operators giving features that best match the predicted labels to the output labels; in other words, the CNN seeks to find operators (and associated features) that best map the inputs to the outputs.

In this section we discuss the different elements of a 2D CNN architecture. To facilitate the analysis, we construct a simple CNN of the form shown in Figure 6. This architecture contains a single layer that performs convolution, activation, pooling, and flattening. Generalizing the discussion to multiple layers and higher dimensions (e.g., 3D) is rather straightforward once the basic concepts are established. Also, in the discussion that follows, we consider a single input data sample (that we denote as ) and discuss the different transformation operations performed to it along the CNN to obtain a final prediction (that we denote as ). We then discuss how to combine multiple samples to train the CNN.

### 4.1 Convolution Block

An input sample of the CNN is the tensor . The convolution block uses the operator to conduct a MIMO convolution. The output of this convolution is the tensor with entries given by:

 Ψ(j)[x1,x2] =bc[j]+p∑i=1nU∑x′1=1nU∑x′2=1U(i,j)[x′1,x′2]V(i)[x1+x′1−1,x2+x′2−1], (19)

where , , and . A bias parameter is added after the convolution operation; the bias helps adjust the magnitude of the convolved signal. To enable compact notation, we define the convolution block using the mapping ; here, we note that the mapping depends on the parameters (the operators) and (the bias). An example of a convolution block with a 3-channel input and 2-channel operator is shown in Figure 7.

### 4.2 Activation Block

The convolution block outputs the signal ; this is passed through an activation block given by the mapping with . Here, we define an activation mapping of the form . The activation of is conducted element-wise as:

 A(i)[x1,x2] =α(Ψ(i)[x1,x2]), (20)

where

is an activation function (a scalar function). Typical activation functions include the sigmoid, hyperbolic tangent (tanh), and Rectified Linear Unit (ReLU):

 αsig(z) =11+e−z (21) αtanh(z) =tanh(z) αReLU(z) =max(0,z),

Figures 8 and 9 illustrate the transformation induced by the activation functions. These functions act as basis functions that, when combined in the CNN, enable capturing nonlinear behavior. A common problem with sigmoid and tanh functions is that they exhibit the so-called vanishing gradient effect [9]. Specifically, when the input values are large or small in magnitude, the gradient of the sigmoid and tanh functions is small (flat at both ends) and this makes the activation output insensitive to changes in the input. Furthermore, both the sigmoid and tanh functions are sensitive to the change in input when the output is close to 1/2 and zero, respectively. The ReLU function is commonly used to avoid vanishing gradient effects and to increase sensitivity [23]. This activation function outputs a value of zero when the input is less than or equal to zero, but outputs the input value itself when the input is greater than zero. The function is linear when the input is greater than zero, which makes the CNN easier to optimize with gradient-based methods [9]. However, ReLU is not continuously differentiable when the input is zero; in practice, CNN implementations assume that the gradient is zero when the input is zero.

### 4.3 Pooling Block

The pooling block is a transformation that reduces the size of the output obtained by convolution and subsequent activation. This block also seeks to make the representation approximately invariant to small translation of the inputs [9]. Max-pooling and average-pooling are the most common pooling operations (here we focus on max-pooling). The pooling operation can be expressed as a mapping and delivers a tensor . To simplify the discussion, we denote ; the max-pooling operation with is defined as:

 P(i)[x1,x2] =max{A(i)[(x1−1)nU+n′U,(x2−1)nU+n′U],n′U∈{1,nU},n′U∈{1,nU}}, (22)

where , . In Figure 10, we illustrate the max-pooling and averaging pooling operations on a 1-channel input.

Operations of convolution, activation, and pooling constitute a convolution unit and deliver an output object . This object can be fed into another convolution unit to obtain a new output object and this recursion can be performed over multiple units. The recursion has the effect of highlighting different features of the input data object (e.g., that capture local and global patterns). In our discussion we consider a single convolution unit.

### 4.4 Flattening Block

The convolution, activation, and pooling blocks deliver a tensor that is flattened to a vector ; this vector is fed into a fully-connected (dense) block that performs the final prediction. The vector is typically known as the feature vector. The flattening block is represented as the mapping with that outputs . Note that this block is simply a vectorization of a tensor.

### 4.5 Prediction (Dense) Block

The feature vector is input into a prediction block that delivers the final prediction . This block is typically a fully-connected (dense) neural net with multiple weighting and activation units (here we only consider a single unit). The prediction block first mixes the elements of the feature vector as:

 d =wTv+bd, (23)

where is the weighting (parameter) vector and is the bias parameter. The vector is normally known as the evidence. We denote the weighting operation using the mapping and thus . In Figure 11, we illustrate this weighting operation. An activation function is applied to the evidence to obtain the final prediction with . This involves an activation (e.g., with a sigmoidal or ReLU function). We thus have that the prediction block has the form . One can build a multi-layer fully-connected network by using a recursion of weighting and activation steps (here we consider one layer to simplify the exposition). Here, we assume that the CNN delivers a single (scalar) output . In practice, however, a CNN can also predict multiple outputs; in such a case, the weight parameter becomes a matrix and the bias is a vector. We also highlight that the predicted output can be an integer value (to perform classification tasks) or a continuous value (to perform regression tasks).

## 5 CNN Training

CNN training aims to determine the parameters (operators and biases) that best map the input data to the output data. The fundamental operations in CNN training are forward and backward propagation, which are used by an optimization algorithm to improve parameters. In forward propagation, the input data is propagated forward through the CNN (block by block and layer by layer) to make a prediction and evaluate a loss function that captures the mismatch between the predicted and true output. In backward propagation, one computes the gradient of the loss function with respect to each parameter via a recursive application of the chain rule of differentiation (block by block layer by layer). This recursion starts in the prediction block and proceeds backwards to the convolution block. Notably, we will see that the derivatives of the transformation blocks have explicit (analytical) representations.

### 5.1 Forward Propagation

To explain the process of forward propagation, we consider the CNN architecture shown in Figure 6 with an input and an output (prediction) . All parameters , , and are flattened and concatenated to form a parameter vector , where . We can express the entire sequence of operations carried out in the CNN as a composite mapping and thus . We call this mapping the forward propagation mapping (shown in Figure 6) and note that this can be decomposed sequentially (via lifting) as:

 ^y =F(V;θ) (24) =F(V;U,bc,w,bd) =hd(fd(ff(fp(hc(fc(V;U,bc))));w,bd)).

This reveals the nested nature and the order of the CNN transformations on the input data. Specifically, we see that one conducts convolution, activation, pooling, flattening, and prediction.

To perform training, we collect a set of input samples and corresponding output labels with sample index . Our aim is to compare the true output labels to the CNN predictions . To do so, we define the loss function that we aim to minimize. For classification tasks, the output is binary and a common loss function is the cross-entropy [14]:

 L(^y) =−1nn∑i=1[y[i]log(^y[i])+(1−y[i])log(1−^y[i])], (25)

For regression tasks, we have that the output is continuous and a common loss function is the mean squared error:

 L(^y)=1nn∑i=1(y[i]−^y[i])2. (26)

In compact form, we can write the loss function as . Here, contains all the input samples for . The loss function can be decomposed into individual samples and thus we can express it as:

 (27)

where is the contribution associated with the -th data sample.

### 5.2 Optimization Algorithms

The parameters of the CNN are obtained by finding a solution to the optimization problem:

 minθL(θ). (28)

The loss function embeds the forward propagation mapping (which is a highly nonlinear mapping); as such, the loss function might contain many local minima (and such minima tend to be flat). We recall that appearance of flat minima is often the manifestation of model overparameterization; this is due to the fact that the parameter vector can contain hundreds of thousands to millions of values. As such, training of CNNs need to be equipped with appropriate validation procedures (to ensure that the model generalizes well). Most optimization algorithms used for CNNs are gradient-based (first-order); this is because second-order optimization methods (e.g., Newton-based) involve calculating the Hessian of the loss function with respect to parameter vector , which is computationally expensive (or intractable). Quasi-Newton methods (e.g., limited-memory BFGS) are also often used to train CNNs.

Gradient descent (GD) is the most commonly used algorithm to train CNNs. This algorithm updates the parameters as:

 θ←θ−η⋅Δθ (29)

where is the learning rate (step length) and is the step direction. In GD, the step direction is set to (the gradient of the loss). Since the loss function is a sum over samples, the gradient can be decomposed as:

 ∇θL(θ)=1nn∑i=1∇θL(i)(θ). (30)

Although GD is easy to compute and implement, is not updated until the gradient of the entire dataset is calculated (which can contain millions of data samples). In other words, the parameters are not updated until all the gradient components are available. This leads to load-imbalancing issues and limits the parallel scalability of the algorithm.

Stochastic gradient descent (SGD) is a variant of GD that updates more frequently (with gradients from a partial number of samples). Specifically, changes after calculating the loss for each sample. SGD updates the parameters by using the step , where sample is selected at random. Since is updated frequently, SGD requires less memory and has better parallel scalability [17]

. Moreover, it has been shown empirically that this algorithm has the tendency to escape local minima (it explores the parameter space better). However, if the training sample has high variance, SGD will converge slowly.

Mini-batch GD is an enhancement of SGD; here, the entire dataset is divided into batches and is updated after calculating the gradient of each batch. This results in a step direction of the form:

 Δθ=1b∑j∈B(i)∇θL(j)(θ). (31)

where is a set of sample corresponding to batch and the entries of the batches are selected at random. Mini-batch GD updates the model parameters frequently and has faster convergence in the presence of high variance [37].

### 5.3 Backward Propagation

Backward propagation (backpropagation) seeks to compute elements of the gradient of the loss

by recursive use of the chain rule. This approach exploits the nested nature of the forward propagation mapping:

 ^y =hd(fd(ff(fp(hc(fc(V;U,bc))));w,bd)). (32)

This mapping can be expressed in backward form as:

 ^y =hd(d) (33a) d =fd(v;w,bd) (33b) v =ff(P) (33c) P =fp(A) (33d) A =hc(Ψ) (33e) Ψ =fc(V;U,bc) (33f)

An illustration of this backward sequence is presented in Figure 6. Exploiting this structure is essential to enable scalability (particularly for saving memory). To facilitate the explanation, we consider the squared error loss defined for a single sample (generalization to multiple samples is trivial due to the additive nature of the loss):

 L(θ)=12(^y−y)2. (34)

Our goal is to compute the gradient (the parameter step direction) ; here, we have that parameter .

#### Prediction Block Update.

The step direction for the parameters (the gradient) in the prediction block is defined as and its entries are given by:

 Δw[i] =∂L∂w[i] (35) =∂L∂^y∂^y∂w[i] =(^y−y)⋅∂α(d)∂d∂d∂w[i] =(^y−y)⋅^y⋅(1−^y)⋅∂(nv∑i=1w[i]⋅v[i]+bd)∂w[i] =(^y−y)⋅^y⋅(1−^y)⋅v[i],

with . If we define then we can establish that the update has the simple form and thus:

 Δw=Δ^y⋅v. (36)

The gradient for the bias is given by:

 Δbd =∂L∂bd (37) =∂L∂^y∂^y∂bd =(^y−y)⋅^y⋅(1−^y)⋅∂(nv∑i=1w[i]⋅v[i]+bd)∂bd =(^y−y)⋅^y⋅(1−^y),

and thus:

 Δbd=Δ^y. (38)

#### Convolutional Block.

An important feature of CNN is that some of its parameters (the convolution operators) are tensors and we thus require to compute gradients with respect tensors. While this sounds complicated, the gradient of the convolution operator has a remarkably intuitive structure (provided by the grid nature of the data and the structure of the convolution operation). To see this, we begin the recursion at:

 v=ff(fp(hc(fc(V;U,bc)))). (39)

Before computing , we first obtain the gradient for the feature vector as:

 Δv[i] =∂L∂v[i] (40) =∂L∂^y⋅∂^y∂v[i] =(^y−y)⋅^y⋅(1−^y)⋅(∂nv∑i=1w[i]⋅v[i]+bd)∂v[i] =(^y−y)⋅^y⋅(1−^y)⋅w[i] =Δ^y⋅w[i],

with and thus:

 Δv=Δ^y⋅w. (41)

If we define the inverse mapping of the flattening operation as , we can express the update of the pooling block as:

 ΔP=f−1f(Δv). (42)

The gradient from the pooling block is passed to the activation block via up-sampling. If we have applied max-pooling with a pooling operator, then:

 (43)

where calculates the remainder obtained after dividing by .

The gradient with respect to the convolution operator has entries of the form:

 ΔU(i,j)[x1,x2] =∂L∂U(i,j)[x1,x2] (44) =nΨ∑x′1=1nΨ∑x′2=1∂L∂A(j)[x′1,x′2]⋅∂A(j)[x′1,x′2]∂Ψ(j)[x′1,x′2]⋅∂Ψ(j)[x′1,x′2]∂U(i,j)[x1,x2] =nΨ∑x′1=1nΨ∑