A Selective Overview of Deep Learning

04/10/2019 ∙ by Jianqing Fan, et al. ∙ 0

Deep learning has arguably achieved tremendous success in recent years. In simple words, deep learning uses the composition of many nonlinear functions to model the complex dependency between input features and labels. While neural networks have a long history, recent advances have greatly improved their performance in computer vision, natural language processing, etc. From the statistical and scientific perspective, it is natural to ask: What is deep learning? What are the new characteristics of deep learning, compared with classical methods? What are the theoretical foundations of deep learning? To answer these questions, we introduce common neural network models (e.g., convolutional neural nets, recurrent neural nets, generative adversarial nets) and training techniques (e.g., stochastic gradient descent, dropout, batch normalization) from a statistical point of view. Along the way, we highlight new characteristics of deep learning (including depth and over-parametrization) and explain their practical and theoretical benefits. We also sample recent results on theories of deep learning, many of which are only suggestive. While a complete understanding of deep learning remains elusive, we hope that our perspectives and discussions serve as a stimulus for new statistical research.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern machine learning and statistics deal with the problem of

learning from data: given a training dataset where is the input and is the output111When the label is given, this problem is often known as supervised learning. We mainly focus on this paradigm throughout this paper and remark sparingly on its counterpart, unsupervised learning, where is not given., one seeks a function from a certain function class that has good prediction performance on test data. This problem is of fundamental significance and finds applications in numerous scenarios. For instance, in image recognition, the input (reps. the output ) corresponds to the raw image (reps. its category) and the goal is to find a mapping

that can classify future images accurately. Decades of research efforts in statistical machine learning have been devoted to developing methods to find

efficiently with provable guarantees. Prominent examples include linear classifiers (e.g., linear/logistic regression, linear discriminant analysis), kernel methods (e.g., support vector machines), tree-based methods (e.g., decision trees, random forests), nonparametric regression (e.g., nearest neighbors, local kernel smoothing), etc. Roughly speaking, each aforementioned method corresponds to a different function class from which the final classifier is chosen.

Deep learning [70], in its simplest form, proposes the following compositional function class:


Here, for each , is some nonlinear function, and

consists of matrices with appropriate sizes. Though simple, deep learning has made significant progress towards addressing the problem of learning from data over the past decade. Specifically, it has performed close to or better than humans in various important tasks in artificial intelligence, including image recognition 

[50], game playing [114], and machine translation [132]. Owing to its great promise, the impact of deep learning is also growing rapidly in areas beyond artificial intelligence; examples include statistics [15, 111, 76, 104, 41], applied mathematics [130, 22], clinical research [28], etc.

Model Year # Layers # Params Top-5 error
AlexNet M
GoogleNet M
ResNet- M
Table 1: Winning models for ILSVRC image classification challenge.

To get a better idea of the success of deep learning, let us take the ImageNet Challenge 

[107] (also known as ILSVRC) as an example. In the classification task, one is given a training dataset consisting of 1.2 million color images with categories, and the goal is to classify images based on the input pixels. The performance of a classifier is then evaluated on a test dataset of 100 thousand images, and in the end the top-5 error222The algorithm makes an error if the true label is not contained in the predictions made by the algorithm. is reported. Table 1 highlights a few popular models and their corresponding performance. As can be seen, deep learning models (the second to the last rows) have a clear edge over shallow models (the first row) that fit linear models/tree-based models on handcrafted features. This significant improvement raises a foundational question:

  • Why is deep learning better than classical methods on tasks like image recognition?

Figure 1:

Visualization of trained filters in the first layer of AlexNet. The model is pre-trained on ImageNet and is downloadable via PyTorch package

torchvision.models. Each filter contains parameters and is shown as an RGB color map of size .

1.1 Intriguing new characteristics of deep learning

It is widely acknowledged that two indispensable factors contribute to the success of deep learning, namely (1) huge datasets that often contain millions of samples and (2) immense computing power resulting from clusters of graphics processing units (GPUs). Admittedly, these resources are only recently available: the latter allows to train larger neural networks which reduces biases and the former enables variance reduction. However, these two alone are not sufficient to explain the mystery of deep learning due to some of its “dreadful” characteristics: (1)

over-parametrization: the number of parameters in state-of-the-art deep learning models is often much larger than the sample size (see Table 1), which gives them the potential to overfit the training data, and (2) nonconvexity: even with the help of GPUs, training deep learning models is still NP-hard [8]

in the worst case due to the highly nonconvex loss function to minimize. In reality, these characteristics are far from nightmares. This sharp difference motivates us to take a closer look at the salient features of deep learning, which we single out a few below.

1.1.1 Depth

Deep learning expresses complicated nonlinearity through composing many nonlinear functions; see (1). The rationale for this multilayer structure is that, in many real-world datasets such as images, there are different levels of features and lower-level features are building blocks of higher-level ones. See [134] for a visualization of trained features of convolutional neural nets; here in Figure 1, we sample and visualize weights from a pre-trained AlexNet model. This intuition is also supported by empirical results from physiology and neuroscience [56, 2]. The use of function composition marks a sharp difference from traditional statistical methods such as projection pursuit models [38] and multi-index models [73, 27]. It is often observed that depth helps efficiently extract features that are representative of a dataset. In comparison, increasing width (e.g., number of basis functions) in a shallow model leads to less improvement. This suggests that deep learning models excel at representing a very different function space that is suitable for complex datasets.

1.1.2 Algorithmic regularization

The statistical performance of neural networks (e.g., test accuracy) depends heavily on the particular optimization algorithms used for training [131]. This is very different from many classical statistical problems, where the related optimization problems are less complicated. For instance, when the associated optimization problem has a relatively simple structure (e.g., convex objective functions, linear constraints), the solution to the optimization problem can often be unambiguously computed and analyzed. However, in deep neural networks, due to over-parametrization, there are usually many local minima with different statistical performance [72]. Nevertheless, common practice runs stochastic gradient descent with random initialization and finds model parameters with very good prediction accuracy.

1.1.3 Implicit prior learning

It is well observed that deep neural networks trained with only the raw inputs (e.g., pixels of images) can provide a useful representation of the data. This means that after training, the units of deep neural networks can represent features such as edges, corners, wheels, eyes, etc.; see [134]

. Importantly, the training process is automatic in the sense that no human knowledge is involved (other than hyper-parameter tuning). This is very different from traditional methods, where algorithms are designed after structural assumptions are posited. It is likely that training an over-parametrized model efficiently learns and incorporates the prior distribution

of the input, even though deep learning models are themselves discriminative models. With automatic representation of the prior distribution, deep learning typically performs well on similar datasets (but not very different ones) via transfer learning.

(a) MNIST images (b) training and test accuracies
Figure 2: (a) shows the images in the public dataset MNIST; and (b) depicts the training and test accuracies along the training dynamics. Note that the training accuracy is approaching and the test accuracy is still high (no overfitting).

1.2 Towards theory of deep learning

Despite the empirical success, theoretical support for deep learning is still in its infancy. Setting the stage, for any classifier , denote by the expected risk on fresh sample (a.k.a. test error, prediction error or generalization error), and by the empirical risk/training error averaged over a training dataset. Arguably, the key theoretical question in deep learning is

why is small, where is the classifier returned by the training algorithm?

We follow the conventional approximation-estimation decomposition (sometimes, also bias-variance tradeoff) to decompose the term

into two parts. Let be the function space expressible by a family of neural nets. Define to be the best possible classifier and to be the best classifier in . Then, we can decompose the excess error into two parts:


Both errors can be small for deep learning (cf. Figure 2), which we explain below.

  • The approximation error is determined by the function class . Intuitively, the larger the class, the smaller the approximation error. Deep learning models use many layers of nonlinear functions (Figure 3)that can drive this error small. Indeed, in Section 5, we provide recent theoretical progress of its representation power. For example, deep models allow efficient representation of interactions among variable while shallow models cannot.

  • The estimation error reflects the generalization power, which is influenced by both the complexity of the function class and the properties of the training algorithms. Interestingly, for over-parametrized deep neural nets, stochastic gradient descent typically results in a near-zero training error (i.e., ; see e.g. left panel of Figure 2). Moreover, its generalization error remains small or moderate. This “counterintuitive” behavior suggests that for over-parametrized models, gradient-based algorithms enjoy benign statistical properties; we shall see in Section 7 that gradient descent enjoys implicit regularization in the over-parametrized regime even without explicit regularization (e.g., regularization).

The above two points lead to the following heuristic explanation of the success of deep learning models. The large depth of deep neural nets and heavy over-parametrization lead to small or zero training errors, even when running simple algorithms with moderate number of iterations. In addition, these simple algorithms with moderate number of steps do not explore the entire function space and thus have limited complexities, which results in small generalization error with a large sample size. Thus, by combining the two aspects, it explains heuristically that the test error is also small.

1.3 Roadmap of the paper

We first introduce basic deep learning models in Sections 24, and then examine their representation power via the lens of approximation theory in Section 5. Section 6 is devoted to training algorithms and their ability of driving the training error small. Then we sample recent theoretical progress towards demystifying the generalization power of deep learning in Section 7. Along the way, we provide our own perspectives, and at the end we identify a few interesting questions for future research in Section 8. The goal of this paper is to present suggestive methods and results, rather than giving conclusive arguments (which is currently unlikely) or a comprehensive survey. We hope that our discussion serves as a stimulus for new statistics research.

2 Feed-forward neural networks

Before introducing the vanilla feed-forward neural nets, let us set up necessary notations for the rest of this section. We focus primarily on classification problems, as regression problems can be addressed similarly. Given the training dataset where and are independent across , supervised learning aims at finding a (possibly random) function that predicts the outcome for a new input , assuming follows the same distribution as . In the terminology of machine learning, the input is often called the feature, the output called the label, and the pair is an example. The function is called the classifier, and estimation of is training or learning. The performance of is evaluated through the prediction error , which can be often estimated from a separate test dataset.

As with classical statistical estimation, for each

, a classifier approximates the conditional probability

using a function parametrized by . Then the category with the highest probability is predicted. Thus, learning is essentially estimating the parameters . In statistics, one of the most popular methods is (multinomial) logistic regression, which stipulates a specific form for the functions : let and where is a normalization factor to make

a valid probability distribution. It is clear that logistic regression induces linear decision boundaries in

, and hence it is restrictive in modeling nonlinear dependency between and . The deep neural networks we introduce below provide a flexible framework for modeling nonlinearity in a fairly general way.

2.1 Model setup

From the high level, deep neural networks (DNNs) use composition of a series of simple nonlinear functions to model nonlinearity

where denotes composition of two functions and is the number of hidden layers, and is usually called depth of a NN model. Letting , one can recursively define for all . The feed-forward neural networks, also called the multilayer perceptrons (MLPs), are neural nets with a specific choice of : for , define


where and are the weight matrix and the bias/intercept, respectively, associated with the -th layer, and is usually a simple given (known) nonlinear function called the activation function. In words, in each layer , the input vector goes through an affine transformation first and then passes through a fixed nonlinear function . See Figure 3

for an illustration of a simple MLP with two hidden layers. The activation function

is usually applied element-wise, and a popular choice is the ReLU (Rectified Linear Unit) function:


Other choices of activation functions include leaky ReLU, function [79]

and the classical sigmoid function

, which is less used now.

Figure 3: A feed-forward neural network with an input layer, two hidden layers and an output layer. The input layer represents raw features . Both hidden layers compute an affine transform (a.k.s. indices) of the input and then apply an element-wise activation function

. Finally, the output returns a linear transform followed by the softmax activation (resp. simply a linear transform) of the hidden layers for the classification (resp. regression) problem.

Given an output from the final hidden layer and a label , we can define a loss function to minimize. A common loss function for classification problems is the multinomial logistic loss. Using the terminology of deep learning, we say that goes through an affine transformation and then the soft-max function:

Then the loss is defined to be the cross-entropy between the label (in the form of an indicator vector) and the score vector , which is exactly the negative log-likelihood of the multinomial logistic regression model:


where . As a final remark, the number of parameters scales with both the depth and the width (i.e., the dimensionality of ), and hence it can be quite large for deep neural nets.

2.2 Back-propagation in computational graphs

Training neural networks follows the empirical risk minimization paradigm that minimizes the loss (e.g., (5)) over all the training data. This minimization is usually done via stochastic gradient descent (SGD). In a way similar to gradient descent, SGD starts from a certain initial value and then iteratively updates the parameters by moving it in the direction of the negative gradient. The difference is that, in each update, a small subsample called a mini-batch—which is typically of size 32–512—is randomly drawn and the gradient calculation is only on instead of the full batch

. This saves considerably the computational cost in calculation of gradient. By the law of large numbers, this stochastic gradient should be close to the full sample one, albeit with some random fluctuations. A pass of the whole training set is called an


. Usually, after several or tens of epochs, the error on a validation set levels off and training is complete. See Section 

6 for more details and variants on training algorithms.

The key to the above training procedure, namely SGD, is the calculation of the gradient , where


Gradient computation, however, is in general nontrivial for complex models, and it is susceptible to numerical instability for a model with large depth. Here, we introduce an efficient approach, namely back-propagation, for computing gradients in neural networks.

Back-propagation [106]

is a direct application of the chain rule in networks. As the name suggests, the calculation is performed in a backward fashion: one first computes

, then , , and finally . For example, in the case of the ReLU activation function333The issue of non-differentiability at the origin is often ignored in implementation., we have the following recursive/backward relation


where denotes a diagonal matrix with elements given by the argument. Note that the calculation of depends on , which is the partial derivatives from the next layer. In this way, the derivatives are “back-propagated” from the last layer to the first layer. These derivatives are then used to update the parameters. For instance, the gradient update for is given by


where if the -th element of is nonnegative, and otherwise. The step size , also called the learning rate, controls how much parameters are changed in a single update.

Figure 4: The computational graph illustrates the loss (9). For simplicity, we omit the bias terms. Symbols inside nodes represent functions, and symbols outside nodes represent function outputs (vectors/scalars). matmul is matrix multiplication, relu is the ReLU activation, cross entropy is the cross entropy loss, and SoS is the sum of squares.

A more general way to think about neural network models and training is to consider computational graphs. Computational graphs are directed acyclic graphs that represent functional relations between variables. They are very convenient and flexible to represent function composition, and moreover, they also allow an efficient way of computing gradients. Consider an MLP with a single hidden layer and an regularization:


where is the same as (6), and is a tuning parameter. A similar example is considered in [45]. The corresponding computational graph is shown in Figure 4. Each node represents a function (inside a circle), which is associated with an output of that function (outside a circle). For example, we view the term as a result of compositions: first the input data multiplies the weight matrix resulting in , then it goes through the ReLU activation function relu resulting in , then it multiplies another weight matrix leading to , and finally it produces the cross-entropy with label as in (5). The regularization term is incorporated in the graph similarly.

A forward pass is complete when all nodes are evaluated starting from the input . A backward pass then calculates the gradients of with respect to all other nodes in the reverse direction. Due to the chain rule, the gradient calculation for a variable (say, ) is simple: it only depends on the gradient value of the variables () the current node points to, and the function derivative evaluated at the current variable value (). Thus, in each iteration, a computation graph only needs to (1) calculate and store the function evaluations at each node in the forward pass, and then (2) calculate all derivatives in the backward pass.

Back-propagation in computational graphs forms the foundations of popular deep learning programming softwares, including TensorFlow 

[1] and PyTorch [92], which allows more efficient building and training of complex neural net models.

3 Popular models

Moving beyond vanilla feed-forward neural networks, we introduce two other popular deep learning models, namely, the convolutional neural networks (CNNs) and the recurrent neural networks (RNNs). One important characteristic shared by the two models is

weight sharing, that is some model parameters are identical across locations in CNNs or across time in RNNs. This is related to the notion of translational invariance in CNNs and stationarity in RNNs. At the end of this section, we introduce a modular thinking for constructing more flexible neural nets.

3.1 Convolutional neural networks

The convolutional neural network (CNN) [71, 40]

is a special type of feed-forward neural networks that is tailored for image processing. More generally, it is suitable for analyzing data with salient spatial structures. In this subsection, we focus on image classification using CNNs, where the raw input (image pixels) and features of each hidden layer are represented by a 3D tensor

. Here, the first two dimensions of indicate spatial coordinates of an image while the third indicates the number of channels. For instance, is for the raw inputs due to the red, green and blue channels, and can be much larger (say, 256) for hidden layers. Each channel is also called a feature map, because each feature map is specialized to detect the same feature at different locations of the input, which we will soon explain. We now introduce two building blocks of CNNs, namely the convolutional layer and the pooling layer.

  1. Convolutional layer (CONV). A convolutional layer has the same functionality as described in (3), where the input feature goes through an affine transformation first and then an element-wise nonlinear activation. The difference lies in the specific form of the affine transformation. A convolutional layer uses a number of filters to extract local features from the previous input. More precisely, each filter is represented by a 3D tensor (), where is the size of the filter (typically 3 or 5) and denotes the total number of filters. Note that the third dimension of is equal to that of the input feature . For this reason, one usually says that the filter has size , while suppressing the third dimension . Each filter then convolves with the input feature to obtain one single feature map , where444To simplify notation, we omit the bias/intercept term associated with each filter.


    Here is a small “patch” of starting at location . See Figure 5 for an illustration of the convolution operation. If we view the 3D tensors and as vectors, then each filter essentially computes their inner product with a part of indexed by (which can be also viewed as convolution, as its name suggests). One then pack the resulted feature maps into a 3D tensor with size , where

    Figure 5: represents the input feature consisting of spatial coordinates in a total number of 3 channels / feature maps. denotes the -th filter with size . The third dimension of the filter automatically matches the number of channels in the previous input. Every 3D patch of gets convolved with the filter and this as a whole results in a single output feature map with size . Stacking the outputs of all the filters will lead to the output feature with size .

    The outputs of convolutional layers are then followed by nonlinear activation functions. In the ReLU case, we have


    The convolution operation (10) and the ReLU activation (12) work together to extract features from the input . Different from feed-forward neural nets, the filters are shared across all locations . A patch of an input responds strongly (that is, producing a large value) to a filter if they are positively correlated. Therefore intuitively, each filter serves to extract features similar to .

    As a side note, after the convolution (10), the spatial size of the input shrinks to of . However one may want the spatial size unchanged. This can be achieved via padding, where one appends zeros to the margins of the input to enlarge the spatial size to . In addition, a stride in the convolutional layer determines the gap and between two patches and : in (10

    ) the stride is

    , and a larger stride would lead to feature maps with smaller sizes.

  2. Pooling layer (POOL). A pooling layer aggregates the information of nearby features into a single one. This downsampling operation reduces the size of the features for subsequent layers and saves computation. One common form of the pooling layer is composed of the max-pooling filter. It computes , that is, the maximum of the neighborhood in the spatial coordinates; see Figure 6 for an illustration. Note that the pooling operation is done separately for each feature map . As a consequence, a max-pooling filter acting on will result in an output of size . In addition, the pooling layer does not involve any parameters to optimize. Pooling layers serve to reduce redundancy since a small neighborhood around a location in a feature map is likely to contain the same information.

Figure 6: A max pooling layer extracts the maximum of 2 by 2 neighboring pixels/features across the spatial dimension.

In addition, we also use fully-connected layers as building blocks, which we have already seen in Section 2. Each fully-connected layer treats input tensor as a vector , and computes . A fully-connected layer does not use weight sharing and is often used in the last few layers of a CNN. As an example, Figure 7 depicts the well-known LeNet 5 [71], which is composed of two sets of CONV-POOL layers and three fully-connected layers.

Figure 7: LeNet is composed of an input layer, two convolutional layers, two pooling layers and three fully-connected layers. Both convolutions are valid and use filters with size . In addition, the two pooling layers use average pooling.

3.2 Recurrent neural networks

Recurrent neural nets (RNNs) are another family of powerful models, which are designed to process time series data and other sequence data. RNNs have successful applications in speech recognition [108], machine translation [132], genome sequencing [21]

, etc. The structure of an RNN naturally forms a computational graph, and can be easily combined with other structures such as CNNs to build large computational graph models for complex tasks. Here we introduce vanilla RNNs and improved variants such as long short-term memory (LSTM).

(a) One-to-many (b) Many-to-one (c) Many-to-many
Figure 8: Vanilla RNNs with different inputs/outputs settings. (a) has one input but multiple outputs; (b) has multiple inputs but one output; (c) has multiple inputs and outputs. Note that the parameters are shared across time steps.

3.2.1 Vanilla RNNs

Suppose we have general time series inputs . A vanilla RNN models the “hidden state” at time by a vector , which is subject to the recursive formula


Here, is generally a nonlinear function parametrized by . Concretely, a vanilla RNN with one hidden layer has the following form555Similar to the activation function , the function means element-wise operations.

where are trainable weight matrices,

are trainable bias vectors, and

is the output at time . Like many classical time series models, those parameters are shared across time. Note that in different applications, we may have different input/output settings (cf. Figure 8). Examples include

  • One-to-many: a single input with multiple outputs; see Figure 8(a). A typical application is image captioning, where the input is an image and outputs are a series of words.

  • Many-to-one: multiple inputs with a single output; see Figure 8(b). One application is text sentiment classification, where the input is a series of words in a sentence and the output is a label (e.g., positive vs. negative).

  • Many-to-many: multiple inputs and outputs; see Figure 8(c). This is adopted in machine translation, where inputs are words of a source language (say Chinese) and outputs are words of a target language (say English).

As the case with feed-forward neural nets, we minimize a loss function using back-propagation, where the loss is typically

where is the number of categories for classification (e.g., size of the vocabulary in machine translation), and is the length of the output sequence. During the training, the gradients are computed in the reverse time order (from to ). For this reason, the training process is often called back-propagation through time.

One notable drawback of vanilla RNNs is that, they have difficulty in capturing long-range dependencies in sequence data when the length of the sequence is large. This is sometimes due to the phenomenon of exploding/vanishing gradients. Take Figure 8(c) as an example. Computing involves the product

by the chain rule. However, if the sequence is long, the product will be the multiplication of many Jacobian matrices, which usually results in exponentially large or small singular values. To alleviate this issue, in practice, the forward pass and backward pass are implemented in a shorter sliding window

, instead of the full sequence . Though effective in some cases, this technique alone does not fully address the issue of long-term dependency.

3.2.2 GRUs and LSTM

There are two improved variants that alleviate the above issue: gated recurrent units (GRUs)

[26] and long short-term memory (LSTM) [54].

  • A GRU refines the recursive formula (13) by introducing gates, which are vectors of the same length as . The gates, which take values in elementwise, multiply with elementwise and determine how much they keep the old hidden states.

  • An LSTM similarly uses gates in the recursive formula. In addition to , an LSTM maintains a cell state, which takes values in elementwise and are analogous to counters.

Here we only discuss LSTM in detail. Denote by the element-wise multiplication. We have a recursive formula in replace of (13):

where is a big weight matrix with appropriate dimensions. The cell state vector carries information of the sequence (e.g., singular/plural form in a sentence). The forget gate determines how much the values of are kept for time , the input gate controls the amount of update to the cell state, and the output gate gives how much reveals to . Ideally, the elements of these gates have nearly binary values. For example, an element of being close to may suggest the presence of a feature in the sequence data. Similar to the skip connections in residual nets, the cell state has an additive recursive formula, which helps back-propagation and thus captures long-range dependencies.

Figure 9: A vanilla RNN with two hidden layers. Higher-level hidden states are determined by the old states and lower-level hidden states . Multilayer RNNs generalize both feed-forward neural nets and one-hidden-layer RNNs.

3.2.3 Multilayer RNNs

Multilayer RNNs are generalization of the one-hidden-layer RNN discussed above. Figure 9 shows a vanilla RNN with two hidden layers. In place of (13), the recursive formula for an RNN with hidden layers now reads

Note that a multilayer RNN has two dimensions: the sequence length and depth . Two special cases are the feed-forward neural nets (where ) introduced in Section 2, and RNNs with one hidden layer (where ). Multilayer RNNs usually do not have very large depth (e.g., ), since is already very large.

Finally, we remark that CNNs, RNNs, and other neural nets can be easily combined to tackle tasks that involve different sources of input data. For example, in image captioning, the images are first processed through a CNN, and then the high-level features are fed into an RNN as inputs. Theses neural nets combined together form a large computational graph, so they can be trained using back-propagation. This generic training method provides much flexibility in various applications.

3.3 Modules

Deep neural nets are essentially composition of many nonlinear functions. A component function may be designed to have specific properties in a given task, and it can be itself resulted from composing a few simpler functions. In LSTM, we have seen that the building block consists of several intermediate variables, including cell states and forget gates that can capture long-term dependency and alleviate numerical issues.

This leads to the idea of designing modules for building more complex neural net models. Desirable modules usually have low computational costs, alleviate numerical issues in training, and lead to good statistical accuracy. Since modules and the resulting neural net models form computational graphs, training follows the same principle briefly described in Section 2.

Here, we use the examples of Inception and skip connections to illustrate the ideas behind modules. Figure 10

(a) is an example of “Inception” modules used in GoogleNet 

[123]. As before, all the convolutional layers are followed by the ReLU activation function. The concatenation of information from filters with different sizes give the model great flexibility to capture spatial information. Note that filters is an tensor (where is the number of feature maps), so its convolutional operation does not interact with other spatial coordinates, only serving to aggregate information from different feature maps at the same coordinate. This reduces the number of parameters and speeds up the computation. Similar ideas appear in other work [78, 57].

(a) “Inception” module (b) Skip connections
Figure 10: (a) The “Inception” module from GoogleNet. Concat means combining all features maps into a tensor. (b) Skip connections are added every two layers in ResNets.

Another module, usually called skip connections, is widely used to alleviate numerical issues in very deep neural nets, with additional benefits in optimization efficiency and statistical accuracy. Training very deep neural nets are generally more difficult, but the introduction of skip connections in residual networks [50, 51] has greatly eased the task.

The high level idea of skip connections is to add an identity map to an existing nonlinear function. Let be an arbitrary nonlinear function represented by a (fragment of) neural net, then the idea of skip connections is simply replacing with . Figure 10(b) shows a well-known structure from residual networks [50]—for every two layers, an identity map is added:


where can be hidden nodes from any layer and are corresponding parameters. By repeating (namely composing) this structure throughout all layers, [50, 51] are able to train neural nets with hundreds of layers easily, which overcomes well-observed training difficulties in deep neural nets. Moreover, deep residual networks also improve statistical accuracy, as the classification error on ImageNet challenge was reduced by from 2014 to 2015. As a side note, skip connections can be used flexibly. They are not restricted to the form in (14), and can be used between any pair of layers [55].

4 Deep unsupervised learning

In supervised learning, given labelled training set , we focus on discriminative models, which essentially represents by a deep neural net with parameters . Unsupervised learning, in contrast, aims at extracting information from unlabeled data , where the labels are absent. In regard to this information, it can be a low-dimensional embedding of the data or a generative model with latent variables to approximate the distribution

. To achieve these goals, we introduce two popular unsupervised deep leaning models, namely, autoencoders and generative adversarial networks (GANs). The first one can be viewed as a dimension reduction technique, and the second as a density estimation method. DNNs are the key elements for both of these two models.

4.1 Autoencoders

Recall that in dimension reduction, the goal is to reduce the dimensionality of the data and at the same time preserve its salient features. In particular, in principal component analysis (PCA), the goal is to embed the data

into a low-dimensional space via a linear function such that maximum variance can be explained. Equivalently, we want to find linear functions and () such that the difference between and is minimized. Formally, we let

Here, for simplicity, we assume that the intercept/bias terms for and are zero. Then, PCA amounts to minimizing the quadratic loss function


It is the same as minimizing subject to , where

is the design matrix. The solution is given by the singular value decomposition of

[44, Thm. 2.4.8], which is exactly what PCA does. It turns out that PCA is a special case of autoencoders, which is often known as the undercomplete linear autoencoder.

More broadly, autoencoders are neural network models for (nonlinear) dimension reduction, which generalize PCA. An autoencoder has two key components, namely, the encoder function , which maps the input to a hidden code/representation , and the decoder function

, which maps the hidden representation

to a point . Both functions can be multilayer neural networks as (3). See Figure 11 for an illustration of autoencoders. Let be a loss function that measures the difference between and in . Similar to PCA, an autoencoder is used to find the encoder and decoder such that is as small as possible. Mathematically, this amounts to solving the following minimization problem

Figure 11: First an input goes through the decoder , and we obtain its hidden representation . Then, we use the decoder to get as a reconstruction of . Finally, the loss is determined from the difference between the original input and its reconstruction .

One needs to make structural assumptions on the functions and in order to find useful representations of the data, which leads to different types of autoencoders. Indeed, if no assumption is made, choosing and to be identity functions clearly minimizes the above optimization problem. To avoid this trivial solution, one natural way is to require that the encoder maps the data onto a space with a smaller dimension, i.e., . This is the undercomplete autoencoder that includes PCA as a special case. There are other structured autoencoders which add desired properties to the model such as sparsity or robustness, mainly through regularization terms. Below we present two other common types of autoencoders.

  • Sparse autoencoders. One may believe that the dimension of the hidden code is larger than the input dimension , and that admits a sparse representation. As with LASSO [126] or SCAD [36], one may add a regularization term to the reconstruction loss in (16) to encourage sparsity [98]. A sparse autoencoder solves

    This is similar to dictionary learning, where one aims at finding a sparse representation of input data on an overcomplete basis. Due to the imposed sparsity, the model can potentially learn useful features of the data.

  • Denoising autoencoders. One may hope that the model is robust to noise in the data: even if the input data are corrupted by small noise

    or miss some components (the noise level or the missing probability is typically small), an ideal autoencoder should faithfully recover the original data. A denoising autoencoder

    [128] achieves this robustness by explicitly building a noisy data as the new input, and then solves an optimization problem similar to (16) where is replaced by . A denoising autoencoder encourages the encoder/decoder to be stable in the neighborhood of an input, which is generally a good statistical property. An alternative way could be constraining and in the optimization problem, but that would be very difficult to optimize. Instead, sampling by adding small perturbations in the input provides a simple implementation. We shall see similar ideas in Section 6.3.3.

4.2 Generative adversarial networks

Given unlabeled data

, density estimation aims to estimate the underlying probability density function

from which the data is generated. Both parametric and nonparametric estimators [115] have been proposed and studied under various assumptions on the underlying distribution. Different from these classical density estimators, where the density function is explicitly defined in relatively low dimension, generative adversarial networks (GANs) [46] can be categorized as an implicit density estimator in much higher dimension. The reasons are twofold: (1) GANs put more emphasis on sampling from the distribution than estimation; (2) GANs define the density estimation implicitly through a source distribution and a generator function , which is usually a deep neural network. We introduce GANs from the perspective of sampling from and later we will generalize the vanilla GANs using its relation to density estimators.

4.2.1 Sampling view of GANs

Suppose the data at hand are all real images, and we want to generate new natural images. With this goal in mind, GAN models a zero-sum game between two players, namely, the generator and the discriminator . The generator tries to generate fake images akin to the true images while the discriminator aims at differentiating the fake ones from the true ones. Intuitively, one hopes to learn a generator to generate images where the best discriminator cannot distinguish. Therefore the payoff is higher for the generator  if the probability of the discriminator getting wrong is higher, and correspondingly the payoff for the discriminator correlates positively with its ability to tell wrong from truth.

Mathematically, the generator consists of two components, an source distribution

(usually a standard multivariate Gaussian distribution with hundreds of dimensions) and a function

which maps a sample from to a point living in the same space as . For generating images, would be a 3D tensor. Here is the fake sample generated from . Similarly the discriminator is composed of one function which takes an image (real or fake) and return a number , the probability of being a real sample from or not. Oftentimes, both the generating function and the discriminating function are realized by deep neural networks, e.g., CNNs introduced in Section 3.1. See Figure 12 for an illustration for GANs. Denote and the parameters in and , respectively. Then GAN tries to solve the following min-max problem:

Figure 12: GANs consist of two components, a generator which generates fake samples and a discriminator which differentiate the true ones from the fake ones.

Recall that models the belief / probability that the discriminator thinks that is a true sample. Fix the parameters and hence the generator and consider the inner maximization problem. We can see that the goal of the discriminator is to maximize its ability of differentiation. Similarly, if we fix (and hence the discriminator), the generator tries to generate more realistic images to fool the discriminator.

4.2.2 Density estimation view of GANs

Let us now take a density-estimation view of GANs. Fixing the source distribution , any generator induces a distribution over the space of images. Removing the restrictions on , one can then rewrite (17) as


Observe that the inner maximization problem is solved by the likelihood ratio, i.e.

As a result, (18) can be simplified as


where denotes the Jensen–Shannon divergence between two distributions

In words, the vanilla GAN (17) seeks a density that is closest to in terms of the Jensen–Shannon divergence. This view allows to generalize GANs to other variants, by changing the distance metric. Examples include f-GAN [90], Wasserstein GAN (W-GAN) [6], MMD GAN [75], etc. We single out the Wasserstein GAN (W-GAN) [6] to introduce due to its popularity. As the name suggests, it minimizes the Wasserstein distance between and :


where is taken over all Lipschitz functions with coefficient 1. Comparing W-GAN (20) with the original formulation of GAN (17), one finds that the Lipschitz function in (20) corresponds to the discriminator in (17) in the sense that they share similar objectives to differentiate the true distribution from the fake one . In the end, we would like to mention that GANs are more difficult to train than supervised deep learning models such as CNNs [110]. Apart from the training difficulty, how to evaluate GANs objectively and effectively is an ongoing research.

5 Representation power: approximation theory

Having seen the building blocks of deep learning models in the previous sections, it is natural to ask: what is the benefits of composing multiple layers of nonlinear functions. In this section, we address this question from a approximation theoretical point of view. Mathematically, letting be the space of functions representable by neural nets (NNs), how well can a function (with certain properties) be approximated by functions in . We first revisit universal approximation theories, which are mostly developed for shallow neural nets (neural nets with a single hidden layer), and then provide recent results that demonstrate the benefits of depth in neural nets. Other notable works include Kolmogorov-Arnold superposition theorem [7, 120], and circuit complexity for neural nets [91].

5.1 Universal approximation theory for shallow NNs

The universal approximation theories study the approximation of in a space  by a function represented by a one-hidden-layer neural net


where is certain activation function and is the number of hidden units in the neural net. For different space and activation function , there are upper bounds and lower bounds on the approximation error . See [93] for a comprehensive overview. Here we present representative results.

First, as , any continuous function can be approximated by some under mild conditions. Loosely speaking, this is because each component behaves like a basis function and functions in a suitable space admits a basis expansion. Given the above heuristics, the next natural question is: what is the rate of approximation for a finite ?

Let us restrict the domain of to a unit ball in . For and integer , consider the space and the Sobolev space with standard norms

where denotes partial derivatives indexed by . Let be the space of functions in the Sobolev space with . Note that functions in  have bounded derivatives up to -th order, and that smoothness of functions is controlled by (larger means smoother). Denote by the space of functions with the form (21). The following general upper bound is due to [85].

Theorem 1 (Theorem 2.1 in [85]).

Assume is such that has arbitrary order derivatives in an open interval , and that is not a polynomial on . Then, for any , , and integer ,

where is independent of , the number of hidden units.

In the above theorem, the condition on is mainly technical. This upper bound is useful when the dimension is not large. It clearly implies that the one-hidden-layer neural net is able to approximate any smooth function with enough hidden units. However, it is unclear how to find a good approximator ; nor do we have control over the magnitude of the parameters (huge weights are impractical). While increasing the number of hidden units leads to better approximation, the exponent suggests the presence of the curse of dimensionality. The following (nearly) matching lower bound is stated in [80].

Theorem 2 (Theorem 5 in [80]).

Let , and . If the activation function is the standard sigmoid function , then


where is independent of .

Results for other activation functions are also obtained by [80]. Moreover, the term can be removed if we assume an additional continuity condition [85].

For the natural space of smooth functions, the exponential dependence on in the upper and lower bounds may look unappealing. However, [12] showed that for a different function space, there is a good dimension-free approximation by the neural nets. Suppose that a function has a Fourier representation


where . Assume that and that the following quantity is finite


[12] uncovers the following dimension-free approximation guarantee.

Theorem 3 (Proposition 1 in [12]).

Fix a and an arbitrary probability measure on the unit ball in . For every function with and every , there exists some such that

Moreover, the coefficients of may be restricted to satisfy .

The upper bound is now independent of the dimension . However, may implicitly depend on , as the formula in (24) involves an integration over (so for some functions may depend exponentially on ). Nevertheless, this theorem does characterize an interesting function space with an improved upper bound. Details of the function space are discussed by [12]. This theorem can be generalized; see [81] for an example.

To help understand why a dimensionality-free approximation holds, let us appeal to a heuristic argument given by Monte Carlo simulations. It is well-known that Monte Carlo approximation errors are independent of dimensionality in evaluation of high-dimensional integrals. Let us generate randomly from a given density in . Consider the approximation to (23) by