Towards a Mathematical Understanding of the Difficulty in Learning with Feedforward Neural Networks

11/17/2016 ∙ by Hao Shen, et al. ∙ fortiss 0

Training deep neural networks for solving machine learning problems is one great challenge in the field, mainly due to its associated optimisation problem being highly non-convex. Recent developments have suggested that many training algorithms do not suffer from undesired local minima under certain scenario, and consequently led to great efforts in pursuing mathematical explanations for such observations. This work provides an alternative mathematical understanding of the challenge from a smooth optimisation perspective. By assuming exact learning of finite samples, sufficient conditions are identified via a critical point analysis to ensure any local minimum to be globally minimal as well. Furthermore, a state of the art algorithm, known as the Generalised Gauss-Newton (GGN) algorithm, is rigorously revisited as an approximate Newton's algorithm, which shares the property of being locally quadratically convergent to a global minimum under the condition of exact learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have been successfully applied to solve challenging problems in pattern recognition, computer vision, and speech recognition

[1, 2, 3]. Despite this success, training DNNs is still one of the greatest challenges in the field [4]. In this work, we focus on training the classic feedforward Multi-Layer Perceptrons (MLPs). It is known that performance of MLPs is highly dependent on various factors in a very complicated way. For example, studies in [5, 6] identify the topology of MLPs as a determinative factor. Works in [7, 4]

demonstrate the impact of different activation functions to performance of MLPs. Moreover, a choice of error/loss functions is also shown to be influential as in


Even with a well designed MLP architecture, training a specific MLP both effectively and efficiently can be as challenging as constructing the network. The most popular method used in training MLPs is the well-known backpropagation (BP) algorithm [9]. Although the classic BP algorithm shares a great convenience of being very simple, they can suffer from two major problems, namely, (i) existence of undesired local minima, even if global optimality is assumed; and (ii) slow convergence speed. Early works as [10, 11] argue that such problems with BP algorithms are essentially due to their nature of being gradient descent algorithms, while an associated optimisation problem for MLP training is often highly non-convex and of large scale.

One major approach to address the problem of undesired local minima in MLP training is via an error/loss surface analysis [12, 13, 14]. Even for simple tasks, such as the classic XOR problem, the error surface analysis is surprisingly complicated and its results can be hard to conclude [15, 13, 16]. Early efforts in [17, 18, 19] try to identify general conditions on the topology of MLPs to eliminate undesired local minima, i.e., suboptimal local minima. Unfortunately, these attempts fail to provide complete solutions to general problems. On the other hand, although BP algorithms are often thought to be sensitive to initialisations [20], recent results reported in [21] suggest that modern MLP training algorithms can overcome the problem of suboptimal local minima conveniently. Such observations have triggered several very recent efforts to characterise global optimality of DNN training [22, 23, 24].

The work in [22] shows that both deep linear networks and deep nonlinear networks

with only the Rectified Linear Unit (ReLU) function in the hidden layers are free of suboptimal local minima. The attempted technique is not applicable for analysing

deep nonlinear networks with other activation functions, e.g. the Sigmoid and the SoftSign. Recent work in [23] studies a special case of the problem, where the training samples are assumed to be linearly independent, i.e., the number of samples is smaller than the dimension of the sample space. Unfortunately, such a strong assumption does not even hold for the classic XOR problem. Moreover, conditions constructed in its main result (Theorem 3.8 in [23]), i.e., the existence of non-degenerate global minima, can hardly be satisfied in practice (see Remark 5 in Section 4 of this paper for details). Most recently, by restricting the network output and its regularisation to be positively homogeneous with respect to the network parameters, the work in [24] develops sufficient conditions to guarantee all local minima to be globally minimal. So far, such results only apply to networks with either one hidden layer or multiple deep subnetworks connected in parallel. Moreover, it still fails to explain performance of MLPs with non-positively homogeneous activation functions.

To deal with slow convergence speed of the classic BP algorithm, various modifications have been developed, such as momentum based BP algorithm [25], conjugate gradient algorithm [26], BFGS algorithm [27]

, and ADAptive Moment estimation (ADAM) algorithm

[28]. Specifically, a Generalised Gauss-Newton (GGN) algorithm proposed in [29] has demonstrated its prominent performance over many state of the art training algorithms in practice [30, 31]. Unfortunately, justification for such performance is still mathematically vague [32]. Another popular solution to deal with slow convergence is to employ Newton’s method for MLP training. However, an implementation of exact Newton’s method is often computationally prohibitive. Hence, approximations of the Hessian matrix are needed to address the issue, such as a diagonal approximation structure [33], and a block diagonal approximation structure [34]

. However, without a complete evaluation of the true Hessian, performance of these heuristic approximations is hardly convincing. Although existing attempts

[35, 36] have characterised the Hessian by applying partial derivatives, these results still fail to provide further structural information of the Hessian, due to the limitations of partial derivatives.

In this work, we provide an alternative mathematical understanding of the difficulty in training MLPs from a smooth optimisation perspective. Sufficient conditions are identified via a critical point analysis to ensure all local minima to be globally minimal. Convergence properties of the GGN algorithm are rigorously analysed as an approximate Newton’s algorithm.

2 Learning with MLPs

Many machine learning tasks can be formulated as a problem of learning a task-specific ground truth mapping , where and denote an input space and an output space, respectively. The problem of interest is to approximate , given only a finite number of samples in either or . When only samples in are given, say , the problem of approximating is known as unsupervised learning. When both input samples and their desired outputs are provided, i.e., given training samples , the corresponding problem is referred to as supervised learning

. In this work, we only focus on the problem of supervised learning.

A popular approach to evaluate learning outcomes is via minimising an empirical error/loss function as


where denotes a hypothetical function space, where a minimiser is assumed to be reachable, and denotes a suitable error function that evaluates the estimate against some task-dependent prior knowledge . For supervised learning problems, such prior knowledge is simply the corresponding desired output , i.e., . In general, given only a finite number of samples, the ground truth mapping is hardly possible to be exactly learned as the solution . Nevertheless, we can still define a specific scenario of exact learning with respect to a finite number of samples.

Definition 1 (Exact learning).

Let be the ground truth mapping. Given samples , a mapping , which satisfies for all , is called a finite exact approximator of based on the samples.

For describing MLPs, we denote by the number of layers in an MLP structure, and by the number of processing units in the -th layer with . Specifically, by , we refer it to the input layer. Let be a unit activation function, which was traditionally chosen to be non-constant, bounded, continuous, and monotonically increasing. Recent choices, e.g. ReLU, SoftPlus, and Bent identity, are unbounded functions. In this work, we restrict activation functions to be smooth, and denote by and the first and second derivative of the activation function , respectively.

For the -th unit in an MLP architecture, referring to the -th unit in the -th layer, we define the corresponding unit mapping as


where denotes the output from the -th layer, and

are respectively a weight vector and a bias associated with the

-th unit. Note, that the bias is a free variable in general. However, through our analysis in this work, we fix it as a constant scalar as in Eq. (2) for the sake of convenience for presentation. Then, we can simply define the -th layer mapping by stacking all unit mappings in the layer as


with being the -th weight matrix. Specifically, let us denote by the input, then the output at the -th layer is defined as recursively. Note, that the last layer of an MLP commonly employs the identify map as the activation function, i.e., . Finally, by denoting the set of all parameter matrices in the MLP by , we compose all layer-wise mappings to define the overall MLP network mapping


With such a construction, we can define the set of parameterised mappings specified by a given MLP architecture as


More specifically, we denote by the MLP architecture specifying the number of units in each layer.

Now, let and , one can utilise an MLP to approximate the ground truth mapping . Then, an empirical total loss function of MLP based learning can be formulated as


If the error function is differentiable in , then the function is differentiable in the weights . For the convenience of presentation, in the rest of the paper we denote the sample-wise loss function by


It is important to notice that, even when the error function is constructed to be convex with respect to the first argument, the total loss function as defined in Eq. (6) is still non-convex in . Since the set of parameters is unbounded, when squashing activation functions are employed, exploding the norm of any weight matrix will not drive the function value of to infinity. Namely, the total loss function can be non-coercive [37]. Therefore, the existence and attainability of global minima of are not guaranteed in general. However, in the finite sample setting, when appropriate nonlinear activation functions, such as Sigmoid and , are employed in the hidden layer, a three-layer MLP with a sufficiently large number of units in the hidden layer can achieve exact learning of finite samples [38, 39].

In the rest of the paper, we assume the existence of an MLP structure that is capable of realising a finite exact approximator.

Assumption 1.

Let be the ground truth mapping. Given unique samples , there exists an MLP architecture , as defined in Eq. (5), and a weight , so that the corresponding MLP is a finite exact approximator of .

As exact learning of finite samples is assumed, a suitable error function is critical to ensure its attainability and uniqueness via an optimisation procedure. Optimistically, a finite exact approximator corresponds with a global minimum of the corresponding total loss function without undesired local minima. We propose the following assumption as a practical principle of choosing error function.

Principle 1 (Choice of error function).

The error function is differentiable with respect to its first argument. If the gradient of with respect to the first argument vanishes at , i.e., , then is a global minimum of .

Remark 1.

Typical examples of error function include the classic squared loss, smooth approximations of norm with , Blake-Zisserman loss, and Cauchy loss [40]. Moreover, by Principle 1, the weights as assumed in Assumption 1 is a global minimiser of the total loss function .

3 Critical point analysis of MLP training

In order to develop a gradient descent algorithm to minimise the cost function as in Eq. (6), the derivative of all unit mappings are building blocks for our computation. We define the derivative of the activation function in the -th unit as


and the collection of all derivatives of activation functions in the -th layer as


For simplicity, we denote by

. We apply the chain rule of multivariable derivative to compute the directional derivative of

with respect to in direction as


where and refer to the directional derivative of with respect to the first and the second argument, respectively. Explicitly, the first derivative of , i.e., as a linear map acting on direction , is


where the operator puts a vector into a diagonal matrix form, and the first derivative of with respect to the second parameter in direction as

Input : Samples , an MLP architecture , and initial weights ;
Output: Accumulation point ;
Step 1: For , feed samples through the MLP to compute           (1) and for ;           (2) ;
Step 2: For , compute           (1) ;           (2) ;
Step 3: Repeat from Step 1 until convergence ;
Algorithm 1 BP algorithm for supervised learning (batch learning).

The gradient of in the -th weight matrix with respect to the Euclidean metric can be computed as


where . The gradient is realised as a rank-one matrix, and acts as a linear map . By exploring the layer-wise structure of the MLP, the corresponding vector can be computed recursively backwards from the output layer , i.e.,


with . With such a backward mechanism in computing the gradient , we recover the classic BP algorithm, see Algorithm 1.

In what follows, we characterise critical points of the total loss function as defined in Eq. (6) by setting its gradient to zero, i.e., . Explicitly, the gradient of with respect to the -th weight is computed by


where and are respectively the -th layer output and the -th error vector as defined in Eq. (15). Similar to the recursive construction of the error vector as in Eq. (15), we construct a sequence of matrices as, for all ,


with . Then the vector form of the gradient can be written as


where puts a matrix into the vector form, and denotes the Kronecker product of matrices. Then, by applying the previous calculation to all samples, critical points of the total loss function are characterised as solutions of the following linear system in


where the superscript indicates the corresponding term for the -th sample, and is the collection of the Jacobian matrices of the MLP for all samples. Here, is the number of variables in the MLP, i.e.,


The above parameterised linear equation system in is strongly dependent on several factors, essentially all factors in designing an MLP, i.e., the MLP structure, the activation function, the error function, given samples, and the weight matrices. If the trivial solution is reachable at some weights , then a finite exact approximator is realised by the corresponding MLP, i.e., . Additionally, if the solution is the only solution of the linear equation system for all , then any local minimum of the loss function is a global minimum. Thus, we conclude the following theorem.

Theorem 1 (suboptimal local minima free condition).

Let an MLP architecture satisfy Assumption 1 for a specific learning task, and the error function satisfy Principle 1. If the following two conditions are fulfilled for all ,

  1. the matrix , as constructed in (19), is non-zero,

  2. the vector , as constructed in (19), lies in the row span of ,

then a finite exact approximator is realised at a global minimum , i.e., , and the loss function is free of suboptimal local minima, i.e., any local minimum of is a global minimum.

Obviously, condition (1) in Theorem 1 is quite easy to be ensured, while condition (2) is hardly possible to be realised, since it might require enormous efforts to design the space of MLPs . Nevertheless, if the rank of matrix is equal to for all , then the trivial solution zero is the only solution of the parameterised linear system. Hence, we have the following proposition as a special case of Theorem 1.

Proposition 1 (Strong suboptimal local minima free condition).

Let an MLP architecture satisfy Assumption 1 for a specific learning task, and the error function satisfy Principle 1. If the rank of matrix as constructed in (19) is equal to for all , then a finite exact approximator is realised at a global minimum , i.e., , and the loss function is free of suboptimal local minima.

Given the number of rows of being , we suggest the second principle of ensuring performance of MLPs.

Principle 2 (Choice of the number of NN variables).

The total number of variables in an MLP architecture needs to be greater than or equal to , i.e., .

In what follows, we investigate the possibility or difficulty to fulfil the condition, i.e., for all , required in Proposition 1. Let us firstly construct the two identically partitioned matrices ( partitions), by collecting the partitions ’s and ’s, as


Then, the matrix constructed on the left hand side of Eq. (19) is computed as the Khatri-Rao product of and , i.e., pairwise Kronecker products for all pairs of partitions in and , denoted by


Each row block of associated with a specific layer is by construction the Khatri-Rao product of the corresponding row blocks in and , i.e., with and . We firstly investigate the rank property of row blocks of in the following lemma222The proof is given in the provided supplements..

Proposition 2.

Given a collection of matrices and a collection of vectors , for , let and . Then the rank of the Khatri-Rao product is bounded from below by


If all matrices ’s and are of full rank, then the rank of has the following properties:

  1. If , then ;

  2. If and , then ;

  3. If and , then .

Unfortunately, stacking these row blocks for together to construct cannot bring better knowledge about the rank of .

Proposition 3.

For an MLP architecture , the rank of as defined in Eq. (22) is bounded from below by


Clearly, in order to have the rank of better bounded from below, it is important to ensure higher rank of all ’s and ’s. By choosing appropriate activation functions in hidden layers, all ’s can have full rank [38, 39]. Then, in what follows, we present three heuristics aiming to keep all ’s being of full rank as much as possible. By the construction of as specified in Eq. (17), i.e., matrix product of ’s and ’s, it is reasonable to ensure full rankness of all ’s and ’s.

Principle 3 (Constraints on MLP weights).

All weight matrices for all are of full rank.

Remark 2.

Note that, the full rank constraint on the weight matrices does not introduce new local minima of the constrained total loss function. However, whether such a constraint may exclude all global minima of the unconstrained total loss function for a given MLP architecture is still an open problem.

Principle 4 (Choice of activation functions).

The derivative of activation function is non-zero for all , i.e., .

Remark 3.

It is trivial to verify that most of popular differentiable activation functions, such as the Sigmoid, , SoftPlus, and SoftSign, satisfy Principle 4. Note, that the Identity activation function, which is employed in the output layer, also satisfies this principle. However, potentially vanishing gradient of squashing activation functions can be still an issue in practice due to finite machine precision. Therefore, activation functions without vanishing gradients, e.g. the Bent identity or the leaky ReLU (non-differentiable at the origin), might be preferred.

However, even when both Principles 3 and 4 are fulfilled, matrices ’s still cannot be guaranteed to have full rank, according to the Sylvester’s rank inequality [41]. Hence, we need to prevent loss of rank in each due to matrix product, i.e., to preserve the smaller rank of two matrices after a matrix product.

Principle 5 (Choice of the number of hidden units).

Given an MLP with hidden layers, the numbers of units in three adjacent layers, namely , , and , satisfy the following inequality with

Remark 4.

The condition together with implies , i.e., the inequality in Eq. (25) takes effect when there is more than one hidden layer, since Principle 3 and 4 are sufficient to ensure ’s and ’s to have full rank. However, for , the inequality in Eq. (25) together with Principle 3 and 4 ensures no loss of rank in all , i.e., .

4 Hessian analysis of MLP training

It is well known that gradient descent algorithms can suffer from slow convergence. Information from the Hessian matrix of the loss function is critical for developing efficient numerical algorithms. Specifically, definiteness of the Hessian matrix is an indicator to the isolatedness of the critical points, which will affect significantly the convergence speed of the corresponding algorithm.

We start with the Hessian of the sample-wise MLP loss function as defined in Eq. (7), which is a bilinear operator , computed by the second directional derivative of . For the sake of readability, we only present one component of the second directional derivative with respect to two specific layer indices and , i.e.,


where is the diagonal matrix with its diagonal entries being the second derivative of the activation functions. Since exact learning of finite samples is assumed at a global minimum , gradients of the error function at all samples are simply zero, i.e., for all . Consequently, the second summand in the last equation in Eq. (26) vanishes, as for all . Then, the Hessian evaluated at in direction is given as


where is the Hessian of the error function with respect to the output of the MLP . By a direct computation, we have the Hessian of the total loss function at a global minimum in a matrix form as


where is the Jacobian matrix of the MLP evaluated at as defined in Eq. (19), and is a block diagonal matrix of all Hessians of for all samples evaluated at . It is then trivial to conclude the following proposition about the rank of .

Proposition 4.

The rank of the Hessian of the total loss function at a global minimum is bounded from above by

Remark 5.

When Principle 2 is assumed, it is easy to see


Namely, when an MLP is designed from scratch without insightful knowledge and reaches exact learning, then it is very likely that the Hessian is degenerate, i.e., the classic BP algorithm will suffer significantly from slow convergence. Moreover, it is also obvious that conditions constructed in the main result of [23] (Theorem 3.8), i.e., the existence of non-degenerate global minima, is hardly possible to be satisfied for the setting considered in [23].

If Proposition 1 holds true, i.e., , then the rank of will depend on the rank of . To ensure to have full rank, we need to have a non-degenerate Hessian for the error function at global minima, i.e., is a Morse function [42]. Hence, in addition to Principle 1, we state the following principle on the choice of error functions.

Principle 1.a (Strong choice of error function).

In addition to Principle 1, the error function is Morse, i.e., the Hessian is non-degenerate for all .

5 Case studies

In this section, we evaluate our results in the previous sections by two case studies, namely loss surface analysis of training MLPs with one hidden layer, and development of an approximate Newton’s algorithm.

5.1 MLPs with one hidden layer

Firstly, we revisit some classic results on MLPs with only one hidden layer [18, 19], and exemplify our analysis using the classic XOR problem [15, 12, 13, 16]. For a learning task with unique training samples, a finite exact approximator is realisable with a two-layer MLP () having units in the hidden layer, and its training process exempts from suboptimal local minima.

Proposition 5.

Let an MLP architecture with one hidden layer satisfy Principle 1, 3, and 4. Then, for a learning task with unique training samples, if the following two conditions are fulfilled:

  1. There are units in the hidden layer, i.e., ,

  2. unique samples produce a basis in the output space of the hidden layer for all ,

then a finite exact approximator is realised at a global minimum , i.e., , and the loss function is free of suboptimal local minima.

When the scalar-valued bias in each unit map, as in Eq. (2), is set to be a free variable, a dummy unit is introduced in each layer, except the output layer. The dummy unit always feeds a constant input of one to its successor layer. Results in [18, 19] claim that, with the presence of dummy units, only units in the hidden layer are required to achieve exact learning and eliminate all suboptimal local minima. Such a statement has been shown to be false by counterexample utilising the XOR problem [16]. In the following proposition, we reinvestigate this problem as a concrete example of applying Proposition 3.

Proposition 6.

Let a two-layer MLP architecture with dummy units and satisfy Principle 1, 3, and 4, and . For a learning task with unique samples , we have

  1. if , then

  2. if , then


In the rest of this section, we examine these statements on the XOR problem [12, 13, 16]. Two specific MLP structures have been extensively studied in the literature, namely, the network and the network. Squashing activation functions are used in the hidden layer. Dummy units are introduced in both the input and hidden layer. The XOR problem has unique input samples


where the last row is due to the dummy unit in the input layer and . Their associated desired outputs are specified as


Then a squared loss function is defined as


where . In order to identify local minima of the loss function , we firstly compute , according to situation (2) in Proposition 6. Hence, we need to investigate the rank of the Jacobian matrix


Firstly, let us have a look at the XOR network. By considering the dummy unit in the hidden layer, the matrix has the last row as . Then, according to Proposition 2, the first row block in in Eq. (36) has the smallest rank of two, while the second row block has the smallest rank of one. Hence, the rank of is lower bounded by two, and local minima can exist for training the XOR network [13, 15].

Similarly, for the network, i.e., , although the first row block in has potentially the largest rank of four, it is still not immune from collapsing to lower rank of three. Meanwhile, the second row block in has also the smallest rank of one. Therefore, the rank of is lower bounded by three. As a sequel, there still exist undesired local minima in training the XOR network [16].

5.2 An approximate Netwon’s algorithm

It is important to notice that the Hessian is neither diagonal nor block diagonal, which differs from the existing approximate strategies of the Hessian in [33, 34]. A classic Newton’s method for minimising the total loss function requires to compute the exact Hessian of from its second directional derivative as in Eq. (26). The complexity of computing the second summand on the right hand side of the last equality in Eq. (26) is of order in the number of layers . Namely, an implementation of exact Newton’s method becomes much more expensive, when an MLP gets deeper. Motivated by the fact that this computationally expensive term vanishes at a global minimum, shown in Eq. (28), we propose to approximate the Hessian of at an arbitrary weight with the following expression


where , and is the Jacobian matrix of the MLP as defined in Eq. (19). With this approximation, we can construct an approximate Newton’s algorithm to minimise the total loss function . Specifically, for the -th iterate , an approximate Newton’s direction is computed by solving the following linear system for


where is the gradient of at . When Principle 1.a holds, the approximate Newton’s direction can be computed as with solving the following linear system


Then the corresponding Newton’s update is defined as


where is a suitable step size.

Input : Samples , an MLP architecture , and initial weights ;
Output: Accumulation point ;
Step 1: For , feed samples through the MLP to compute and for ;
Step 2: Compute           (1) ;           (2) ;           (3) ;
Step 3: For and , compute           (1) ;           (2) ;
Step 4: For , compute the gradient
Step 5: Compute the approximate Hessian according to Eq. (37) ;
Step 6: Compute the Newton update , where solves           ;
Step 7: Repeat from Step 1 until convergence;
Algorithm 2 An approximate Newton’s algorithm for supervised learning (batch learning).
((a)) Convergence with exact learning
((b)) Impact of vanishing gradient
((c)) Convergence without exact learning
Figure 1: Investigation of an approximate Newton’s algorithm, a.k.a. the Generalised Gauss-Newton algorithm.
Remark 6.

The approximate Hessian proposed in Eq. (37) is by construction positive semi-definite at arbitrary weights, while definiteness of the exact Hessian is not conclusive. It is trivial to see that the approximate Hessian coincides with the ground-truth Hessian as Eq. (28) at global minima. Hence, when , the corresponding approximate Newton’s algorithm induced by the update rule in Eq. (40) shares the same local quadratic convergence properties to a global minimum as the exact Newton’s method (see Section 6).

In general, computing the approximate Newton’s update as defined in Eq. (38) can be computationally expensive. Interestingly, the approximate Newton’s algorithm is indeed the state of the art GGN algorithm developed in [29]. Efficient implementations of the GGN algorithm have been extensively explored in [30, 43, 44]. In the next section, we investigate the theoretical convergence properties of the approximate Newton’s algorithm, i.e., the GGN algorithm.

6 Numerical evaluation

In this section, we investigate performance of the Approximate Newton’s (AN) algorithm, i.e., the GGN algorithm. We test the algorithm on the four regions classification benchmark, as originally proposed in [45]. In around the origin, we have a square area , and three concentric circles with their radiuses being , , and . Four regions/classes are interlocked and nonconvex, see [45] for further details of the benchmark. Samples are drawn in the box for training with their corresponding outputs being the classes .

We firstly demonstrate theoretical convergence properties of the AN/GGN algorithm with , by deploying an MLP architecture . Dummy units are introduced in both the input and hidden layer. Activation functions in the hidden layer are chosen to be the Bend identity, and the error function is the squared loss. With a set of randomly drawn training samples, we run the AN/GGN algorithm from randomly initialised weights. The convergence is measured by the distance of the accumulation point to all iterates , i.e., by an extension of the Frobenius norm of matrices to collections of matrices as