Understanding the Loss Surface of Neural Networks for Binary Classification

02/19/2018 ∙ by Shiyu Liang, et al. ∙ Facebook University of Illinois at Urbana-Champaign 0

It is widely conjectured that the reason that training algorithms for neural networks are successful because all local minima lead to similar performance, for example, see (LeCun et al., 2015, Choromanska et al., 2015, Dauphin et al., 2014). Performance is typically measured in terms of two metrics: training performance and generalization performance. Here we focus on the training performance of single-layered neural networks for binary classification, and provide conditions under which the training error is zero at all local minima of a smooth hinge loss function. Our conditions are roughly in the following form: the neurons have to be strictly convex and the surrogate loss function should be a smooth version of hinge loss. We also provide counterexamples to show that when the loss function is replaced with quadratic loss or logistic loss, the result may not hold.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Local search algorithms like stochastic gradient descent 

[4] or variants have gained huge success in training deep neural networks (see, [5]; [6]; [7], for example). Despite the spurious saddle points and local minima on the loss surface [3], it has been widely conjectured that all local minima of the empirical loss lead to similar training performance [1, 2]. For example, [8] empirically showed that neural networks with identical architectures but different initialization points can converge to local minima with similar classification performance. However, it still remains a challenge to characterize the theoretical properties of the loss surface for neural networks.

In the setting of regression problems, theoretical justifications has been established to support the conjecture that all local minima lead to similar training performance. For shallow models, [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] provide conditions under which the local search algorithms are guaranteed to converge to the globally optimal solution for the regression problem. For deep linear networks, it has been shown that every local minimum of the empirical loss is a global minimum [21, 22, 23, 24, 25]. In order to characterize the loss surface of more general deep networks for regression tasks, [2] have proposed an interesting approach. Based on certain constructions on network models and additional assumptions, they relate the loss function to a spin glass model and show that the almost all local minima have similar empirical loss and the number of bad local minima decreases quickly with the distance to the global optimum. Despite the interesting results, it remains a concern to properly justify their assumptions. More recently, it has been shown  [26, 27] that, when the dataset satisfies certain conditions, if one layer in the multilayer network has more neurons than the number of training samples, then a subset of local minima are global minima.

Although the loss surfaces in regression tasks have been well studied, the theoretical understanding of loss surfaces in classification tasks is still limited. [27, 28, 29] treat the classification problem as the regression problem by using quadratic loss, and show that (almost) all local minima are global minima. However, the global minimum of the quadratic loss does not necessarily have zero misclassification error even in the simplest cases (e.g., every global minimum of quadratic loss can have non-zero misclassification error even when the dataset is linearly separable and the network is a linear network). This issue was mentioned in [26] and a different loss function was used, but their result only studied the linearly separable case and a subset of the critical points.

In view of the prior work, the context and contributions of our paper are as follows:

  • Prior work on quadratic and related loss functions suggest that one can achieve zero misclassification error at all local minima by overparameterizing the neural network. The reason for over-parameterization is that the quadratic loss function tries to match the output of the neural network to the label of each training sample.

  • On the other hand, hinge loss-type functions only try to match the sign of the outputs with the labels. So it may be possible to achieve zero misclassification error without over-parametrization. We provide conditions under which the misclassification error of neural networks is zero at all local minima for hinge-loss functions.

  • Our conditions are roughly in the following form: the neurons have to be increasing and strictly convex, the neural network should either be single-layered or is multi-layered with a shortcut-like connection and the surrogate loss function should be a smooth version of the hinge loss function.

  • We also provide counterexamples to show that when these conditions are relaxed, the result may not hold.

  • We establish our results under the assumption that either the dataset is linearly separable or the positively and negatively labeled samples are located on different subspaces. Whether this assumption is necessary is an open problem, except in the case of certain special neurons.

The outline of this paper is as follows. In Section 2, we present the necessary definitions. In Section 3, we present the main results and we discuss each condition in Section 4. Conclusions are presented in Section 5. All proofs are provided in Appendix.

2 Preliminaries

Network models.

Given an input vector

of dimension , we consider a neural network with layers for binary classification. We denote by the number of neurons on the -th layer (note that and

). We denote the neuron activation function by

. Let denote the weight matrix connecting the -th layer and the -th layer and

denote the bias vector for the neurons in the

-th layer. Therefore, the output of the network can be expressed by

where denotes all parameters in the neural network.

Data distribution. In this paper, we consider binary classification tasks where each sample is drawn from an underlying data distribution defined on . The sample is considered positive if , and negative otherwise. Let denote a set of orthonormal basis on the space . Let and denote two subsets of such that all positive and negative samples are located on the linear span of the set and , respectively, i.e., and . Let denote the size of the set , denote the size of the set and denote the size of the set , respectively.

Loss and error. Let denote a dataset with samples, each independently drawn from the distribution . Given a neural network parameterized by and a loss function in binary classification tasks333We note that, in regression tasks, the empirical loss is usually defined as ., we define the empirical loss as the average loss of the network on a sample in the dataset , i.e.,

Furthermore, for a neural network

, we define a binary classifier

of the form , where the sign function , if , and otherwise. We define the training error (also called the misclassification error) as the misclassification rate of the neural network on the dataset , i.e.,

where is the indicator function. The training error measures the classification performance of the network on the finite samples in the dataset .

3 Main Results

In this section, we present the main results. We first introduce several important conditions in order to derive the main results, and we will provide further discussions on these conditions in the next section.

3.1 Conditions

To fully specify the problem, we need to specify our assumptions on several components of the model, including: (1) the loss function, (2) the data distribution, (3) the network architecture and (4) the neuron activation function.

Assumption 1 (Loss function)

Let denote a loss function satisfying the following conditions: (1) is a surrogate loss function, i.e., for all , where denotes the indicator function; (2) has continuous derivatives up to order on ; (3) is non-decreasing (i.e., for all ) and there exists a positive constant such that iff .

The first condition in Assumption 1 ensures that the training error is always upper bounded by the empirical loss , i.e., . This guarantees that the neural network can correctly classify all samples in the dataset (i.e., ), when the neural network achieves zero empirical loss (i.e., ). The second condition ensures that the empirical loss has continuous derivatives with respect to the parameters up to a sufficiently high order. The third condition ensures that the loss function is non-decreasing and is achievable if and only if . Here, we provide a simple example of the loss function satisfying all conditions in Assumption 1: the polynomial hinge loss, i.e., . We note that, in this paper, we use to denote the empirical loss when the loss function is and the network is parametrized by a set of parameters . Further results on the impact of loss functions are presented in Section 4.

Assumption 2 (Data distribution)

Assume that for random vectors independently drawn from the distribution and independently drawn from the distribution , matrices and

are full rank matrices with probability one.

Assumption  2 states that support of the conditional distribution is sufficiently rich so that samples drawn from it will be linearly independent. In other words, by stating this assumption, we are avoiding trivial cases where all the positively labeled points are located in a very small subset of the linear span of Similarly for the negatively labeled samples.

Assumption 3 (Data distribution)

Assume , i.e., .

Assumption 3 assumes that the positive and negative samples are not located on the same linear subspace. Previous works [30, 31, 32, 30] have observed that some classes of natural images (e.g., images of faces, handwritten digits, etc) can be reconstructed from lower-dimensional representations. For example, using dimensionality reduction methods such as PCA, one can approximately reconstruct the original image from only a small number of principal components [30, 31]. Here, Assumption 3 states that both the positively and negatively labeled samples have lower-dimensional representations, and they do not exist in the same lower-dimensional subspace. We provide additional analysis in Section 4, showing how our main results generalize to other data distributions.

Figure 1: (a) The identity shortcut connection adopted in the residual network [33]. (b) The shortcut-like connection adopted in this paper.
Assumption 4 (Network architecture)

Assume that the neural network is a single-layered neural network, or more generally, has shortcut-like connections shown in Fig 1 (b), where is a single layer network and is a feedforward network.

Shortcut connections are widely used in the modern network architectures (e.g., Highway Networks [34], ResNet [33], DenseNet [35], etc.), where the skip connections allow the deep layers to have direct access to the outputs of shallow layers. For instance, in the residual network, each residual block has a identity shortcut connection, shown in Fig 1 (a), where the output of each residual block is the vector sum of its input and the output of a network .

Instead of using the identity shortcut connection, in this paper, we first pass the input through a single layer network , where vector denotes the weight vector, matrix denotes the weight matrix and vector denotes the vector containing all parameters in . We next add the output of this network to a network and use the addition as the output of the whole network, i.e., where vector and denote the vector containing all parameters in the network and the whole network , respectively. We note here that, in this paper, we do not restrict the number of layers and neurons in the network and this means that the network can be a feedforward network introduced in Section 2 or a single layer network or even a constant. In fact, when the network is a single layer network or a constant, the whole network becomes a single layer network. Furthermore, we note that, in Section 4, we will show that if we remove this connection or replace this shortcut-like connection with the identity shortcut connection, the main result does not hold.

Assumption 5 (Neuron activation)

Assume that neurons in the network are real analytic and satisfy for all . Assume that neurons in the network are real functions on .

In Assumption 5, we assume that neurons in the network are infinitely differentiable and have positive second order derivatives on , while neurons in the network are real functions. We make the above assumptions to ensure that the loss function is partially differentiable w.r.t. the parameters in the network up to a sufficiently high order and allow us to use Taylor expansion in the analysis. Here, we list a few neurons which can be used in the network : softplus neuron, i.e., , quadratic neuron, i.e, , etc. We note that neurons in the network and do not need to be of the same type and this means that a more general class of neurons can be used in the network , e.g., threshold neuron, i.e.,

, rectified linear unit

, sigmoid neuron , etc. Further discussion on the effects of neurons on the main results are provided in Section 4.

3.2 Main Results

Now we present the following theorem to show that when assumptions 1-5 are satisfied, every local minimum of the empirical loss function has zero training error if the number of neurons in the network are chosen appropriately.

Theorem 1 (Linear subspace data)

Suppose that assumptions 1-5 are satisfied. Assume that samples in the dataset are independently drawn from the distribution . Assume that the number of neurons in the network satisfies , where . If is a local minimum of the loss function and , then holds with probability one.

Remark: (i) By setting the network to a constant, it directly follows from Theorem 1 that if a single layer network consisting of neurons satisfying Assumption 5 and all other conditions in Theorem 1 are satisfied, then every local minimum of the empirical loss has zero training error. (ii) The positiveness of is guaranteed by Assumption 3. In the worst case (e.g., and ), the number of neurons needs to be at least greater than the number of samples, i.e., . However, when the two orthonormal basis sets and differ significantly (i.e., ), the number of neurons required by Theorem 1 can be significantly smaller than the number of samples (i.e., ). In fact, we can show that, when the neuron has quadratic activation function , the assumption can be further relaxed such that the number of neurons is independent of the number of samples. We discuss this in the following proposition.

Proposition 1

Assume that assumptions 1-5 are satisfied. Assume that samples in the dataset are independently drawn from the distribution . Assume that neurons in the network satisfy and the number of neurons in the network satisfies . If is a local minimum of the loss function and , then holds with probability one.

Remark: Proposition 1 shows that if the number of neuron is greater than the dimension of the subspace, i.e., , then every local minimum of the empirical loss function has zero training error. We note here that although the result is stronger with quadratic neurons, it does not imply that the quadratic neuron has advantages over the other types of neurons (e.g., softplus neuron, etc). This is due to the fact that when the neuron has positive derivatives on , the result in Theorem 1 holds for the dataset where positive and negative samples are linearly separable. We provide the formal statement of this result in Theorem 2. However, when the neuron has quadratic activation function, the result in Theorem 1 may not hold for linearly separable dataset and we will illustrate this by providing a counterexample in the next section.

As shown in Theorem 1, when the data distribution satisfies Assumption 2 and 3, every local minimum of the empirical loss has zero training error. However, we can easily see that distributions satisfying these two assumptions may not be linearly separable. Therefore, to provide a complementary result to Theorem 1, we consider the case where the data distribution is linearly separable. Before presenting the result, we first present the following assumption on the data distribution.

Assumption 6 (Linear separability)

Assume that there exists a vector such that the data distribution satisfies .

In Theorem 2, we will show that when the samples drawn from the data distribution are linearly separable, and the network has a shortcut-like connection shown in Figure 1, all local minima of the empirical loss function have zero training errors if the type of the neuron in the network are chosen appropriately.

Theorem 2 (Linearly separable data)

Suppose that the loss function satisfies Assumption 1 and the network architecture satisfies Assumption 4. Assume that samples in the dataset are independently drawn from a distribution satisfying Assumption 6. Assume that the single layer network has neurons and neurons in the network are twice differentiable and satisfy for all . If is a local minimum of the loss function , , then holds with probability one.

Remark: Similar to Proposition 1, Theorem 2 does not require the number of neurons to be in scale with the number of samples. In fact, we make a weaker assumption here: the single layer network only needs to have at least one neuron, in contrast to at least neurons required by Proposition 1. Furthermore, we note here that, in Theorem 2, we assume that neurons in the network have positive derivatives on . This implies that Theorem 2 may not hold for a subset of neurons considered in Theorem 1 (e.g., quadratic neuron, etc). We will provide further discussions on the effects of neurons in the next section.

So far, we have provided results showing that under certain constraints on the (1) neuron activation function, (2) network architecture, (3) loss function and (4) data distribution, every local minimum of the empirical loss function has zero training error. In the next section, we will discuss the implications of these conditions on our main results.

4 Discussions

In this section, we discuss the effects of the (1) neuron activation, (2) shortcut-like connections, (3) loss function and (4) data distribution on the main results, respectively. We show that the result may not hold if these assumptions are relaxed.

4.1 Neuron Activations

To begin with, we discuss whether the results in Theorem 1 and 2 still hold if we vary the neuron activation function in the single layer network

. Specifically, we consider the following five classes of neurons: (1) softplus class, (2) rectified linear unit (ReLU) class, (3) leaky rectified linear unit (Leaky ReLU) class, (4) quadratic class and (5) sigmoid class. In the following, for each class of neurons, we show whether the main results hold and provide counterexamples if certain conditions in the main results are violated. We summarize our findings in Table 

1. We visualize some neurons activation functions from these five classes in Fig. 2(a).

Figure 2: (a) Five types of neuron activations, including softplus neuron, ReLU, Leaky-ReLU, sigmoid neuron, quadratic neuron. (b) Four types of surrogate loss functions, including binary loss (i.e., ), polynomial hinge loss (i.e., ), square loss (i.e., ) and logistic loss (i.e., ). Definitions of all neurons can be found in Section 4.1.

Softplus class contains neurons with real analytic activation functions , where , for all . A widely used neuron in this class is the softplus neuron, i.e., , which is a smooth approximation of ReLU. We can see that neurons in this class satisfy assumptions in both Theorem 1 and 2 and this indicates that both theorems hold for the neurons in this class.

ReLU class contains neurons with for all and is piece-wise continuous on . Some commonly adopted neurons in this class include: threshold units, i.e., , rectified linear units (ReLU), i.e., and rectified quadratic units (ReQU), i.e., . We can see that neurons in this class do not satisfy neither assumptions in Theorem 1 nor 2. In proposition 2, we show that when the single layer network consists of neurons in the ReLU class, even if all other conditions in Theorem 1 or 2 are satisfied, the empirical loss function can have a local minimum with non-zero training error.

Proposition 2

Suppose that assumptions 1 and 4 are satisfed. Assume that neurons in the network satisfy that for all and is piece-wise continuous on . Then there exists a network architecture and a distribution satisfying assumptions in Theorem 1 or 2 such that with probability one, the empirical loss has a local minima satisfying , where and are the number of positive and negative samples, respectively.

Remark: (i) We note here that the above result holds in the over-parametrized case, where the number of neurons in the network is larger than the number of samples in the dataset. In addition, all counterexamples shown in Section 4.1 hold in the over-parametrized case. (ii) We note here that applying the same analysis, we can generalize the above result to a larger class of neurons satisfying the following condition: there exists a scalar such that constant for all and is piece-wise continuous on . (iii) We note that the training error is strictly non-zero when the dataset has both positive and negative samples and this can happen with probability at least .

Theorem Softplus ReLU Leaky-ReLU Sigmoid Quadratic
1 Yes No No No Yes
2 Yes No No No No
Table 1: The result whether Theorem 1 or 2 hold for all neurons in each class. The definition of each class can be found in Section 4.1.

Leaky-ReLU class contains neurons with for all and is piece-wise continuous on . Some commonly used neurons in this class include ReLU, i.e., , leaky rectified linear unit (Leaky-ReLU), i.e., for , for and some constant , exponential linear unit (ELU), i.e., for , for and some constant . We can see that all neurons in this class do not satisfy assumptions in Theorem 1, while some neurons in this class satisfy the condition in Theorem 2 (e.g., linear neuron, ) and some neurons do not (e.g., ReLU). In Proposition 2, we have provided a counterexample showing that Theorem 2 does not hold for some neurons in this class (e.g., ReLU). Next, we will present the following proposition to show that when the network consists of neurons in the Leaky-ReLU class, even if all other conditions in Theorem 1 are satisfied, the empirical loss function is likely to have a local minimum with non-zero training error with high probability.

Proposition 3

Suppose that Assumption 1 and 4 are satisfied. Assume that neurons in the network satisfy that for all and is piece-wise continuous on . Then there exists a network architecture and a distribution satisfying assumptions in Theorem 1 such that, with probability at least , the empirical loss has a local minima with non-zero training error.

Remark: We note that applying the same proof, we can generalize the above result to a larger class of neurons, i.e., neurons satisfying the condition that there exists two scalars and such that for all and is piece-wise continuous on . In addition, we note that the ReLU neuron (but not all neurons in the ReLU class) satisfies the definition of both ReLU class and Leaky-ReLU class, and therefore both Proposition 2 and 3 hold for the ReLU neuron.

Sigmoid class contains neurons with constant on . We list a few commonly adopted neurons in this family: sigmoid neuron, i.e., , hyperbolic tangent neuron, i.e., , arctangent neuron, i.e., and softsign neuron, i.e.,

. We note that all real odd functions

444A real function is an odd function, if for all . satisfy the conditions of the sigmoid class. We can see that none of the above neurons satisfy assumptions in Theorem 1, since neurons in this class satisfy either for all or is not twice differentiable. For Theorem 2, we can see that some neurons in this class satisfy the condition in Theorem 2 (e.g., sigmoid neuron) and some neurons do not (e.g., constant neuron for all ). In Proposition 2, we provided a counterexample showing that Theorem 2 does not hold for some neurons in this class (e.g., constant neuron). Next, we present the following proposition showing that when the network consists of neurons in the sigmoid class, then there always exists a data distribution satisfying the assumptions in Theorem 1 such that, with a positive probability, the empirical loss has a local minima with non-zero training error.

Proposition 4

Suppose that assumptions 1 and 4 are satisfed. Assume that there exists a constant such that neurons in the network satisfy for all . Assume that the dataset has samples. There exists a network architecture and a distribution satisfying assumptions in Theorem 1 such that, with a positive probability, the empirical loss function has a local minimum satisfying , where and denote the number of positive and negative samples in the dataset, respectively.

Remark: Proposition 4 shows that when the network consists of neurons in the sigmoid class, even if all other conditions are satisfied, the results in Theorem 1 does not hold with a positive probability.

Quadratic family contains neurons where is real analytic and strongly convex on and has a global minimum at the point . A simple example of neuron in this family is the quadratic neuron, i.e., . It is easy to check that all neurons in this class satisfy the conditions in Theorem 1 but not in Theorem 2. For Theorem 2, we present a counterexample and show that, when the network consists of neurons in the quadratic class, even if positive and negative samples are linearly separable, the empirical loss can have a local minimum with non-zero training error.

Proposition 5

Suppose that Assumption 1 and 4 are satisfied. Assume that neurons in satisfy that is strongly convex and twice differentiable on and has a global minimum at . There exists a network architecture and a distribution satisfying assumptions in Theorem 2 such that with probability one, the empirical loss has a local minima satisfying , where and denote the number of positive and negative samples in the dataset, respectively.

4.2 Shortcut-like Connections

In this subsection, we discuss whether the main results still hold if we remove the shortcut-like connections or replace them with the identity shortcut connections used in the residual network [33]. Specifically, we provide two counterexamples and show that the main results do not hold if the shortcut-like connections are removed or replaced with the identity shortcut connections.

Feed-forward networks. When the shortcut-like connections (i.e., the network in Figure 1(b)) are removed, the network architecture can be viewed as a standard feedforward neural network. We provide a counterexample to show that, for a feedforward network with ReLU neurons, even if the other conditions in Theorem 1 or  2 are satisfied, the empirical loss functions is likely to have a local minimum with non-zero training error. In other words, neither Theorem 1 nor  2 holds when the shortcut-like connections are removed.

Proposition 6

Suppose that assumption 1 is satisfied. Assume that the feedforward network has at least one hidden layer and at least one neuron in each hidden layer. If neurons in the network satisfy that for all and is continuous on , then for any dataset with samples, the empirical loss has a local minima with , where and are the number of positive and negative samples in the dataset, respectively.

Remark: The result holds for ReLUs, since it is easy to check that the ReLU neuron satisfies the above assumptions.

Identity shortcut connections. As we stated earlier, adding shortcut-like connections to a network can improve the loss surface. However, the shortcut-like connections shown in Fig 1(b) are different from some popular shortcut connections used in the real-world applications, e.g., the identity shortcut connections in the residual network. Thus, a natural question arises: do the main results still hold if we use the identity shortcut connections? To address the question, we provide the following counterexample to show that, when we replace the shortcut-like connections with the identity shortcut connections, even if the other conditions in Theorem 1 are satisfied, the empirical loss function is likely to have a local minimum with non-zero training error. In other words, Theorem 1 does not hold for the identity shortcut connections.

Proposition 7

Assume that is a feedforward neural network parameterized by and all neurons in are ReLUs. Define a network with identity shortcut connections as , . Then there exists a distribution satisfying the assumptions in Theorem 1 such that with probability at least , the empirical loss has a local minimum with non-zero training error.

4.3 Loss Functions

In this subsection, we discuss whether the main results still hold if we change the loss function. We mainly focus on the following two types of surrogate loss functions: quadratic loss and logistic loss. We will show that if the loss function is replaced with the quadratic loss or logistic loss, then neither Theorem 1 nor 2 holds. In addition, we show that when the loss function is the logistic loss and the network is a feedforward neural network, there are no local minima with zero training error in the real parameter space. In Fig. 2(b), we visualize some surrogate loss functions discussed in this subsection.

Quadratic loss. The quadratic loss has been well-studied in prior works. It has been shown that when the loss function is quadratic, under certain assumptions, all local minima of the empirical loss are global minima. However, the global minimum of the quadratic loss does not necessarily have zero misclassification error, even in the realizable case (i.e., the case where there exists a set of parameters such that the network achieves zero misclassification error on the dataset or the data distriubtion). To illustrate this, we provide a simple example where the network is a simplified linear network and the data distribution is linearly separable.

Example 1

Let the distribution satisfy that , and

is a uniform distribution on the interval

. For a linear model , every global minimum of the population loss satisfies .

Remark: The proof of the above result in Appendix B.7 is very straightforward. We have only provided it there since we are unable to find a reference which explicitly states such a result, but we will not be surprised if this result has been known to others. This example shows that every global minimum of the quadratic loss has non-zero misclassification error, although the linear model is able to achieve zero misclassification error on this data distribution. Similarly, one can easily find datasets under which all global minima of the quadratic loss have non-zero training error.

In addition, we provide two examples in Appendix B.8 and show that, when the loss function is replaced with the quadratic loss, even if the other conditions in Theorem 1 or 2 are satisfied, every global minimum of the empirical loss has a training error larger than with a positive probability. In other words, our main results do hold for the quadratic loss.

The following observation may be of independent interest. Different from the quadratic loss, the loss functions conditioned in Assumption 1 have the following two properties: (i) the minimum empirical loss is zero if and only if there exists a set of parameters achieving zero training error; (ii) every global minimum of the empirical loss has zero training error in the realizable case.

Proposition 8

Let denote a feedforward network parameterized by and let the dataset have samples. When the loss function satisfies Assumption 1 and , we have if and only if . Furthermore, if , every global minimum of the empirical loss has zero training error, i.e., .

Remark: We note that the network does not need to be a feedforward network. In fact, the same results hold for a large class of network architectures, including both architectures shown in Fig 1. We provide additional analysis in Appendix B.9.

Logistic loss. The logistic loss is different from the loss functions conditioned in Assumption 1, since the logistic loss does not have a global minimum on . Here, for the logistic loss function, we show that even if the remaining assumptions in Theorem 1 hold, every critical point is a saddle point. In other words, Theorem 1 does not hold for logistic loss. Additional analysis on Theorem 2 are provided in Appendix B.11.

Proposition 9

Assume that the loss function is the logistic loss, i.e., . Assume that assumptions 2-5 are satisfied. Assume that samples in the dataset are independently drawn from the distribution . Assume that the number of neurons in the network satisfies , where . If denotes a critical point of the empirical loss , then is a saddle point. In particular, there are no local minima.

Remark: We note here that the result can be generalized to every loss function which is real analytic and has a positive derivative on .

Furthermore, we provide the following result to show that when the dataset contains both positive and negative samples, if the loss is the logistic loss, then every critical point of the empirical loss function has non-zero training error.

Proposition 10

Assume the dataset consists of both positive and negative samples. Assume that is a feedforward network parameterized by . Assume that the loss function is logistic, i.e., . If the real parameters denote a critical point of the empirical loss , then .

Remark: We provide the proof in Appendix B.12. The above proposition implies every critical point is either a local minimum with non-zero training error or is a saddle point (also with non-zero training error). We note here that, similar to Proposition 9, the result can be generalized to every loss function that is differentiable and has a positive derivative on .

4.4 Open Problem: Datasets

In this paper, we have mainly considered a class of non-linearly separable distribution where positive and negative samples are located on different subspaces. We show that if the samples are drawn from such a distribution, under certain additional conditions, all local minima of the empirical loss have zero training errors. However, one may ask: how well does the result generalize to other non-linearly separable distributions or datasets? Here, we partially answer this question by presenting the following necessary condition on the dataset so that Theorem 1 can hold.

Proposition 11

Suppose that assumptions 14 and 5 are satisfied. For any feedforward architecture , every local minimum of the empirical loss function , satisfies only if the matrix is neither positive nor negative definite for all sequences satisfying and .

Remark: The proposition implies that when the dataset does not meet this necessary condition, there exists a feedforward architecture such that the empirical loss function has a local minimum with a non-zero training error. We use this implication to prove the counterexamples provided in Appendix B.14 when Assumption 2 or 3 on the dataset is not satisfied. Therefore, Theorem 1 no longer holds when Assumption 2 or 3 is removed. We note that the necessary condition shown here is not equivalent to Assumption 2 and 3. Now we present the following result to show the sufficient and necessary condition that the dataset should satisfy so that Proposition 1 can hold.

Proposition 12

Suppose that the loss function satisfies Assumption 1 and neurons in the network satisfy Assumption 5. Assume that the single layer network has neurons and assume that neurons in are quadratic neurons, i.e., . For any network architecture , every local minimum of the empirical loss function , satisfies if and only if the matrix is indefinite for all sequences satisfying .

Remark: (i) This sufficient and necessary condition implies that for any network architecture , there exists a set of parameters such that the network can correctly classify all samples in the dataset. This also indicates the existence of a set of parameters achieving zero training error, regardless of the network architecture of . We provide the proof in Appendix B.15. (ii) We note that Proposition 12 only holds for the quadratic neuron. The problem of finding the sufficient and necessary conditions for the other types of neurons is open.

5 Conclusions

In this paper, we studied the surface of a smooth version of the hinge loss function in binary classification problems. We provided conditions under which the neural network has zero misclassification error at all local minima and also provide counterexamples to show that when some of these assumptions are relaxed, the result may not hold. Further work involves exploiting our results to design efficient training algorithms classification tasks using neural networks.

References

Appendix A Additional Results in Section 3

a.1 Proof of Lemma 1

Lemma 1 (Necessary condition.)

Assume that neurons in the network are twice differentiable and the loss function has a continuous derivative on up to the third order. If and parameters denote a local minimum of the loss function , then for any ,

Proof.

We first recall some notations defined in the paper. The output of the neural network is

where is the single layer neural network parameterized by , i.e.,

and is a deep neural network parameterized by . The empirical loss function is given by

Since the loss function has a continuous derivative on up to the third order, neurons in the network are twice differentiable, then the gradient vector and the Hessian matrix exists. Furthermore, by the assumption that is a local minima of the loss function , then we should have for ,

(1)

Now we need to prove that if is a local minima, then

We prove it by contradiction. Assume that there exists such that

Then by equation (1), we have . Now, we consider the following Hessian matrix . Since is a local minima of the loss function , then the matrix should be positive semidefinite at . By , we have

In addition, we have

Since the matrix is positive semidefinite, then for any and ,

Since

and by setting

then