On the decision boundary of deep neural networks
While deep learning models and techniques have achieved great empirical success, our understanding of the source of success in many aspects remains very limited. In an attempt to bridge the gap, we investigate the decision boundary of a production deep learning architecture with weak assumptions on both the training data and the model. We demonstrate, both theoretically and empirically, that the last weight layer of a neural network converges to a linear SVM trained on the output of the last hidden layer, for both the binary case and the multi-class case with the commonly used cross-entropy loss. Furthermore, we show empirically that training a neural network as a whole, instead of only fine-tuning the last weight layer, may result in better bias constant for the last weight layer, which is important for generalization. In addition to facilitating the understanding of deep learning, our result can be helpful for solving a broad range of practical problems of deep learning, such as catastrophic forgetting and adversarial attacking. The experiment codes are available at https://github.com/lykaust15/NN_decision_boundaryREAD FULL TEXT VIEW PDF
Recent research in the deep learning field has produced a plethora of ne...
This work presents EddyNet, a deep learning based architecture for autom...
Motivation:While deep learning has achieved great success in computer vi...
This paper presents a novel technique based on gradient boosting to trai...
Dropout is commonly used to help reduce overfitting in deep neural netwo...
In this paper we propose a function space approach to Representation Lea...
Understanding why and how certain neural networks outperform others is k...
On the decision boundary of deep neural networks
In recent years, deep learning has achieved impressive success in various fields RN353
. Not only has it boosted the performance of the state-of-the-art methods in various areas, such as computer visionRN349 RN401 , it has also enabled machines to achieve human level intelligence in specific tasks RN400 . Despite its great empirical success, deep learning is often criticized for being used as a black box RN402 , which refers to the well-known gap between its empirical power and the theoretical understanding of it RN387 .
As suggested by RN387 , a satisfactory theoretical understanding of deep learning should cover three aspects: 1) representation power, 2) optimization characteristics, and 3) generalization property. The representation power of deep learning has been extensively and rigorously discussed in RN385
. In terms of the second aspect, that is, the convergence analysis of stochastic gradient descent (SGD) and the property of the minima obtained, numerous recent studies have endued promising answersRN355 ; RN388 ; RN351 ; RN364 ; RN361 ; RN367 ; RN403 ; RN360 . For example, RN355 proves the conjecture of RN342 , extending the result to deep nonlinear neural networks and showing the nonexistence of poor local minima. RN358 also shows that all local minima are globally optimal, given reasonable assumptions. RN403 ; RN360 prove the convergence of SGD given assumptions of the input distribution.
As for the generalization mystery, the studies are still in the early stage. Through systematic experiments, RN385 suggests that although the explicit regularization, such as weight decay and dropout, may be helpful, the implicit regularization of SGD may be the key for generalization. Following that direction, RN360 provides the generalization guarantee for over-parameterized networks on linearly separable data, which are trained by SGD. RN359
shows that, for linearly separable data, gradient descent (GD) on an unregularized logistic regression problem results in the max-margin (hard margin SVM) solution. On the other hand,RN398 ; RN399 try to demystify the generalization property via deriving the generalization bounds.
In this paper, we follow the direction of RN385 ; RN360 ; RN359 , investigating the implicit bias of GD and SGD. Unlike the previous studies, we do not oversimplify the model architecture. In fact, the architecture, which is shown in Fig. 1, is a productive one, which can reach the state-of-the-art performance on CIFAR-10 if we use DenseNet RN407
as the transformation function. Moreover, we have little requirement for the input data distribution, only assuming that the loss converges to zero. In the Main Result section and Experiments section, we show that the direction of the neural network’s last weight layer converges to that of the SVM solution trained on the transformed data in the transformed space both theoretically and empirically. In addition, we also show that the decision boundary of the last layer is closer to the SVM decision boundary if we train the whole network, instead of only fine-tuning the last layer, in the Experiments part. We extend our result to multi-class classification problem with cross-entropy loss, which is the most common scenario in practice, on the MNIST and CIFAR10. Our study bridges the gap between the purely theoretical side, which investigates the over-simplified models and has strict requirements for the input distribution, and the practical usage of complex deep learning models. In practice, people usually owe the superior performance of deep learning to the model’s ability of learning representation and classifier simultaneously. We demystify the relationship between the learned representation and the classifier, and characterize the learned classifier in particular.
Unlike the setting of previous studies RN359 ; RN360 ; RN403 , which assume the training data is linearly separable or follows a certain distribution, we do not have such a requirement. Formally, for binary classification, we consider a dataset , with , and binary labels . We use to denote the data matrix. For multi-class classification, we have and is the number of classes.
Regarding the neural network model, we do not restrict to any specific the architecture neither. Consider a neural network with the architecture shown in Fig. 1, which is basically a production network with practical usage. We divide the neural network into four components. The original space and label space are the training interface. The transformation function combined with the transformed space (the output of the last hidden layer) is one of the reasons why the deep learning’s performance is being continuously improved. For the sake of analysis, we take the transformed space as an independent component which is fully connected with the label space. Formally, we denote the output of the last hidden layer on example as , with .
We denote the entire parameter set of the network as . The network defines a function for the binary case. The transformation function is , where is the parameter set of the transformation function. Notice that from
to the final output, the last weight layer defines a linear transformation, which has the following form:
is the weight vector of the last layer (notice that for the binary case,). We use to denote the -th row of it. So, we have .
In general, the empirical loss over the training dataset has the following form:
is the specified loss function (e.g., exponential loss, cross-entropy, …). For example, with the exponential loss,, the empirical risk is given by
where the second expression emphasizes the last weight layer.
For multi-class classification, the commonly used loss function is cross-entropy loss:
where is the -th component of W, which is the weight for a certain class ; is the component of W for the class represented by .
In this section, we start with the result in RN359 for linearly separable data in logistic regression and then obtain the result for the neural network in Fig. 1. Finally, we extend the result from the binary case to the multi-class case.
In RN359 , the authors investigate the following problem.
For a logistic regression problem, whose weight vector is , the loss has the following form:
For this binary case, assuming all the labels are positive (we can re-define as ), we have the GD update for that loss function at iteration having the following form:
The authors show that finally diverges RN359 :
But the direction of the above solution converges to that of the hard margin SVM solution RN359 .
For any dataset which is linearly separable, any -smooth decreasing loss function with an exponential tail (the loss function tail is bounded by two exponential functions), any step size and any starting point , the gradient descent iterations will behave as:
where is the max margin vector:
and the residual grows at most as = , and so
Furthermore, except for measuring zero, the residual is bounded.
As for our problem, we have the following assumption:
The loss in Equation (2) converges to zero: .
This assumption is a reasonable assumption. It could be satisfied as long as the data is linearly or non-linearly separable, with no wrongly labeled data points and the model has enough capacity, which is usually the case for deep learning models. Based on Assumption 1, we have the following lemma:
In fact, since the last weight layer is a linear transformation, if is not linearly separable, the classification error can never reach zero, let alone the loss. Following Definition 1, let us re-define as , Based on Lemma 2 and Lemma 3, we obtain the first main result:
For any neural network for binary classification, any -smooth decreasing loss function with an exponential tail, small enough step size and any start point , as long as , the direction of the neural network’s last weight layer converges:
where is the max margin vector:
in which is the re-defined input of the last weight layer.
It is true that the convergence of the transformation function can also affect the last layer decision boundary. However, since the loss converges to zero, the variance of the transformation function is bounded after long enough training time, which makes the theorem hold.
As for the multi-class classification problem, we have the following lemma from RN359 :
For a logistic regression problem in which we learn a predictor for each class in a linearly separable multi-class dataset, any starting point and any small enough step size, under most circumstances (i.e., except for a measure zero), the iterates of gradient descent on the cross-entropy loss will behave as:
where the residual is bounded and is the solution of the K-class SVM:
Similar to Theorem 1, we can derive the following result for the multi-class case with cross-entropy loss.
For any neural network, small enough step size and any starting point , as long as the dataset makes , the iterates of gradient descent on the cross-entropy loss of the last weight layer W will behave as:
where the residual is bounded, is the weight for class at iteration and is the solution of the -class SVM:
There are seven datasets in our experiments, including five simulated 2D datasets and two real datasets. The five simulated datasets can be referred to Fig. 2 (A1-A5). The first three (Plate, Blob, and Sector) are linearly separable. The last two (Sector not separable and Moon) are non-linearly separable. There are 5000 points within each simulated dataset. The two real datasets are MNIST RN409 and CIFAR-10 RN410 . Since MNIST and CIFAR-10 are multi-class datasets, we randomly chose two classes out of the 10 classes for each one for the binary classification case. We used the network architecture in Fig. 1 for all the experiments. The only difference is the transformation function. We used a fully connected layer with 2000 nodes as the transformation function for the five simulated datasets; ResNet RN406 for MNIST; and DenseNet RN407 for CIFAR-10. For visualization purpose, we set as
. We used cross-entropy loss as the loss function and ReLU as the activation function. For multi-class classification problem, we set the number of nodes in the output layer the same as the number of classes. We used GD for the simulated datasets and SGD for MNIST and CIFAR-10. We turned off all the commonly used explicit regularizers, such as weight decay and dropout, for all the experiments.
The results are summarized in Fig. 2 (additional results can be found in the Appendices). The decision boundary of neural networks in the original input space can be referred to Fig. 2 (B1-B5). The green and black dots are the training data points. We sampled test data points uniformly across the whole space so that we can visualize the decision boundary of the trained neural networks. The blue points are the ones predicted by the model with the same label as the black training data while the red points are the ones predicted with the same label as the green training data. The curve that separates the blue points and red points can be considered as the decision boundary of the network. Although it is difficult to gain insight from the original space, as suggested by the analysis in the Main Result section, the transformed space is more interesting. Fig. 2 (D1-D5) shows the training data and testing data in the transformed space. As a comparison, we trained a linear SVM with the transformed training data and labeled the same testing data points with the SVM classifier, whose results are shown in Fig. 2 (C1-C5). As shown in the figure, the direction of the neural network’s last layer decision boundary trained with GD converges to that of the linear SVM solution, which verifies Theorem 1. Furthermore, the two kinds of decision boundaries are very close to each other, not only in the direction but also in the constant bias term. We further discuss this phenomenon in the next subsection.
After training a residual network with the MNIST data, we mapped the data into the transformed space. Within that space, we sampled test data uniformly and labeled those test data points using the last layer of the network in Fig. 1, which results in the decision boundary in Fig. 3 (A). Utilizing the training data in the transformed space, we trained a linear SVM classifier and plotted out the decision boundary of that classifier in Fig. 3 (B). As shown in the figures, after mapping the data into the transformed space, the direction of the first decision boundary is very close to that of the second decision boundary, which further supports Theorem 1. Furthermore, with the transformation function fixed, we reinitialized the last layer and retrained the last layer, whose result is shown in Fig. 3 (C). It suggests that our result still holds. On the other hand, the original boundary obtained by training the network as a whole is closer to the SVM boundary in terms of the bias constant, which suggests the whole network training may have better initialization for the last layer and thus make the model generalize better. Notice that although we turned off dropout and the model had been trained for a very long time to make it completely fit to the training data, the trained model still has very impressive generalization property with the testing accuracy being as high as 99.7%.
We trained a model with DenseNet transformation function on the CIFAR-10 dataset. The decision boundary results of this dataset could be referred to Fig. 4. As shown in the figure, similar to the result on MNIST, the directions of those two boundaries are very close to each other, which further supports Theorem 1. Furthermore, in addition to being close in terms of direction, the neural network boundary is very close to the midpoint of the two clusters, if it does not cross the midpoint, where the SVM boundary should pass theoretically. This phenomenon is consistent with the result of the simulated datasets and the MNIST dataset, suggesting that training the whole neural network using GD or SGD may result in a decision boundary with good bias constant. In terms of the trained model’s generalization property, although we turned off explicit regularizers, the model can still have 92.6% testing accuracy for this CIFAR-10 dataset, which is within the performance range of a productive deep learning model.
In practice, deep learning is usually used for multi-class classification with cross-entropy loss. We investigated the multi-class classification case in this section. We performed experiments on a simulated three class Blob dataset. The neural network decision boundary in the original space and the transformed space can be referred to Fig. 5 (A,B), respectively. As a comparison, the SVM decision boundary on the transformed data in the transformed space is shown in Fig. 5 (C). Those results, which show the decision boundary direction of the neural network last weight layer converges to that of SVM, verify Theorem 2. We also performed such experiment on the MNIST data with DenseNet transformation function. During the training, we also tried other optimizers other than just SGD, such as Momentum. The results are shown in Fig. 5 (D,E). From the two figures, we can find that the corresponding decision boundary directions of the neural network last layer and SVM are very close to each other. Besides, similar to the previous result, the decision boundary of neural network is very close to the midpoint between different clusters. Those experiments further support Theorem 2, which also shows that our hypothesis may be generalized to other optimizers, such as Momentum.
We also investigated the decision boundary of the DenseNet’s last layer, which is used to perform 10-class classification on Fashion MNIST. We used the same architecture from RN407 , except for that we added an additional layer to make the last hidden layer in 2D space for visualization purpose. We turned off the commonly used techniques for improving performance, such as data augmentation and dropout. We deployed Momentum as the optimizer. After the model being trained for 1,000 epoches, the loss oscillated around . The testing accuracy is around 91.8%, which is within the known performance range of the deep learning model on this dataset. We show the decision boundary comparison of the network’s last layer and the multi-class linear SVM solution in Fig. 6. As shown in the figure, although the experiment setting is not exactly the same as the assumptions in our main result, the decision boundary of the trained neural network still worths investigating. In fact, the decision boundary shown on the up-left of Fig. 6 (A) is very similar to that of Fig. 6 (B). On the other hand, the transformed representation of the blue class has very complex spatial relationship with the other three classes around it, which causes the neural network get stuck in local minimum and diverge from the multi-class linear SVM solution. Although this real task does not completely fit our assumption and our result, the experiment shows that the margin theory can have the potential to explain the generalization property of deep learning.
The result of this paper can be useful for solving several practical problems related to deep learning, such as catastrophic forgetting RN408 and the data-hungry challenge RN412 . We take these two as examples. On the other hand, we believe that investigating the transformation function would be helpful for solving adversarial attacking RN411 and studying the last layer can push out new ways of introducing uncertainty into supervised deep learning RN413 .
Catastrophic forgetting RN408 , which means the neural network does not have the ability of learning new knowledge without forgetting the learned knowledge, is one of the bottlenecks of deep learning. Recently, a rehearsal framework, called SupportNet RN404 , was proposed to deal with catastrophic forgetting when performing class incremental learning. In short, it maintains a subset of the old data, which is chosen based on the support vector information obtained by using SVM to approximate the last layer, and feeds the subset together with the new data to the model when incorporating the new classes into the model. Despite the lack of theoretical analysis in the paper, the framework works quite well in practice, even achieving nearly optimal performance on some datasets. In fact, according to Lemma 1 and Theorem 2, we can write such that and is bounded. The gradient of the exponential loss for can then be formulated as:
when the model converges and , only those data with the largest exponents, that is, should be the smallest, will contribute to the gradients. Those samples are exactly the support vectors of the SVM trained on the transformed data, which are selected by SupportNet. Using those data for future tuning, the model is very likely to learn the same boundary for the old classes. Our results partially explains why that rehearsal method works very well in practice.
It is always desirable to reduce the training data size for the data-hungry deep learning method, without too much performance compromise. In practice, especially in the computer vision field, when the data size is not large enough, people usually take advantage of transfer learningyosinski2014transferable , fine-tuning the last one or two layers of a pre-trained model with the training data. In fact, based on our result in the Main Result section and the analysis in the previous subsection, it is not data-hungry from the transformed space to the label space since only the support vector samples matter, which means the drawback property of deep learning comes from the transformation function component. The transfer learning technique, taking advantage of an existing transformation function and avoiding the data size requirement of that component, can thus learn a useful model with limited data.
Bridging the gap between the theoretical research and the practical power of deep learning is a fascinating research direction. In this paper, we investigate the decision boundary of a productive deep learning architecture with weak assumption on both the training data and the model. Through comprehensive theoretical analysis and experiments, we show that the direction of the neural network’s last weight layer converges to that of a linear SVM trained on the transformed data if the loss converges to zero, for both the binary case and the multi-class case with the commonly used cross-entropy loss. In addition, we show it empirically that training a neural network as a whole may result in better bias constant for the last weight layer, which is important for the generalization property of deep learning models. In addition to facilitating the understanding of deep learning and thus further improving its performance, our result can be useful for solving a broad range of practical problems in the deep learning field, such as catastrophic forgetting, reducing the data size requirement of deep learning, adversarial attacking, and introducing uncertainty into deep learning.
Neural networks and principal component analysis: Learning from examples without local minima.Neural Networks, 2(1):53–58, 1989.
Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
Proceedings of the IEEE conference on computer vision and pattern recognition, 1(2):3, 2017.
We first want to clarify why in the main text, we chose the data range of the simulated linearly separable datasets to be relatively large, from to around or . Here we provide the results of the small-range linearly separable datasets (from to around ), which can be referred to Fig. 7. Intuitively, the decision boundary in the original space (Fig. 7 (B1-B3)) is very surprising because the highly over-parameterized multi-layer neural network seems to learn a linear decision boundary. We argue that it is because of the small range of the datasets and also the shape of the activation function. As we know, a very large part of the ReLU activation function is linear. If the data range is very small, it is very likely that during training, the nonlinear part of the activation function is not used. As a result, the whole network becomes a linear classifier, which makes the decision boundary linear. We demonstrate that by performing an additional experiment on the small-range Blob dataset with the neural network having the following square activation function:
Within this function, there is no linear part. So even the data range is small, the decision boundary of the neural network should still be nonlinear. The experimental results of this setting are shown in the last column of Fig. 7. From Fig. 7 (B4), we can see that the decision boundary in the original space is a nonlinear one, which is as expected. On the other hand, we also show the decision boundary in the original space in different scales in Fig. 8. As shown in Fig. 8 (A, B), although the boundary is nonlinear globally (Fig. 8 (C)), it is very similar to a linear boundary if we only consider its local shape (i.e. from to ), which supports our assumption, that is, if the data range is small, the nonlinear power of the activation function is used limitedly. This experiment demonstrates that the data range combining with the activation function can have a significant impact on the decision boundary in the original space. To eliminate the potential misunderstanding and misleading results caused by the datasets and emphasize the main results, we chose the large-range datasets in the main text.
On the other hand, if we investigate the results of the neural network (Fig. 7 (D1-D4)) and the linear SVM (Fig. 7 (C1-C4)) in the tranformed space on those small-range datasets, we can find that the results are similar to those on the large-range datasets in the main text, which further supports our main results.