# An Unconstrained Layer-Peeled Perspective on Neural Collapse

Neural collapse is a highly symmetric geometric pattern of neural networks that emerges during the terminal phase of training, with profound implications on the generalization performance and robustness of the trained networks. To understand how the last-layer features and classifiers exhibit this recently discovered implicit bias, in this paper, we introduce a surrogate model called the unconstrained layer-peeled model (ULPM). We prove that gradient flow on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer. Moreover, we show that the ULPM with the cross-entropy loss has a benign global landscape for its loss function, which allows us to prove that all the critical points are strict saddle points except the global minimizers that exhibit the neural collapse phenomenon. Empirically, we show that our results also hold during the training of neural networks in real-world tasks when explicit regularization or weight decay is not used.

Comments

There are no comments yet.

## Authors

• 2 publications
• 12 publications
• 3 publications
• 14 publications
• 26 publications
• ### A Geometric Analysis of Neural Collapse with Unconstrained Features

We provide the first global optimization landscape analysis of Neural Co...
05/06/2021 ∙ by Zhihui Zhu, et al. ∙ 18

read it

• ### Neural collapse with unconstrained features

Neural collapse is an emergent phenomenon in deep learning that was rece...
11/23/2020 ∙ by Dustin G. Mixon, et al. ∙ 0

read it

• ### Critical Points of Neural Networks: Analytical Forms and Landscape Properties

Due to the success of deep learning to solving a variety of challenging ...
10/30/2017 ∙ by Yi Zhou, et al. ∙ 0

read it

• ### Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path

Recent work [Papyan, Han, and Donoho, 2020] discovered a phenomenon call...
06/03/2021 ∙ by X. Y. Han, et al. ∙ 0

read it

• ### Layer-Peeled Model: Toward Understanding Well-Trained Deep Neural Networks

In this paper, we introduce the Layer-Peeled Model, a nonconvex yet anal...
01/29/2021 ∙ by Cong Fang, et al. ∙ 0

read it

• ### Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory

We empirically evaluate common assumptions about neural networks that ar...
10/01/2019 ∙ by Micah Goldblum, et al. ∙ 0

read it

• ### Persistency of Excitation for Robustness of Neural Networks

When an online learning algorithm is used to estimate the unknown parame...
11/04/2019 ∙ by Kamil Nar, et al. ∙ 12

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep learning has achieved state-of-the-art performance in various applications [lecun2015deep]

, such as computer vision

[krizhevsky2012imagenet][brown2020language], and scientific discovery [long2018pde, zhang2018deep]. Despite the empirical success of deep learning, how gradient descent or its variants lead deep neural networks to be biased towards solutions with good generalization performance on the test set is still a major open question. To develop a theoretical foundation for deep learning, many studies have investigated the implicit bias of gradient descent in different settings [li2018algorithmic, amari2020does, vaswani2020each, soudry2018implicit, lyu2019gradient, arora2019implicit].

It is well acknowledged that well-trained end-to-end deep architectures can effectively extract features relevant to a given label. Although theoretical analysis of deep learning has been successful in recent years [dlbook, goldblum2019truth], most of the studies that aim to analyze the properties of the final output function fail to understand the features learned by neural networks. Recently, in [papyan2020prevalence], the authors observed that the features in the same class will collapse to their mean and the mean will converge to an equiangular tight frame (ETF) during the terminal phase of training, that is, the stage after achieving zero training error. This phenomenon, namely, neural collapse [papyan2020prevalence]

, provides a clear view of how the last-layer features in the neural network evolve after interpolation and enables us to understand the benefit of training after achieving zero training error to achieve better performance in terms of generalization and robustness. To theoretically analyze the neural collapse phenomenon,

[fang2021layer] proposed the layer-peeled model (LPM) as a simple surrogate for neural networks, where the last-layer features are modeled as free optimization variables. In particular, in a balanced -class classification problem using a neural network with neurons in the last hidden layer, the LPM takes the following form:

 minW,H1NN∑i=1L(Whi,yi), s.t. 12||W||2F≤C1,12||H||2F≤C2, (1)

where are positive constants. Here, is the weight of the final linear classifier, is the feature of the last layer and is the corresponding label. The intuition behind the LPM is that modern deep networks are often highly over-parameterized, with the capacity to learn any representations of the input data. It has been shown that an equiangular tight frame (ETF), i.e., feature with neural collapse, is the only global optimum of the LPM objective (1) [fang2021layer, lu2020neural, wojtowytsch2020emergence, zhu2021geometric].

However, feature constraints in LPMs are unrealistic settings that are never used in practice. In this study, we directly deal with the unconstrained model and show that gradient flow can find those neural collapse solutions without the help of explicit constraints and regularization. To do this, we build a connection between the neural collapse and recent theories on max-margin implicit regularization [lyu2019gradient, wei2018margin], and use it to provide a convergence result to the first-order stationary point of a minimum-norm separation problem. Furthermore, we illustrate that the cross-entropy loss enjoys a benign global landscape where all the critical points are strict saddles in the tangent space, except for the only global minimizers that exhibit the neural collapse phenomenon. Finally, We verify our insights via empirical experiments.

In contrast to previous theoretical works on neural collapse, our analysis does not incorporate any explicit regularization or constraint on features. A comparison with other results can be found in Table 1 and we defer a detailed discussion to Section 5.2. The reasons we investigate the unregularized objective are summarized as follows:

1. Feature regularization or constrain is an unrealistic setting and never used in practice. However, previous studies have justified that neural networks continue to perform well without any regularization or constraint [zhang2021understanding].

2. As shown in this study, neural collapse exists even under an unconstrained setting, which implies the emergence of neural collapse should be attributed to gradient descent and cross-entropy loss rather than explicit regularization.

3. Regularization or constraint feature constraint can be barriers for existing theories of neural networks [jacot2018neural, lyu2019gradient]. By allowing features to be totally free, we hope our results can inspire further analysis to plug in a realistic neural network.

### 1.1 Contributions

The contributions of the present study can be summarized as follows.

• We build a relationship between the max-margin analysis [soudry2018implicit, nacson2019convergence, lyu2019gradient] and the neural collapse and provide the implicit bias analysis to the feature rather than the output function. Although both parameters and features diverge to infinity, we prove that the convergent direction is along the direction of the minimum-norm separation problem.

• Previous works [lyu2019gradient, ji2020gradient] only prove that gradient flow on homogeneous neural networks will converge to the KKT point of the corresponding minimum-norm separation problem. However, the minimum-norm separation problem remains a highly non-convex problem and a local KKT point may not be the neural collapse solution. In this study, we perform a more detailed characterization of the convergence direction via landscape analysis.

• Previous analysis about neural collapse relies on the explicit regularization or constraint. In this study, we show that the implicit regularization effect of gradient flow is sufficient to lead to a neural collapse solution. The emergence of neural collapse should be attributed to gradient descent and loss function, rather than explicit regularization or constraint. We put detailed discussion in Section 5.2.

### 1.2 Related Works

#### Implicit Bias of Gradient Descent:

To understand how gradient descent or its variants helps deep learning to find solutions with good generalization performance on the test set, a recent line of research have studied the implicit bias of gradient descent in different settings. For example, gradient descent is biased toward solutions with smaller weights under loss [li2018algorithmic, amari2020does, vaswani2020each] and will converge to large margin solution while using logistic loss [soudry2018implicit, nacson2019convergence, lyu2019gradient, chizat2020implicit, ji2020gradient]. For linear networks, [arora2019implicit, razin2020implicit, gidel2019implicit] have shown that gradient descent determines a low-rank approximation.

#### Loss Landscape Analysis:

Although the practical optimization problems encountered in machine learning are often nonconvex, recent works have shown the landscape can enjoy benign properties which allow further analysis. In particular, these landscapes do not exhibit spurious local minimizers or flat saddles and can be easily optimized via gradient-based methods

[ge2015escaping]. Examples include phase retrieval [sun2018geometric], low-rank matrix recovery [ge2016matrix, ge2015escaping], dictionary learning [sun2016complete, qu2019analysis, laurent2018deep] and blind deconvolution [lau2019short].

## 2 Preliminaries and Problem Setup

### 2.1 Preliminaries

We consider a balanced dataset with classes . A standard fully connected neural network can be represented as:

 f(x;Wfull)=bL+WLσ(bL−1+WL−1σ(⋯σ(b1+W1x)⋯)). (2)

Here denote the weight matrices in each layer, denote the bias terms, and

denotes the nonlinear activation function, for example, ReLU or sigmoid. Let

denote the last layer feature for data and denotes the feature mean within the k-th class. To provide a formal definition of neural collapse, we first introduce the concept of a simplex equiangular tight frame (ETF):

###### Definition 2.1 (Simplex ETF).

A symmetric matrix is said to be a simplex equiangular tight frame (ETF) if

 M=√KK−1Q(IK−1K1K1⊤K). (3)

Where is a matrix with orthogonal columns.

Let be the weight of the final layer classifier, the four criteria of neural collapse can be formulated precisely as:

• (NC1) Variability collapse: As training progresses, the within-class variation of the activation becomes negligible as these activation collapse to their class mean .

 ||hk,i−¯hk||=0,1≤k≤K.
• (NC2) Convergence to Simplex ETF:

The vectors of the class-means (after centering by their global-mean) converge to having equal length, forming equal-size angles between any given pair, and being the maximally pairwise-distanced configuration constrained to the previous two properties.

 cos(¯hk,¯hj)=−1K−1,||¯hk||=||¯hj||,k≠j.
• (NC3) Convergence to self-duality: The linear classifiers and class-means will converge to align with each other, up to appropriate rescaling, that is, there exist a universal constant such that

 wk=C¯hk,k∈[K].
• (NC4) Simplification to Nearest Class-Center For a given deepnet activation:

 h=σ(bL−1+WL−1σ(⋯σ(b1+W1x)))∈Rd,

the network classifier converges to choose whichever class has the nearest train class-mean

In this paper, we say that a point satisfies the neural collapse conditions or is neural collapse solution if the above four criteria are all satisfied for .

### 2.2 Problem Setup

We mainly focus on the neural collapse phenomenon, which is only related to the classifiers and features in the last layer. Since the general analysis of the highly non-smooth and non-convex neural network is difficult, we peel down the last layer of the neural network and propose the following unconstrained layer-peeled model (ULPM) as a simplification to capture the main characters related to neural collapse during training dynamics. A similar simplification is commonly used in previous theoretical works [lu2020neural, fang2021layer, wojtowytsch2020emergence, zhu2021geometric], but ours does not have any constraint or regularization on features and is closer to realistic neural network models. It should be mentioned that although [mixon2020neural, han2021neural] also studied the unconstrained model, their analysis adopted an approximation to real dynamics and was highly dependent on the loss function, which is rarely used in classification tasks. Compared with their works, we directly deal with the real training dynamics and cover the most popular cross-entropy loss in classification tasks.

Let and be the matrices of classifiers and features in the last layer, where is the number of classes and is the number of data points in each class. The ULPM is defined as follows:

 minW,H L(W,H)=minW,H−K∑k=1n∑i=1log⎛⎝exp(w⊤khk,i)∑Kj=1exp(w⊤jhk,i)⎞⎠. (4)

Here, we do not have any constraint or regularization on features, which corresponds to the absence of weight decay in deep learning training. The objective function (4) is generally non-convex on and we aim to study the landscape of the objective function (4). Furthermore, we consider the gradient flow of the objective function:

 dW(t)dt=−∂L(W(t),H(t))∂W,dHdt=−∂L(W(t),H(t))∂H.

#### Notations.

denotes the Frobenius norm, denotes the matrix spectral norm, denotes the nuclear norm, denotes the vector norm and is the trace of matrices. We use to denote the set of indices up to .

## 3 Main Results

In this section, we present our main results regarding the training dynamics and landscape analysis of (4). This section is organized as follows: First, in Section 3.1, we show the relationship between the margin and neural collapse in our surrogate model. Inspired by this relationship, we propose a minimum-norm separation problem (5) and show the connection between the convergence direction of the gradient flow and the KKT point of (5). In addition, we explicitly solve the global optimum of (5) and show that it must satisfy the neural collapse conditions. However, owing to the non-convexity, we find an Example 3.1 in Section 3.2, which shows that there exist some bad KKT points such that a simple gradient flow will get stuck in them and does not converge to the neural collapse solution, which is proved to be optimal in Theorem 3.3. Then, we present our second–order analysis result in Theorem 3.4 to show that those bad points will exhibit decreasing directions in the tangent space thus gradient descent and its variants can avoid those bad directions and only converge to the neural collapse solutions [lee2016gradient, ge2015escaping].

### 3.1 Convergence To The First–Order Stationary Point

Heuristically speaking, the simplex ETF (Definition 2.1) gives a set of vectors with the maximum average angle between them. As a result, neural collapse implies that the neural networks tend to maximize the angles between each class and the corresponding classifiers. At a high level, such behavior is quite similar to margin maximization which is known to be an implicit regularization effect of gradient descent and has been extensively studied in [soudry2018implicit, nacson2019convergence, lyu2019gradient, ji2020gradient]. First, we illustrate the connection between the margin and neural collapse. Recall that the margin of a single data point and the associated feature is . Then, the margin of the entire dataset can be defined as

 qmin(W,H):=mink∈[1,K],i∈[1,n]qk,i(W,H).

The following theorem demonstrates that neural collapse yields the maximum margin solution of our ULPM model:

###### Theorem 3.1 (Neural collapse as max-margin solution).

For the ULPM model (4), the margin of the entire dataset always satisfies

 qmin(W,H)≤∥W∥2F+∥H∥2F2(K−1)√n

and the equality holds if and only if satisfies the neural collapse conditions with .

Based on this finding, we present an analysis of the convergence of the gradient flow on the ULPM (4). Following [lyu2019gradient], we link the gradient flow on cross-entropy loss to a minimum-norm separation problem.

###### Theorem 3.2.

For problem (4), let be the path of gradient flow at time , if there exist a time such that , then any limit point of

 {(^H(t),^W(t)):=(H(t)√∥W(t)∥2F+∥H(t)∥2F,W(t)√∥W(t)∥2F+∥H(t)∥2F)}

is along the direction (i.e., a constant multiple) of a Karush-Kuhn-Tucker (KKT) point of the following minimum-norm separation problem:

 minW,H12||W||2F+12||H||2F (5) s.t. w⊤khk,i−w⊤jhk,i≥1,k≠j∈[K],i∈[n].
###### Remark 3.1.

Here we write (5) as a constraint problem, but the constraint is introduced by the implicit regularization effect of gradient flow on our ULPM objective (4). Since the training dynamics will diverge to infinity, we hope to justify that the diverge direction is highly related to neural collapse and an appropriate normalization is needed, which is why it appears to be a constraint optimization form. Our goal is to justify that the neural collapse phenomenon is caused by the properties of the loss function and training dynamics rather than an explicit regularization or constraint, which seems to be necessary for previous studies [fang2021layer, lu2020neural, wojtowytsch2020emergence].

###### Remark 3.2.

In Theorem 3.2, we assume that there exists a time such that . Note that the loss function can be rewritten as:

 L(W,H)=K∑k=1n∑i=1log(1+∑j≠kexp(wjhk,i−wkhk,i)) (6)

the requirement implies that for any , which is equivalent to ; that is, every feature is separated perfectly by the classifier. This assumption is common in the study of implicit bias in the nonlinear setting [nacson2019lexicographic, lyu2019gradient, ji2020gradient] and its validity can be justified by the fact that neural collapse is found only in the terminal phase of training in the deep neural network, where the training accuracy has achieved 100%. It is also an interesting direction to remove this assumption and study the early-stage dynamics of training, which is beyond the scope of this study and we leave it to future exploration.

Theorem 3.2 indicates that the convergent direction of the gradient flow is restricted to the max-margin directions, which usually have good robustness and generalization performance. In general, the KKT conditions are not sufficient to obtain global optimality because the minimum-norm separation problem (5) is non-convex. On the other hand, we can precisely characterize its global optimum in the ULPM case based on Theorem 3.1:

###### Corollary 3.1.

Every global optimum of the minimum-norm separation problem (5) is also a KKT point that satisfies the neural collapse conditions.

With Theorem 3.2 bridging dynamics with KKT points of (5) and Corollary 3.1 associating global optimum of (5) with neural collapse, the remaining work is to close the gap between the KKT point and the global optimum.

### 3.2 Second–Order Landscape Analysis

In the convex optimization problem, the KKT conditions are usually equivalent to global optimality. Unfortunately, owing to the non-convex nature of the objective (4), the KKT points can also be saddle points or local optimum other than the global optimum. In this section, we aim to show that this non-convex optimization problem is actually not scary via landscape analysis. To be more specific, we prove that except for the global optimum given by neural collapse, all the other KKT points are actually saddle points that can be avoided by gradient flow.

In contrast to previous landscape analysis of non-convex problems, where people aim to show that the objective has a negative directional curvature around any stationary point [sun2015nonconvex, zhang2020symmetry], our analysis is slightly different. Note that since the model is unconstrained, once features can be perfectly separated, the ULPM objective (4) will always decrease along the direction of the current point and the optimum is attained only in infinity. Although growing along all of those perfectly separating directions can let the loss function decrease to 0, the speed of decrease is quite different and there exists an optimal direction with the fastest decreasing speed. As shown in Section 3.1, first-order analysis of training dynamics fails to distinguish such an optimal direction from KKT points, and we need second-order analysis to help us fully characterize the realistic training dynamics. First, we provide an example to illustrate the motivation and necessity of second-order landscape analysis.

###### Example 3.1 (A Motivating Example).

Consider the case where , let be the following point:

 W=H=⎡⎢ ⎢ ⎢⎣1−100−1100001−100−11⎤⎥ ⎥ ⎥⎦.

One can easily verify that enables our model to classify all features perfectly. Furthermore, we can show that it is along the direction of a KKT point of the minimum-norm separation problem (5) by constructing the Lagrangian multiplier as follows:

 Λ=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣001212001212121200121200⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

To see this, simply write down the corresponding Lagrangian (note that we aim to justify is along the direction of a KKT point of (5), to make it a true KKT point , one needs to multiple on ):

 L(W,H,Λ)=14∥W∥2F+14∥H∥2F−4∑i=1∑j≠iλi,j(12wihi−12wjhi−1).

Simply take derivatives for and we find that it satisfies the KKT conditions. However, the gradient of is:

 ∇WL(W,H)=∇HL(W,H)=−2+2e−22+2e−2+2e2⎡⎢ ⎢ ⎢⎣1−100−1100001−100−11⎤⎥ ⎥ ⎥⎦.

We can see that the directions of the gradient and the parameter align with each other (i.e., is parallel to and is parallel to ), which implies that simple gradient descent may get stuck in this direction and only grow the parameter norm. However, if we construct:

 W′=H′⊤=√11+2α2⎡⎢ ⎢ ⎢⎣1+α−1+ααα−1+α1+ααα−α−α1−α−1−α−α−α−1−α1−α⎤⎥ ⎥ ⎥⎦,

By simple calculation, we find that satisfies . Since and as , this result implies that for any , we can choose appropriate such that:

 ||W′||2F=||W||2F,||H′||2F=||H||2F, ||W′−W||2F+||H′−H||2F<ϵ,L(W′,H′)

In Example 3.1, it is shown that there are some KKT points of the minimum-norm separation problem (5) that are not globally optimal, but there exist some better points close to it; thus, the gradient-based method can easily avoid them (see [lee2016gradient, ge2015escaping] for a detailed discussion). In the following theorem, we will show that the best directions are neural collapse solutions in the sense that the loss function is the lowest among all the growing directions.

###### Theorem 3.3.

The optimal value of the loss function (4) on a sphere is obtained (i.e., for any ) if only if satisfies neural collapse conditions and .

###### Remark 3.3.

Note that the second condition is necessary because neural collapse conditions do not specify the norm ratio of and . That is, if satisfies the neural collapse conditions, then for any will also satisfy them, but only some certain are optimal.

Now, we turn to those points that are not globally optimal. To formalize our discussion in the motivating example 3.1, we first introduce the tangent space:

###### Definition 3.1 (tangent space).

The tangent space of is defined as a set of directions that are orthogonal to :

 T(W,H)={ΔW∈RK×d,ΔH∈Rd×nK):%tr(W⊤ΔW)+tr(H⊤ΔH)=0}

Our next result justifies our observation in Example 3.1 that for those unoptimal points, there exists a direction in the tangent space such that moving along this direction will lead to a lower objective value.

###### Theorem 3.4.

If is not the optimal solution in Theorem 3.3 (i.e., is not neural collapse solution or it is neural collapse solution but ), then there exists a direction and constant such that for any ,

 L(W+δΔW,H+δΔH)

Further more, it implies that for any there exists a point such that:

 ||W′||2F+||H′||2F=||W||2F+||H||2F, (8) ||W′−W||2F+||H′−H||2F<ϵ,L(W′,H′)
###### Remark 3.4.

The result in (7) gives us a decreasing direction orthogonal to the direction of . As shown in Example 3.1, the gradient on these unoptimal points might be parallel to ; thus, the first-order analysis fails to explain the prevalence of neural collapse. Here the decreasing direction must be obtained by analyzing the Hessian matrices, which further indicates that these points are saddle points in the tangent space, which is why we name it second-order landscape analysis. A formal statement and proof are presented in the appendix C. Previous works have shown that for a large family of gradient-based methods, they can avoid saddle points and only converge to minimizers [lee2016gradient, ge2015escaping, panageas2019first], thus our landscape analysis indicates that the gradient flow dynamics only find neural collapse directions.

## 4 Empirical Results

To evaluate our theory, we trained the VGG-13 [simonyan2014very] and ResNet18 [he2016deep] on both FashionMNIST [xiao2017fashion]

and CIFAR-10 datasets

[krizhevsky2009learning]

without weight decay and tracked the convergence speed of the last layer feature to the neural collapse solution every few epochs to see how it changes during the terminal phase training. All the networks were trained for 500 epochs, using stochastic gradient descent without weight decay. The results are plotted in

Figure 1. A detailed definition of the metrics can be found in Appendix D. We observe that among all the experiments, the variation of norms became smaller than 0.1 after 500 training epochs (the non-decreasing variation of norms is also found in [papyan2020prevalence], it may attribute to the network architecture and characteristic of real-world datasets); the cosines between pairs of last layer features and that of the classifiers converged to the equiangular state with maximum angles at rate ; the distance between normalized centered classifier and normalized last layer feature decreased at rate after 100 epochs, and the within class variation approximately decreased at rate after 100 epochs. It can be observed that all the aforementioned quantities either decrease or stay in small values during the training process, implying that neural collapse can occur with sufficient training epochs.

## 5 Conclusion and Discussion

### 5.1 Conclusion

To understand the implicit bias of neural features from gradient descent training, we built a connection between max-margin implicit bias and the neural collapse phenomenon and studied the ULPM in this study. We proved that the gradient flow of the ULPM converges to the KKT point of a minimum-norm separation problem, where the global optimum satisfies the neural collapse conditions. Although the ULPM is non-convex, we show that ULPM has a benign landscape where all the stationary points are strict saddle points in the tangent space, except for the global neural collapse solution. Our study helps to demystify the neural collapse phenomenon, which sheds light on the generalization and robustness during the terminal phase of training deep networks in classification problems.

### 5.2 Relationship with Other Results on Neural Collapse

Theoretical analysis of neural collapse was first provided by [lu2020neural, wojtowytsch2020emergence, fang2021layer], who showed that the neural collapse solution is the only global minimum of the simplified non-convex objective function. In particular, [wojtowytsch2020emergence, lu2020neural]

studied a continuous integral form of the loss function and showed that the features learned should have a uniform distribution on the sphere. A more realistic discrete setting was studied in

[fang2021layer], where the constraint is on the entire feature matrix rather than individual features. Our result utilizes the implicit bias of the cross-entropy loss function to remove the feature norm constraint, which is not practical in real applications.

Although the global optimum can be fully characterized by neural collapse conditions, the ULPM objective is still highly non-convex. Regarding optimization, [mixon2020neural, poggio2020explicit, han2021neural] analyzed the unconstrained feature model with loss and established convergence results for collapsed features for gradient descent. However, they fail to generalize to the more practical cross-entropy loss functions used in classification tasks. The analysis relies highly on the loss to obtain a closed-form gradient flow and still requires some additional approximation to guarantee global convergence.

The most relevant study is a concurrent work [zhu2021geometric], which provides a landscape analysis of the regularized unconstrained feature model. zhu2021geometric turns the feature norm constraint in fang2021layer into feature norm regularization and still preserves the neural collapse global optimum. At the same time, it shows that the modified regularized objective shares a benign landscape, where all the critical points are strict saddles except for the global one. Although our study and zhu2021geometric discover similar landscape results, we believe our characterization remains closer to the real algorithms since we do not introduce any constraints or regularization on the feature norm following the conventional setting in realist training. The regularization of features introduced in zhu2021geometric is still different from weight decay regularization [krogh1992simple]. However, weight decay on homogeneous neural networks is equivalent to gradient descent with scaling step size on an unregularized objective [li2019exponential, zhang2018deep]. Moreover, neural networks are found to perform well in the absence of weight decay, which highlights the importance of implicit regularization [zhang2021understanding]. As a result, instead of explicit feature norm constraint/regularization, in this study, we consider implicit regularization brought by the gradient flow on cross-entropy. We show that implicit regularization is sufficient to lead the dynamics to converge to the neural collapse solution without the explicit regularization.

## Acknowledgments

We are grateful to Qing Qu and X.Y. Han for helpful discussions and feedback on an early version of the manuscript. Wenlong Ji is partially supported by the elite undergraduate training program of the School of Mathematical Sciences at Peking University. Yiping Lu is supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF).

## Appendix A Elements of optimization

In this section, we introduce some basic definitions and theory about optimization. In the following discussion, we consider a standard form inequality constrained optimization problem:

 minx∈Rd f(x) (9) s.t. gi(x)≤0,i∈[n].

In addition, we assume all of those functions and are twice differentiable. A point is said to be feasible if and only if it satisfies all of the constraints in (9), i.e., . And the Lagrangian of problem (9) is defined as following:

 L(x,λ)=f(x)+n∑i=1λigi(x).

### a.1 Karush Kuhn Tucker Conditions

Now let’s first introduce the definition of Karush Kuhn Tucker (KKT) point and approximate KKT point. Here we follows the definition of -KKT point as in lyu2019gradient.

###### Definition A.1 (Definition of KKT point).

A feasible point is said to be KKT point of problem (9) if there exist such that the following Karush Kuhn Tucker (KKT) conditions hold:

###### Definition A.2 (Definition of (ϵ,δ)-KKT point).

For any , a feasible point of (9) is said to be -KKT point of problem (9) if there exist such that:

Generally speaking, KKT conditions might not be necessary for global optimality. We need some additional regular conditions to make it necessary. For example, as shown in Dutta2013ApproximateKP we can require the problem to satisfy the following Mangasarian-Fromovitz constraint qualification (MFCQ):

###### Definition A.3 (Mangasarian-Fromovitz constraint qualification (MFCQ) ).

For a feasible point of (9), problem (9) is said to satisfy (MFCQ) at if there exist a vector such that:

 ⟨∇xgi(x),v⟩>0,i∈[n]. (10)

Moreover, when MFCQ holds we can build a connection between approximate KKT point and KKT point, see detailed proof in Dutta2013ApproximateKP:

###### Theorem A.1 (Relationship between Approximate KKT point and KKT point).

Let be a sequence of feasible points of (9) , and be two sequences of real numbers such that is an -KKT point for every , and If as . If MFCQ (A.3) holds at , then is a KKT point of .

## Appendix B Omitted proofs from Section 3.1

Recall our ULPM problem:

 minW,H L(W,H)=−K∑k=1n∑i=1log⎛⎝exp(w⊤khk,i)∑Kj=1exp(w⊤jhk,i)⎞⎠.

Let’s review some basic notations defined in the main body. Let , the margin of a single feature is defined to be . We define the margin of entire dataset as We first prove Theorem 3.1 in the mainbody.

###### Proof of Theorem 3.1.

First we can find that the margin will not change if we minus a vector for all , so if we denote the mean of classifier and then we have , that is:

 ~w⊤khk,i−~w⊤jhk,i≥qmin(W,H),∀j≠k∈[K],i∈[n]. (11)

Note that then sum this inequality over we have:

 (K−1)~w⊤khk,i−∑j≠k~w⊤jhk,i=K~w⊤khk,i≥(K−1)qmin(W,H),∀k∈[K],i∈[n].

By Cauchy inequality, we have:

 12(1√n||~wk||22+√n||hk,i||22)≥~w⊤khk,i≥K−1Kqmin(W,H). (12)

Sum (12) over k and i we have:

 12√n(||~W||2F+||H||2F)≥n(K−1)qmin(W,H). (13)

On the other hand, we know that:

 ||~W||2F=K∑i=1||wi−1nK∑i=1wi||22≤K∑i=1||wi||22=||W||2F.

Then we can conclude that:

 qmin(W,H)≤∥W∥2F+∥H∥2F2(K−1)√n

as desired. When the equality holds, first we have which is equivalent to . Take it back into (13), then we must have all of the equality holds in (12) and (11), which give us:

 wk=√nhk,i,||wk||22=n||hk,i||22=||W||2F+||H||2F2K.

Take this into (11) we have:

 hk,i=hk,i′,h⊤k,ihj,i′=w⊤kwj=−||W||2F+||H||2F2K(K−1)√n,

which implies neural collapse conditions. ∎

Now let’s turn to the training dynamics and prove Theorem 3.2. Before starting the proof, we need to introduce some additional notations. Since the are all optimization variables here, we denote as the whole parameter for simplicity and all of previous function can be defined on by matching the corresponding parameters. Denote as the norm of and . Now we can state our first lemma to show how training dynamics of gradient flow on ULPM objective (B) is related to a KKT point of (5).

###### Lemma B.1.

If there exist a time such that , then for any is an - approximate KKT point of the following minimum-norm separation problem. More precisely, we have

 ϵ=√2(1−β(t))~γ(t),δ=K2(K−1)n2e~γ(t)qmin(t),

where:

 β=⟨θ||θ||2,dθdt/||dθdt||2⟩

is the angle between and its corresponding gradient.

###### Proof.

The training dynamics is given by gradient flow:

 dθdt=−∂L(θ)∂θ.

Then by the chain rule we have:

 −dL(θ)dt=−∂L∂θdθdt=∥∥∥dθdt∥∥∥2. (14)

It indicates that the loss function is monotonically decreasing. If , we have . On the other hand, note that

 L(θ(t))=K∑k=1n∑i=1log(1+∑j≠ke−sk,i,j(t))≥log(1+exp(−qmin(t))), (15)

which gives us .

Let , note that we can rewrite the ULPM objective function (B) as . By the chain rule and the gradient flow equation we have

 g=−∂L∂θK∑k=1n∑i=1∑j≠ke−sk,i,j1+∑l≠ke−sk,i,lgk,i,j,

where is the gradient of , i.e. . Now let and construct , we only need to show:

 ||~θ−K∑k=1n∑i=1∑j≠kλk,i,j~gk,i,j||22≤1−β~γ, (16)
 K∑k=1n∑i=1∑j≠kλk,i,j(sk,i,j(~θ)−1)≤K2(K−1)n2eqmin~γ. (17)

To prove (16), we only need to compute (Recall that ):

 ||~θ−K∑k=1n∑i=1∑j≠ynλk,i,j~gk,i,j||22=ρ2qmin||θ||θ||2−g||g||2||22=ρ2qmin(2−2β).

Note that:

 (18)

Then we have the following inequality:

 ~γ≤qminρ2. (19)

Take this back into (18) we have (16) as desired.

To prove (17), first by our construction:

 K∑k=1n∑i=1∑j≠kλk,i,j(sk,i,j(~θ)−1)=ρqmin||g||2K∑k=1n∑i=1∑j≠ke−sk,i,j1+∑l≠ke−sk,i,l(sk,i,j−qmin). (20)

Note that and since , we have:

 ||g||2≥1ρ⟨g,θ⟩= 1ρK∑k=1n∑i=1∑j≠ke−sk,i,j1+∑l≠ke−sk,i,l⟨gk,i,j,θ⟩ (21) = 2ρK∑k=1n∑i=1∑j≠ke−sk,i,j1+∑l≠ke−sk,i,lsk,i,j ≥ 2ρqminKK∑k=1n∑i=1∑j≠ke−sk,i,j(since sk,i,j≥qmin>0 and e−sk,i,l≤1) ≥ 2ρqminKe−qmin.

Take this inequality back into the (20) we have:

 K∑k