1 Introduction
Constructing deep neural network (DNN) models by stacking layers unlocks the field of deep learning, leading to the early success in computer vision, such as AlexNet
[Krizhevsky et al., 2012], ZFNet [Zeiler and Fergus, 2014], and VGG [Simonyan and Zisserman, 2015]. However, stacking more and more layers can suffer from worse performance [He and Sun, 2015, Srivastava et al., 2015, He et al., 2016a]; thus, it is no longer a valid option to further improve DNN models. In fact, such a degradation problem is not caused by overfitting, but worse training performance [He et al., 2016a]. When neural networks become sufficiently deep, optimization landscapes quickly transition from being nearly convex to being highly chaotic [Li et al., 2018]. As a result, stacking more and more layers in DNN models can easily converge to poor local minima (see Figure 1 in [He et al., 2016a]).To address the issue above, the modern deep learning paradigm has shifted to designing DNN models based on blocks or modules of the same kind in cascade. A block or module comprises specific operations on a stack of layers to avoid the degradation problem and learn better representations. For example, Inception modules in the GoogLeNet
[Szegedy et al., 2015], residual blocks in the ResNet [He et al., 2016a, b, Zagoruyko and Komodakis, 2016, Kim et al., 2016, Xie et al., 2017, Xiong et al., 2018], dense blocks in the DenseNet [Huang et al., 2017], attention modules in the Transformer [Vaswani et al., 2017], SqueezeandExcitation (SE) blocks in the SE network (SENet) [Hu et al., 2018], and residual Ublocks [Qin et al., 2020] in UNet. Among the above examples, the most popular block design is the residual block which merely adds a skip connection (or a residual connection) between the input and output of a stack of layers. This modification has led to a huge success in deep learning. Many modern DNN models in different applications also adopt residual blocks in their architectures, e.g., VNet in medical image segmentation
[Milletari et al., 2016], Transformer in machine translation [Vaswani et al., 2017], and residual LSTM in speech recognition [Kim et al., 2017]. Empirical results have shown that ResNets can be even scaled up to layers or bottleneck residual blocks, and still improve performance [He et al., 2016b].Despite the huge success, our understanding of ResNets is very limited. To the best of our knowledge, no theoretical results have addressed the following question: Is learning better ResNets as easy as stacking more blocks? The most recognized intuitive answer for the above question is that a particular stack of layers can focus on fitting the residual between the target and the representation generated in the previous residual block; thus, adding more blocks always leads to no worse training performance. Such an intuition is indeed true for a constructively blockwise training procedure; but not clear when the weights in a ResNet are optimized as a whole. Perhaps the theoretical works in the literature closest to the above question are recent results in an albeit modified and constrained ResNet model that every local minimum is less than or equal to the empirical risk provided by the best linear predictor [Shamir, 2018, Kawaguchi and Bengio, 2019, Yun et al., 2019]. Although the aims of these works are different from our question, they actually prove a special case under these simplified models in which the final residual representation is better than the input representation for linear prediction. We notice that the models considered in these works are very different from standard ResNets using preactivation residual blocks [He et al., 2016b] due to the absence of the nonlinearities at the final residual representation that feeds into the final affine layer. Other noticeable simplifications include scalarvalued output [Shamir, 2018, Yun et al., 2019] and single residual block [Shamir, 2018, Kawaguchi and Bengio, 2019]. In particular, Yun et al. [2019] additionally showed that residual representations do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing their simplified ResNet models.
In this paper, we take a step towards answering the abovementioned question by constructing practical and analyzable blockbased DNN models. Main contributions of our paper are as follows:
Improved representation guarantees for wide ResNEsts with bottleneck residual blocks.
We define a ResNEst as a standard singlestage ResNet that simply drops the nonlinearities at the last residual representation (see Figure 2). We prove that sufficiently wide ResNEsts with bottleneck residual blocks under practical assumptions can always guarantee a desirable training property that ResNets with bottleneck residual blocks empirically achieve (but theoretically difficult to prove), i.e., adding more blocks does not decrease performance given the same arbitrarily selected basis. To be more specific, any local minimum obtained from ResNEsts has an improved representation guarantee under practical assumptions (see Remark 2 (a) and Corollary 1
). Our results apply to loss functions that are differentiable and convex; and do not rely on any assumptions regarding datasets, or convexity/differentiability of the residual functions.
Basic vs. bottleneck.
In the original ResNet paper, He et al. [2016a] empirically pointed out that ResNets with basic residual blocks indeed gain accuracy from increased depth, but are not as economical as the ResNets with bottleneck residual blocks (see Figure 1 in [Zagoruyko and Komodakis, 2016] for different block types). Our Theorem 1 supports such empirical findings.
Generalized and analyzable DNN models.
ResNEsts are more general than the models considered in [Hardt and Ma, 2017, Shamir, 2018, Kawaguchi and Bengio, 2019, Yun et al., 2019] due to the removal of their simplified ResNet settings. In addition, the ResNEst modifies the input by an expansion layer that expands the input space. Such an expansion turns out to be crucial in deriving theoretical guarantees for improved residual representations. We find that the importance on expanding the input space in standard ResNets with bottleneck residual blocks has not been well recognized in existing theoretical results in the literature.
Restricted basis function models.
We reveal a linear relationship between the output of the ResNEst and the input feature as well as the feature vector going into the last affine layer in each of residual functions. By treating each of feature vectors as a basis element, we find that ResNEsts are basis function models handicapped by
a coupling problem in basis learning and linear prediction that can limit performance.Augmented ResNEsts.
As shown in Figure 1, we present a special architecture called augmented ResNEst or AResNEst that introduces a new weight matrix on each of feature vectors to solve the coupling problem that exists in ResNEsts. Due to such a decoupling, every local minimum obtained from an AResNEst bounds the empirical risk of the associated ResNEst from below. AResNEsts also directly enable us to see how features are supposed to be learned. It is necessary for features to be linearly unpredictable if residual representations are strictly improved over blocks.
Wide ResNEsts with bottleneck residual blocks do not suffer from saddle points.
At every saddle point obtained from a ResNEst, we show that there exists at least one direction with strictly negative curvature, under the same assumptions used in the improved representation guarantee, along with the specification of a squared loss and suitable assumptions on the last feature and dataset.
Improved representation guarantees for DenseNEsts.
Although DenseNets [Huang et al., 2017] have shown better empirical performance than ResNets, we are not aware of any theoretical support for DenseNets. We define a DenseNEst (see Figure 4) as a simplified DenseNet model that only utilizes the dense connectivity of the DenseNet model, i.e., direct connections from every stack of layers to all subsequent stacks of layers. We show that any DenseNEst can be represented as a wide ResNEst with bottleneck residual blocks equipped with orthogonalities. Unlike ResNEsts, any DenseNEst exhibits the desirable property, i.e., adding more dense blocks does not decrease performance, without any special architectural redesign. Compared to AResNEsts, the way the features are generated in DenseNEsts makes linear predictability even more unlikely, suggesting better feature construction.
2 ResNEsts and augmented ResNEsts
In this section, we describe the proposed DNN models. These models and their new insights are preliminaries to our main results in Section 3. Section 2.1 recognizes the importance of the expansion layer and defines the ResNEst model. Section 2.2 points out the basis function modeling interpretation and the coupling problem in ResNEsts, and shows that the optimization on the set of prediction weights is nonconvex. Section 2.3 proposes the AResNEst to avoid the coupling problem and shows that the minimum empirical risk obtained from a ResNEst is bounded from below by the corresponding AResNEst. Section 2.4 shows that linearly unpredictable features are necessary for strictly improved residual representations in AResNEsts.
2.1 Dropping nonlinearities in the final representation and expanding the input space
The importance on expanding the input space via (see Figure 2) in standard ResNets has not been well recognized in recent theoretical results ([Shamir, 2018, Kawaguchi and Bengio, 2019, Yun et al., 2019]) although standard ResNets always have an expansion implemented by the first layer before the first residual block. Empirical results have even shown that a standard 16layer wide ResNet outperforms a standard 1001layer ResNet [Zagoruyko and Komodakis, 2016], which implies the importance of a wide expansion of the input space.
We consider the proposed ResNEst model shown in Figure 2 whose th residual block employs the following inputoutput relationship:
(1) 
for . The term excluded the first term on the righthand side is a composition of a nonlinear function
and a linear transformation,
^{1}^{1}1For any affine function , if desired, one can use where and and discuss on the linear function instead. All the results derived in this paper hold true regardless of the existence of bias parameters. which is generally known as a residual function. forms a linear transformation and we consider as a function implemented by a neural network with parameters for all . We define the expansion for the input to the ResNEst using a linear transformation with a weight matrix . The output (or to indicate blocks) of the ResNEst is defined as where . is the expansion factor and is the output dimension of the network. The number of blocks is a nonnegative integer. When , the ResNEst is a twolayer linear network .Notice that the ResNEst we consider in this paper (Figure 2) is more general than the models in [Hardt and Ma, 2017, Shamir, 2018, Kawaguchi and Bengio, 2019, Yun et al., 2019] because our residual space (the space where the addition is performed at the end of each residual block) is not constrained by the input dimension due to the expansion we define. Intuitively, a wider expansion (larger ) is required for a ResNEst that has more residual blocks. This is because the information collected in the residual representation grows after each block, and the fixed dimension of the residual representation must be sufficiently large to avoid any loss of information. It turns out a wider expansion in a ResNEst is crucial in deriving performance guarantees because it assures the quality of local minima and saddle points (see Theorem 1 and 2).
2.2 Basis function modeling and the coupling problem
The conventional inputoutput relationship of a standard ResNet is not often easy to interpret. We find that redrawing the standard ResNet block diagram [He et al., 2016a, b] with a different viewpoint, shown in Figure 2, can give us considerable new insight. As shown in Figure 2, the ResNEst now reveals a linear relationship between the output and the features. With this observation, we can write down a useful inputoutput relationship for the ResNEst:
(2) 
where for . Note that we do not impose any requirements for each other than assuming that it is implemented by a neural network with a set of parameters . We define as the linear feature and regard as nonlinear features of the input , since is in general nonlinear. The benefit of our formulation (2) is that the output of a ResNEst now can be viewed as a linear function of all these features. Our point of view of ResNEsts in (2) may be useful to explain the finding that ResNets are ensembles of relatively shallow networks [Veit et al., 2016].
As opposed to traditional nonlinear methods such as basis function modeling (chapter 3 in the book by Bishop, 2006) where a linear function is often trained on a set of handcrafted features, the ResNEst jointly finds features and a linear predictor function by solving the empirical risk minimization (ERM) problem denoted as (P) on . We denote as the empirical risk (will be used later on). Indeed, one can view training a ResNEst as a basis function modeling with a trainable (datadriven) basis by treating each of features as a basis vector (it is reasonable to assume all features are not linearly predictable, see Section 2.4). However, unlike a basis function modeling, the linear predictor function in the ResNEst is not entirely independent of the basis generation process. We call such a phenomenon as a coupling problem which can handicap the performance of ResNEsts. To see this, note that feature (basis) vectors can be different if is changed (the product is the linear predictor function for the feature ). Therefore, the set of parameters needs to be fixed to sufficiently guarantee that the basis is not changed with different linear predictor functions. It follows that and are the only weights which can be adjusted without changing the features. We refer to and as prediction weights and as feature finding weights in the ResNEst. Obviously, the set of all the weights in the ResNEst is composed of the feature finding weights and prediction weights.
Because is quite general in the ResNEst, any direct characterization on the landscape of ERM problem seems intractable. Thus, we propose to utilize the basis function modeling point of view in the ResNEst and analyze the following ERM problem:
(3) 
where for any fixed feature finding weights . We have used and to denote the loss function and training data, respectively. denotes a ResNEst using a fixed feature finding weights . Although () has less optimization variables and looks easier than (P), Proposition 1 shows that it is a nonconvex problem. Remark 1 explains why understanding () is valuable.
Remark 1.
Let the set of all local minimizers of () using any possible features equip with the corresponding . Then, this set is a superset of the set of all local minimizers of the original ERM problem (P). Any characterization of () can then be translated to (P) (see Corollary 2 for example).
Assumption 1.
and is full rank.
Proposition 1.
If is the squared loss and Assumption 1 is satisfied, then (a) the objective function of () is nonconvex and nonconcave; (b) every critical point that is not a local minimizer is a saddle point in ().
The proof of Proposition 1 is deferred to Appendix A.1 in the supplementary material. Due to the product in , our Assumption 1 is similar to one of the important data assumptions used in deep linear networks [Baldi and Hornik, 1989, Kawaguchi, 2016]. Assumption 1 is easy to be satisfied as we can always perturb if the last nonlinear feature and dataset do not fit the assumption. Although Proposition 1 (a) examines the nonconvexity for a fixed , the result can be extended to the original ERM problem (P) for the ResNEst. That is, if there exists at least one such that Assumption 1 is satisfied, then the objective function for the optimization problem (P) is also nonconvex and nonconcave because there exists at least one point in the domain at which the Hessian is indefinite. As a result, this nonconvex loss landscape in (P) immediately raises issues about suboptimal local minima in the loss landscape. This leads to an important question: Can we guarantee the quality of local minima with respect to some reference models that are known to be good enough?
2.3 Finding reference models: bounding empirical risks via augmentation
To avoid the coupling problem in ResNEsts, we propose a new architecture in Figure 1 called augmented ResNEst or AResNEst. An block AResNEst introduces another set of parameters to replace every bilinear map on each feature in (2) with a linear map:
(4) 
Now, the function is linear with respect to all the prediction weights . Note that the parameters still exist and are now dedicated to feature finding. On the other hand, and are deleted since they are not used in the AResNEst. As a result, the corresponding ERM problem (PA) is defined on . We denote as the empirical risk in AResNEsts. The prediction weights are now different from the ResNEst as the AResNEst uses . Because any AResNEst prevents the coupling problem, it exhibits a nice property shown below.
Assumption 2.
The loss function is differentiable and convex in for any .
Proposition 2.
The proof of Proposition 2 is deferred to Appendix A.2 in the supplementary material. According to Proposition 2, AResNEst establishes empirical risk lower bounds (ERLBs) for a ResNEst. Hence, for the same picked arbitrarily, an AResNEst is better than a ResNEst in terms of any pair of two local minima in their loss landscapes. Assumption 2 is practical because it is satisfied for two commonly used loss functions in regression and classification, i.e., the squared loss and crossentropy loss. Other losses such as the logistic loss and smoothed hinge loss also satisfy this assumption.
2.4 Necessary condition for strictly improved residual representations
What properties are fundamentally required for features to be good, i.e., able to strictly improve the residual representation over blocks? With AResNEsts, we are able to straightforwardly answer this question. A fundamental answer is they need to be at least linearly unpredictable. Note that must be linearly unpredictable by if
(7) 
for any local minimum in (PA). In other words, the residual representation is not strictly improved from the previous representation if the feature is linearly predictable by the previous features. Fortunately, the linearly unpredictability of is usually satisfied when is nonlinear; and the set of features can be viewed as a basis function. This viewpoint also suggests avenues for improving feature construction through imposition of various constraints. By Proposition 2, the relation in (7) always holds with equality, i.e., the residual representation is guaranteed to be always no worse than the previous one at any local minimizer obtained from an AResNEst.
3 Wide ResNEsts with bottleneck residual blocks always attain ERLBs
Assumption 3.
.
Assumption 4.
The linear inverse problem has a unique solution.
Theorem 1.
The proof of Theorem 1 is deferred to Appendix A.3 in the supplementary material. Theorem 1 (a) provides a sufficient condition for a critical point to be a global minimum of (). Theorem 1 (b) gives an affirmative answer for every local minimum in () to attain the ERLB. To be more specific, any pair of obtained local minima from the ResNEst and the AResNEst using the same arbitrary are equally good. In addition, the implication of Theorem 1 (b) is that every local minimum of () is also a global minimum despite its nonconvex landscape (Proposition 1), which suggests there exists no suboptimal local minimum for the optimization problem (). One can also establish the same results for local minimizers of (P) under the same set of assumptions by replacing “() under any ” with just “(P)” in Theorem 1. Such a modification may gain more clarity, but is more restricted than the original statement due to Remark 1. Note that Theorem 1 is not limited to fixing any weights during training; and it applies to both normal training (train all the weights in a network as a whole) and blockwise or layerwise training procedures.
3.1 Improved representation guarantees
Remark 2.
Although there may exist suboptimal local minima in the optimization problem (P), Remark 2 suggests that such minima still improve residual representations over blocks under practical conditions. Mathematically, Remark 2 (a) and Remark 2 (b) are described by Corollary 1 and the general version of Corollary 2, respectively. Corollary 1 compares the minimum empirical risk obtained at any two representations among to for any given network satisfying the assumptions; and Corollary 2 extends this comparison to the input representation.
Corollary 1.
The proof of Corollary 1 is deferred to Appendix A.4 in the supplementary material. Because Corollary 1 holds true for any properly given weights, one can apply Corollary 1 to proper local minimizers of (P). Corollary 2 ensures that ResNEsts are guaranteed to be no worse than the best linear predictor under practical assumptions. This property is useful because linear estimators are widely used in signal processing applications and they can now be confidently replaced with ResNEsts.
Corollary 2.
The proof of Corollary 2 is deferred to Appendix A.5 in the supplementary material. To the best of our knowledge, Corollary 2 is the first theoretical guarantee for vectorvalued ResNetlike models that have arbitrary residual blocks to outperform any linear predictors. Corollary 2 is more general than the results in [Shamir, 2018, Kawaguchi and Bengio, 2019, Yun et al., 2019] because it is not limited to assumptions like scalarvalued output or single residual block. In fact, we can have a even more general statement because any local minimum obtained from () with random or any is better than the minimum empirical risk provided by the best linear predictor, under the same assumptions used in Corollary 2. This general version fully describes Remark 2 (b).
Theorem 1, Corollary 1 and Corollary 2 are quite general because they are not limited to specific loss functions, residual functions, or datasets. Note that we do not impose any assumptions such as differentiability or convexity on the neural network for in residual functions. Assumption 3 is practical because the expansion factor is usually larger than the input dimension ; and the output dimension
is usually not larger than the input dimension for most supervised learning tasks using sensory input. Assumption
4 states that the features need to be uniquely invertible from the residual representation. Although such an assumption requires a special architectural design, we find that it is always satisfied empirically after random initialization or training when the “bottleneck condition” is satisfied.3.2 How to design architectures with representational guarantees?
Notice that one must be careful with the ResNEst architectural design so as to enjoy Theorem 1, Corollary 1 and Corollary 2. A ResNEst needs to be wide enough such that to necessarily satisfy Assumption 4. We call such a sufficient condition on the width and feature dimensionalities as a bottleneck condition. Because each nonlinear feature size for (say ) must be smaller than the dimensionality of the residual representation , each of these residual functions is a bottleneck design [He et al., 2016a, b, Zagoruyko and Komodakis, 2016] forming a bottleneck residual block. We now explicitly see the importance of the expansion layer. Without the expansion, the dimenionality of the residual representation is limited to the input dimension. As a result, Assumption 4 cannot be satisfied for ; and the analysis for the ResNEst with multiple residual blocks remains intractable or requires additional assumptions on residual functions.
Loosely speaking, a sufficiently wide expansion or satisfaction of the bottleneck condition implies Assumption 4. If the bottleneck condition is satisfied, then ResNEsts are equivalent to AResNEsts for a given , i.e., . If not (e.g., basic blocks are used in a ResNEst), then a ResNEst can have a problem of diminishing feature reuse or end up with poor performance even though it has excellent features that can be fully exploited by an AResNEst to yield better performance, i.e., . From such a viewpoint, Theorem 1 supports the empirical findings in [He et al., 2016a] that bottleneck blocks are more economical than basic blocks. Our results thus recommend AResNEsts over ResNEsts if the bottleneck condition cannot be satisfied.
3.3 Guarantees on saddle points
In addition to guarantees for the quality of local minima, we find that ResNEsts can easily escape from saddle points due to the nice property shown below.
Theorem 2.
The proof of Theorem 2 is deferred to Appendix A.6 in the supplementary material. In contrast to Theorem 1 (a), Theorem 2 (a) provides a necessary condition for a saddle point. Although () is a nonconvex optimization problem according to Proposition 1 (a), Theorem 2 (b) suggests a desirable property for saddle points in the loss landscape. Because there exists at least one direction with strictly negative curvature at every saddle point that satisfies the bottleneck condition, the secondorder optimization methods can rapidly escape from saddle points [Dauphin et al., 2014]. If the firstorder methods are used, the randomness in stochastic gradient helps the firstorder methods to escape from the saddle points [Ge et al., 2015]. Again, we require the bottleneck condition to be satisfied in order to guarantee such a nice property about saddle points. Note that Theorem 2 is not limited to fixing any weights during training; and it applies to both normal training and blockwise training procedures due to Remark 1.
4 DenseNEsts are wide ResNEsts with bottleneck residual blocks equipped with orthogonalities
Instead of adding one nonlinear feature in each block and remaining in same space , the DenseNEst model shown in Figure 3 preserves each of features in their own subspaces by a sequential concatenation at each block. For an block DenseNEst, we define the th dense block as a function of the form
(8) 
for where the dense function is a general nonlinear function; and is the output of the th dense block. The symbol concatenates vector and vector and produces a higherdimensional vector . We define where is the input to the DenseNEst. For all , is a function implemented by a neural network with parameters where with . The output of a DenseNEst is defined as for , which can be written as
(9) 
where for are regarded as nonlinear features of the input . We define as the linear feature. is the prediction weight matrix in the DenseNEst as all the weights which are responsible for the prediction is in this single matrix from the viewpoint of basis function modeling. The ERM problem (PD) for the DenseNEst is defined on . To fix the features, the set of parameters needs to be fixed. Therefore, the DenseNEst ERM problem for any fixed features, denoted as (), is fairly straightforward as it only requires to optimize over a single weight matrix, i.e.,
(10) 
where Unlike ResNEsts, there is no such coupling between the feature finding and linear prediction in DenseNEsts. Compared to ResNEsts or AResNEsts, the way the features are generated in DenseNEsts generally makes the linear predictability even more unlikely. To see that, note that the directly applies on the concatenation of all previous features; however, the applies on the sum of all previous features.
Different from a ResNEst which requires Assumption 2, 3 and 4 to guarantee its superiority with respect to the best linear predictor (Corollary 2), the corresponding guarantee in a DenseNEst shown in Proposition 3 requires weaker assumptions.
Proposition 3.
If Assumption 2 is satisfied, then any local minimum of (PD) is smaller than or equal to the minimum empirical risk given by any linear predictor of the input.
The proof of Proposition 3 is deferred to Appendix A.7 in the supplementary material. Notice that no special architectural design in a DenseNEst is required to make sure it always outperforms the best linear predictor. Any DenseNEst is always better than any linear predictor when the loss function is differentiable and convex (Assumption 2). Such an advantage can be explained by the in the DenseNEst. Because is the only prediction weight matrix which is directly applied onto the concatenation of all the features, () is a convex optimization problem. We point out the difference of between the ResNEst and DenseNEst. In the ResNEst, needs to interpret the features from the residual representation; while the in the DenseNEst directly accesses the features. That is why we require Assumption 4 in the ResNEst to eliminate any ambiguity on the feature interpretation.
Can a ResNEst and a DenseNEst be equivalent? Yes, Proposition 4 establishes a link between them.
Proposition 4.
The proof of Proposition 4 is deferred to Appendix A.8 in the supplementary material. Because the concatenation of two given vectors can be represented by an addition over two vectors projected onto a higher dimensional space with disjoint supports, one straightforward construction for an equivalent ResNEst is to sufficiently expand the input space and enforce the orthogonality of all the column vectors in . As a result, any DenseNEst can be viewed as a ResNEst that always satisfies Assumption 4
and of course the bottleneck condition no matter how we train the DenseNEst or select its hyperparameters, leading to the desirable guarantee, i.e., any local minimum obtained in optimizing the prediction weights of the resulting ResNEst from any DenseNEst always attains the lower bound. Thus, DenseNEsts are certified as being advantageous over ResNEsts by Proposition
4. For example, a small may be chosen and then the guarantee in Theorem 1 can no longer exist, i.e., . However, the corresponding ResNEst induced by a DenseNEst always achieves . Hence, Proposition 4 can be regarded as a theoretical support for why standard DenseNets [Huang et al., 2017] are in general better than standard ResNets [He et al., 2016b].5 Related work
In this section, we discuss ResNet works that investigate on properties of local minima and give more details for our important references that appear in the introduction. We focus on highlighting their results and assumptions used so as to compare to our theoretical results derived from practical assumptions. The earliest theoretical work for ResNets can be dated back to [Hardt and Ma, 2017] which proved a vectorvalued ResNetlike model using a linear residual function in each residual block has no spurious local minima (local minima that give larger objective values than the global minima) under squared loss and nearidentity region assumptions. There are results [Li and Yuan, 2017, Liu et al., 2019]
proved that stochastic gradient descent can converge to the global minimum in scalarvalued twolayer ResNetlike models; however, such a desirable property relies on strong assumptions including single residual block and Gaussian input distribution.
Li et al. [2018] visualized the loss landscapes of a ResNet and its plain counterpart (without skip connections); and they showed that the skip connections promote flat minimizers and prevent the transition to chaotic behavior. Liang et al. [2018] showed that scalarvalued and single residual block ResNetlike models can have zero training error at all local minima by making strong assumptions in the data distribution and loss function for a binary classification problem. In stead of pursuing local minima are global in the empirical risk landscape using strong assumptions, Shamir [2018] first took a different route and proved that a scalarvalued ResNetlike model with a direct skip connection from input to output layer (single residual block) is better than any linear predictor under mild assumptions. To be more specific, he showed that every local minimum obtained in his model is no worse than the global minimum in any linear predictor under more generalized residual functions and no assumptions on the data distribution. He also pointed out that the analysis for the vectorvalued case is nontrivial. Kawaguchi and Bengio [2019] overcame such a difficulty and proved that vectorvalued models with single residual block is better than any linear predictor under weaker assumptions. Yun et al. [2019] extended the prior work by Shamir [2018] to multiple residual blocks. Although the model considered is closer to a standard ResNet compared to previous works, the model output is assumed to be scalarvalued. All abovementioned works do not take the first layer that appears before the first residual block in standard ResNets into account. As a result, the dimensionality of the residual representation in their simplified ResNet models is constrained to be the same size as the input.Broader impact
One of the mysteries in ResNets and DenseNets is that learning better DNN models seems to be as easy as stacking more blocks. In this paper, we define three generalized and analyzable DNN architectures, i.e., ResNEsts, AResNEsts, and DenseNEsts, to answer this question. Our results not only establish guarantees for monotonically improved representations over blocks, but also assure that all linear (affine) estimators can be replaced by our architectures without harming performance. We anticipate these models can be friendly options for researchers or engineers who value or mostly rely on linear estimators or performance guarantees in their problems. In fact, these models should yield much better performance as they can be viewed as basis function models with datadriven bases that guarantee to be always better than the best linear estimator. Our contributions advance the fundamental understanding of ResNets and DenseNets, and promote their use cases through a certificate of attractive guarantees.
Acknowledgments and disclosure of funding
We would like to thank the anonymous reviewers for their constructive comments. This work was supported in part by NSF under Grant CCF2124929 and Grant IIS1838830, in part by NIH/NIDCD under Grant R01DC015436, Grant R21DC015046, and Grant R33DC015046, in part by Halıcıoğlu Data Science Institute, and in part by Wrethinking, the Foundation.
References

Neural networks and principal component analysis: learning from examples without local minima
. Neural Networks 2 (1), pp. 53–58. Cited by: §2.2.  Pattern recognition and machine learning. springer. Cited by: §2.2.
 Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2933–2941. Cited by: §3.3.

Escaping from saddle points—online stochastic gradient for tensor decomposition
. In Conference on Learning Theory, pp. 797–842. Cited by: §3.3.  Identity matters in deep learning. In International Conference on Learning Representations, Cited by: §1, §2.1, §5.
 Convolutional neural networks at constrained time cost. In Conference on Computer Vision and Pattern Recognition, pp. 5353–5360. Cited by: §1.
 Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §1, §1, §2.2, §3.2, §3.2.
 Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §B.2, §1, §1, Figure 2, §2.2, §3.2, §4.
 Squeezeandexcitation networks. In Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: §1.
 Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §1, §1, §4.
 Depth with nonlinearity creates no bad local minima in resnets. Neural Networks 118, pp. 167–174. Cited by: §1, §1, §2.1, §2.1, §3.1, §5.
 Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594. Cited by: §2.2.
 Residual LSTM: design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360. Cited by: §1.

Accurate image superresolution using very deep convolutional networks
. In Conference on Computer Vision and Pattern Recognition, pp. 1646–1654. Cited by: §1.  ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
 Learning multiple layers of features from tiny images. Tech Report. Cited by: §B.1.
 Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pp. 6389–6399. Cited by: §1, §5.

Convergence analysis of twolayer neural networks with relu activation
. In Advances in Neural Information Processing Systems, Vol. 30, pp. 597–607. Cited by: §5.  Understanding the loss surface of neural networks for binary classification. In International Conference on Machine Learning, pp. 2835–2843. Cited by: §5.
 Towards understanding the importance of shortcut connections in residual networks. In Advances in Neural Information Processing Systems, Cited by: §5.
 VNet: fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision, pp. 565–571. Cited by: §1.
 UNet: going deeper with nested Ustructure for salient object detection. Pattern Recognition 106, pp. 107404. Cited by: §1.
 Are ResNets provably better than linear predictors?. In Advances in Neural Information Processing Systems, pp. 507–516. Cited by: §1, §1, §2.1, §2.1, §3.1, §5.
 Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, Cited by: §1.
 Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §1.
 Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §1.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1.
 Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, pp. 550–558. Cited by: §2.2.
 Aggregated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recognition, pp. 1492–1500. Cited by: §1.
 The Microsoft 2017 conversational speech recognition system. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. Cited by: §1.
 Are deep ResNets provably better than linear predictors?. In Advances in Neural Information Processing Systems, pp. 15686–15695. Cited by: §1, §1, §2.1, §2.1, §3.1, §5.
 Wide residual networks. In British Machine Vision Conference (BMVC), pp. 87.1–87.12. Cited by: §B.2, §B.3, §B.3, §1, §1, §2.1, §3.2.
 Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pp. 818–833. Cited by: §1.
Appendix A Proofs
a.1 Proof of Proposition 1
Proof.
Let
(11) 
for and
(12) 
where . The Hessian of in () is given by
(13) 
where
(14) 
We have used to denote the Kronecker product. See Appendix A.9 for the derivation of the Hessian. By the generalized Schur complement,
(15) 
which implies the projection of onto the range of is itself. As a result,
(16) 
where denotes the MoorePenrose pseudoinverse. Substituting the submatrices in (13) to the above equation, we obtain
(17) 
which implies
(18) 
On the other hand, the above condition is also necessary for the Hessian to be negative semidefinite because which implies (16).
Now, using the assumption , notice that the condition in (18) is not satisfied for any point in the set
(19) 
Hence, there exist some points in the domain at which the Hessian is indefinite. The objective function in () is nonconvex and nonconcave. We have proved the statement (a).
By the generalized Schur complement and the assumption that is full rank, we have
(20) 
where we have used the spectrum property of the Kronecker product and the positive definiteness of . Notice that this is a contradiction because any point with is in the set . Hence, there exists no point at which the Hessian is negative semidefinite. Because the negative semidefiniteness is a necessary condition for a local maximum, every critical point is then either a local minimum or a saddle point. We have proved the statement (b). ∎
a.2 Proof of Proposition 2
Proof.
is convex in because it is a nonnegative weighted sum of convex functions composited with affine mappings. Thus, () is a convex optimization problem and is the best linear fit using . That is, for any local minimizer , it is always true that
(21) 
for arbitrary . ∎
a.3 Proof of Theorem 1
Proof.
By the convexity in Proposition 2, every critical point in () is a global minimizer. Since the objective function of () is differentiable, the firstorder derivative is a zero row vector at any critical point, i.e.,
(22) 
for . Again, we have used to denote the Kronecker product. According to (22), the point is a global minimizer in (
) if and only if the sum of rank one matrices is a zero matrix for
, i.e.,(23) 
Next, we show that every local minimizer of () establishes a corresponding global minimizer in () such that for .
At any local minimizer of (), the firstorder necessary condition with respect to is given by
(24) 
Equivalently, we can write the above firstorder necessary condition into a matrix form
(25) 
On the other hand, for the firstorder necessary condition with respect to , we obtain
(26) 
The corresponding matrix form of the above condition is given by
(27) 
When is full rank at a critical point, (25) implies because the null space of is degenerate according to Assumption 3. Then, applying such an implication to (27) along with Assumption 4, we obtain
(28) 
Note that all the column vectors in are linearly independent if and only if the linear inverse problem has a unique solution for . We have proved the statement (a).
On the other hand, when is not full rank at a local minimizer, then there exists a perturbation on such that the new point is still a local minimizer which has the same objective value. Let be any local minimizer of () for which is not full row rank. By the definition of a local minimizer, there exists some such that
(29) 
where is an open ball centered at with the radius . Then must also be a local minimizer for any nonzero and any sufficiently small nonzero such that . Substituting the minimizer in (27) yields
(30) 
Subtracting (27) from the above equation, we obtain
(31) 
Multiplying both sides by , we have
(32) 
because can be arbitrary as long as it is sufficiently small. As a result, (28) is also true when is not full row rank. We have proved the statement (b). ∎
a.4 Proof of Corollary 1
a.5 Proof of Corollary 2
Proof.
By Theorem 1 (b),
(35) 
for any local minimizer of () using feature finding parameters . Then, by the convexity in Proposition 2, every local minimizer is a global minimzer of () using . Hence, it must be true that
(36) 
for arbitrary due to the zero prediction weights for . We have proved the statement (a). If the inequality in (36) is strict, i.e.,
(37) 
then (36) implies
(38) 
We have proved the statement (b). ∎
a.6 Proof of Theorem 2
Proof.
By Theorem 1 (a), every critical point with full rank is a global minimizer of (). Therefore, must be rankdeficient at every saddle point. We have proved the statement (a).
We argue that the Hessian is neither positive semidefinite nor negative semidefinite at every saddle point. According to the proof of Proposition 1, there exists no point in the domain of the objective function of (<
Comments
There are no comments yet.