In this paper, we analyze vector-output two-layer ReLU neural networks from an optimization perspective. These networks, while simple, are the building blocks of deep networks which have been found to perform tremendously well for a variety of tasks. We find that vector-output networks regularized with standard weight-decay have a convex semi-infinite strong dual–a convex program with infinitely many constraints. However, this strong dual has a finite parameterization, though expressing this parameterization is non-trivial. In particular, we find that expressing a vector-output neural network as a convex program requires taking the convex hull of completely positive matrices. Thus, we find an intimate, novel connection between neural network training and copositive programs, i.e. programs over the set of completely positive matrices (anjos2011handbook)
. We describe algorithms which can be used to find the global minimum of the neural network training problem in polynomial time for data matrices of fixed rank, which holds for convolutional architectures. We also demonstrate under certain conditions that we can provably find the optimal solution to the neural network training problem using soft-thresholded Singular Value Decomposition (SVD). In the general case, we introduce a relaxation to parameterize the neural network training problem, which in practice we find to be tight in many circumstances.
1.1 Related work
Our analysis focuses on the optima of finite-width neural networks. This approach contrasts with certain approaches which have attempted to analyze infinite-width neural networks, such as the Neural Tangent Kernel (jacot2018neural). Despite advancements in this direction, infinite-width neural networks do not exactly correspond to their finite-width counterparts, and thus this method of analysis is insufficient for fully explaining their success (arora2019exact).
Other works may attempt to optimize neural networks with assumptions on the data distribution. Of particular interest is (ge2018learning)
, which demonstrates that a polynomial number of samples generated from a planted neural network model is sufficient for extracting its parameters using tensor methods, assuming the inputs are drawn from a symmetric distribution. If the input distribution to a simple convolutional neural network with one filter is Gaussian, it has also been shown that gradient descent can find the global optimum in polynomial time(brutzkus2017globally). In contrast to these works, we seek to find general principles for learning two-layer ReLU networks, regardless of the data distribution and without planted model assumptions.
Another line of work aims to understand the success of neural networks via implicit regularization, which analyzes how models trained with Stochastic Gradient Descent (SGD) find solutions which generalize well, even without explicit control of the optimization objective (gunasekar2017implicit; neyshabur2014search). In contrast, we consider the setting of explicit regularization, which is often used in practice in the form of weight-decay, which regularizes the sum of squared norms of the network weights with a single regularization parameter , which can be critical for neural network performance (golatkar2019time).
Our approach of analyzing finite-width neural networks with a fixed training dataset has been explored for networks with a scalar output (pilanci2020neural; ergen2020aistats; ergen2020training). In fact, our work here can be considered a generalization of these results. We consider a ReLU-activation two-layer network with neurons:
where the function denotes the ReLU activation, are the first-layer weights of the network, and are the second-layer weights. In the scalar-output case, the weights are scalars, i.e. . pilanci2020neural find that the neural network training problem in this setting corresponds to a finite-dimensional convex program.
However, the setting of scalar-output networks is limited. In particular, this setting cannot account for tasks such as multi-class classification or multi-dimensional regression, which are some of the most common uses of neural networks. In contrast, the vector-output setting is quite general, and even greedily training and stacking such shallow vector-output networks can match or even exceed the performance of deeper networks on large datasets for classification tasks (belilovsky2019greedy). We find that this important task of extending the scalar case to the vector-output case is an exceedingly non-trivial task, which generates novel insights. Thus, generalizing the results of pilanci2020neural is an important task for a more complete knowledge of the behavior of neural networks in practice.
Certain works have also considered technical problems which arise in our analysis, though in application they are entirely different. Among these is analysis into cone-constrained PCA, as explored by deshpande2014cone and asteris2014nonnegative. They consider the following optimization problem
This problem is in general considered NP-hard. asteris2014nonnegative provide an exponential algorithm which runs in time to find the exact solution to (2), where and is a symmetric matrix. We leverage this result to show that the optimal value of the vector-output neural network training problem can be found in the worst case in exponential time with respect to , while in the case of a fixed-rank data matrix our algorithm is polynomial-time. In particular, convolutional networks with fixed filter sizes (e.g., convolutional kernels) correspond to the fixed-rank data case (e.g., ). In search of a polynomial-time approximation to (2), deshpande2014cone evaluate a relaxation of the above problem, given as
While the relaxation not tight in all cases, the authors find that in practice it works quite well for approximating the solution to the original optimization problem. This relaxation, in particular, corresponds to what we call a copositive relaxation, because it consists of a relaxation of the set . When and the norm constraint is removed, is the set of completely positive matrices (dur2010copositive). Optimizing over the set of completely positive matrices is NP-hard, as is optimizing over its convex hull:
Thus, optimizing over is a convex optimization problem which is nevertheless NP-hard. Various relaxations to have been proposed, such as the copositive relaxation used by (deshpande2014cone) above:
In fact, this relaxation is tight, given that and (burer2015gentle; kogan1993characterization). However, for , so the copositive relaxation provides a lower bound in the general case. These theoretical results prove insightful for understanding the neural network training objective.
We find the semi-infinite convex strong dual for the vector-output two-layer ReLU neural network training problem, and prove that it has a finite-dimensional exact convex optimization representation.
We establish a new connection between vector-output neural networks, copositive programs and cone-constrained PCA problems, yielding new insights into the nature of vector-output neural network training, which extend upon the results of the scalar-output case.
We provide methods that globally solve the vector-output neural network training problem in polynomial time for data matrices of a fixed rank, but for the full-rank case, the complexity is necessarily exponential in assuming .
We provide conditions on the training data and labels with which we can find a closed-form expression for the optimal weights of a vector-output ReLU neural network using soft-thresholded SVD.
We propose a copositive relaxation to establish a heuristic for solving the neural network training problem. This copositive relaxation is often tight in practice.
In this work, we consider fitting labels from inputs with a two layer neural network with ReLU activation and neurons in the hidden layer. This network is trained with weight decay regularization on all of its weights, with associated parameter
. For some general loss function, this gives us the non-convex primal optimization problem
In the simplest case, with a fully-connected network trained with squared loss111Appendix A.6 contains extensions to general convex loss functions., this becomes:
However, alternative models can be considered. In particular, for example, ergen2020training consider two-layer CNNs with global average pooling, for which we can define the patch matrices , which define the patches which individual convolutions operate upon. Then, the vector-output neural network training problem with global average pooling becomes
We will show that in this convolutional setting, because the rank of the set of patches
cannot exceed the filter size of the convolutions, there exists an algorithm which is polynomial in all problem dimensions to find the global optimum of this problem. We note that such matrices typically exhibit rapid singular value decay due to spatial correlations, which may also motivate replacing it with an approximation of much smaller rank.
In the following section, we will demonstrate how the vector-output neural network problem has a convex semi-infinite strong dual. To understand how to parameterize this semi-infinite dual in a finite fashion, we must introduce the concept of hyper-plane arrangements. We consider the set of diagonal matrices
This is a finite set of diagonal matrices, dependent on the data matrix , which indicate the set of possible arrangement activation patterns for the ReLU non-linearity, where a value of 1 indicates that the neuron is active, while 0 indicates that the neuron is inactive. In particular, we can enumerate the set of sign patterns as , where depends on but is in general bounded by
for (pilanci2020neural; stanley2004introduction). Note that for a fixed rank , such as in the convolutional case above, is polynomial in . Using these sign patterns, we can completely characterize the range space of the first layer after the ReLU:
We also introduce a class of data matrices for which the analysis of scalar-output neural networks simplifies greatly, as shown in (ergen2020convex). These matrices are called spike-free matrices. In particular, a matrix is called spike-free if it holds that
When is spike-free, then, the set of sign patterns reduces to a single sign pattern, , because of the identity in (7). The set of spike-free matrices includes (but is not limited to) diagonal matrices and whitened matrices for which , such as the output of Zero-phase Component Analysis (ZCA) whitening. The setting of whitening the data matrix has been shown to improve the performance of neural networks, even in deeper settings where the whitening transformation is applied to batches of data at each layer (huang2018decorrelated). We will see that spike-free matrices provide polynomial-time algorithms for finding the global optimum of the neural network training problem in both and (ergen2020convex), though the same does not hold for vector-output networks.
2.1 Warm-Up: Scalar-Output Networks
We first present strong duality results for the scalar-output case, i.e. the case where .
Theorem (pilanci2020neural) There exists an such that if , for all , the neural network training problem (5) has a convex semi-infinite strong dual, given by
Furthermore, the neural network training problem has a convex, finite-dimensional strong bi-dual, given by
This is a convex program with variables and linear inequalities. Solving this problem with standard interior point solvers thus has a complexity of , which is thus exponential in , but for a fixed rank is polynomial in .
In the case of a spike-free , however, the dual problem simplifies to a single sign pattern constraint . Then the convex strong bi-dual becomes (ergen2020convex)
This convex problem has a much simpler form, with only linear inequality constraints and variables, which therefore has a complexity of . We will see that the results of scalar-output ReLU neural networks are a specific case of the vector-output case.
3 Strong Duality
3.1 Convex Semi-infinite duality
There exists an such that if , for all , the neural network training problem (5) has a convex semi-infinite strong dual, given by
Furthermore, the neural network training problem has a convex, finite-dimensional strong bi-dual, given by
for convex sets
The strong dual given in (11) is convex, albeit with infinitely many constraints. In contrast, (12) is a convex problem has finitely many constraints. This convex model learns a sparse set of locally linear models
which are constrained to be in a convex set, for which group sparsity and low-rankness over hyperplane arrangements is induced by the sum of nuclear-norms penalty. The emergence of the nuclear norm penalty is particularly interesting, since similar norms have also been used for rank minimization problems(candes2010power; recht2010guaranteed), proposed as implicit regularizers for matrix factorization models (gunasekar2017implicit), and draws similarities to nuclear norm regularization in multitask learning (argyriou2008convex; abernethy2009new), and trace Lasso (grave2011trace). We note the similarity of this result to that from pilanci2020neural, whose formulation is a special case of this result with , where reduce to
from which we can obtain the convex program presented by pilanci2020neural. Further, this result extends to CNNs with global average pooling, which is discussed in Appendix A.3.2.
It is interesting to observe that the convex program (12) can be interpreted as a piecewise low-rank model that is partitioned according to the set of hyperplane arrangements of the data matrix. In other words, a two-layer ReLU network with vector output is precisely a linear learner over the features , where convex constraints and group nuclear norm regularization is applied to the linear model weights. In the case of the CNNs, the piecewise low-rank model is over the smaller dimensional patch matrices , which result in significantly fewer hyperplane arrangements, and therefore, fewer local low-rank models.
3.2 Provably Solving the Neural Network Training Problem
In this section, we present a procedure for minimizing the convex program as presented in (12) for general output dimension . This procedure relies on Algorithm 5 for cone-constrained PCA from (asteris2014nonnegative), and the Frank-Wolfe algorithm for constrained convex optimization (frank1956algorithm). Unlike SGD, which is a heuristic method applied to a non-convex training problem, this approach is built upon results of convex optimization and provably finds the global minimum of the objective. In particular, we can solve the problem in epigraph form,
where we can perform bisection over in an outer loop to determine the overall optimal value of (12). Then, we have the following algorithm to solve the inner minimization problem of (14):
For steps :
For each solve the following subproblem:
And define the pairs to be the argmaxes of the above subproblems. This is is a form of semi-nonnegative matrix factorization (semi-NMF) on the residual at step (ding2008convex). It can be solved via cone-constrained PCA in time where .
For the semi-NMF factorization obtaining the largest objective value, , form . For all other , simply let .
For step size , update
The derivations for the method and complexity of Algorithm 1 are found in Appendix A.4. We have thus described a Frank-Wolfe algorithm which provably minimizes the convex dual problem, where each step requires a semi-NMF operation, which can be performed in time.
3.3 Spike-free Data Matrices and Closed-Form Solutions
As discussed in Section 2, if is spike-free, the set of sign partitions is reduced to the single partition . Then, the convex program (12) becomes
This problem can also be solved with Algorithm 1. However, the asymptotic complexity of this algorithm is unchanged, due to the cone-constrained PCA step. If the constraint on were removed, (15) would be identical to optimizing a linear-activation network. However, additional cone constraint on allows for a more complex representation, which demonstrates that even in the spike-free case, a ReLU-activation network is quite different from a linear-activation network.
Recalling that whitened data matrices where are spike-free, for a further simplified class of data and label matrices, we can find a closed-form expression for the optimal weights.
Consider a whitened data matrix where , and labels with SVD of . If the left-singular vectors of satisfy , there exists a closed-form solution for the optimal to problem (15), given by
The resulting model is a soft-thresholded SVD of , which arises as the solution of maximum-margin matrix factorization (srebro2005maximum). The scenario that all the left singular vectors of satisfy the affine constraints occurs when the all of the left singular vectors are nonnegative, which is the case for example when
is a one-hot-encoded matrix. In this scenario where the left-singular vectors ofsatisfy , we note that the ReLU constraint on is not active, and therefore, the solution of the ReLU-activation network training problem is identical to that of the linear-activation network. This linear-activation setting has been well-studied, such as in matrix factorization models by (cabral2013unifying; li2017geometry), and in the context of implicit bias of dropout by (mianjy2018implicit; mianjy2019dropout). This theorem thus provides a setting in which the ReLU-activation neural network performs identically to a linear-activation network.
4 Neural networks and copositive programs
4.1 An equivalent copositive program
We now present an alternative representation of the neural network training problem with squared loss, which has ties to copositive programming.
For all , the neural network training problem (5) has a convex strong dual, given by
for convex sets , given by
This is a minimization problem with a convex objective over sets of convex, completely positive cones–a copositive program, which is NP-hard. There exists a cutting plane algorithm solves this problem in , which is polynomial in for data matrices of rank (see Appendix A.5). This formulation provides a framework for viewing ReLU neural networks as implicit copositive programs, and we can find conditions during which certain relaxations can provide optimal solutions.
4.2 A copositive relaxation
We consider the copositive relaxation of the sets from (17). We denote this set
In general, , with equality when (kogan1993characterization; dickinson2013copositive). We define the relaxed program as
Because of the enumeration over sign-patterns, this relaxed program still has a complexity of to solve, and thus does not improve upon the asymptotic complexity presented in Section 3.
4.3 Spike-free data matrices
If is restricted to be spike-free, the convex program (19) becomes
With spike-free data matrices, the copositive relaxation presents a heuristic algorithm for the neural network training problem which is polynomial in both and . This contrasts with the exact formulations of (12) and (17), for which the neural network training problem is exponential even for a spike-free . Table 1 summarizes the complexities of the neural network training problem.
|Scalar-output||Vector-output (exact)||Vector-output (relaxation)|
5.1 Does SGD always find the global optimum for neural networks?
While SGD applied to the non-convex neural network training objective is a heuristic which works quite well in many cases, there may exist pathological cases where SGD fails to find the global minimum. Using Algorithm 1, we can now verify whether SGD find the global optimum. In this experiment, we present one such case where SGD has trouble finding the optimal solution in certain circumstances. In particular, we generate random inputs , where the elements of
are drawn from an i.i.d standard Gaussian distribution:. We then randomly initialize a data-generator neural network with 100 hidden neurons and and an output dimension of 5, and generate labels using this model. We then attempt to fit these labels using a neural network and squared loss, with . We compare the results of training this network for 5 trials with 10 and 50 neurons to the global optimum found by Algorithm 1. In this circumstance, with 10 neurons, none of the realizations of SGD converge to the global optimum as found by Algorithm 1, but with 50 neurons, the loss is nearly identical to that found by Algorithm 1.
5.2 Maximum-Margin Matrix Factorization
In this section, we evaluate the performance of the soft-thresholded SVD closed-form solution presented in Theorem 2. In order to evaluate this method, we take a subset of 3000 points from the CIFAR-10 and CIFAR-100 datasets(krizhevsky2009learning). For each dataset, we first de-mean the data matrix , then whiten the data-matrix using ZCA whitening. We seek to fit one-hot-encoded labels representing the class labels from these datasets.
We compare the results of our algorithm from Theorem 2 and that of SGD in Fig. 6. As we can see, in both cases, the soft-thresholded SVD solution finds the same value as SGD, yet in far shorter time. Appendix A.1.4 contains further details about this experiment, including test accuracy plots.
5.3 Effectiveness of the Copositive Program
In this section, we compare the objective values obtained by SGD, Algorithm 1, and the copositive program defined in (17). We use an artificially-generated spiral dataset, with and 3 classes (see Fig. 9(a) for an illustration). In this case, since , we note that the copositive relaxation in (19) is tight. Across different values of , we compare the solutions found by these three methods. As shown in Fig. 9, the copositive relaxation, the solution found by SGD, and the solution found by Algorithm 1 all coincide with the same loss across various values of . This verifies our theoretical proofs of equivalence of (5), (12), and (19).
We studied the vector-output ReLU neural network training problem, and designed the first algorithms for finding the global optimum of this problem, which are polynomial-time in the number of samples for a fixed data rank. We found novel connections between this vector-output ReLU neural network problem and a variety of other problems, including semi-NMF, cone-constrained PCA, soft-thresholded SVD, and copositive programming. These connections yield insights into the neural network training problem and open room for more exploration. Of particular interest is extending these results to deeper networks, which would further explain the performance of neural networks as they are often used in practice. One such method to extend the results in this paper to deeper networks is to greedily train and stack two-layer networks to create one deeper network, which has shown to mimic the performance of deep networks trained end-to-end. An alternative approach would be to analyze deeper networks directly, which would require a hierarchical representation of sign patterns at each hidden layer, which would complicate our analysis significantly. Some preliminary results for convex program equivalents of deeper training problems are presented under whitened input data assumptions in (ergen2020convexdeep). Another interesting research direction is investigating efficient relaxations of our vector output convex programs for larger scale simulations. Convex relaxations for scalar output ReLU networks with approximation guarantees were studied in (bartan2019convex; ergen2019convexshallow; ergen2019convexcutting; d2020global). Furthermore, landscapes of vector output neural networks and dynamics of gradient descent type methods can be analyzed by leveraging our results. In (lacotte2020all), an analysis of the landscape for scalar output networks based on the convex formulation was given which establishes a direct mapping between the non-convex and convex objective landscapes. Finally, our copositive programming and semi-NMF representations of ReLU networks can be used to develop more interpretable neural models. An investigation of scalar output convex neural models for neural image reconstruction was given in (sahiner2020convex). Future works can explore these interesting and novel directions.
Appendix A Appendix
a.1 Additional experimental details
Unless otherwise stated, the neural networks were trained until full training loss convergence with SGD with a momentum parameter of 0.95 and a batch size the size of the training set (i.e. full-batch gradient descent), and the learning rate was decremented by a factor of 2 whenever the training loss reached a plateau. The initial learning rate was set as high as possible without causing the training to diverge. All neural networks were initialized with Kaiming uniform initialization (he2015delving).
a.1.1 Additional Experiment: Copositive relaxation when
The copositive relaxation for the neural network training problem described in (19) is not guaranteed to exactly correspond to the objective when . However, we find that in practice, this relaxation is tight even in such settings. To demonstrate such an instance, we consider the problem of generating images from noise.
In particular, we initialize element-wise from an i.i.d standard Gaussian distribution. To analyze the spike-free setting, we whitened using ZCA whitening. Then, we attempted to fit images
from the MNIST handwritten digits dataset(lecun1998gradient) and CIFAR-10 dataset (krizhevsky2009learning) respectively. From each dataset, we select 100 random images with 10 samples from each class and flatten them into vectors, to form and . We allow the noise inputs to have the same shape as the output. Clearly, in these cases, with and respectively, the copositive relaxation (20) is not guaranteed to correspond exactly to the neural network training optimum.
However, we find across a variety of regularization parameters , that the solution found by SGD and this copositive relaxation exactly correspond, as demonstrated in Figure 12.
While for the lowest value of , the copositive relaxation does not exactly correspond with the value obtained by SGD, we note that we showed the objective value of the copositive relaxation to be a lower bound
of the neural network training objective–meaning that the differences seen in this plot are likely due to a numerical optimization issue, rather than a fundamental one. Non-convex SGD was trained for 60,000 epochs with 1000 neurons with a learning rate of, while the copositive relaxation was trained using Adam (kingma2014adam) with the Geotorch library for constrained optimization and manifold optimization for deep learning in PyTorch, which allowed us to express the PSD constraint, with an additional hinge loss to penalize the violations of the affine constraints. This copositive relaxation was trained for 60,000 epochs with a learning rate of for CIFAR-10 and for MNIST, and , and as parameters for Adam.
a.1.2 Additional Experiment: Comparing ReLU Activation to Linear Activation in the Case of Spike-Free Matrices
As discussed in Section 3.3., if the data matrix is spike-free, the resulting convex ReLU model (15) is similar to a linear-activation network, with the only difference being an additional cone constraint on the weight matrix . It stands to wonder whether in the case of spike-free data matrices, the use of a ReLU network is necessary at all, and whether a linear-activation network would perform equally well.
In this experiment, we compare the performance of a ReLU-activation network to a linear-activation one, and demonstrate that even in the spike-free case, there exist instances in which the ReLU-activation network would be preferred. In particular, we take as our training data 3000 demeaned and ZCA-whitened images from the CIFAR-10 dataset to form our spike-free training data . We then generate continuous labels from a randomly-initialized ReLU two-layer network with 4000 hidden units. We use this same label-generating neural network to generate labels for images from the full 10,000-sample test set of CIFAR-10 as well, after the test images are pre-processed with the same whitening transformation used on the training data.
Across different values of , we measured the training and generalization performance of both ReLU-activation and linear-activation two-layer neural networks trained with SGD on this dataset. Both networks used 4000 hidden units, and were trained for 400 epochs with a learning rate of and momentum of . Our results are displayed in Figure 15.
As we can see, for all values of , while the linear-activation network has equal or lesser training loss than the ReLU-activation network, the ReLU-activation network generalizes significantly better, achieving orders of magnitude better test loss. We should note that for values of and above, both networks learn the zero network (i.e. all weights at optimum are zero), so both their training and test loss are identical to each other. We can also observe that the best-case test loss for the linear-activation network is to simply learn the zero network, whereas for a value of the ReLU-activation network can learn to generalize better than the zero network (achieving a test loss of 63038, compared to a test loss of 125383 of the zero-network).
These results demonstrate that even for spike-free data matrices, there are reasons to prefer a ReLU-activation network to a linear-activation network. In particular, because of the cone-constraint on the dual weights , the ReLU network is induced to learn a more complex representation than the linear network, which would explain its better generalization performance.
The CIFAR-10 dataset consists of 50,000 training images and 10,000 test images of for 3 RGB channels, with 10 classes (krizhevsky2009learning)
. These images were normalized by the per-channel training set mean and standard deviation. To form our training set, selected 3,000 training images from these datasets at random, where each class was equally represented. This data was then feature-wise de-meaned and transformed using ZCA. This same training class mean and ZCA transformation was also then used on the 10,000 testing points for evaluation.
a.1.3 Does SGD always find the global optimum for neural networks?
For these experiments, SGD was trained with an initial learning rate of for 20,000 epochs. We used a regularization penalty value of . The value for for Algorithm 1 was found by first starting at the value of regularization penalty from the solution from SGD, then refining this value using manual tuning. A final value of was chosen. For this experiment, there were sign patterns. Algorithm 1 was run for 30,000 iterations, and took X seconds to solve.
a.1.4 Maximum-Margin Matrix Factorization
The CIFAR-10 and CIFAR-100 datasets consist of 50,000 training images and 10,000 test images of for 3 RGB channels, with 10 and 100 classes respectively (krizhevsky2009learning). These images were normalized by the per-channel training set mean and standard deviation. To form our training set, selected 3,000 training images from these datasets at random, where each class was equally represented. This data was then feature-wise de-meaned and transformed using ZCA. This same training class mean and ZCA transformation was also then used on the 10,000 testing points for evaluation. For CIFAR-10, we used a regularization parameter value of , whereas for CIFAR-100, we used a value of .
SGD was trained for 400 epochs with a learning rate of with 1000 neurons, trained with one-hot encoded labels and squared loss. Figure 18
displays the test accuracy of the learned networks. Surprisingly the whitened classification from only 3,000 images generalizes quite well in both circumstances, far exceeding performance of the null classifier. For the CIFAR-10 experiments, the algorithm from Theorem 2 took only 0.018 seconds to solve, whereas for CIFAR-100 it took 0.36 seconds to solve.
a.1.5 Effectiveness of the Copositive Program
For this classification problem, we use one-hot encoded labels and squared loss. For , SGD used a learning rate of , and otherwise used a learning rate of . SGD was trained for 8,000 epochs with 1000 neurons, while Algorithm 1 ran for 1,000 iterations. The copositive relaxation was optimized with CVXPY with a first-order solver on a CPU with 256 GB of RAM (diamond2016cvxpy). The first-order convex solver for the copositive relaxation used a maximum of 20,000 iterations. This dataset had sign patterns. The value of for Algorithm 1 was chosen as the regularization penalty from the solution of SGD.
a.2 Note on Data matrices of a Fixed Rank
Consider the neural network training problem
Let be the compact SVD of with rank , where , and . Let and . We note that and . Then, we can re-parameterize the problem as
We note that only appears in the regularization term. Minimizing over thus means simply setting it to 0. Then, we have
We note that and . Thus, for of rank , we can effectively reduce the dimension of the neural network training problem without loss of generality. This thus holds for all results concerning the complexity of the neural network training problem with data matrices of a fixed rank.
a.3.1 Proof of Theorem 1
We begin with the primal problem (5), repeated here for convenience:
We start by re-scaling the weights in order to obtain a slightly different, equivalent objective, which has been performed previously in (pilanci2020neural; savarese2019infinite).
The primal problem is equivalent to the following optimization problem
Proof: Note that for any , we can re-scale the parameters , . Noting that the network output is unchanged by this re-scaling scheme, we have the equivalent problem
Minimizing with respect to , we thus end up with
We can thus set without loss of generality. Further, relaxing this constraint to does not change the optimal solution. In particular, for the problem
the constraint will be active for all non-zero . Thus, relaxing the constraint will not change the objective. This proves the Lemma.
Now, we are ready to prove the first part of Theorem 1, i.e. the equivalence to the semi-infinite program (11).
For all primal neural network training problem (25) has a strong dual, in the form of
Proof: We first form the Lagrangian of the primal problem, by first-reparameterizing the problem as
and then forming the Lagrangian as
By Sion’s minimax theorem, we can switch the inner maximum and minimum, and minimize over and . This produces the following problem:
We then simply need to interchange max and min to obtain the desired form. Note that this interchange does not change the objective value due to semi-infinite strong duality. In particular, for any , this problem is strictly feasible (simply let ) and the objective value is bounded by . Then, by Theorem 2.2 of (shapiro2009semi), we know that strong duality holds, and
Furthermore, by (shapiro2009semi), for a signed measure , we obtain the following strong dual of the dual program (11):
where defines the unit -ball. By discretization arguments in Section 3 of (shapiro2009semi), and by Helly’s theorem, there exists some such that this is equivalent to
Minimizing with respect to , we obtain
which we can minimize with respect to to obtain the finite parameterization
This proves that the semi-infinite dual provides a finite support with at most non-zero neurons. Thus, if the number of neurons of the primal problem , strong duality holds. Now, we seek to show the second part of Theorem 1, namely the equivalence to (12). Starting from (11), we have that the dual constraint is given by
Using the concept of dual norm, we can introduce variable to further re-express this constraint as
Then, enumerating over sign patterns , we have
Now, we express this in terms of an inner product.
Now, we can take the convex hull of the constraint set, noting that since the objective is affine, this does not change the objective value.