1 Introduction
In this paper, we analyze vectoroutput twolayer ReLU neural networks from an optimization perspective. These networks, while simple, are the building blocks of deep networks which have been found to perform tremendously well for a variety of tasks. We find that vectoroutput networks regularized with standard weightdecay have a convex semiinfinite strong dual–a convex program with infinitely many constraints. However, this strong dual has a finite parameterization, though expressing this parameterization is nontrivial. In particular, we find that expressing a vectoroutput neural network as a convex program requires taking the convex hull of completely positive matrices. Thus, we find an intimate, novel connection between neural network training and copositive programs, i.e. programs over the set of completely positive matrices (anjos2011handbook)
. We describe algorithms which can be used to find the global minimum of the neural network training problem in polynomial time for data matrices of fixed rank, which holds for convolutional architectures. We also demonstrate under certain conditions that we can provably find the optimal solution to the neural network training problem using softthresholded Singular Value Decomposition (SVD). In the general case, we introduce a relaxation to parameterize the neural network training problem, which in practice we find to be tight in many circumstances.
1.1 Related work
Our analysis focuses on the optima of finitewidth neural networks. This approach contrasts with certain approaches which have attempted to analyze infinitewidth neural networks, such as the Neural Tangent Kernel (jacot2018neural). Despite advancements in this direction, infinitewidth neural networks do not exactly correspond to their finitewidth counterparts, and thus this method of analysis is insufficient for fully explaining their success (arora2019exact).
Other works may attempt to optimize neural networks with assumptions on the data distribution. Of particular interest is (ge2018learning)
, which demonstrates that a polynomial number of samples generated from a planted neural network model is sufficient for extracting its parameters using tensor methods, assuming the inputs are drawn from a symmetric distribution. If the input distribution to a simple convolutional neural network with one filter is Gaussian, it has also been shown that gradient descent can find the global optimum in polynomial time
(brutzkus2017globally). In contrast to these works, we seek to find general principles for learning twolayer ReLU networks, regardless of the data distribution and without planted model assumptions.Another line of work aims to understand the success of neural networks via implicit regularization, which analyzes how models trained with Stochastic Gradient Descent (SGD) find solutions which generalize well, even without explicit control of the optimization objective (gunasekar2017implicit; neyshabur2014search). In contrast, we consider the setting of explicit regularization, which is often used in practice in the form of weightdecay, which regularizes the sum of squared norms of the network weights with a single regularization parameter , which can be critical for neural network performance (golatkar2019time).
Our approach of analyzing finitewidth neural networks with a fixed training dataset has been explored for networks with a scalar output (pilanci2020neural; ergen2020aistats; ergen2020training). In fact, our work here can be considered a generalization of these results. We consider a ReLUactivation twolayer network with neurons:
(1) 
where the function denotes the ReLU activation, are the firstlayer weights of the network, and are the secondlayer weights. In the scalaroutput case, the weights are scalars, i.e. . pilanci2020neural find that the neural network training problem in this setting corresponds to a finitedimensional convex program.
However, the setting of scalaroutput networks is limited. In particular, this setting cannot account for tasks such as multiclass classification or multidimensional regression, which are some of the most common uses of neural networks. In contrast, the vectoroutput setting is quite general, and even greedily training and stacking such shallow vectoroutput networks can match or even exceed the performance of deeper networks on large datasets for classification tasks (belilovsky2019greedy). We find that this important task of extending the scalar case to the vectoroutput case is an exceedingly nontrivial task, which generates novel insights. Thus, generalizing the results of pilanci2020neural is an important task for a more complete knowledge of the behavior of neural networks in practice.
Certain works have also considered technical problems which arise in our analysis, though in application they are entirely different. Among these is analysis into coneconstrained PCA, as explored by deshpande2014cone and asteris2014nonnegative. They consider the following optimization problem
(2) 
This problem is in general considered NPhard. asteris2014nonnegative provide an exponential algorithm which runs in time to find the exact solution to (2), where and is a symmetric matrix. We leverage this result to show that the optimal value of the vectoroutput neural network training problem can be found in the worst case in exponential time with respect to , while in the case of a fixedrank data matrix our algorithm is polynomialtime. In particular, convolutional networks with fixed filter sizes (e.g., convolutional kernels) correspond to the fixedrank data case (e.g., ). In search of a polynomialtime approximation to (2), deshpande2014cone evaluate a relaxation of the above problem, given as
(3) 
While the relaxation not tight in all cases, the authors find that in practice it works quite well for approximating the solution to the original optimization problem. This relaxation, in particular, corresponds to what we call a copositive relaxation, because it consists of a relaxation of the set . When and the norm constraint is removed, is the set of completely positive matrices (dur2010copositive). Optimizing over the set of completely positive matrices is NPhard, as is optimizing over its convex hull:
Thus, optimizing over is a convex optimization problem which is nevertheless NPhard. Various relaxations to have been proposed, such as the copositive relaxation used by (deshpande2014cone) above:
In fact, this relaxation is tight, given that and (burer2015gentle; kogan1993characterization). However, for , so the copositive relaxation provides a lower bound in the general case. These theoretical results prove insightful for understanding the neural network training objective.
1.2 Contributions

We find the semiinfinite convex strong dual for the vectoroutput twolayer ReLU neural network training problem, and prove that it has a finitedimensional exact convex optimization representation.

We establish a new connection between vectoroutput neural networks, copositive programs and coneconstrained PCA problems, yielding new insights into the nature of vectoroutput neural network training, which extend upon the results of the scalaroutput case.

We provide methods that globally solve the vectoroutput neural network training problem in polynomial time for data matrices of a fixed rank, but for the fullrank case, the complexity is necessarily exponential in assuming .

We provide conditions on the training data and labels with which we can find a closedform expression for the optimal weights of a vectoroutput ReLU neural network using softthresholded SVD.

We propose a copositive relaxation to establish a heuristic for solving the neural network training problem. This copositive relaxation is often tight in practice.
2 Preliminaries
In this work, we consider fitting labels from inputs with a two layer neural network with ReLU activation and neurons in the hidden layer. This network is trained with weight decay regularization on all of its weights, with associated parameter
. For some general loss function
, this gives us the nonconvex primal optimization problem(4) 
In the simplest case, with a fullyconnected network trained with squared loss^{1}^{1}1Appendix A.6 contains extensions to general convex loss functions., this becomes:
(5) 
However, alternative models can be considered. In particular, for example, ergen2020training consider twolayer CNNs with global average pooling, for which we can define the patch matrices , which define the patches which individual convolutions operate upon. Then, the vectoroutput neural network training problem with global average pooling becomes
(6) 
We will show that in this convolutional setting, because the rank of the set of patches
cannot exceed the filter size of the convolutions, there exists an algorithm which is polynomial in all problem dimensions to find the global optimum of this problem. We note that such matrices typically exhibit rapid singular value decay due to spatial correlations, which may also motivate replacing it with an approximation of much smaller rank.
In the following section, we will demonstrate how the vectoroutput neural network problem has a convex semiinfinite strong dual. To understand how to parameterize this semiinfinite dual in a finite fashion, we must introduce the concept of hyperplane arrangements. We consider the set of diagonal matrices
This is a finite set of diagonal matrices, dependent on the data matrix , which indicate the set of possible arrangement activation patterns for the ReLU nonlinearity, where a value of 1 indicates that the neuron is active, while 0 indicates that the neuron is inactive. In particular, we can enumerate the set of sign patterns as , where depends on but is in general bounded by
for (pilanci2020neural; stanley2004introduction). Note that for a fixed rank , such as in the convolutional case above, is polynomial in . Using these sign patterns, we can completely characterize the range space of the first layer after the ReLU:
We also introduce a class of data matrices for which the analysis of scalaroutput neural networks simplifies greatly, as shown in (ergen2020convex). These matrices are called spikefree matrices. In particular, a matrix is called spikefree if it holds that
(7) 
When is spikefree, then, the set of sign patterns reduces to a single sign pattern, , because of the identity in (7). The set of spikefree matrices includes (but is not limited to) diagonal matrices and whitened matrices for which , such as the output of Zerophase Component Analysis (ZCA) whitening. The setting of whitening the data matrix has been shown to improve the performance of neural networks, even in deeper settings where the whitening transformation is applied to batches of data at each layer (huang2018decorrelated). We will see that spikefree matrices provide polynomialtime algorithms for finding the global optimum of the neural network training problem in both and (ergen2020convex), though the same does not hold for vectoroutput networks.
2.1 WarmUp: ScalarOutput Networks
We first present strong duality results for the scalaroutput case, i.e. the case where .
Theorem (pilanci2020neural) There exists an such that if , for all , the neural network training problem (5) has a convex semiinfinite strong dual, given by
(8) 
Furthermore, the neural network training problem has a convex, finitedimensional strong bidual, given by
(9) 
This is a convex program with variables and linear inequalities. Solving this problem with standard interior point solvers thus has a complexity of , which is thus exponential in , but for a fixed rank is polynomial in .
In the case of a spikefree , however, the dual problem simplifies to a single sign pattern constraint . Then the convex strong bidual becomes (ergen2020convex)
(10) 
This convex problem has a much simpler form, with only linear inequality constraints and variables, which therefore has a complexity of . We will see that the results of scalaroutput ReLU neural networks are a specific case of the vectoroutput case.
3 Strong Duality
3.1 Convex Semiinfinite duality
Theorem 1
There exists an such that if , for all , the neural network training problem (5) has a convex semiinfinite strong dual, given by
(11) 
Furthermore, the neural network training problem has a convex, finitedimensional strong bidual, given by
(12) 
for convex sets
(13) 
The strong dual given in (11) is convex, albeit with infinitely many constraints. In contrast, (12) is a convex problem has finitely many constraints. This convex model learns a sparse set of locally linear models
which are constrained to be in a convex set, for which group sparsity and lowrankness over hyperplane arrangements is induced by the sum of nuclearnorms penalty. The emergence of the nuclear norm penalty is particularly interesting, since similar norms have also been used for rank minimization problems
(candes2010power; recht2010guaranteed), proposed as implicit regularizers for matrix factorization models (gunasekar2017implicit), and draws similarities to nuclear norm regularization in multitask learning (argyriou2008convex; abernethy2009new), and trace Lasso (grave2011trace). We note the similarity of this result to that from pilanci2020neural, whose formulation is a special case of this result with , where reduce tofrom which we can obtain the convex program presented by pilanci2020neural. Further, this result extends to CNNs with global average pooling, which is discussed in Appendix A.3.2.
Remark 1.1
It is interesting to observe that the convex program (12) can be interpreted as a piecewise lowrank model that is partitioned according to the set of hyperplane arrangements of the data matrix. In other words, a twolayer ReLU network with vector output is precisely a linear learner over the features , where convex constraints and group nuclear norm regularization is applied to the linear model weights. In the case of the CNNs, the piecewise lowrank model is over the smaller dimensional patch matrices , which result in significantly fewer hyperplane arrangements, and therefore, fewer local lowrank models.
3.2 Provably Solving the Neural Network Training Problem
In this section, we present a procedure for minimizing the convex program as presented in (12) for general output dimension . This procedure relies on Algorithm 5 for coneconstrained PCA from (asteris2014nonnegative), and the FrankWolfe algorithm for constrained convex optimization (frank1956algorithm). Unlike SGD, which is a heuristic method applied to a nonconvex training problem, this approach is built upon results of convex optimization and provably finds the global minimum of the objective. In particular, we can solve the problem in epigraph form,
(14) 
where we can perform bisection over in an outer loop to determine the overall optimal value of (12). Then, we have the following algorithm to solve the inner minimization problem of (14):
Algorithm 1:

Initialize .

For steps :

For each solve the following subproblem:
And define the pairs to be the argmaxes of the above subproblems. This is is a form of seminonnegative matrix factorization (semiNMF) on the residual at step (ding2008convex). It can be solved via coneconstrained PCA in time where .

For the semiNMF factorization obtaining the largest objective value, , form . For all other , simply let .

For step size , update

The derivations for the method and complexity of Algorithm 1 are found in Appendix A.4. We have thus described a FrankWolfe algorithm which provably minimizes the convex dual problem, where each step requires a semiNMF operation, which can be performed in time.
3.3 Spikefree Data Matrices and ClosedForm Solutions
As discussed in Section 2, if is spikefree, the set of sign partitions is reduced to the single partition . Then, the convex program (12) becomes
(15) 
This problem can also be solved with Algorithm 1. However, the asymptotic complexity of this algorithm is unchanged, due to the coneconstrained PCA step. If the constraint on were removed, (15) would be identical to optimizing a linearactivation network. However, additional cone constraint on allows for a more complex representation, which demonstrates that even in the spikefree case, a ReLUactivation network is quite different from a linearactivation network.
Recalling that whitened data matrices where are spikefree, for a further simplified class of data and label matrices, we can find a closedform expression for the optimal weights.
Theorem 2
Consider a whitened data matrix where , and labels with SVD of . If the leftsingular vectors of satisfy , there exists a closedform solution for the optimal to problem (15), given by
(16) 
The resulting model is a softthresholded SVD of , which arises as the solution of maximummargin matrix factorization (srebro2005maximum). The scenario that all the left singular vectors of satisfy the affine constraints occurs when the all of the left singular vectors are nonnegative, which is the case for example when
is a onehotencoded matrix. In this scenario where the leftsingular vectors of
satisfy , we note that the ReLU constraint on is not active, and therefore, the solution of the ReLUactivation network training problem is identical to that of the linearactivation network. This linearactivation setting has been wellstudied, such as in matrix factorization models by (cabral2013unifying; li2017geometry), and in the context of implicit bias of dropout by (mianjy2018implicit; mianjy2019dropout). This theorem thus provides a setting in which the ReLUactivation neural network performs identically to a linearactivation network.4 Neural networks and copositive programs
4.1 An equivalent copositive program
We now present an alternative representation of the neural network training problem with squared loss, which has ties to copositive programming.
Theorem 3
For all , the neural network training problem (5) has a convex strong dual, given by
(17) 
for convex sets , given by
(18) 
This is a minimization problem with a convex objective over sets of convex, completely positive cones–a copositive program, which is NPhard. There exists a cutting plane algorithm solves this problem in , which is polynomial in for data matrices of rank (see Appendix A.5). This formulation provides a framework for viewing ReLU neural networks as implicit copositive programs, and we can find conditions during which certain relaxations can provide optimal solutions.
4.2 A copositive relaxation
We consider the copositive relaxation of the sets from (17). We denote this set
In general, , with equality when (kogan1993characterization; dickinson2013copositive). We define the relaxed program as
(19) 
Because of the enumeration over signpatterns, this relaxed program still has a complexity of to solve, and thus does not improve upon the asymptotic complexity presented in Section 3.
4.3 Spikefree data matrices
If is restricted to be spikefree, the convex program (19) becomes
(20) 
With spikefree data matrices, the copositive relaxation presents a heuristic algorithm for the neural network training problem which is polynomial in both and . This contrasts with the exact formulations of (12) and (17), for which the neural network training problem is exponential even for a spikefree . Table 1 summarizes the complexities of the neural network training problem.
Scalaroutput  Vectoroutput (exact)  Vectoroutput (relaxation)  

Spikefree  
General 
5 Experiments
5.1 Does SGD always find the global optimum for neural networks?
While SGD applied to the nonconvex neural network training objective is a heuristic which works quite well in many cases, there may exist pathological cases where SGD fails to find the global minimum. Using Algorithm 1, we can now verify whether SGD find the global optimum. In this experiment, we present one such case where SGD has trouble finding the optimal solution in certain circumstances. In particular, we generate random inputs , where the elements of
are drawn from an i.i.d standard Gaussian distribution:
. We then randomly initialize a datagenerator neural network with 100 hidden neurons and and an output dimension of 5, and generate labels using this model. We then attempt to fit these labels using a neural network and squared loss, with . We compare the results of training this network for 5 trials with 10 and 50 neurons to the global optimum found by Algorithm 1. In this circumstance, with 10 neurons, none of the realizations of SGD converge to the global optimum as found by Algorithm 1, but with 50 neurons, the loss is nearly identical to that found by Algorithm 1.5.2 MaximumMargin Matrix Factorization
In this section, we evaluate the performance of the softthresholded SVD closedform solution presented in Theorem 2. In order to evaluate this method, we take a subset of 3000 points from the CIFAR10 and CIFAR100 datasets
(krizhevsky2009learning). For each dataset, we first demean the data matrix , then whiten the datamatrix using ZCA whitening. We seek to fit onehotencoded labels representing the class labels from these datasets.We compare the results of our algorithm from Theorem 2 and that of SGD in Fig. 6. As we can see, in both cases, the softthresholded SVD solution finds the same value as SGD, yet in far shorter time. Appendix A.1.4 contains further details about this experiment, including test accuracy plots.
5.3 Effectiveness of the Copositive Program
In this section, we compare the objective values obtained by SGD, Algorithm 1, and the copositive program defined in (17). We use an artificiallygenerated spiral dataset, with and 3 classes (see Fig. 9(a) for an illustration). In this case, since , we note that the copositive relaxation in (19) is tight. Across different values of , we compare the solutions found by these three methods. As shown in Fig. 9, the copositive relaxation, the solution found by SGD, and the solution found by Algorithm 1 all coincide with the same loss across various values of . This verifies our theoretical proofs of equivalence of (5), (12), and (19).
6 Conclusion
We studied the vectoroutput ReLU neural network training problem, and designed the first algorithms for finding the global optimum of this problem, which are polynomialtime in the number of samples for a fixed data rank. We found novel connections between this vectoroutput ReLU neural network problem and a variety of other problems, including semiNMF, coneconstrained PCA, softthresholded SVD, and copositive programming. These connections yield insights into the neural network training problem and open room for more exploration. Of particular interest is extending these results to deeper networks, which would further explain the performance of neural networks as they are often used in practice. One such method to extend the results in this paper to deeper networks is to greedily train and stack twolayer networks to create one deeper network, which has shown to mimic the performance of deep networks trained endtoend. An alternative approach would be to analyze deeper networks directly, which would require a hierarchical representation of sign patterns at each hidden layer, which would complicate our analysis significantly. Some preliminary results for convex program equivalents of deeper training problems are presented under whitened input data assumptions in (ergen2020convexdeep). Another interesting research direction is investigating efficient relaxations of our vector output convex programs for larger scale simulations. Convex relaxations for scalar output ReLU networks with approximation guarantees were studied in (bartan2019convex; ergen2019convexshallow; ergen2019convexcutting; d2020global). Furthermore, landscapes of vector output neural networks and dynamics of gradient descent type methods can be analyzed by leveraging our results. In (lacotte2020all), an analysis of the landscape for scalar output networks based on the convex formulation was given which establishes a direct mapping between the nonconvex and convex objective landscapes. Finally, our copositive programming and semiNMF representations of ReLU networks can be used to develop more interpretable neural models. An investigation of scalar output convex neural models for neural image reconstruction was given in (sahiner2020convex). Future works can explore these interesting and novel directions.
References
Appendix A Appendix
a.1 Additional experimental details
All neural networks in the experiments were trained using the Pytorch deep learning library
(NEURIPS2019_9015), using a single NVIDIA GeForce GTX 1080 Ti GPU. Algorithm 1 was trained using a CPU with 256 GB of RAM, as was the maximummargin matrix factorization. Unless otherwise stated, the FrankWolfe method from Algorithm 1 used a step size of , and all methods were trained to minimize squared loss.Unless otherwise stated, the neural networks were trained until full training loss convergence with SGD with a momentum parameter of 0.95 and a batch size the size of the training set (i.e. fullbatch gradient descent), and the learning rate was decremented by a factor of 2 whenever the training loss reached a plateau. The initial learning rate was set as high as possible without causing the training to diverge. All neural networks were initialized with Kaiming uniform initialization (he2015delving).
a.1.1 Additional Experiment: Copositive relaxation when
The copositive relaxation for the neural network training problem described in (19) is not guaranteed to exactly correspond to the objective when . However, we find that in practice, this relaxation is tight even in such settings. To demonstrate such an instance, we consider the problem of generating images from noise.
In particular, we initialize elementwise from an i.i.d standard Gaussian distribution. To analyze the spikefree setting, we whitened using ZCA whitening. Then, we attempted to fit images
from the MNIST handwritten digits dataset
(lecun1998gradient) and CIFAR10 dataset (krizhevsky2009learning) respectively. From each dataset, we select 100 random images with 10 samples from each class and flatten them into vectors, to form and . We allow the noise inputs to have the same shape as the output. Clearly, in these cases, with and respectively, the copositive relaxation (20) is not guaranteed to correspond exactly to the neural network training optimum.However, we find across a variety of regularization parameters , that the solution found by SGD and this copositive relaxation exactly correspond, as demonstrated in Figure 12.
While for the lowest value of , the copositive relaxation does not exactly correspond with the value obtained by SGD, we note that we showed the objective value of the copositive relaxation to be a lower bound
of the neural network training objective–meaning that the differences seen in this plot are likely due to a numerical optimization issue, rather than a fundamental one. Nonconvex SGD was trained for 60,000 epochs with 1000 neurons with a learning rate of
, while the copositive relaxation was trained using Adam (kingma2014adam) with the Geotorch library for constrained optimization and manifold optimization for deep learning in PyTorch, which allowed us to express the PSD constraint, with an additional hinge loss to penalize the violations of the affine constraints. This copositive relaxation was trained for 60,000 epochs with a learning rate of for CIFAR10 and for MNIST, and , and as parameters for Adam.a.1.2 Additional Experiment: Comparing ReLU Activation to Linear Activation in the Case of SpikeFree Matrices
As discussed in Section 3.3., if the data matrix is spikefree, the resulting convex ReLU model (15) is similar to a linearactivation network, with the only difference being an additional cone constraint on the weight matrix . It stands to wonder whether in the case of spikefree data matrices, the use of a ReLU network is necessary at all, and whether a linearactivation network would perform equally well.
In this experiment, we compare the performance of a ReLUactivation network to a linearactivation one, and demonstrate that even in the spikefree case, there exist instances in which the ReLUactivation network would be preferred. In particular, we take as our training data 3000 demeaned and ZCAwhitened images from the CIFAR10 dataset to form our spikefree training data . We then generate continuous labels from a randomlyinitialized ReLU twolayer network with 4000 hidden units. We use this same labelgenerating neural network to generate labels for images from the full 10,000sample test set of CIFAR10 as well, after the test images are preprocessed with the same whitening transformation used on the training data.
Across different values of , we measured the training and generalization performance of both ReLUactivation and linearactivation twolayer neural networks trained with SGD on this dataset. Both networks used 4000 hidden units, and were trained for 400 epochs with a learning rate of and momentum of . Our results are displayed in Figure 15.
As we can see, for all values of , while the linearactivation network has equal or lesser training loss than the ReLUactivation network, the ReLUactivation network generalizes significantly better, achieving orders of magnitude better test loss. We should note that for values of and above, both networks learn the zero network (i.e. all weights at optimum are zero), so both their training and test loss are identical to each other. We can also observe that the bestcase test loss for the linearactivation network is to simply learn the zero network, whereas for a value of the ReLUactivation network can learn to generalize better than the zero network (achieving a test loss of 63038, compared to a test loss of 125383 of the zeronetwork).
These results demonstrate that even for spikefree data matrices, there are reasons to prefer a ReLUactivation network to a linearactivation network. In particular, because of the coneconstraint on the dual weights , the ReLU network is induced to learn a more complex representation than the linear network, which would explain its better generalization performance.
The CIFAR10 dataset consists of 50,000 training images and 10,000 test images of for 3 RGB channels, with 10 classes (krizhevsky2009learning)
. These images were normalized by the perchannel training set mean and standard deviation. To form our training set, selected 3,000 training images from these datasets at random, where each class was equally represented. This data was then featurewise demeaned and transformed using ZCA. This same training class mean and ZCA transformation was also then used on the 10,000 testing points for evaluation.
a.1.3 Does SGD always find the global optimum for neural networks?
For these experiments, SGD was trained with an initial learning rate of for 20,000 epochs. We used a regularization penalty value of . The value for for Algorithm 1 was found by first starting at the value of regularization penalty from the solution from SGD, then refining this value using manual tuning. A final value of was chosen. For this experiment, there were sign patterns. Algorithm 1 was run for 30,000 iterations, and took X seconds to solve.
a.1.4 MaximumMargin Matrix Factorization
The CIFAR10 and CIFAR100 datasets consist of 50,000 training images and 10,000 test images of for 3 RGB channels, with 10 and 100 classes respectively (krizhevsky2009learning). These images were normalized by the perchannel training set mean and standard deviation. To form our training set, selected 3,000 training images from these datasets at random, where each class was equally represented. This data was then featurewise demeaned and transformed using ZCA. This same training class mean and ZCA transformation was also then used on the 10,000 testing points for evaluation. For CIFAR10, we used a regularization parameter value of , whereas for CIFAR100, we used a value of .
SGD was trained for 400 epochs with a learning rate of with 1000 neurons, trained with onehot encoded labels and squared loss. Figure 18
displays the test accuracy of the learned networks. Surprisingly the whitened classification from only 3,000 images generalizes quite well in both circumstances, far exceeding performance of the null classifier. For the CIFAR10 experiments, the algorithm from Theorem 2 took only 0.018 seconds to solve, whereas for CIFAR100 it took 0.36 seconds to solve.
a.1.5 Effectiveness of the Copositive Program
For this classification problem, we use onehot encoded labels and squared loss. For , SGD used a learning rate of , and otherwise used a learning rate of . SGD was trained for 8,000 epochs with 1000 neurons, while Algorithm 1 ran for 1,000 iterations. The copositive relaxation was optimized with CVXPY with a firstorder solver on a CPU with 256 GB of RAM (diamond2016cvxpy). The firstorder convex solver for the copositive relaxation used a maximum of 20,000 iterations. This dataset had sign patterns. The value of for Algorithm 1 was chosen as the regularization penalty from the solution of SGD.
a.2 Note on Data matrices of a Fixed Rank
Consider the neural network training problem
(21) 
Let be the compact SVD of with rank , where , and . Let and . We note that and . Then, we can reparameterize the problem as
(22) 
We note that only appears in the regularization term. Minimizing over thus means simply setting it to 0. Then, we have
(23) 
We note that and . Thus, for of rank , we can effectively reduce the dimension of the neural network training problem without loss of generality. This thus holds for all results concerning the complexity of the neural network training problem with data matrices of a fixed rank.
a.3 Proofs
a.3.1 Proof of Theorem 1
We begin with the primal problem (5), repeated here for convenience:
(24) 
We start by rescaling the weights in order to obtain a slightly different, equivalent objective, which has been performed previously in (pilanci2020neural; savarese2019infinite).
Lemma 4
The primal problem is equivalent to the following optimization problem
(25) 
Proof: Note that for any , we can rescale the parameters , . Noting that the network output is unchanged by this rescaling scheme, we have the equivalent problem
(26) 
Minimizing with respect to , we thus end up with
(27) 
We can thus set without loss of generality. Further, relaxing this constraint to does not change the optimal solution. In particular, for the problem
(28) 
the constraint will be active for all nonzero . Thus, relaxing the constraint will not change the objective. This proves the Lemma.
Now, we are ready to prove the first part of Theorem 1, i.e. the equivalence to the semiinfinite program (11).
Lemma 5
For all primal neural network training problem (25) has a strong dual, in the form of
(29) 
Proof: We first form the Lagrangian of the primal problem, by firstreparameterizing the problem as
(30) 
and then forming the Lagrangian as
(31) 
By Sion’s minimax theorem, we can switch the inner maximum and minimum, and minimize over and . This produces the following problem:
(32) 
We then simply need to interchange max and min to obtain the desired form. Note that this interchange does not change the objective value due to semiinfinite strong duality. In particular, for any , this problem is strictly feasible (simply let ) and the objective value is bounded by . Then, by Theorem 2.2 of (shapiro2009semi), we know that strong duality holds, and
(33) 
as desired.
Furthermore, by (shapiro2009semi), for a signed measure , we obtain the following strong dual of the dual program (11):
(34) 
where defines the unit ball. By discretization arguments in Section 3 of (shapiro2009semi), and by Helly’s theorem, there exists some such that this is equivalent to
(35) 
Minimizing with respect to , we obtain
(36) 
which we can minimize with respect to to obtain the finite parameterization
(37) 
This proves that the semiinfinite dual provides a finite support with at most nonzero neurons. Thus, if the number of neurons of the primal problem , strong duality holds. Now, we seek to show the second part of Theorem 1, namely the equivalence to (12). Starting from (11), we have that the dual constraint is given by
(38) 
Using the concept of dual norm, we can introduce variable to further reexpress this constraint as
(39) 
Then, enumerating over sign patterns , we have
(40) 
Now, we express this in terms of an inner product.
(41) 
Letting :
(42) 
Now, we can take the convex hull of the constraint set, noting that since the objective is affine, this does not change the objective value.
Comments
There are no comments yet.