efficient algorithms for learning commonly used neural network architectures continues to be a core challenge in machine learning. The underlying difficulty arises from the highly non-convex nature of the optimization problems posed by neural networks. Obtaining provable guarantees for learning even very basic architectures remains open.
In this paper we consider a simple convolutional neural network with a single filter and overlapping patches followed by average pooling (Figure1). More formally, for an input image , we consider patches of size indicated by selection matrices where each matrix has exactly one in each row and at most one in each column. The neural network is computed as where
is the activation function and
is the weight vector corresponding to the convolution filter. We focus on ReLU and leaky ReLU activation functions.
1.1 Our Contributions
The main contribution of this paper is a simple, stochastic update algorithm Convotron (Algorithm 1) for provably learning the above convolutional architecture. The algorithm has the following properties:
Works for general classes of overlapping patches and requires mild distributional conditions.
Proper recovery of the unknown weight vector.
Stochastic in nature with a “gradient-like” update step.
Requires no special/random initialization scheme or tuning of the learning rate.
Tolerates noise and succeeds in the probabilistic concept model of learning.
Logarithmic convergence in , the error parameter, in the realizable setting.
This is the first efficient algorithm for learning general classes of overlapping patches (and the first algorithm for any class of patches that succeeds under mild distributional assumptions). Prior work has focused on analyzing SGD in the realizable/noiseless setting with the caveat of requiring either disjoint patches [BG17, DLT17b] with Gaussian inputs or technical conditions linking the underlying true parameters and the “closeness of patches” [DLT17a].
In contrast, our conditions depend only on the patch structure itself and can be efficiently verified. Commonly used patch structures in computer vision applications such as 1D/2D grids satisfy our conditions. Additionally, we require only that the underlying distribution on samples is symmetric and induces a covariance matrix on the patches with polynomially bounded condition number444Brutzkus and Globerson [BG17] proved that the problem, even with disjoint patches, is NP-hard in general, and so some distributional assumption is needed for efficient learning.
. All prior work handles only continuous distributions. Another major difference from prior work is that we give guarantees using purely empirical updates. That is, we do not require an assumption that we have access to exact quantities such as the population gradient of the loss function.
We further show that in the commonly studied setting of Gaussian inputs and non-overlapping patches, updating with respect to a single non-overlapping patch is sufficient to guarantee convergence. This indicates that the Gaussian/no-overlap assumption is quite strong.
1.2 Our Approach
Our approach is to exploit the monotonicity of the activation function instead of the strong convexity of the loss surface. We use ideas from isotonic regression and extend them in the context of convolutional networks. These ideas have been successful for learning generalized linear models [KKSK11], improperly learning fully connected, depth-three neural networks [GK17b], and learning graphical models [KM17].
1.3 Related Work
It is known that in the worst case, learning even simple neural networks is computationally intractable. For example, in the non-realizable (agnostic) setting, it is known that learning a single ReLU (even for bounded distributions and unit norm hidden weight vectors) with respect to square-loss is as hard as learning sparse parity with noise [GKKT16]
, a notoriously difficult problem from computational learning theory. For learning one hidden layer convolutional networks, Brutzkus and Globerson[BG17] proved that distribution-free recoverability of the unknown weight vector is NP-hard, even if we restrict to disjoint patch structures.
As such, a major open question is to discover the mildest assumptions that lead to polynomial-time learnability for simple neural networks. In this paper, we consider the very popular class of convolutional neural networks (for a summary of other recent approaches for learning more general architectures see [GK17a]
). For convolutional networks, all prior research has focused on analyzing conditions under which (Stochastic) Gradient Descent converges to the hidden weight vector in polynomial-time.
Along these lines, Brutzkus and Globerson [BG17] proved that with respect to the spherical Gaussian distribution and for disjoint (non-overlapping) patch structures, gradient descent recovers the weight vector in polynomial-time. Zhong et al. [ZSD17]
showed that gradient descent combined with tensor methods can recover one hidden layer involving multiple weight vectors but still require a Gaussian distribution and non-overlapping patches. Du et al.[DLT17b] proved that gradient descent recovers a hidden weight vector involved in a type of two-layer convolutional network under the assumption that the distribution is a spherical Gaussian, the patches are disjoint, and the learner has access to the true population gradient of the loss function.
We specifically highlight the work of Du, Lee, and Tian [DLT17a], who proved that gradient descent recovers a hidden weight vector in a one-layer convolutional network under certain technical conditions that are more general than the Gaussian/no-overlap patch scenario. Their conditions involve a certain “alignment” of the unknown patch structure, the hidden weight vector, and the (continuous) marginal distribution. However, it is unclear which concrete patch-structure/distributional combinations their framework captures. We also note that all of the above results assume there is no noise; i.e., they work in the realizable setting.
In contrast, we consider an alternative to gradient descent, namely Convotron, that is based on isotonic regression. The exploration of alternative algorithms to gradient descent is a feature of our work, as it may lead to new algorithms for learning deeper networks.
corresponds to the
-norm for vectors and the spectral norm for matrices. The identity matrix is denoted by. We denote the input-label distribution by over input drawn from and label drawn from . The marginal distribution on the input is denoted by
and the corresponding probability density function is denoted by.
In this paper we consider a simple convolution neural network with one hidden layer and average pooling. Given input , the network computes patches of size where each patch’s location is indicated by matrices . Each matrix has exactly one 1 in each row and at most one 1 in every column. As before, the neural network is computed as follows:
where is the activation function and is the weight vector corresponding to the convolution filter.
We study the problem of learning the teacher network under the square loss, that is, we wish to find a such that
We make the following assumptions:
Learning Model: Probabilistic Concept Model [KS90], that is, for all , , for some unknown where is noise with and for some . Note we do not require that the noise is independent of the instance.555In the realizable setting, as in previous works, it is assumed that .
Distribution: The marginal distribution on the input space is a symmetric distribution about the origin, that is, for all , .
: The minimum eigenvalue ofwhere and the maximum eigenvalue of are polynomially bounded.
Activation Function: The activation function has the following form:
for some constant .
The distributional assumption includes common assumptions such as Gaussian inputs, but is far less restrictive. For example, we do not require the distribution to be continuous nor do we require it to have identity covariance. In Section 4, we show that commonly used patch schemes from computer vision satisfy our patch requirements. The assumption on activation functions is satisfied by popular activations such as ReLU () and leaky ReLU ().
2.1 Some Useful Properties
The activations we consider in this paper have the following useful property:
For all ,
The loss function can be upper bounded by the -norm distance of weight vectors using the following lemma.
For any , we have
For all and ,
The Gershgorin Circle Theorem, stated below, is useful for bounding the eigenvalues of matrices.
Theorem 1 ([Wei03]).
For a matrix , define . Each eigenvalue of must lie in at least one of the disks .
Note: The proofs of lemmas in this section have been deferred to the Appendix.
3 The Convotron Algorithm
In this section we describe our main algorithm Convotron and give a proof of its correctness. Convotron is an iterative algorithm similar in flavor to SGD with a modified (aggressive) gradient update. Unlike SGD (Algorithm 3), Convotron comes with provable guarantees and also does not need a good initialization scheme for convergence.
The following theorem describes the convergence rate of our algorithm:
If Assumptions 1 are satisfied then for and , with probability , the weight vector computed by Convotron satisfies
Define The dynamics of Convotron can be expressed as follows:
We need to bound the RHS of the above equation. We have,
Now we bound the variance of. Note that . Further,
Combining the above equations and taking expectation over , we get
for , and .
We set and break the analysis to two cases:
Case 1: . This implies that .
Case 2: .
Observe that once Case 2 is satisfied, we have . Hence, for any iteration , Case 2 will continue to hold true. This implies that either at each iteration decreases by a factor or it is less than . Thus if Case 1 is not satisfied for any iteration up to , then we have,
since at initialization . Setting and using Markov’s inequality, with probability , over the choice of ,
By using Lemma 2, we can get a bound on by appropriately scaling .
3.1 Convotron in the Realizable Case
For the realizable (no noise) setting, that is, for all , , for some unknown , Convotron achieves faster convergence rates.
If Assumptions 1 are satisfied with the learning model restricted to the realizable case, then for suitably choosen , after iterations, with probability , the weight vector computed by Convotron satisfies
Since the setting has no noise, . Setting that parameter in Theorem 2 gives us as tends to infinity as tends to 0 and taking the minimum removes this dependence from . Substituting this gives us the required result. ∎
Observe that the dependence of in the convergence rate is for the realizable setting, compared to the dependence in the noisy setting.
4 Which Patch Structures are Easy to Learn?
In this section, we will show that the commonly used convolutional filters in practice (“patch and stride”) have good eigenvalues giving us fast convergence by Theorem 2. We will start with the 1D case and then subsequently extend the result for the 2D case.
4.1 1D Convolution
Here we formally describe a patch and stride convolution in the one-dimensional setting. Consider a 1D image of dimension . Let the patch size be and stride be . Let the patches be indexed from 1 and let patch start at position and be contiguous through position . The matrix of dimension corresponding to patch looks as follows,
where indicates a matrix of dimension with all zeros and indicates the identity matrix of size .
Thus, the total number of patches is . We will assume that and . The latter condition is to ensure there is some overlap, non-overlapping case, which is easier, is handled in the next section.
We will bound the extremal eigenvalues of . Simple algebra gives us the following structure for ,
For understanding, we show the matrix structure for and .
4.1.1 Bounding Extremal Eigenvalues of
The following lemmas bound the extremal eigenvalues of .
Maximum eigenvalue of satisfies where and .
Using Theorem 1, we have . Observe that is bisymmetric thus and we can restrict to the top half of the matrix. The structure of indicates that in a fixed row, the diagonal entry is maximum and the non-zero entries decrease monotonically by 1 as we move away from the diagonal. Also, there can be at most non-zero entries in any row. Thus the sum is maximized when there are non-zero entries and the diagonal entry is the middle entry, that is at position . By simple algebra,
Minimum eigenvalue of satisfies .
We break the analysis into following two cases:
In this case we directly bound the minimum eigenvalue of . Using Theorem 1, we know that . For , for some . The maximum value that can take is and since , must be either 0 or 1. Also, for any , there exists a unique such that since , thus there are exactly 2 non-zero entries in each row of , . This gives us, for each , . Thus, we get that .
Combining both, we get the required result. ∎
4.1.2 Learning Result for 1D
Augmenting the above analysis with Theorem 2 gives us learnability of 1D convolution filters.
If Assumptions 1(a),(b), and (d) are satisfied and the patches have a patch and stride structure with parameters , then for suitably chosen and , with probability , the weight vector output by Convotron satisfies
Combining the above Lemmas gives us that and . Observe that . Substituting these values in Theorem 2 gives us the desired result. ∎
Comparing with SGD, [BG17] showed that even for and , Gradient descent can get stuck in a local minima with probability .
4.2 2D Convolution
Here we formally define stride and patch convolutions in two dimensions. Consider a 2D image of dimension . Let the patch size be and stride in both directions be respectively. Enumerate patches such that patch starts at position and is a rectangle with diagonally opposite point . Let and . Let us vectorize the image row-wise into a dimension vector and enumerate each patch row-wise to get a dimensional vector.
Let be the indicator matrix of dimension with 1 at if the th location of patch is . More formally, for all for , , and else 0. Note that there are patches in total with the corresponding patch matrices being for .
4.2.1 Bounding Extremal Eigenvalues of
We will bound the extremal eigenvalues of . Let ’s be the patch matrices corresponding to the 1D convolution for parameters defined as in the previous section and let . Define ’s for and similarly with parameters instead of .
Intuitively and give the indices corresponding to the row and column of the 2D patch and the Kronecker product vectorizes it to give us the th patch. More formally, we will show that iff .
Let with , and with , . Then, iff and . We know that iff and iff . This gives us that , which is the same condition for . Thus . ∎
We have and where and .
Since and are positive semi-definite, and . Using the lemmas from the previous section gives us the required result. ∎
Note that this technique can be extended to higher dimensional patch structures as well.
4.2.2 Learning Result for 2D
Similar to the 1D case, combining the above analysis with Theorem 2 gives us learnability of 2D convolution filters.
If Assumptions 1(a),(b), and (d) are satisfied and the patches have a 2D patch and stride structure with parameters , then for suitably chosen and , with probability , the weight vector output by Convotron satisfies
5 Non-overlapping Patches are Easy
In this section, we will show that if there is one patch that does not overlap with any patch and the covariance matrix is identity then we can easily learn the filter even if the other patches have arbitrary overlaps. This includes the commonly used Gaussian assumption. WLOG we assume that is the patch that does not overlap with any other patch implying for all .
Observe that the algorithm ignores the directions of all other patches and yet succeeds. This indicates that with respect to a Gaussian distribution, in order to have an interesting patch structure (for one layer networks), it is necessary to avoid having even a single disjoint patch. The following theorem shows the convergence of Convotron-No-Overlap.
If Assumptions 1 are satisfied with , then for and , with probability , the weight vector outputted by Convotron-No-Overlap satisfies