Very little is understood about the exact class of functions learnable with deep neural networks. Note that we are differentiating between representability and learnability. The former simply means that there exists a deep network that can represent the function to be learned, while the latter means that it can be learned in a reasonable amount of time via a method such as gradient descent.
While several natural functions seem learnable, at least to some degree by deep networks, we do not quite understand the specific properties they exhibit that makes them learnable. The fact that they can be learned means that they can be represented by a deep network. Thus, it is important to understand the class of functions that can be represented by deep networks (as teacher networks) that are learnable (by any student network).
In this work we study the learnability of random
deep networks. By random, we mean that all weights in the network are drawn i.i.d from a Gaussian distribution. We theoretically analyze the level of learnability as a function of (teacher network) depth. Informally, we show the following statement in Section4.2.
The learnability of a random deep network with activations drops exponentially with its depth.
Our proof proceeds by arguing that for a random network, a pair of non-collinear inputs progressively repel each other in terms of the angle between them. Thus, after a small number of layers, they become essentially orthogonal. While this statement appears intuitively true, proving it formally entails a delicate analysis of a stochastic non-linear recurrence, which forms the technical meat of our work. Inputs becoming orthogonal after passing through random layers means that their correlation with any fixed function becomes very small. We use this fact along with known lower bounds in the statistical query (SQ) model to argue that random networks of depth cannot be efficiently learned by SQ algorithms , which is a rich family that includes widely-used methods such as PCA, gradient descent, etc. Our proof technique showing non-learnability in the deep network setting could be of independent interest.
While we argue the random deep network becomes harder to learn as the depth exceeds where is the number of inputs, an interesting question is whether these functions are cryptographically hard to learn, i.e., are they cryptographic hash functions? While we do not definitively answer this question, we provide some evidence that may point in this direction. Informally we show the following in Section 6.
Outputs of random networks of -depth are -way independent.
To confirm that our bounds are not too pessimistic in practice, we take different random deep networks of varying depth and evaluate experimentally to what extent they are learnable using known, state-of-the-art deep learning techniques. Our experiments show that while these random deep networks are learnable at lower depths, their learnability drops sharply with increasing depth, supporting our theoretical lower bounds.
Several works investigate the learnability of simple networks where the network to be learnt is presented as a blackbox teacher network that is used to train a student network by taking several input-output pairs. If the inputs to the teacher network are drawn from a Gaussian distribution, then for networks of small depth (two) [7, 6, 16, 20, 18, 21] guarantee learning the function under certain assumptions. Our non-learnability results for higher depths are in contrast with these results.
Non-learnability results [4, 9, 2, 15] have been shown for depth-two networks assuming adversarial input distributions. Our work does not assume any adversarial input distribution but rather studies deep networks with random weights.
Another line of research borrows ideas from kernel methods and polynomial approximations to approximate the neural network by a linear function in a high-dimensional space and subsequently learning the same [19, 9, 10, 11]
. Tensor decomposition methods[1, 13] have also been applied to learning these simple architectures. On the other hand, while our non-learnability results hold for non-linear activations, we show a matching learnability result with only linear functions.
2 Background and notation
In this paper we denote vectors by lower case, with subscripts denoting the components; we denote matrices by upper case. Ifis a scalar function and is a vector, we let denote the vector obtained by applying to each component of . For two vectors , let denote the angle between and and let denote their inner product and let
For a random variable
, we denote its mean and standard deviation byand respectively. Let
denote the unit normal distribution.
We consider deep networks with the following topology. There are inputs to the network, where each input is ; thus the domain of the network is . The network has depth . The th layer of the network () operates on an input and produces the output
for the next layer, where
is a non-linear activation function andis an matrix. Therefore, each internal layer of the network is fully connected and has the same width111This assumption is not critical for the results; we make this for simplicity of exposition. as the input. By convention, corresponds to the input to the network and the top layer is a matrix so that the final network output is also .
In this paper the objects of study are random deep networks. We assume that each is an random matrix, where each entry is chosen uniformly at random in . Let be the function computed by a depth- network with activation and random weight matrices on an input . We will consider non-linear activation functions such as the sign function (), the rectified linear function (
), and the sigmoid function ().
Our analysis will mainly focus on the activation function; in this case we will abbreviated the computed function as . The function satisfies the following well-known elegant geometric property :
For any two vectors , .
We now define a function that will play an important role in our analysis.
This function satisfies the following properties outlined below, which can be seen from the Taylor expansions of and respectively. The first part is an upper bound on the function whereas the second part says that the function can be upper bounded by a linear function for inputs sufficiently close to zero.
(i) For each , (ii) There is a sufficiently small constant such that for each , it holds that for some constant .
For and , let denote distribution of the average of independent random variables such that each random variable is
with probabilityand with probability . Let denote that the random variable is distributed as .
3 A repulsion property
In this section we prove a basic property of random deep networks. The key point is that whatever be the initial angle between the pair of inputs , after a few random layers, the vectors become close to orthogonal. Specifically the distribution of the angle becomes indistinguishable from the case had the inputs been chosen randomly.
For the remainder of this section we fix a pair of inputs and that are not collinear and study the behavior of a random network on both these inputs. Let denote the angle between the outputs at layer and let . Our goal is to study the evolution of as a function of .
Recall that the th bit of , namely , is obtained by taking , where is the th row of . Similarly, . We now calculate the expected value of the product of the bits of and .
For any and any , it holds that .
Applying Lemma 1 to and with as the random vector, we obtain
We now measure the expected angle between and Note that .
Taking the expectation of
applying Lemma 3, and using finishes the proof. ∎
To proceed, we next introduce the following definition.
Definition 5 (-mixing).
A pair of inputs is -mixed if , where are chosen uniformly at random in .
Our goal is to show that is mixed by studying the stochastic iteration defined by Corollary 4. Note that by symmetry, . Hence, to show -mixing, it suffices to show . We will in fact show that the iterative process -mixes for sufficiently large.
3.1 Mixing using random walks
Corollary 4 states a stochastic recurrence for ; it is a Markov process, where for each , . However the recurrence is non-linear and hence it is not straightforward to analyze. Our analysis will consist of two stages. In the first stage, we will upper bound the number of steps such that with overwhelming probability goes below a sufficiently small constant. In the second stage, we will condition that is below a small constant (assured by the first stage). At this point, to bound we upper bound it by a certain function of the distribution that tracks its level of asymmetry and show that this level of asymmetry drops by a constant factor in each step. Note that if the distribution is symmetric around the origin then .
Note that can take at most distinct values since the inner product between a pair of vectors takes only discrete values through . This leads to interpreting the recurrence in Corollary 4 as a random walk on a -node graph. (A minor issue is that there are two sink nodes corresponding to and . We will argue that as long as the inputs are not collinear, the probability of ending up in one of these sink nodes is exponentially small.)
We now begin with the first stage of the analysis.
For and chosen according to Lemma 2, we have
For ease of understanding let us assume ; an analogous argument works for the negative case.
For convenience, we will track and show that rapidly goes to a value bounded away from 1. Initially, since the Hamming distance between and , which are not collinear, is at least .
This implies by a Chernoff bound that for any ,
If the above high probability event keeps happening for steps, then will keep increasing and will become less than in steps, where is from Lemma 2(ii). Further, the probability that any of these high probability events do not happen is, by a union bound, at most
since each . ∎
We now move to the second stage. For the rest of the section we will work with the condition that for all , where is given by Lemma 6. First we bound the probability that continues to stay small steps after .
For , .
If , from a Chernoff bound as in the proof of Lemma 6. A union bound finishes the proof. ∎
From Lemma 2(ii), we know that for , we have .
Next we make use of the following Lemma that we prove in the next subsection
Conditioning on the absence of the low probability event in Lemma 7,
For any not collinear we have .
3.2 Exponential decay
We will now show Lemma 8 that decays exponentially in . For ease of exposition throughout this subsection we will implicitly condition on the event that right from the initial step through the duration of the execution of the Markov process (since probability of this happening is exponentially small, it alters the bounds only negligibly).
For convenience, let denote . Define
Note that if is an even function (i.e., the distribution is symmetric around the origin), then we have . We will show that
We first show this statement holds for all point-mass distributions. Suppose has support only at point . Recall that is obtained by first computing and then taking the distribution . By the conditioning that and from Lemma 2, we have . Without loss of generality assume . Then for , since this can be viewed as a biased coin that has a higher probability of heads than tails and therefore the probability of getting more heads than tails is more than the converse probability. So
where the last step uses for point-mass.
Note that when viewed as a function of the distribution is a convex function (since the absolute value operator is convex).
Next we use the statement for point-mass distributions to derive the same conclusion for two-point distributions with support at . Any such distribution is a convex combination of two distributions: one of them is symmetric and is supported on both points with equal mass and the residual is supported only at either or (whichever had a higher probability). The former distribution is symmetric and hence when taken to the next step will result again in a symmetric distribution which, as observed earlier, must have value of . The latter initially has value equal to and after one step will have value at most by our conclusion for point-mass distributions. The conclusion for two-point distributions then follows from the convexity of .
Finally, we use these steps to derive the same conclusion for all distributions. By the convexity of , any distribution can be expressed as a convex linear combination of two-point distributions . The distribution is also the same convex combination of the resulting distributions . Now if then . Now . So by the convexity of the function, we have
This immediately yields the following.
Applying Lemma 10 inductively completes the proof. ∎
In this section we discuss the implications of the repulsion property we showed in Section 3. We first calculate the statistical correlation of the function computed by a random network with some fixed function. We use the correlation bound to argue that random networks are hard to learn in the statistical query model. Finally we show an almost matching converse to the hardness result: we show upper bounds on the learnability of a random network.
4.1 Statistical correlations
We will now show that the statistical correlation , for random , with any fixed function is exponentially small in the depth of the network. Consider all the inputs that are in one halfspace; this ensures that no two inputs are collinear. Note that the number of such inputs is . Define for , the norm .
using Theorem 9. Note that this bound is when has bounded first and second norms and when . ∎
4.2 Statistical query hardness
In this section we show how the statistical correlation bound we obtained in Lemma 12
leads to hardness of learning in the statistical query (SQ) model. The SQ model captures many classes of powerful machine learning algorithms including gradient descent, PCA, etc; see for example the survey by.
. In this model, instead of working with labeled examples, a learning algorithm can estimate probabilities involving the target concept. Letdenote the space of inputs. Formally speaking, a statistical query oracle for a Boolean function with respect to a distribution on takes as input a query function and a tolerance and returns such that . A concept class is SQ-learnable with respect to a distribution if there is an algorithm that, for every , obtains an -approximation222 A function is an -approximation to with respect to if . to with respect to in time for some polynomial . Here is the number of parameters to describe any and is the number of parameters to describe any . We assume that the evaluation of the function that uses for the oracle takes time and the tolerance is , where and are also polynomials.
It has been shown that for Boolean functions, the statistical query function can be restricted to be a correlation query function of the form , where . The following can be inferred from Lemma 12 and the arguments in Section 5.1 in .
For Boolean function , all statistical queries made by an SQ-learning algorithm can be restricted to be correlational queries.
Theorem 14 (, Theorem 31).
Let be a concept class that is SQ-learnable with respect to , particularly, there exists an algorithm that for every , uses at most queries of tolerance lower bounded by to produce a -approximation to . Let . There exists a collection of sets such that and the set -approximates333 A set –approximates with respect to distribution if for every there exists such that -approximates with respect to . with respect to .
Using these, we now show the hardness result.
If , then the class of functions is not SQ-learnable unless .
Given that the square of the correlation of a uniformly chosen random function is at most , by a Markov bound, with probability , the correlation with a fixed will also be at most . By a union bound, the correlation squared with all the functions in will then be at most ; i.e., the absolute value of the correlation is . However, for the desired tolerance and learnability in Theorem 14, there is a function such that -approximates , i.e., the correlation of with is . Thus, we need , which completes the proof. ∎
For any SQ algorithm for learning a random deep network of depth . In particular this means that to get accuracy that is a constant better than , the number of queries is exponential in ; If the number of queries is then the improvement over an accuracy of is .
In the next section, we will show a matching upper bound of for the achievable accuracy in polynomial time.
4.3 An upper bound
We will now show that , i.e., a random network of depth with activation, can indeed be learned up to an accuracy of , in fact, by a linear function. This essentially complements the hardness result in Section 4.2.
We will show that the linear function defined by is highly correlated with . We complete the proof by induction.
We study the output at each layer for both and . Let (resp., ) denote the vector of inputs to the th layer of (resp., ). Let and ; note that for . If , then we have . Now,
The key step is to express . Fix an and for brevity let . Since , we obtain
Here, the third equality follows since the random variables and are independent. The fourth equality follows since . The final equality follows since (i) and hence the expectation in the first term is of the form where , which is some absolute constant444 For , is the mean of the folded unit normal distribution and is . that depends on and (ii) the second term vanishes since . Plugging (4) into (3), we obtain
by an induction on and using the base case at the input layer where . Since the final output of is a bit, the average correlation with is . ∎
5 On extending to
We now extend some of the previous analysis of the activation function to the activation function. As we describe later in Section A for the (and ) teacher networks, to keep the output vectors at each layer bounded, we normalize the output of each layer to have mean
along each dimension. This is analogous to the batch normalization operation commonly used in training; in our case, we refer to this operator as. Thus, the operation at each layer is the following: (i) multiply the previous layer’s outputs by a random Gaussian matrix , (ii) apply , and (iii) apply the normalization operation .
As in the previous analysis, the key step is to show that whatever be the initial angle between the pair of inputs , after a few random layers with , the vectors quickly become close to orthogonal.
We fix a pair of input vectors and with length and inner product and study the behavior of this layer on the inner product of the corresponding output vectors. Recall that . In particular, we are interested in , where the batch normalization operator for is the function .
By 2-stability of Gaussians and since is a random matrix, and are also Gaussian random vectors. Note that we do not not make any assumption that input vectors at each layer are Gaussian distributed: and are arbitrary fixed vectors, however multiplying them with a random matrix leads to Gaussian random vectors. We get the following (where we use and to refer to and respectively).
Thus, we need to analyze how the angle between two random Gaussian vectors changes when passed through a and the normalization operator . In particular, given two vectors at angle, we wish to find the expected cosine similarity () after applying one iteration. Furthermore, since all dimensions of and are independent, using an argument similar to that in Corollary 4 we can consider the expected dot product along only one dimension of the Gaussian vectors. Thus, we have
We now make the following claim (proof in Appendix).
In Figure 1 we plot the ratio of this expected value (which we denote as ) and via simulations, which shows that this ratio is always less than 1 for all values of . In addition to the plot for , we also plot the ratio for function (obtained via simulations and via its closed-form expression given by (1).
However, we are unable to account for the distribution around the expected value and prove that the magnitude of the expected value of the sum of such random variables drops exponentially with height. Hence, we leave this as a conjecture that we believe to be true.
For any not collinear we have for all .
As in the case, SQ-hardness hardness for activation follows directly from the above conjecture, using Section 4.
6 -way independence
-additive-close to full independence: A set of bits is independent if the probability of seeing a specific combination of values is . It is close to fully independent if its probability distribution differs from the fully independent distribution in statistical distance by at most
close to fully independent if its probability distribution differs from the fully independent distribution in statistical distance by at most.
-additive-close to -way independence: A binary function is -additive-close to -way independent if the outputs any inputs is -additive-close to fully independent.
-multiplicative-close to full independence: A set of bits factor close to independent if the probability of seeing a specific combination of values is .
Now take a subset of binary inputs so that no pair is equal to one another or opposite; that is, . We will also exclude opposites among the inputs.
The function with depth is -additive close to -way-independent
For a function of depth , with high probability of over the first layers, the output after the last layer is -multiplicative close to fully independent.
If outputs at a layer are orthogonal then after one layer the outputs form a set of fully independent dimensional vectors.
The main idea behind the following lemma is that the outputs become near orthogonal after layers.
After steps of the random walk with probability
We already know that becomes less than some constant after steps and after that it decreases by factor in expectation in each iteration. The sampling from can cause a perturbation of at most w.h.p. . So w.h.p. . Chasing this recursion gives that after steps, w.h.p. . ∎
We say that a pair of vectors are near -near orthogonal if normalized dot products
For inputs , w.h.p. after layers, all pairs from the set of inputs are pairwise -near orthogonal w.h.p.
If inputs to a layer are -near orthogonal then after one layer the outputs at any node at that next layer is -multiplicative close to independent.
We are dotting the inputs with a random vector and then taking sign. If is the matrix formed by these inputs as rows , then consider the distribution of . If rows of were orthogonal, is a uniform Gaussian random vector in . If is near orthogonal, it will depart slightly from a uniform Gaussian. Consider the pdf.
is also dimensional normally distributed vector with covariance matrix .
So pdf of will be for some constant . If all the normalized dot products are at most . Then where each entry in is at most
. So eigenvalues ofis at most and of is within . So the determinant is . So over all the distribution differs from that of a uniform normal distribution by . ∎
Since w.p. , the normalized dot product is , we know that conditioned on this, we get a -multiplicative close to fully independent outputs. So over all it is -additive-close to fully independent. This gives.
The function after depth is -additive close to -way-independent
We considered the problem of learnability of functions generated by deep random neural networks. We show theoretically that deep random networks with activations are hard to learn in the SQ model as their depth exceeds , where is the input dimension. Experiments show that functions generated by deep teacher networks (with activations) are not learnable by any reasonable deep student network, beyond a certain depth, even using the state-of-the-art training methods. Our work motivates the need to better understand the class of functions representable by deep teacher networks that make them amenable to being learnable.
Appendix A Experiments
We next describe experiments that support our theoretical results and show that random networks using or activations are not learnable at higher depths.
Experimental setup and data.
Since we wish to evaluate learnability of deep random networks, we rely on generating synthetic datasets for binary classification, where we can ensure that the training and evaluation dataset is explicitly generated by a deep neural network. In particular, we specify a teacher network parameterized by a choice of network depth , layer width , activation function , and input dimension . Each layer of the teacher network is a fully connected layer specified by the width and activation function. The last layer is a linear layer that outputs a single float, which is then thresholded to obtain a Boolean label. We initialize all parameters in the network with random Gaussian values. In particular, we initialize all the matrices .
In addition, as mentioned in Section 5, we re-normalize the output of each layer to have mean and variance along each dimension (since otherwise the function becomes trivially learnable as most outputs concentrate around with increasing depth). We generate million training examples for each teacher network. For each training example, we generate a uniform random input vector in the range as an input to this network, and use the Boolean output of the network to generate (positive or negative) labels. We generate different teacher networks by varying the activation functions (, , and ) and depth (from to ) for the network. We fix the width at and input dimension at .
We try to learn the classification function represented by each teacher network, with a corresponding student network which is assumed to know the topology and activation function used by the teacher network, but not the choice of weights and biases in the network. Further, we use student networks for varying depths and widths . For the case of a teacher network with , since the activation function is not differentiable, we use a
function in the corresponding student network. We configure dropout, batch-normalization, and residual/skip-connections at each layer of the teacher network, and tune all hyperparameters before reporting our experimental results.
Figure 2 illustrates the learnability of the functions generated by the teacher network for the different parameters mentioned above. In particular, we report results for , , . We observe that for all three activation functions, the test AUC of the student network falls starts close to at small depths and drops of rapidly to as we increase the depth. This is consistent with our theoretical results summarized in Section 4. While this behavior is more representative in the case of and teacher networks, the activation function seems to exhibit a threshold behavior in the AUC. In particular, even as the AUC stays as high as for , it drops sharply to for greater depths. We leave the analysis of this intriguing behavior as an open problem.
In another experiment, we studied the effect of overparameterizing the width and height parameters of the student network when compared to the teacher network. To this end, we varied the width of the student network as and the depth as . We did not observe any significant change in the learnability results for all combinations of these parameter values.
Appendix B Proof of Lemma 20
Without loss of generality, let