I Introduction
Consider a sample , where . We have a feedforward neural network with a given architecture (but the weights are unknown). Each sample point has binary labels, either +1 or 1. Sauer’s lemma provides an upper bound on the number of possible labelings that could be generated by a hypothesis class (the growth function) in terms of the VC dimension of the hypothesis class.
We are interested in hypothesis classes corresponding to neural networks with a fixed architecture but unspecified weights. While it is hard to exactly specify the VC dimension of this class, upper bounds on the VC dimension and the growth function are easily derived, see for example [1, Section 6.2]. The growth function for a feedforward, linear threshold network is upper bounded by , where is the number of neurons in the network, and , the number of weights.
Our goal in this paper is to generate labels of the sample uniformly at random from the set of all possible labelings that a given feedforward architecture can provide. We obtain a polynomial time (in both the number of samples and the size of the network), near uniform sampling from arbitrary feedforward networks. In the special case of a single neuron, we also provide a random walk based algorithm for perfectly uniform sampling, and with polynomial mixing time for the random walk.
Aside from the theoretical interest in generating labelings, we are also motivated by questions in property testing. Namely, we want to estimate the statistics of all labelings generated by a given architecture. As an example, we may want to find out the the probability that a subset of samples are all labeled the same if all labels were generated at random from the given architecture. In future work, we intend to leverage these insights into better initializations of neural networks while training.
We obtain these results by developing insights on random walks between chambers of intersecting hyperplanes in high dimensions. This is a well studied area, see for example
[2]. General arrangements of these hyperplanes intersect in complicated ways, as in our problem, and random walks between these chambers is nontrivial. It is common to visualize the geometry of these arrangments by means of a chamber graph, see Chapter 7 of [3] for a synopsis of such chamber graphs. Random walks over hyperplane arrangements appears in contexts quite different from ours. For example, Bidigere, Hanlon and Rockmore modeled card shuffling in [4], with such random walks. Some other applications are in [5, 6, 7, 8].The statistics of the random walks considered in the references above is different from ours. Typically, these authors provide an explicit expression to estimate the eigenvalues of the random walk to bound the mixing time. In our paper, we use conductance to understand the mixing properties of our random walk as in
[9] and [10].The more general problem of uniformly sampling geometric objects is extensively studied in Markov Chain Monte Carlo (MCMC) literature, e.g. Dyer, Frieze and Kannan’s work
[dyer1991random] on estimating the volume of high dimensional convex bodies.Ii Setup and Notations
We consider a feedforward linear threshold neural network with layers. The input to the network is dimensional and there is a single binary output label. Namely, i.e. any neuron with parameters , (, ) outputs on an input , where if and otherwise. In subsequent work, we extend our results to more general activation functions.
Let be the graph of the feedforward neural network with a fixed architecture and different parameters (the weights and thresholds put together). Let , and let be the neural network which assigns the parameters of to be W. For any given architecture , let be the function expressed by .
The vectors
are the input and are the labels assigned to x. For a length sample , letbe the set of all labelings that can be generated on by the architecture . Note that the set and for , [1, Section 6.2] (or [11])
When , , or is potentially shattered.
Problem For a given architecture and data , how can we randomly sample from , in time polynomial in both and , such that any labeling appears with probability at least ?
Background
A hyperplane in (or a hyperplane in dimensions) is the set of all points satisfying for some fixed vector . Let be a single neuron with input dimension . As before, is a length sample.
Let and . Physically, the vector in dimensions, defines the parameters of the single neuron . For each sample point , define to be the hyperplane in the parameter space :
We start with a visualization from [1].
Theorem 1.
All parameter vectors that belong to the same connected component of label in the same way. Conversely, different components have different labelings on .
We recall a few standard terms regarding hyperplane arrangements formed by .

The connected components in are called chambers (or regions).

The chamber graph is constructed as follows: assign a vertex to every chamber. Two vertices are connected if their associated chambers share a common face.

Any hyperplane arrangement is centered if the intersection of the component hyperplanes contains the origin. In our case, always contains the origin, the samples generate a centered arrangement in the parameter space.

A collection of centered hyperplanes in is in general position, if for all , every intersection of distinct hyperplanes forms a dimension linear space, and any intersection of more than hyperplanes is contains only the origin. Randomly chosen planes are in general position almost surely.
Psuedo polynomial optimal training algorithm
A theoretically useful framework was introduced in Theorem 4.1 of [12]
for ReLU networks, where the network size
is treated as a constant, and we look at the dependency purely on the sample size (thereby treating as a polynomial).We note that our nearuniform polynomial time sampling procedure implies a probabilistic, psuedopolynomial training algorithm that attains the global minimum for any feedforward linear threshold neural network. This implication is immediate from the coupon collector problem—since given any confidence, generating at most samples guarantees that we have seen every possible labeling that can be produced.
Iii Properties of hyperplane arrangements
We summarize a few useful properties of hyperplane arrangments that we will use in our arguments in the paper.
Proposition 1 ([1, Theorem 3.1]).
The number of chambers in a centered hyperplane arrangement formed by hyperplanes in dimensions in the general position is
In fact, even sampling all labels of a sample of size , even when the network consists of a single neuron, in time polynomial in both and dimension of the data points, is nontrivial. The number of chambers by Theorem 1 is the number of labels on a size sample, which from the above Proposition is roughly . Clearly, trivial enumeration of labels is out of question. As we will see later in Section IVA, this is not the only difficulty even for a single neuron.
Proposition 2.
Let form a centered hyperplane arrangement in dimensions. Let be any vector normal to the hyperplane . If have rank , then any chamber in the hyperplane arrangment has at least faces.
Proof.
Let . Suppose the proposition is false. Then there exists a chamber with exactly faces. Without loss of generality, let be the normal vectors of the different faces of this chamber respectively, such that for any point x within the chamber
Since the rank of is , we can choose a vector such that is linearly independent from .
We now show that the hyperplane that determined by is also a face of the chamber by proving that there is a point in the chamber satisfying
(1) 
Since is linearly independent of , we can choose a vector y such that for but . Now let x be any point in the chamber and set where . It is easy to verify now that satisfies (1). This contradicts the assumption that the chamber contained faces, where . ∎
Proposition 3.
The chamber graph of any hyperplane arrangement in in general position satisfies (i) the degree of any vertex is at least and at most , and (ii) any pair of vertices has graph distance at most .
Proof.
(i) from Proposition , (ii) from [3, Lemma 7.15]. ∎
Iv Sampling labelings
For the sample , where , is the set of all possible labels generated on by the network . We would like to sample from uniformly.
In Section IVA, we let be a single neuron and even this turns out to be nontrivial. Inspired by the inductive approach for computing hyperplane partition number [2, Chapter 2], we derive Algorithm RS (for Recursive Sampling) in Section IVA that generates a label from almost uniformly.
In Sections IVB and IVC we expand in two directions. In Section IVB, we provide means to perfectly sample from all labelings of a single neuron using a random walk on with a perfectly uniform stationary distribution. This allows us to sample from perfectly uniformly. The mixing time of this random walk is as yet unproven, but we provide partial evidence (empirical as well as proofs for small dimensions) that this random walk is fast mixing, with mixing time at most linear in the number of dimensions and at most quadratic in the number of samples.
In Section IVC, we build on our RS approach to sample from arbitrary feedforward networks in time (true) polynomial in the sample size , network size and input dimension (), showing that even for arbitrary networks, we get nearuniform sampling of the possible labels that could be produced by the network.
Iva The recursive approach
From Theorem 1, to sample uniformly from we only need to sample the weights uniformly from the connected components of .
However, even for a single neuron, this is not trivial. As already noted from Proposition 2, the basic combinatorial difficulty comes from the fact that there are roughly labelings for almost all samples —therefore the number of chambers is exponential in the dimension . Clearly one can not simply enumerate all the possible components.
But a bigger difficulty comes from the fact that the arrangement of the hyperplanes can be very heterogeneous. The volume of some of the chambers can be arbitrarily small and therefore such chambers may be difficult to find. We settle this problem by using a recursive sampling approach that is inspired by the inductive approach for computing hyperplane partition number.
Our recursive algorithm RS() (see Algorithm ) takes as its inputs unit vectors , all from, say . The vectors are interpreted as normal vectors of distinct centered hyperplanes in . For simplicity, the reader can assume that these hyperplanes are in the general position, but they do not have to be. To sample from , we would therefore simply call RS(), where and .
The call RS() works recursively on the dimension of the vectors and , by calling RS with a new set of vectors in with . The base case is when RS is called with vectors in 1 dimension or when . When RS is called with vectors in 1 dimension, the problem is trivial since there is only one centered hyperplane arrangement in dimension, the origin. When RS is called with (no matter the dimension of the single input vector), the problem is also trivial since there are only two chambers for one hyperplane.
To generate the vectors in , we choose a hyperplane at random from , say , and compute the intersection of with all the remaining hyperplanes. These intersections are at most hyperplanes in and let be the unit normal vectors of these hyperplanes (written in the specific orthonormal basis indicated).
Theorem 2.
Let where and rank of is . Let be the set of nonempty chambers induced by the centered hyperplanes orthogonal to the vectors in . Algorithm RS() runs in time and any chamber in the hyperplane arrangment induced by is sampled with probability at least
Proof.
(Outline only) The algorithm will run at most recursive iterations. For each iteration, we need to compute the base of the null space (Step 3) and time in Step 7 to compute the projection of each input vectors to the plane chosen in Step 2. This yields the total complexity to be .
To see the probability lower bound, define
We now claim that
This is because any chamber has at least faces by Proposition 2. For any chamber , we therefore have probability at least of choosing both a hyperplane that forms the face of and the direction of the hyperplane that faces the chamber . Conditioned on this choice of hyperplane and direction, we need to obtain the probability that the recursive call in step in Step 6 returns a point in the face of .
Observe that the face of is a dimensional linear space. In Step 6, note that the rank of is exactly , but can be less than . The theorem follows by solving the recursive inequality, standard approximations on binomial coefficient and by noting that when , there are two chambers, thus yielding for all . ∎
Note that when the above probability is , a factor off the hyperplane slicing bound in Proposition 1. Note also that if the input vectors in have rank , the above approach still works. We can effectively project down the inputs into by choosing a basis for that contains vectors that are orthogonal to the span of the input vectors.
IvB A random walk approach
To mitigate the fact that the recursive approach above only yields approximately uniform sampling, We introduce a random walk based algorithm that samples arbitrarily close to uniform. Specifically, we run Algorithm NRW on a lazy chamber graph, both outlined below. One component of Algorithm NRW is Algorithm Chamber, that determines which chamber an input point belongs to.
Theorem 3.
Algorithm Chamber runs in polynomial time both on and .
Proof.
The theorem follows since linear programming can be solved in polynomial time [13]. ∎
Analysis
We first analyze random walk defined by Algorithm NRW over the simple chamber graph, assuming the hyperplanes are in general position. With this assumption any vertex in the chamber graph has degree at least and at most from Proposition 3. Furthermore, from Proposition 3 the graph is connected and the distance between any two vertexes is at most .
Since the random walk is a reversible Markov chain, the stationary distribution of the random walk will be proportional to the degree of the vertices [9, Chapter 1.6]. From our observation on the bounds of degrees in Proposition 3, we will therefore have for any two vertices and
The more fundamental question is the mixing time of the random walk, or how quickly the walk generates stationary samples. While there are several approaches to analyze the mixing time, we focus on Cheeger’s inequality [9, Theorem 13.14] that bounds the spectral gap of the random walk’s transition matrix using the conductance of the graph. Recall that the conductance of a graph is
where is the vertex set, is size of the cut between and , is the sum of degrees of vertexes in . The following theorem gives a lower bound on the conductance of chamber graph when dimension .
Theorem 4.
The chamber graph of dimensional hyperplane arrangement with size that is in the general position has conductance lower bounded by .
Proof.
For any set of vertices in the chamber graph with size no greater than , we will show that the conductance of , , is lower bounded as follows
Let be the set with smallest volume satisfying
We first claim that must be connected. If not, we can write as the union of (maximally) connected components, , where are the maximally connected components within (in particular, note that there are no edges between distinct ). Then, if and , then
implying that has lower conductance than and is smaller in size than , a contradiction.
Let be the boundary surface of the chambers corresponding to vertexes in . Since is connected, we must have to be piecewise line segments.
We now claim that will partition the chamber graph into two connected components. Since is connected, we just have to show that is also connected.
Suppose not, and let , where are maximally connected, and is the union of different connected components. Let and . Then we have
and since and , we have
Therefore, there must be some component such that
If satisfies , then again we have a contradiction because of the following. If , we are done. If , it means that every component in has conductance . But if there are more than two components in , then has a larger cut than each of the components, and therefore must have a larger volume as well, contradicting the assumption on .
If , then consider the set . Note that
Now . This follows since there is no boundary between and any of the other , and the only boundary has is with . Furthermore, , implying that has lower conductance than , again a contradiction.
Now, we know that the boundary between the chambers in and the rest of the hyperplane arrangement is exactly a piecewise line segment that separates into two connected components. There are only 3 possibilities, as shown in Figure 1. We now observe is exactly the sum of the dimensional faces in the arrangement that intersect with . Since there are at most lines in the arrangement, there exist a line that intersect with (or ) by at least many faces, see figure 1. The number of faces in is no less than the number of faces in , because any line that intersects with in must also intersect with , and at most two lines can intersect at the same point on by our general position assumption. The theorem now follows.
∎
For the general dimension case, we have the following conjecture. See Appendix for justification and partial proofs.
Conjecture 1.
The conductance of any dimensional general position hyperplane arrangement of size is lower bounded by .
Remark 1.
Note that the requirement for general position of the hyperplanes is necessary for fast mixing given by the Conjecture above. Else it is easy to construct a hyperplane arrangement with mixing time lower bounded by . As shown in Figure 2, the cut made by the gray shaded top plane has only boundary chamber but the total number of chambers below the plane is roughly (in two dimensions, while in dimensions, we will have the cut and volumne to be and respectively).
Lazy Chamber graph
Algorithm NRW on the regular chamber graph will not give an exact uniform sampling, but is off by a factor of as mentioned above. This is easily fixed by adding dummy vertices and dummy edges to each vertex in the chamber graph raising the degree of every vertex in the original chamber graph to . Call such a graph to be lazy chamber graph.
We will call the vertex in the original chamber graph to be chamber vertex and the dummy vertices added to be augmentation vertices. The stationary probability of the new random walk, restricted on the chamber vertices, is exactly uniform. If the Algorithm NRW on the chamber graph is fast mixing, we can show that Algorithm NRW on the lazy chamber graph is also fast mixing:
Lemma 1.
If the conductance of the chamber graph is , the lazy chamber graph has conductance .
Proof.
We only need to show that any subset of vertex in the lazy chamber graph we have . We observe that if an augmentation vertex is in , then the chamber vertex attached to it must also be included in . We denote to be the set of all chamber vertexes in .
The vertexes in can be partitioned into two classes, where is the set of all chamber vertices that have all their attached augmentation vertices in and is the complement of in . Similarly, , to be the sets that contains also the attached new vertex of and in ). We have
since all vertexes in are boundary vertexes. Note that since any vertex in will attach at least new vertex in order to make degree , we have . Now, by the definition of conductance we have . This is because, if then . Otherwise, we have , thus and , we have . Therefore, we have
Now since , the theorem follows. ∎
Combining all the results, we have
Theorem 5.
Assuming conjecture 1. For an given parameter and in the general position, Algorithm NRW run on the lazy chamber graph generated by can generate labels from with distribution close (in variational distance) to uniform, and runs in time .
IvC Sampling for arbitrary neural networks
We now consider the sampling for arbitrary neural networks. Let be the samples, we choose the weights of the network layer by layer. At layer we use the previous sampled weights in layers to generate outputs , where is output of layer with input , a binary vector. For each neuron in layer we independently sample weights using Algorithm RS with input .
To illustrate the idea more concretely, consider neural networks with one hidden layer. Let to be the input samples of dimension , for each neuron in the hidden layer, we use Algorithm RS to generate the weights independently. We now fix the weights we sampled for the neuron in the hidden layer and view the function that expressed by the hidden layer to be some function , where is the number of neurons in the hidden layer. We now define to be the new input sample for the output layer, and again use Algorithm RS to sample the weights for the output neuron with input .
Theorem 6.
For a neural network with fixed architecture, neurons and parameters, the above sampling procedure runs in time. Given a sample , each labeling in produced by this architecture appears with probability at least
Proof.
We use induction on the layers. For any given labeling produced by weights w, let to be the probability that the output of layer is consistent with the output on weight w. We have
where is the input dimension of the th neuron in layer , and the product term comes from Theorem 2 and independence. Note that the rank of the outputs may reduced after passing the previous layers, however, this will only make the probability larger than by Theorem 2. Now, the theorem follows with the same argument as in [1, Theorem 6.1] for bounding VC dimension of linear threshold neural networks. ∎
V Simulations
We run our recursive algorithm for randomly choosing samples with different dimension and size , for each pair of we run the sampling procedure times to count the empirical distribution on the different labeling. We then compute the ratio of the maximum and minimal probability that appears in the balling that we sampled, and rounding the ratio to be integers. One can see from figure 2 that, for each sample size there is a peak for the probability ratio when the dimension of sample increases. For given dimension , one can see that for is small the ratio will increase according the increasing of , for is large the ratio will decrease when the is increasing. We also runs our sampling procedure on MNIST data set with sample, the run time is around mins.
References
 [1] M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations. cambridge university press, 2009.
 [2] R. P. Stanley et al., “An introduction to hyperplane arrangements,” Geometric combinatorics, vol. 13, pp. 389–496, 2004.
 [3] S. Ovchinnikov, Graphs and cubes. Springer Science & Business Media, 2011.
 [4] P. Bidigare, P. Hanlon, D. Rockmore et al., “A combinatorial description of the spectrum for the tsetlin library and its generalization to hyperplane arrangements,” Duke Mathematical Journal, vol. 99, no. 1, pp. 135–174, 1999.
 [5] K. S. Brown and P. Diaconis, “Random walks and hyperplane arrangements,” Annals of Probability, pp. 1813–1854, 1998.
 [6] C. A. Athanasiadis and P. Diaconis, “Functions of random walks on hyperplane arrangements,” Advances in Applied Mathematics, vol. 45, no. 3, pp. 410–437, 2010.

[7]
J. Pike, “Eigenfunctions for random walks on hyperplane arrangements,” Ph.D. dissertation, University of Southern California, 2013.
 [8] A. Björner, “Random walks, arrangements, cell complexes, greedoids, and selforganizing libraries,” in Building bridges. Springer, 2008, pp. 165–203.
 [9] D. A. Levin and Y. Peres, Markov chains and mixing times. American Mathematical Soc., 2017, vol. 107.
 [10] N. Berestycki, “Mixing times of markov chains: Techniques and examples,” AleaLatin American Journal of Probability and Mathematical Statistics, 2016.

[11]
S. ShalevShwartz and S. BenDavid,
Understanding machine learning: From theory to algorithms
. Cambridge university press, 2014.  [12] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” arXiv preprint arXiv:1611.01491, 2016.
 [13] N. Megiddo et al., On the complexity of linear programming. IBM Thomas J. Watson Research Division, 1986.
 [14] V. Koltun, “The arrangement method for linear programming,” Computer Science Department, Stanford University, 2005.
Appendix A On the conductance conjecture
In order to provide a convincing reason as to why we believe the Conjecture 1, we provide a proof of the following partial result: suppose we cut a general position hyperplane arrangement with another hyperplane. The conductance of such a cut is lower bounded by (no matter the number of dimensions).
Proposition 4.
Let be a general position hyperplane arrangement in dimension . is another hyperplane. Then the number of chambers in (viewed as a hyperplane arrangment in ) is lower bounded by
Proof.
A set of hyperplane in dimension is said to be in almost full rank position if any planes has rank at least . Note that the hyperplanes are in almost full rank position, since the projection on to can only reduce the rank by . Note that there may be two and coincident, but we treat them as different planes.
Denote , we show that the intersection for is also in almost full rank position. We only need to show that any planes has rank at least .
Suppose not, w.l.o.g. has rank at most and .
But we will show that will have rank , thus obtaining a contradiction. To see this, let be the base of the linear space , a normal vector to . We have and , implying that , which in turn implies for two different set of s. Meaning that has rank of .
The proposition now follows by induction. ∎
A random walk on vertexes:
For hyperplane arrangement that is in the general position. We define the vertexes of the arrangement to be all the intersections with and . By the general position assumption, we know that there are exactly many vertexes. Two vertexes is said to be connected if they are connected by a dimensional face (intersection by hyperplanes) of the hyperplane arrangement. There are many edges and each vertex adjacent to at most and at least edges. The graph that defined by the vertexes and edges is known as the arrangement graph and studied in [14]. Where the author obtained the following conductance bound using a coupling argument:
Theorem 7 ([14, Theorem 4.3]).
The conductance of the arrangement graph is lower bounded by
Note that theorem will implies conjecture if we also know that the number of vertexes in any cut of the chamber do not much greater than the number of faces. Proposition 1 shows that this is satisfied if the cut is a plane, since there are exactly vertices but at least faces).
Comments
There are no comments yet.