Almost Uniform Sampling From Neural Networks

by   Changlong Wu, et al.
University of Hawaii

Given a length n sample from R^d and a neural network with a fixed architecture with W weights, k neurons, linear threshold activation functions, and binary outputs on each neuron, we study the problem of uniformly sampling from all possible labelings on the sample corresponding to different choices of weights. We provide an algorithm that runs in time polynomial both in n and W such that any labeling appears with probability at least (W/2ekn)^W for W<n. For a single neuron, we also provide a random walk based algorithm that samples exactly uniformly.



There are no comments yet.


page 1

page 2

page 3

page 4


Sampling connected subgraphs: nearly-optimal mixing time bounds, nearly-optimal ε-uniform sampling, and perfect uniform sampling

We study the connected subgraph sampling problem: given an integer k ≥ 3...

Learning a Single Neuron for Non-monotonic Activation Functions

We study the problem of learning a single neuron 𝐱↦σ(𝐰^T𝐱) with gradient...

Knots in random neural networks

The weights of a neural network are typically initialized at random, and...

On the Error of Random Sampling: Uniformly Distributed Random Points on Parametric Curves

Given a parametric polynomial curve γ:[a,b]→ℝ^n, how can we sample a ran...

Stable Recovery of Entangled Weights: Towards Robust Identification of Deep Neural Networks from Minimal Samples

In this paper we approach the problem of unique and stable identifiabili...

fastball: A fast algorithm to sample binary matrices with fixed marginals

Many applications require randomly sampling binary graphs with fixed deg...

An Informal Introduction to Multiplet Neural Networks

In the artificial neuron, I replace the dot product with the weighted Le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consider a sample , where . We have a feedforward neural network with a given architecture (but the weights are unknown). Each sample point has binary labels, either +1 or -1. Sauer’s lemma provides an upper bound on the number of possible labelings that could be generated by a hypothesis class (the growth function) in terms of the VC dimension of the hypothesis class.

We are interested in hypothesis classes corresponding to neural networks with a fixed architecture but unspecified weights. While it is hard to exactly specify the VC dimension of this class, upper bounds on the VC dimension and the growth function are easily derived, see for example [1, Section 6.2]. The growth function for a feedforward, linear threshold network is upper bounded by , where is the number of neurons in the network, and , the number of weights.

Our goal in this paper is to generate labels of the sample uniformly at random from the set of all possible labelings that a given feedforward architecture can provide. We obtain a polynomial time (in both the number of samples and the size of the network), near uniform sampling from arbitrary feedforward networks. In the special case of a single neuron, we also provide a random walk based algorithm for perfectly uniform sampling, and with polynomial mixing time for the random walk.

Aside from the theoretical interest in generating labelings, we are also motivated by questions in property testing. Namely, we want to estimate the statistics of all labelings generated by a given architecture. As an example, we may want to find out the the probability that a subset of samples are all labeled the same if all labels were generated at random from the given architecture. In future work, we intend to leverage these insights into better initializations of neural networks while training.

We obtain these results by developing insights on random walks between chambers of intersecting hyperplanes in high dimensions. This is a well studied area, see for example 

[2]. General arrangements of these hyperplanes intersect in complicated ways, as in our problem, and random walks between these chambers is nontrivial. It is common to visualize the geometry of these arrangments by means of a chamber graph, see Chapter 7 of [3] for a synopsis of such chamber graphs. Random walks over hyperplane arrangements appears in contexts quite different from ours. For example, Bidigere, Hanlon and Rockmore modeled card shuffling in [4], with such random walks. Some other applications are in  [5, 6, 7, 8].

The statistics of the random walks considered in the references above is different from ours. Typically, these authors provide an explicit expression to estimate the eigenvalues of the random walk to bound the mixing time. In our paper, we use conductance to understand the mixing properties of our random walk as in 

[9] and [10].

The more general problem of uniformly sampling geometric objects is extensively studied in Markov Chain Monte Carlo (MCMC) literature, e.g. Dyer, Frieze and Kannan’s work 

[dyer1991random] on estimating the volume of high dimensional convex bodies.

Ii Setup and Notations

We consider a feed-forward linear threshold neural network with layers. The input to the network is dimensional and there is a single binary output label. Namely, i.e. any neuron with parameters , (, ) outputs on an input , where if and otherwise. In subsequent work, we extend our results to more general activation functions.

Let be the graph of the feedforward neural network with a fixed architecture and different parameters (the weights and thresholds put together). Let , and let be the neural network which assigns the parameters of to be W. For any given architecture , let be the function expressed by .

The vectors

are the input and are the labels assigned to x. For a length sample , let

be the set of all labelings that can be generated on by the architecture . Note that the set and for [1, Section 6.2] (or [11])

When , , or is potentially shattered.

Problem For a given architecture and data , how can we randomly sample from , in time polynomial in both and , such that any labeling appears with probability at least ?


A hyperplane in (or a hyperplane in dimensions) is the set of all points satisfying for some fixed vector . Let be a single neuron with input dimension . As before, is a length sample.

Let and . Physically, the vector in dimensions, defines the parameters of the single neuron . For each sample point , define to be the hyperplane in the parameter space :

We start with a visualization from [1].

Theorem 1.

All parameter vectors that belong to the same connected component of label in the same way. Conversely, different components have different labelings on .

We recall a few standard terms regarding hyperplane arrangements formed by .

  • The connected components in are called chambers (or regions).

  • The chamber graph is constructed as follows: assign a vertex to every chamber. Two vertices are connected if their associated chambers share a common face.

  • Any hyperplane arrangement is centered if the intersection of the component hyperplanes contains the origin. In our case, always contains the origin, the samples generate a centered arrangement in the parameter space.

  • A collection of centered hyperplanes in is in general position, if for all , every intersection of distinct hyperplanes forms a -dimension linear space, and any intersection of more than hyperplanes is contains only the origin. Randomly chosen planes are in general position almost surely.

Psuedo polynomial optimal training algorithm

A theoretically useful framework was introduced in Theorem 4.1 of [12]

for ReLU networks, where the network size

is treated as a constant, and we look at the dependency purely on the sample size (thereby treating as a polynomial).

We note that our near-uniform polynomial time sampling procedure implies a probabilistic, psuedo-polynomial training algorithm that attains the global minimum for any feedforward linear threshold neural network. This implication is immediate from the coupon collector problem—since given any confidence, generating at most samples guarantees that we have seen every possible labeling that can be produced.

Iii Properties of hyperplane arrangements

We summarize a few useful properties of hyperplane arrangments that we will use in our arguments in the paper.

Proposition 1 ([1, Theorem 3.1]).

The number of chambers in a centered hyperplane arrangement formed by hyperplanes in dimensions in the general position is

In fact, even sampling all labels of a sample of size , even when the network consists of a single neuron, in time polynomial in both and dimension of the data points, is non-trivial. The number of chambers by Theorem 1 is the number of labels on a size- sample, which from the above Proposition is roughly . Clearly, trivial enumeration of labels is out of question. As we will see later in Section IV-A, this is not the only difficulty even for a single neuron.

Proposition 2.

Let form a centered hyperplane arrangement in dimensions. Let be any vector normal to the hyperplane . If have rank , then any chamber in the hyperplane arrangment has at least faces.


Let . Suppose the proposition is false. Then there exists a chamber with exactly faces. Without loss of generality, let be the normal vectors of the different faces of this chamber respectively, such that for any point x within the chamber

Since the rank of is , we can choose a vector such that is linearly independent from .

We now show that the hyperplane that determined by is also a face of the chamber by proving that there is a point in the chamber satisfying


Since is linearly independent of , we can choose a vector y such that for but . Now let x be any point in the chamber and set where . It is easy to verify now that satisfies (1). This contradicts the assumption that the chamber contained faces, where . ∎

Proposition 3.

The chamber graph of any hyperplane arrangement in in general position satisfies (i) the degree of any vertex is at least and at most , and (ii) any pair of vertices has graph distance at most .


(i) from Proposition , (ii) from [3, Lemma 7.15]. ∎

Iv Sampling labelings

For the sample , where , is the set of all possible labels generated on by the network . We would like to sample from uniformly.

In Section IV-A, we let be a single neuron and even this turns out to be non-trivial. Inspired by the inductive approach for computing hyperplane partition number [2, Chapter 2], we derive Algorithm RS (for Recursive Sampling) in Section IV-A that generates a label from almost uniformly.

In Sections IV-B and IV-C we expand in two directions. In Section IV-B, we provide means to perfectly sample from all labelings of a single neuron using a random walk on with a perfectly uniform stationary distribution. This allows us to sample from perfectly uniformly. The mixing time of this random walk is as yet unproven, but we provide partial evidence (empirical as well as proofs for small dimensions) that this random walk is fast mixing, with mixing time at most linear in the number of dimensions and at most quadratic in the number of samples.

In Section IV-C, we build on our RS approach to sample from arbitrary feedforward networks in time (true) polynomial in the sample size , network size and input dimension (), showing that even for arbitrary networks, we get near-uniform sampling of the possible labels that could be produced by the network.

Iv-a The recursive approach

From Theorem 1, to sample uniformly from we only need to sample the weights uniformly from the connected components of .

However, even for a single neuron, this is not trivial. As already noted from Proposition 2, the basic combinatorial difficulty comes from the fact that there are roughly labelings for almost all samples —therefore the number of chambers is exponential in the dimension . Clearly one can not simply enumerate all the possible components.

But a bigger difficulty comes from the fact that the arrangement of the hyperplanes can be very heterogeneous. The volume of some of the chambers can be arbitrarily small and therefore such chambers may be difficult to find. We settle this problem by using a recursive sampling approach that is inspired by the inductive approach for computing hyperplane partition number.

Our recursive algorithm RS() (see Algorithm ) takes as its inputs unit vectors , all from, say . The vectors are interpreted as normal vectors of distinct centered hyperplanes in . For simplicity, the reader can assume that these hyperplanes are in the general position, but they do not have to be. To sample from , we would therefore simply call RS(), where and .

The call RS() works recursively on the dimension of the vectors and , by calling RS with a new set of vectors in with . The base case is when RS is called with vectors in 1 dimension or when . When RS is called with vectors in 1 dimension, the problem is trivial since there is only one centered hyperplane arrangement in dimension, the origin. When RS is called with (no matter the dimension of the single input vector), the problem is also trivial since there are only two chambers for one hyperplane.

To generate the vectors in , we choose a hyperplane at random from , say , and compute the intersection of with all the remaining hyperplanes. These intersections are at most hyperplanes in and let be the unit normal vectors of these hyperplanes (written in the specific orthonormal basis indicated).

Input: , interpreted as unit normal vectors of (distinct) centered hyperplanes in an dimensional space.
Output: point representing a chamber in the hyperplane arrangement formed by ,

Let be the hyperplane in orthogonal to .

  • If output -1 or 1 with equal probability. If but , output or with equal probability.

  • Uniformly choose an index from .

  • For hyperplane , choose an arbitrary orthonormal basis . Note that is a -dimensional linear space in , and the rows of contain the orthonormal basis vectors, each being a vector in .

  • Compute the intersection of with , .

  • Set to be the unit vector in normal to (written using the basis ), . Note .

  • , where are the distinct vectors among . Note .

  • Compute the smallest distance of to the planes with .

  • Let be -1 or 1 with equal probability, output .

Algorithm 1 RS()
Theorem 2.

Let where and rank of is . Let be the set of non-empty chambers induced by the centered hyperplanes orthogonal to the vectors in . Algorithm RS() runs in time and any chamber in the hyperplane arrangment induced by is sampled with probability at least


(Outline only) The algorithm will run at most recursive iterations. For each iteration, we need to compute the base of the null space (Step 3) and time in Step 7 to compute the projection of each input vectors to the plane chosen in Step 2. This yields the total complexity to be .

To see the probability lower bound, define

We now claim that

This is because any chamber has at least faces by Proposition 2. For any chamber , we therefore have probability at least of choosing both a hyperplane that forms the face of and the direction of the hyperplane that faces the chamber . Conditioned on this choice of hyperplane and direction, we need to obtain the probability that the recursive call in step in Step 6 returns a point in the face of .

Observe that the face of is a -dimensional linear space. In Step 6, note that the rank of is exactly , but can be less than . The theorem follows by solving the recursive inequality, standard approximations on binomial coefficient and by noting that when , there are two chambers, thus yielding for all . ∎

Note that when the above probability is , a factor off the hyperplane slicing bound in Proposition 1. Note also that if the input vectors in have rank , the above approach still works. We can effectively project down the inputs into by choosing a basis for that contains vectors that are orthogonal to the span of the input vectors.

Iv-B A random walk approach

To mitigate the fact that the recursive approach above only yields approximately uniform sampling, We introduce a random walk based algorithm that samples arbitrarily close to uniform. Specifically, we run Algorithm NRW on a lazy chamber graph, both outlined below. One component of Algorithm NRW is Algorithm Chamber, that determines which chamber an input point belongs to.

Input: walk length and hyperplanes in
Output: point and chamber

  • Initialize , where is a normal vector of

  • Set Chamber. will be the chamber in the arrangement that contains .

  • For through , do

    • Uniformly choose a face of

    • Set to be the chamber adjacent to and across the face chosen in step (a.)

    • Set to any point in the chamber

  • Output and

Algorithm 2 NRW

Input: point and hyperplanes
Output: The faces of the chamber containing w.

  • Compute .

  • For do:

    • Define a linear program with

      and for .

    • If the linear programming in step has a solution, add to the collection.

Algorithm 3 Chamber
Theorem 3.

Algorithm Chamber runs in polynomial time both on and .


The theorem follows since linear programming can be solved in polynomial time [13]. ∎


We first analyze random walk defined by Algorithm NRW over the simple chamber graph, assuming the hyperplanes are in general position. With this assumption any vertex in the chamber graph has degree at least and at most from Proposition 3. Furthermore, from Proposition 3 the graph is connected and the distance between any two vertexes is at most .

Since the random walk is a reversible Markov chain, the stationary distribution of the random walk will be proportional to the degree of the vertices [9, Chapter 1.6]. From our observation on the bounds of degrees in Proposition 3, we will therefore have for any two vertices and

The more fundamental question is the mixing time of the random walk, or how quickly the walk generates stationary samples. While there are several approaches to analyze the mixing time, we focus on Cheeger’s inequality [9, Theorem 13.14] that bounds the spectral gap of the random walk’s transition matrix using the conductance of the graph. Recall that the conductance of a graph is

where is the vertex set, is size of the cut between and , is the sum of degrees of vertexes in . The following theorem gives a lower bound on the conductance of chamber graph when dimension .

Theorem 4.

The chamber graph of -dimensional hyperplane arrangement with size that is in the general position has conductance lower bounded by .


For any set of vertices in the chamber graph with size no greater than , we will show that the conductance of , , is lower bounded as follows

Let be the set with smallest volume satisfying

We first claim that must be connected. If not, we can write as the union of (maximally) connected components, , where are the maximally connected components within (in particular, note that there are no edges between distinct ). Then, if and , then

implying that has lower conductance than and is smaller in size than , a contradiction.

Let be the boundary surface of the chambers corresponding to vertexes in . Since is connected, we must have to be piece-wise line segments.

We now claim that will partition the chamber graph into two connected components. Since is connected, we just have to show that is also connected.

Suppose not, and let , where are maximally connected, and is the union of different connected components. Let and . Then we have

and since and , we have

Therefore, there must be some component such that

If satisfies , then again we have a contradiction because of the following. If , we are done. If , it means that every component in has conductance . But if there are more than two components in , then has a larger cut than each of the components, and therefore must have a larger volume as well, contradicting the assumption on .

If , then consider the set . Note that

Now . This follows since there is no boundary between and any of the other , and the only boundary has is with . Furthermore, , implying that has lower conductance than , again a contradiction.

Now, we know that the boundary between the chambers in and the rest of the hyperplane arrangement is exactly a piece-wise line segment that separates into two connected components. There are only 3 possibilities, as shown in Figure 1. We now observe is exactly the sum of the -dimensional faces in the arrangement that intersect with . Since there are at most lines in the arrangement, there exist a line that intersect with (or ) by at least many faces, see figure 1. The number of faces in is no less than the number of faces in , because any line that intersects with in must also intersect with , and at most two lines can intersect at the same point on by our general position assumption. The theorem now follows.

Fig. 1: Possibility of piece-wise linear partition

For the general dimension case, we have the following conjecture. See Appendix for justification and partial proofs.

Conjecture 1.

The conductance of any -dimensional general position hyperplane arrangement of size is lower bounded by .

Remark 1.

Note that the requirement for general position of the hyperplanes is necessary for fast mixing given by the Conjecture above. Else it is easy to construct a hyperplane arrangement with mixing time lower bounded by . As shown in Figure 2, the cut made by the gray shaded top plane has only boundary chamber but the total number of chambers below the plane is roughly (in two dimensions, while in dimensions, we will have the cut and volumne to be and respectively).

Fig. 2: Hyperplane arrangement with small conductance

Lazy Chamber graph

Algorithm NRW on the regular chamber graph will not give an exact uniform sampling, but is off by a factor of as mentioned above. This is easily fixed by adding dummy vertices and dummy edges to each vertex in the chamber graph raising the degree of every vertex in the original chamber graph to . Call such a graph to be lazy chamber graph.

We will call the vertex in the original chamber graph to be chamber vertex and the dummy vertices added to be augmentation vertices. The stationary probability of the new random walk, restricted on the chamber vertices, is exactly uniform. If the Algorithm NRW on the chamber graph is fast mixing, we can show that Algorithm NRW on the lazy chamber graph is also fast mixing:

Lemma 1.

If the conductance of the chamber graph is , the lazy chamber graph has conductance .


We only need to show that any subset of vertex in the lazy chamber graph we have . We observe that if an augmentation vertex is in , then the chamber vertex attached to it must also be included in . We denote to be the set of all chamber vertexes in .

The vertexes in can be partitioned into two classes, where is the set of all chamber vertices that have all their attached augmentation vertices in and is the complement of in . Similarly, , to be the sets that contains also the attached new vertex of and in ). We have

since all vertexes in are boundary vertexes. Note that since any vertex in will attach at least new vertex in order to make degree , we have . Now, by the definition of conductance we have . This is because, if then . Otherwise, we have , thus and , we have . Therefore, we have

Now since , the theorem follows. ∎

Combining all the results, we have

Theorem 5.

Assuming conjecture 1. For an given parameter and in the general position, Algorithm NRW run on the lazy chamber graph generated by can generate labels from with distribution close (in variational distance) to uniform, and runs in time .


By the relationship between mixing time and spectral gap [10, Theorem 2.2], we have

The theorem follows since the spectral gap is lower bounded by square of conductance by Cheeger’s inequality [9, Theorem 13.14]. ∎

Iv-C Sampling for arbitrary neural networks

We now consider the sampling for arbitrary neural networks. Let be the samples, we choose the weights of the network layer by layer. At layer we use the previous sampled weights in layers to generate outputs , where is output of layer with input , a binary vector. For each neuron in layer we independently sample weights using Algorithm RS with input .

To illustrate the idea more concretely, consider neural networks with one hidden layer. Let to be the input samples of dimension , for each neuron in the hidden layer, we use Algorithm RS to generate the weights independently. We now fix the weights we sampled for the neuron in the hidden layer and view the function that expressed by the hidden layer to be some function , where is the number of neurons in the hidden layer. We now define to be the new input sample for the output layer, and again use Algorithm RS to sample the weights for the output neuron with input .

Theorem 6.

For a neural network with fixed architecture, neurons and parameters, the above sampling procedure runs in time. Given a sample , each labeling in produced by this architecture appears with probability at least


We use induction on the layers. For any given labeling produced by weights w, let to be the probability that the output of layer is consistent with the output on weight w. We have

where is the input dimension of the th neuron in layer , and the product term comes from Theorem 2 and independence. Note that the rank of the outputs may reduced after passing the previous layers, however, this will only make the probability larger than by Theorem 2. Now, the theorem follows with the same argument as in [1, Theorem 6.1] for bounding VC dimension of linear threshold neural networks. ∎

V Simulations

We run our recursive algorithm for randomly choosing samples with different dimension and size , for each pair of we run the sampling procedure times to count the empirical distribution on the different labeling. We then compute the ratio of the maximum and minimal probability that appears in the balling that we sampled, and rounding the ratio to be integers. One can see from figure 2 that, for each sample size there is a peak for the probability ratio when the dimension of sample increases. For given dimension , one can see that for is small the ratio will increase according the increasing of , for is large the ratio will decrease when the is increasing. We also runs our sampling procedure on MNIST data set with sample, the run time is around mins.

Fig. 3: Ratio of maximum and minimal empirical probability


  • [1] M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations.    cambridge university press, 2009.
  • [2] R. P. Stanley et al., “An introduction to hyperplane arrangements,” Geometric combinatorics, vol. 13, pp. 389–496, 2004.
  • [3] S. Ovchinnikov, Graphs and cubes.    Springer Science & Business Media, 2011.
  • [4] P. Bidigare, P. Hanlon, D. Rockmore et al., “A combinatorial description of the spectrum for the tsetlin library and its generalization to hyperplane arrangements,” Duke Mathematical Journal, vol. 99, no. 1, pp. 135–174, 1999.
  • [5] K. S. Brown and P. Diaconis, “Random walks and hyperplane arrangements,” Annals of Probability, pp. 1813–1854, 1998.
  • [6] C. A. Athanasiadis and P. Diaconis, “Functions of random walks on hyperplane arrangements,” Advances in Applied Mathematics, vol. 45, no. 3, pp. 410–437, 2010.
  • [7]

    J. Pike, “Eigenfunctions for random walks on hyperplane arrangements,” Ph.D. dissertation, University of Southern California, 2013.

  • [8] A. Björner, “Random walks, arrangements, cell complexes, greedoids, and self-organizing libraries,” in Building bridges.    Springer, 2008, pp. 165–203.
  • [9] D. A. Levin and Y. Peres, Markov chains and mixing times.    American Mathematical Soc., 2017, vol. 107.
  • [10] N. Berestycki, “Mixing times of markov chains: Techniques and examples,” Alea-Latin American Journal of Probability and Mathematical Statistics, 2016.
  • [11] S. Shalev-Shwartz and S. Ben-David,

    Understanding machine learning: From theory to algorithms

    .    Cambridge university press, 2014.
  • [12] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” arXiv preprint arXiv:1611.01491, 2016.
  • [13] N. Megiddo et al., On the complexity of linear programming.    IBM Thomas J. Watson Research Division, 1986.
  • [14] V. Koltun, “The arrangement method for linear programming,” Computer Science Department, Stanford University, 2005.

Appendix A On the conductance conjecture

In order to provide a convincing reason as to why we believe the Conjecture 1, we provide a proof of the following partial result: suppose we cut a general position hyperplane arrangement with another hyperplane. The conductance of such a cut is lower bounded by (no matter the number of dimensions).

Proposition 4.

Let be a general position hyperplane arrangement in dimension . is another hyperplane. Then the number of chambers in (viewed as a hyperplane arrangment in ) is lower bounded by


A set of hyperplane in dimension is said to be in almost full rank position if any planes has rank at least . Note that the hyperplanes are in almost full rank position, since the projection on to can only reduce the rank by . Note that there may be two and coincident, but we treat them as different planes.

Denote , we show that the intersection for is also in almost full rank position. We only need to show that any planes has rank at least .

Suppose not, w.l.o.g. has rank at most and .

But we will show that will have rank , thus obtaining a contradiction. To see this, let be the base of the linear space , a normal vector to . We have and , implying that , which in turn implies for two different set of s. Meaning that has rank of .

The proposition now follows by induction. ∎

A random walk on vertexes:

For hyperplane arrangement that is in the general position. We define the vertexes of the arrangement to be all the intersections with and . By the general position assumption, we know that there are exactly many vertexes. Two vertexes is said to be connected if they are connected by a -dimensional face (intersection by hyperplanes) of the hyperplane arrangement. There are many edges and each vertex adjacent to at most and at least edges. The graph that defined by the vertexes and edges is known as the arrangement graph and studied in [14]. Where the author obtained the following conductance bound using a coupling argument:

Theorem 7 ([14, Theorem 4.3]).

The conductance of the arrangement graph is lower bounded by

Note that theorem will implies conjecture if we also know that the number of vertexes in any cut of the chamber do not much greater than the number of faces. Proposition 1 shows that this is satisfied if the cut is a plane, since there are exactly vertices but at least faces).