Restricted Boltzmann Machines (RBMs) (Smolensky 1986; Freund and Haussler 1994) are generative probability models defined by undirected stochastic networks with bipartite interactions between visible and hidden units.
These models are well-known in machine learning applications, where they are used to infer distributed representations of data and to train the layers of deep neural networks
are generative probability models defined by undirected stochastic networks with bipartite interactions between visible and hidden units. These models are well-known in machine learning applications, where they are used to infer distributed representations of data and to train the layers of deep neural networks(Hinton et al. 2006; Bengio 2009). The restricted connectivity of these networks allows to train them efficiently on the basis of cheap inference and finite Gibbs sampling (Hinton 2002; 2012), even when they are defined with many units and parameters. An RBM defines Gibbs-Boltzmann probability distributions over the observable states of the network, depending on the interaction weights and biases. An introduction is offered by Fischer and Igel (2012). The expressive power of these probability models has attracted much attention and has been studied in numerous papers, treating, in particular, their universal approximation properties (Younes 1996; Le Roux and Bengio 2008; Montúfar and Ay 2011), approximation errors (Montúfar et al. 2011), efficiency of representation (Martens et al. 2013; Montúfar and Morton 2015), and dimension (Cueto et al. 2010).
In certain applications, it is preferred to work with conditional probability distributions, instead of joint probability distributions.
For example, in a classification task, the conditional distribution may be used to indicate a belief about the class of an input, without modeling the probability of observing that input; in sensorimotor control, it can describe a stochastic policy for choosing actions based on world observations; and in the context of information communication, to describe a channel. RBMs naturally define models of conditional probability distributions, called conditional restricted Boltzmann machines (CRBMs). These models inherit many of the nice properties of RBM probability models, such as the cheap inference and efficient training. Specifically, a CRBM is defined by clamping the states of an input subset of the visible units of an RBM. For each input state one obtains a conditioned distribution over the states of the output visible units. See Figure 1 for an illustration of this architecture. This kind of conditional models and slight variants thereof have seen success in many applications; for example, in classification (Larochelle and Bengio 2008), collaborative filtering (Salakhutdinov et al. 2007), motion modeling (Taylor et al. 2007; Zeiler et al. 2009; Mnih et al. 2012; Sutskever and Hinton 2007) , and reinforcement learning
, and reinforcement learning(Sallans and Hinton 2004).
So far, however, there is not much theoretical work addressing the expressive power of CRBMs. We note that it is relatively straightforward to obtain some results on the expressive power of CRBMs from the existing theoretical work on RBM probability models. Nevertheless, an accurate analysis requires to take into account the specificities of the conditional case. Formally, a CRBM is a collection of RBMs, with one RBM for each possible input value. These RBMs differ in the biases of the hidden units, as these are influenced by the input values. However, these hidden biases are not independent for all different inputs, and, moreover, the same interaction weights and biases of the visible units are shared for all different inputs. This sharing of parameters draws a substantial distinction of CRBM models from independent tuples of RBM models.
In this paper we address the representational power of CRBMs, contributing theoretical insights to the optimal number of hidden units. Our focus lies on the classes of conditional distributions that can possibly be represented by a CRBM with a fixed number of inputs and outputs, depending on the number of hidden units. Having said this, we do not discuss the problem of finding the optimal parameters that give rise to a desired conditional distribution (although our derivations include an algorithm that does this), nor problems related to incomplete knowledge of the target conditional distributions and generalization errors. A number of training methods for CRBMs have been discussed in the references listed above, depending on the concrete applications. The problems that we deal with here are the following: 1) are distinct parameters of the model mapped to distinct conditional distributions; what is the smallest number of hidden units that suffices for obtaining a model that can 2) approximate any target conditional distribution arbitrarily well (a universal approximator); 3) approximate any target conditional distribution without exceeding a given error tolerance; 4) approximate selected classes of conditional distributions arbitrarily well? We provide non-trivial solutions to all of these problems. We focus on the case of binary units, but the main ideas extend to the case of discrete non-binary units.
This paper is organized as follows. Section 2 contains formal definitions and elementary properties of CRBMs. Section 3 investigates the geometry of CRBM models in three subsections. In Section 3.1 we study the dimension of the sets of conditional distributions represented by CRBMs and show that in most cases this is the dimension expected from counting parameters (Theorem 4). In Section 3.2 we address the universal approximation problem, deriving upper and lower bounds on the minimal number of hidden units that suffices for this purpose (Theorem 7). In Section 3.3 we analyze the maximal approximation errors of CRBMs (assuming optimal parameters) and derive an upper-bound for the minimal number of hidden units that suffices to approximate every conditional distribution within a given error tolerance (Theorem 11). Section 4 investigates the expressive power of CRBMs in two subsections. In Section 4.1 we describe how CRBMs can represent natural families of conditional distributions that arise in Markov random fields. In Section 4.2 we study the ability of CRBMs to approximate conditional distributions with restricted supports. This section addresses, especially, the approximation of deterministic conditional distributions (Theorem 21). In Section 5 we offer a discussion and an outlook. In order to present the main results in a concise way, we have deferred all proofs to the appendices. Nonetheless, we think that the proofs are interesting in their own right, and we have prepared them with a fair amount of detail.
We will denote the set of probability distributions on by .
A probability distribution is a vector of
is a vector ofnon-negative entries , , adding to one, . The set is a -dimensional simplex in .
We will denote the set of conditional distributions of a variable ,
given another variable , by .
A conditional distribution is a row-stochastic matrix with rows
row-stochastic matrix with rows, . The set is a -dimensional polytope in . It can be regarded as the -fold Cartesian product , where there is one probability simplex for each possible input state . We will use the abbreviation , where is a natural number.
The conditional restricted Boltzmann machine (CRBM) with input units, output units, and hidden units, denoted , is the set of all conditional distributions in that can be written as
with normalization function
Here, , , and are column state vectors of the input units, output units, and hidden units, respectively, and denotes transposition. The parameters of this model are the matrices of interaction weights , and the vectors of biases , .
When there are no input units (), the model reduces to the restricted Boltzmann machine probability model with visible units and hidden units, denoted .
We can view as a collection of restricted Boltzmann machine probability models with shared parameters. For each input , the output distribution is the probability distribution represented by for the parameters . All have the same interaction weights , the same biases for the visible units, and differ only in the biases for the hidden units. The joint behavior of these distributions with shared parameters is not trivial.
The model can also be regarded as representing block-wise normalized versions of the joint probability distributions represented by .
Namely, a joint distribution
. Namely, a joint distributionis an array with entries , , . Conditioning on is equivalent to considering the normalized -th row , .
3 Geometry of Conditional Restricted Boltzmann Machines
In this section we investigate three basic questions about the geometry of CRBM models. First, what is the dimension of a CRBM model? Second, how many hidden units does a CRBM need in order to be able to approximate every conditional distribution arbitrarily well? Third, how accurate are the approximations of a CRBM, depending on the number of hidden units?
The model is defined by marginalizing out the hidden units of a graphical model. This implies that several choices of parameters may represent the same conditional distributions. In turn, the dimension of the set of representable conditional distributions may be smaller than the number of model parameters, in principle.
When the dimension of is equal to the number of parameters, , or, otherwise, equal to the dimension of the ambient polytope of conditional distributions, , then the model is said to have the expected dimension. In this section we show that has the expected dimension for most triplets . In particular, we show that this holds in all practical cases, where the number of hidden units is smaller than exponential with respect to the number of visible units .
The dimension of a parametric model is given by the maximum of the rank of the Jacobian of its parametrization (assuming mild differentiability conditions).
Computing the rank of the Jacobian is not easy in general.
A resort is to compute the rank only in the limit of large parameters, which corresponds to considering a piece-wise linearized version of the original model, called the
The dimension of a parametric model is given by the maximum of the rank of the Jacobian of its parametrization (assuming mild differentiability conditions). Computing the rank of the Jacobian is not easy in general. A resort is to compute the rank only in the limit of large parameters, which corresponds to considering a piece-wise linearized version of the original model, called thetropical model. Cueto et al. (2010) used this approach to study the dimension of RBM probability models. Here we apply their ideas in order to study the dimension of CRBM conditional models.
The following functions from coding theory will be useful for phrasing the results:
Let denote the cardinality of the largest subset of whose elements are at least Hamming distance apart. Let denote the smallest cardinality of a set such that every element of is at most Hamming distance apart from that set.
Cueto et al. (2010) showed that for , and for . It is known that and . In turn, the probability model has the expected dimension for most pairs . Noting that , we directly infer the following bounds for the dimension of conditional models:
These bounds are too loose and do not allow us to attest whether the conditional model has the expected dimension, unless . Hence we need to study the conditional model in more detail. We obtain the following result:
The conditional model has the expected dimension in the following cases:
We note the following practical version of the theorem, which results from inserting appropriate bounds on the functions and :
The conditional model has the expected dimension in the following cases:
These results show that, in all cases of practical interest, where is less than exponential in , the dimension of the CRBM model is indeed equal to the number of model parameters. In all these cases, almost every conditional distribution that can be represented by the model is represented by at most finitely many different choices of parameters.
On the other hand, the dimension alone is not very informative about the ability of a model to approximate target distributions. In particular, it may be that a high dimensional model covers only a tiny fraction of the set of all conditional distributions, or also that a low dimensional model can approximate any target conditional relatively well. We address the minimal dimension and number of parameters of a universal approximator in the next section. In the subsequent section we address the approximation errors depending on the number of parameters.
3.2 Universal Approximation
In this section we ask for the smallest number of hidden units for which the model can approximate every conditional distribution from arbitrarily well.
Note that each conditional distribution can be identified with the set of joint distributions of the form , with strictly positive marginals . In particular, by fixing a marginal distribution, we obtain an identification of and a subset of . Figure 2 illustrates this identification in the case and .
This implies that universal approximators of joint probability distributions define universal approximators of conditional distributions. We know that is a universal approximator whenever (see Montúfar and Ay 2011), and therefore:
The model can approximate every conditional distribution from arbitrarily well whenever .
This improves previous results by Younes (1996) and van der Maaten (2011). On the other hand, since conditional models do not need to model the input-state distribution, in principle it is possible that is a universal approximator even if is not a universal approximator. In fact, we obtain the following improvement of Proposition 6, which does not follow from corresponding results for RBM probability models:
The model can approximate every conditional distribution from arbitrarily well whenever
In fact, the model can approximate every conditional distribution from arbitrarily well whenever , where is any natural number satisfying , and and are functions (defined in Lemma 30 and Proposition 32) which tend to approximately and , respectively, as tends to infinity.
We note the following weaker but practical version of Theorem 7:
Let . The model can approximate every conditional distribution from arbitrarily well whenever .
These results are significant, because they reduce the bounds following from universal approximation results for probability models by an additive term of order , which corresponds precisely to the order of parameters needed in order to model the input-state distributions.
As expected, the asymptotic behavior of the theorem’s bound is exponential in the number of input and output units. This lies in the nature of the universal approximation property. A crude lower bound on the number of hidden units that suffices for universal approximation can be obtained by comparing the number of parameters of the model and the dimension of the conditional polytope:
If the model can approximate every conditional distribution from arbitrarily well, then necessarily .
The results presented above highlight the fact that CRBM universal approximation may be possible with a drastically smaller number of hidden units than RBM universal approximation, for the same number of visible units. However, even with these reductions the universal approximation property requires an enormous number of hidden units. In order to provide a more informative description of the approximation capabilities of CRBMs, in the next section we investigate how the maximal approximation error decreases as hidden units are added to the model.
3.3 Maximal Approximation Errors
From a practical perspective it is not necessary to approximate conditional distributions arbitrarily well, but fair approximations suffice. This can be especially important if the number of required hidden units grows disproportionately with the quality of the approximation. In this section we investigate the maximal approximation errors of CRBMs depending on the number of hidden units. Figure 3 gives a schematic illustration of the maximal approximation error of a conditional model.
The Kullback-Leibler divergence of two probability distributions
The Kullback-Leibler divergence of two probability distributionsand in is given by
where denotes the marginal distribution over .
The divergence of two conditional distributions and in is given by
where denotes the uniform distribution over
denotes the uniform distribution over. Even if the divergence between two joint distributions does not vanish, the divergence between their conditional distributions may vanish.
The divergence from a conditional distribution to the set of conditional distributions defined by a model of joint probability distributions is given by
The maximum of the divergence from a conditional distribution to satisfies
Hence we can bound the maximal divergence of a CRBM by the maximal divergence of an RBM (studied in Montúfar et al. 2011) and obtain the following:
If , then the divergence from any conditional distribution to the model is bounded by
This proposition implies the universal approximation result from Proposition 6 as the special case with vanishing approximation error, but it does not imply Theorem 7 in the same way. Taking more specific properties of the conditional model into account, we can improve the proposition and obtain the following:
Let . The divergence from any conditional distribution in to the model is bounded from above by
In fact, the divergence from any conditional distribution in to is bounded from above by , where is the largest integer with .
This theorem implies the universal approximation result from Theorem 7 as the special case with vanishing approximation error. We note the following weaker but practical version of Theorem 11 (analogue to Corollary 8):
Let and . The divergence from any conditional distribution in to the model is bounded from above by , whenever .
Given an error tolerance, we can use these bounds to find a sufficient number of hidden units that guarantees approximations within this error tolerance.
In plain terms, the results presented above show that the worst case approximation errors of CRBMs decrease at least with the logarithm of the number of hidden units. On the other hand, in practice one is not interested in approximating all possible conditional distributions, but only special classes. One can expect that CRBMs can approximate certain classes of conditional distributions better than others. This is the subject of the next section.
4 Representation of Special Classes of Conditional Models
In this section we ask about the classes of conditional distributions that can be compactly represented by CRBMs and whether CRBMs can approximate interesting conditional distributions using only a moderate number of hidden units.
The first part of the question is about familiar classes of conditional distributions that can be expressed in terms of CRBMs, which in turn would allow us to compare CRBMs with other models and to develop a more intuitive picture of Definition 1.
The second part of the question clearly depends on the specific problem at hand. Nonetheless, some classes of conditional distributions may be considered generally interesting, as they contain solutions to all instances of certain classes of problems. An example is the class of deterministic conditional distributions, which suffices to solve any Markov decision problem in an optimal way.
4.1 Representation of Conditional Markov Random Fields
In this section we discuss the ability of CRBMs to represent conditional Markov random fields, depending on the number of hidden units that they have. The main idea is that each hidden unit of an RBM can be used to model the pure interaction of a group of visible units. This idea appeared in previous work by Younes (1996), in the context of universal approximation.
Consider a simplicial complex on ; that is, a collection of subsets of such that implies for all , and . The random field with interactions is the set of probability distributions of the form
with normalization and parameters , .
We obtain the following result:
Let be a simplicial complex on . If , then the model can represent every conditional distribution of , given , that can be represented by .
An interesting special case is when each output distribution can be chosen arbitrarily from a given Markov random field:
Let be a simplicial complex on and for each let be some probability distribution from . If , then the model can represent the conditional distribution defined by , for all , for all .
We note the following direct implication for RBM probability models:
Let be a simplicial complex on . If , then can represent any probability distribution from .
Figure 4 illustrates a Markov random field and an RBM architecture that can represent it.
4.2 Approximation of Conditional Distributions with Restricted Supports
In this section we continue the discussion about the classes of conditional distributions that can be represented by CRBMs, depending on the number of hidden units. Here we focus on a hierarchy of conditional distributions defined by the total number of input-output pairs with positive probability.
For any , , and , let denote the union of all -dimensional faces of ; that is, the set of conditional distributions that have a total of or fewer non-zero entries, .
Note that . The vertices (zero-dimensional faces) of are the conditional distributions which assign positive probability to only one output, given each input, and are called deterministic. By Carathéodory’s theorem, every element of is a convex combination of or fewer deterministic conditional distributions.
The sets arise naturally in the context of reinforcement learning and partially observable Markov decision processes (POMDPs).
Namely, every finite POMDP has an associated effective dimension
arise naturally in the context of reinforcement learning and partially observable Markov decision processes (POMDPs). Namely, every finite POMDP has an associated effective dimension, which is the dimension of the set of all state processes that can be generated by stationary stochastic policies. Montúfar et al. (2014) showed that the policies represented by conditional distributions from the set are sufficient to generate all the processes that can be generated by . In general, the effective dimension is relative small, such that is a much smaller policy search space than .
We have the following result:
If , then the model can approximate every element from arbitrarily well.
This result shows the intuitive fact that each hidden unit of can be used to model the probability of an input-output pair. Since each conditional distribution has input-output probabilities that are completely determined by the other probabilities (due to normalization), it is interesting to ask whether the amount of hidden units indicated in the proposition is strictly necessary. Further below, Theorem 21 will show that, indeed, hidden units are required for modeling the positions of the positive probability input-output pairs, even if their specific values do not need to be modeled.
We note that certain structures of positive probability input-output pairs can be modeled with fewer hidden units than stated in Proposition 18. An simple example is the following direct generalization of Corollary 8:
If is divisible by and , then the model can approximate every element from arbitrarily well, when the set of positive-probability outputs is the same for all inputs.
In the following we will focus on deterministic conditional distributions. This is a particularly interesting and simple class of conditional distributions with restricted supports. It is well known that any finite Markov decision processes (MDPs) has an optimal policy defined by a stationary deterministic conditional distribution (see Bellman 1957; Ross 1983). Furthermore, Ay et al. (2013) showed that it is always possible to define simple two-dimensional manifolds that approximate all deterministic conditional distributions arbitrarily well.
Certain classes of conditional distributions (in particular deterministic conditionals) coming from feedforward networks can be approximated arbitrarily well by CRBMs:
The model can approximate every conditional distribution arbitrarily well, which can be represented by a feedforward network with input units, a hidden layer of linear threshold units, and an output layer of sigmoid units. In particular, the model can approximate every deterministic conditional distribution from arbitrarily well, which can be represented by a feedforward linear threshold network with input, hidden, and output units.
The representational power of feedforward linear threshold networks has been studied intensively in the literature. For example, Wenzel et al. (2000) showed that a feedforward linear threshold network with input, hidden, and output units, can represent the following:
Any Boolean function , when ; e.g., when .
The parity function , when .
The indicator function of any union of linearly separable subsets of .
Although CRBMs can approximate this rich class of deterministic conditional distributions arbitrarily well, the next result shows that the number of hidden units required for universal approximation of deterministic conditional distributions is rather large:
The model can approximate every deterministic policy from arbitrarily well if and only if .
By this theorem, in order to approximate all deterministic conditional distributions arbitrarily well, a CRBM requires exponentially many hidden units, with respect to the number of input units.
This paper gives a theoretical description of the representational capabilities of conditional restricted Boltzmann machines (CRBMs) relating model complexity and model accuracy. CRBMs are based on the well studied restricted Boltzmann machine (RBM) probability models. We proved an extensive series of results that generalize recent theoretical work on the representational power of RBMs in a non-trivial way.
We studied the problem of parameter identifiability. We showed that every CRBM with up to exponentially many hidden units (in the number of input and output units) represent a set of conditional distributions of dimension equal to the number of model parameters. This implies that in all practical cases, CRBMs do not waste parameters, and, generically, only finitely many choices of the interaction weights and biases produce the same conditional distribution.
We addressed the classical problems of universal approximation and approximation quality. Our results show that a CRBM with hidden units can approximate every conditional distribution of output units, given input units, without surpassing a Kullback-Leibler approximation error of the form (assuming optimal parameters). Thus this model is a universal approximator whenever . In fact we provided tighter bounds depending on . For instance, if , then the universal approximation property is attained whenever . Our proof is based on an upper bound for the complexity of an algorithm that packs Boolean cubes with sequences of non-overlapping stars, for which improvements may be possible. It is worth mentioning that the set of conditional distributions for which the approximation error is maximal may be very small. This is a largely open and difficult problem. We note that our results can be plugged into certain analytic integrals (Montúfar and Rauh 2014) to produce upper-bounds for the expectation value of the approximation error when approximating conditional distributions drawn from a product Dirichlet density on the polytope of all conditional distributions. For future work it would be interesting to extend our (optimal-parameter) considerations by an analysis of the CRBM training complexity and the errors resulting from non-optimal parameter choices.
We also studied specific classes of conditional distributions that can be represented by CRBMs, depending on the number of hidden units. We showed that CRBMs can represent conditional Markov random fields by using each hidden unit to model the interaction of a group of visible variables. Furthermore, we showed that CRBMs can approximate all binary functions with input bits and output bits arbitrarily well if or and only if . In particular, this implies that there are exponentially many deterministic conditional distributions which can only be approximated arbitrarily well by a CRBM if the number of hidden units is exponential in the number of input units. This aligns with well known examples of functions that cannot be compactly represented by shallow feedforward networks, and reveals some of the intrinsic constraints of CRBM models that may prevent them from grossly over-fitting.
We think that the developed techniques can be used for studying other conditional probability models as well. In particular, for future work it would be interesting to compare the representational power of CRBMs and of combinations of CRBMs with feedforward nets (combined models of this kind include CRBMs with retroactive connections and recurrent temporal RBMs). Also, it would be interesting to apply our techniques to study stacks of CRBMs and other multilayer conditional models. Finally, although our analysis focuses on the case of binary units, the main ideas can be extended to the case of discrete non-binary units.
Appendix A Details on the Dimension
Proof of Proposition 3.
Each joint distribution of and has the form and the set of all marginals has dimension .
This shows the first statement. The items follow directly from the corresponding statements for the probability model.
Proof of Theorem 4. We will prove a stronger statement, where the condition on appearing in the first item is relaxed to the following: The set contains disjoint radius- Hamming balls whose union does not contain any set of the form for , and whose complement has full affine rank as a subset of .
The proof is based on the ideas developed in (Cueto et al. 2010) for studying the RBM probability model.
We consider the Jacobian of for the parametrization given in Definition 1. The dimension of is the maximum rank of the Jacobian over all possible choices of , . Let denote the most likely hidden state of given the visible state , depending on the parameter . After a few direct algebraic manipulations, we find that the maximum rank of the Jacobian is bounded from below by the maximum over of the dimension of the column-span of the matrix with rows
modulo vectors whose -th entries are independent of given . Here is the Kronecker product, which is defined by . The modulo operation has the effect of disregarding the input distribution in the joint distribution represented by the RBM. For example, from the first block of we can remove the columns that correspond to , without affecting the mentioned column-span. Summarizing, the maximal column-rank of modulo the vectors whose -th entries are independent of given is a lower bound for the dimension of .
Note that depends on in a discrete way; the parameter space is partitioned in finitely many regions where is constant. The piece-wise linear map thus emerging, with linear pieces represented by the , is the tropical CRBM morphism, and its image is the tropical CRBM model.
Each linear region of the tropical morphism corresponds to an inference function taking visible state vectors to the most likely hidden state vectors. Geometrically, such an inference function corresponds to slicings of the -dimensional unit hypercube. Namely, every hidden unit divides the visible space in two halfspaces, according to its preferred state.
Each of these slicings defines a column block of the matrix . More precisely,
where is the matrix with rows for all , and is the same matrix, with rows multiplied by the indicator function of the set of points classified as positive by a linear classifier (slicing).
If we consider only linear classifiers that select rows of corresponding to disjoint Hamming balls of radius one (that is, such that the are disjoint radius-one Hamming balls), then the rank of is equal to the number of such classifiers times (which is the rank of each block ), plus the rank of (which is the remainder rank of the first block ). The column-rank modulo functions of is equal to the rank minus (which is the dimension of the functions of spanned by columns of ), minus at most the number of cylinder sets for some that are contained in . This completes the proof of the general statement in the first item.
The example given in the first item is a consequence of the following observations. Each cylinder set contains points. If a given cylinder set intersects a radius- Hamming ball but is not contained in it, then it also intersects the radius- Hamming sphere around . Choosing the radius- Hamming ball slicings to have centers at least Hamming distance apart, we can ensure that their union does not contain any cylinder set .
The second item is by the second item of Proposition 3;
when the probability model is full dimensional, then is full dimensional.
Appendix B Details on Universal Approximation
b.1 Sufficient Number of Hidden Units
This section contains the proof of Theorem 7 about the minimal size of CRBM universal approximators.
The proof is constructive; given any target conditional distribution, it proceeds by adjusting the weights of the hidden units successively until obtaining the desired approximation.
The idea of the proof is that each hidden unit can be used to model the probability of an output vector, for several different input vectors.
The probability of a given output vector can be adjusted at will by a single hidden unit, jointly for several input vectors, when these input vectors are in general position.
This comes at the cost of generating dependent output probabilities for all other inputs in the same affine space.
The main difficulty of the proof lies in the construction of sequences of successively conflict-free groups of affinely independent inputs, and in estimating the shortest possible length of such sequences exhausting all possible inputs.
The proof is composed of several lemmas and propositions. We start with a few definitions:
about the minimal size of CRBM universal approximators. The proof is constructive; given any target conditional distribution, it proceeds by adjusting the weights of the hidden units successively until obtaining the desired approximation. The idea of the proof is that each hidden unit can be used to model the probability of an output vector, for several different input vectors. The probability of a given output vector can be adjusted at will by a single hidden unit, jointly for several input vectors, when these input vectors are in general position. This comes at the cost of generating dependent output probabilities for all other inputs in the same affine space. The main difficulty of the proof lies in the construction of sequences of successively conflict-free groups of affinely independent inputs, and in estimating the shortest possible length of such sequences exhausting all possible inputs. The proof is composed of several lemmas and propositions. We start with a few definitions:
Given two probability distributions and on a finite set , the Hadamard product or renormalized entry-wise product is the probability distribution on defined by for all . When building this product, we assume that the supports of and are not disjoint, such that the normalization term does not vanish.
The probability distributions that can be represented by RBMs can be described in terms of Hadamard products. Namely, for every probability distribution that can be represented by , the model with one additional hidden unit can represent precisely the probability distribution of the form , where is a mixture, with , of two strictly positive product distributions and . In other words, each additional hidden unit amounts to Hadamard-multiplying the distributions representable by an RBM with the distributions representable as mixtures of product distributions. The same result is obtained by considering only the Hadamard products with mixtures where is equal to the uniform distribution. In this case, the distributions are of the form , where is any strictly positive product distribution and is any weight in .
A probability sharing step is a transformation taking a probability distribution to , for some strictly positive product distribution and some .
We will need two more standard definitions from coding theory:
A radius- Hamming ball in is a set consisting of a length- binary vector and all its immediate neighbors; that is, for some , where denotes the Hamming distance between and . Here .
An -dimensional cylinder set in is a set of length- binary vectors with arbitrary values in coordinates and fixed values in the other coordinates; that is, for some and some with .
The geometric intuition is simple: a cylinder set corresponds to the vertices of a face of a unit cube, and a radius- Hamming ball corresponds to the vertices of a corner of a unit cube. The vectors in a radius- Hamming ball are affinely independent. See Figure 5A for an illustration.
In order to prove Theorem 7, for each and we want to find an such that: for any given strictly positive conditional distribution , there exists and probability sharing steps taking to a strictly positive joint distribution with . The idea is that the starting distribution is represented by an RBM with no hidden units, and each sharing step is realized by adding a hidden unit to the RBM. In order to obtain these sequences of sharing steps, we will use the following technical lemma:
Let be a radius- Hamming ball in and let be a cylinder subset of containing the center of . Let for all , let and let denote the Dirac delta on assigning probability one to . Let be a strictly positive probability distribution with conditionals and let
Then, for any , there is a probability sharing step taking to a joint distribution with conditionals satisfying for all .
Proof. We define the sharing step with a product distribution supported on .
Note that given any distribution on and a radius- Hamming ball whose center is contained in ,
there is a product distribution on such that .
In other words, the restriction of a product distribution to a radius- Hamming ball can be made proportional to any non-negative vector of length .
To see this, note that a product distribution is a vector with entries for all , with factor distributions .
Hence the restriction of to is given by the vector , where, without loss of generality, we chose centered at .
Now, by choosing the factor distributions appropriately, the vector can be made arbitrary in .
We have the following two implications of Lemma 26:
For any and for all , there is an such that, for any strictly positive joint distribution with conditionals satisfying for all , there are sharing steps taking to a joint distribution with conditionals satisfying for all , where is the Dirac delta on assigning probability one to the vector of zeros and
Proof. Consider any . We will show that the probability distribution