## 1 Introduction

In machine learning and statistics, the directed acyclic graph (DAG) is a common modelling choice for expressing relationships between objects. Prime examples of DAG-based graphical models include Bayesian networks, feed-forward neural networks, causal networks, deep belief networks, dynamic Bayesian networks and hidden Markov models, to name a few. Learning the unknown structure of these models presents a significant learning challenge, a task that is often avoided by fixing the structure to a large and hopefully sufficiently expressive model.

Structure learningis a model selection problem in which ones estimates the underlying graphical structure of the model. Over the years, researchers have explored a great variety of approaches to this problem

[jordan1998learning, schmidt2007learning, banerjee2015bayesian, kwok1997constructive, mansinghka2012structured, mohammadi2015bayesian, tervo2016toward], from frequentist to Bayesian, and some using pure heuristic-based search, but the vast majority is limited to finite parametric models.

Bayesian nonparametric learning methods are appealing alternatives to their parametric counterparts, because they offer more flexibility when dealing with generative models of unknown dimensionality [hjort2009invitation]. Instead of looking for specific finite-dimensional models, the idea is rather to define probability measures on infinite-dimensional spaces and then infer the finite subset of active dimensions explaining the data. Over the past years, there has been extensive work on constructing flexible Bayesian nonparametric models for various types of graphical models, allowing complex hidden structures to be learned from data. For instance, [jiang2013infinite] developed a model for infinite latent conditional random fields while [nbn2010] proposed an infinite mixture of fully observable finite-dimensional Bayesian networks. In the case of time series, [chatzis2010infinite] developed the infinite hidden Markov random field model and [doshi2011infinite] proposed an infinite dynamic Bayesian network with factored hidden states. Another interesting model is the infinite factorial dynamical model of [valera2015infinite] representing the hidden dynamics of a system with infinitely many independent hidden Markov models.

The problem of learning networks containing hidden structures with Bayesian nonparametric methods has also received attention. The cascading Indian Buffet Process (CIBP) of [adams2010a] is a Bayesian nonparametric prior over infinitely deep and infinitely broad layered network structures. However, the CIBP does not allow connections from non-adjacent layers, yielding a restricted prior over infinite DAGs. The extended CIBP (ECIBP) is an extension of the previous model which seeks to correct this limitation and support a larger set of DAG structures [dallaire2014learning]

. However, the ECIBP has some drawbacks: the observable nodes are confined to a unique layer placed at the bottom of the network, which prevents learning the order of the nodes or have observable inputs. An immediate consequence of this is the impossibility for an observable unit to be the parent of any hidden unit or any other observable unit, which considerably restricts the support of the prior over DAGs and make their application to deep learning very problematic.

In the context of deep learning, structure learning is often part of the optimization. Recently, [wen2016learning] proposed a method that enforces the model to dynamically learn more compact structures by imposing sparsity through regularization. While sparsity is obviously an interesting property for large DAG-based models, their method ignores the epistemic uncertainty about the structure. Structure learning for probabilistic graphical models can also be applied in deep learning. For instance, [rohekar2018constructing] have demonstrated that deep network structures can be learned throught the use of Bayesian network structure learning strategies. To our knowledge, no Bayesian nonparametric structure learning methods have been applied to deep learning models.

This paper introduces the Indian Chefs Process (ICP), a new Bayesian nonparametric prior for general DAG-based structure learning, which can equally be applied to perform Bayesian inference in probabilistic graphical models and deep learning. The proposed distribution has a support containing all possible DAGs, admits hidden and observable units, is layerless and enforces sparsity. We present its construction in Section

2and describe a learning method based on Markov Chain Monte Carlo in Section

3. In Section 4, we use the ICP as a prior in two Bayesian structure learning experiments: in the first, we compute the posterior distribution on the structure and parameters of a deep generative sigmoid networks and in the second we perform structure learning in convolutional neural networks.## 2 Bayesian nonparametric directed acylic graphs

In this section, we construct the probability distribution over DAGs and orders by adopting the methodology followed by [griffiths2006infinite]. We first define a distribution over finite-dimensional structures, while the final distribution is obtained by evaluating it as the structure size grows to infinity.

Let be a DAG where is the set of nodes and is the adjacency matrix. We introduce an ordering on the nodes so that the direction of an edge is determined by comparing the order value of each node. A connection is only allowed when , meaning that higher order nodes are parents and lower order nodes are children. Notice that this constraint is stronger than acyclicity since all combinations respecting the order value constraint are guaranteed to be acyclic, but an acyclic graph can violate the ordering constraint.

We assume that both the adjacency matrix and the ordering

are random variables and develop a Bayesian framework reflecting our uncertainty. Accordingly, we assign a popularity parameter

and an order value , called reputation, to every node in based on the following model:(1) | ||||

(2) | ||||

(3) |

Here, denotes the indicator function,

denotes the uniform distribution on interval

and is the set of*observed*nodes. In this model, the popularities reflected by control the outgoing connection probability of the nodes, while respecting the

*total order*imposed by . Moreover, the Beta prior parametrization in Eq. (2) is motivated by the Beta Process construction of [paisley10b], where Eq. (1) becomes the base distribution, and is convenient when evaluating the limit in Section 2.1. Also, and correspond to the usual parameters defining a Beta Process and the purpose of the new parameter is to control the popularity of the observable nodes and ensure a non-zero connection probability when required.

Under this model, the conditional probability of the adjacency matrix given the popularities and order values is:

(4) |

The adjacency matrix may contain connections for nodes that are not of interest, i.e. nodes that are not ancestors of any observable nodes. Formally, we define as the set of *active* nodes, which contains all observable nodes and the ones having a directed path ending at an observable node.

When solely considering connections from to , i.e. the adjacency submatrix of the -induced subgraph of , Eq. (4) simplifies to:

(5) |

where denotes the number of outgoing connections from node to any active nodes, denotes the number of active nodes having an order value strictly lower than and

. At this point, we marginalize out the popularity vector

in Eq. (5) with respect to the prior, by using the conjugacy of the Beta and Binomial distributions, and we get:

(6) | ||||

where is the Pochhammer symbol denoting the rising factorial and is the set of active hidden nodes.

The set of active nodes contains all observable nodes as well as their ancestors, which means there exists a part of the graph that is disconnected from . Let us denote by the set of *inactive* nodes. Considering that the -induced subgraph is effectively maximal, then this subgraph must be properly isolated by some envelope of no-connections containing only zeros. The joint probability of submatrices and is:

(7) |

where the number of negative Bernoulli trials depends on itself and . Notice that since the submatrices and contain uninteresting and unobserved binary events, they are trivially marginalized out of .

One way to simplify Eq. (2) is to marginalize out the order values of the inactive nodes with respect to (1). To do so, we first sort the active node orders ascendingly in vector and augment it with the extrema and , where we introduce to denote the number of active nodes. We slightly abuse notation here since these extrema do not refer to any nodes and are only used to compute interval lengths. This provides us with all relevant interval boundaries, including the absolute boundaries implied by Eq. (1). We refer to the ^{th} smallest value of this vector as . Based on the previous notation, the probability for an inactive node to lie between two active nodes is simply . Using this notation, we have the following marginal probability:

(8) | ||||

where denotes the number of inactive nodes and symbolizes the falling factorial. Due to the exchangeability of our model, the joint probability on both the adjacency matrix and active order values can cause problems regarding the index of the nodes. One way to simplify this is to reorder the adjacency matrix according to , which we denote . By using this many-to-one transformation, we obtain a probability distribution on an equivalence class of DAGs that is analog to the *lof* function used by [griffiths2006infinite]. The number of permutation mapping to this sorted representation is accounted for by the normalization constant .

### 2.1 From Finite to Infinite DAGs

An elegant way to construct Bayesian nonparametric models is to consider the infinite limit of finite parametric Bayesian models [orbanz2010bayesian]. Following this idea, we revisit the model of Section 2 so that now contains infinitely many nodes. To this end, we evaluate the limit as of Eq. (8), yielding the following probability distribution:

(9) |

where is the digamma function. Eq. (9) is the proposed marginal probability distribution on the joint space of infinite DAGs and continuous orders. The Indian Chefs Process^{1}^{1}1Described in the supplementary material along with some properties of the distribution. This content is not necessary to reproduce the experimental results. is a stochastic process for generating random DAGs from (9), making it an equivalent representative of this distribution.

### 2.2 The Indian Chefs Process

Now that we have the marginal probability distribution (9), we want to draw random active subgraphs from it. This section introduces the Indian chefs process (ICP), a stochastic process serving this purpose. In the ICP metaphor, chefs draw inspiration from other chefs, based on their *popularity* and *reputation*, to create the menu of their respective restaurant. This creates inspiration maps representable with directed acyclic graphs. ICP defines two types of chefs: 1) star chefs (corresponding to observable variables) which are introduced iteratively and 2) regular chefs (corresponding to hidden variables) which appear only when another chef selects them as a source of inspiration.

The ICP starts with an empty inspiration map as its initial state. The infinitely many chefs can be thought of as lying on a unit interval of reputations. Every chef has a fraction of the infinitely many chefs above him and this fraction is determined by the chef’s own reputation.

The general procedure at iteration is to introduce a new star chef, denoted , within a fully specified map of inspiration representing the connections of the previously processed chefs. The very first step is to draw a reputation value from to determine the position of the star chef in the reputation interval. Once chef is added, sampling the new inspiration connections is done in three steps.

#### Backward proposal

Step one consists in proposing star chef as an inspiration to all the chefs having a lower reputation than chef . To this end, we can first sample the total number of inspiration connections with:

(10) |

and then uniformly pick one of the possible configurations of inspiration connections.

#### Selecting existing chefs

In step two, chef considers any already introduced chefs of higher reputation. The probability for candidate chef to become an inspiration for is:

(11) |

where includes the currently processed chef .

#### Selecting new chefs

The third step allows chef to consider completely new regular chefs as inspirations in every single interval above . The number of new regular chefs to add in the ^{th} reputation interval above follows probability distribution:

(12) |

where the new regular chefs are independently assigned a random reputation drawn from . The regular chefs introduced during this step will be processed one by one using step two and three. Once all newly introduced regular chefs have been processed, the next iteration can begin with step one, a step reserved to star chefs only.

### 2.3 Some properties of the distribution

To better understand the effect of the hyperparameters on the graph properties, we performed an empirical study of some relations between the hyperparameters, the expected number of active nodes

and the expected number of active edges , where is the number of elements in . Figure 1 depicts level curves of for the case of only 1 observable placed at . The figure shows that several combinations of and leads to the same expected number of active nodes. Notice that fixing one hyperparameter, either or , and selecting the expected number of nodes, one can retrieve the second hyperparameter that matches the relationship. We used this fact in the construction of Figure 1 where the unshown parameter could be calculated. In Figure 1, we illustrate the effect of on which essentially shows that smaller values of increase the graph density.When using Bayesian nonparametric models, we are actually assuming that the generative model of the data is infinite-dimensional and that only a finite subset of the parameters are involved in producing a finite set of data. The effective number of parameters explaining the data corresponds to the model complexity and usually scales logarithmically with respect to the sample size. Unlike most Bayesian nonparametric models, the ICP prior scales according to the number of observed nodes added to the network. In Figure 1, we show how the expected number of active hidden nodes increases as function of the number of observable nodes.

### 2.4 Connection to the Indian Buffet Process

There exists a close connection between the Indian Chefs Process (ICP) and the Indian Buffet Process (IBP). In fact, our model can be seen as a generalization of the IBP. Firstly, all realizations of the IBP receive a positive probability under the ICP. Secondly, the two-parameter IBP [griffiths2011indian] is recovered, at least conceptually, when altering the prior on order values (see Eq. (1)) so that all observed nodes are set to reputation and all hidden nodes are set to reputation . This way, connections are prohibited between hidden nodes and between observable nodes, while hidden-to-observable connections are still permitted.

## 3 Structure Learning

In this section, we present a Markov Chain Monte Carlo (MCMC) algorithm approximating the exact ICP prior distribution with samples. We propose a reversible jump MCMC algorithm producing random walks on Eq. (9) [green2009reversible]. This algorithm works in three phases: the first resamples graph connections without adding or removing any nodes, the second phase is a birth-death process on nodes and the third one only involves the order.

The algorithm itself uses the notion of *singleton* and *orphan* nodes. A node is a singleton when it only has a unique active child. Thus, removing its unique connection would disconnect the node from the active subgraph. Moreover, a node is said to be an orphan if it does not have any parents.

#### Within model moves on adjacency matrix:

We begin by uniformly selecting a node from the active subgraph. Here, the set of parents to consider for comprises all non-singleton active nodes having an order value greater than . This set includes both current parents and candidate parents. Then, for each parent , we Gibbs sample the connections using the following conditional probability:

(13) |

where is the number of outgoing connections of node excluding connections going to node and has element removed. Also, all connections not respecting the order are prohibited and therefore have an occurrence probability of 0, along with (trans-dimentional) singleton parent moves.

#### Trans-dimensional moves on adjacency matrix:

We begin with a random uniform selection of node in the active subgraph and, with equal probability, propose either a birth or a death move.

In the birth case, we activate node by connecting it to node . Its order is determined by uniformly selecting an insertion interval above . Assuming node is also the ^{th} element in , we have possible intervals, including zero-length intervals. Let us assume that and are the two nodes between which is to be inserted. Then, we obtain the candidate order value of the new node by sampling . The Metropolis-Hastings acceptance ratio here is:

(14) |

where is the number of singleton-orphan parents of and is the number of active nodes above .

In the death case, we uniformly select one of the singleton-orphan parents of if and simply do nothing in case there exists no such node. Let be the parent to disconnect and consequently deactivate. The Metropolis-Hastings acceptance ratio for this move is:

(15) |

If accepted, node is removed from the active subgraph.

#### Moves on order values:

We re-sample the order value of randomly picked node . This operation is done by finding the lowest order valued parent of along with its highest order valued children, which we respectively denote and . Next, the candidate order value is sampled according to and accepted with Metropolis-Hasting acceptance ratio:

(16) |

which proposes a new total order respecting the partial order imposed by the rest of the DAG.

## 4 Experiments

The ICP distribution (9) can be used as a prior to learn the structure of any DAG-based model involving hidden units. In particular, one can introduce a priori knowledge about the structure by fixing the order values of some observed units. Feedforward neural networks, for instance, can be modelled by imposing for all input units and for the output units. On the other hand, generative models can be designed by placing all observed units at , preventing interconnections between them and forcing the above generative units to explain the data. In section 4.1, we use the ICP as a prior to learn the structures of a generative neural network by approximating the full posterior for 9 datasets. In section 4.2, we use the ICP to learn the structure of a convolutional neural network (CNN) in a Bayesian learning framework.

### 4.1 Bayesian nonparametric generative sigmoid network

The network used in this section is the Nonlinear Gaussian Belief Network (NLGBN) [frey1997continuous], which is basically a generative sigmoid network. In this model, the output of a unit depends on a weighted sum of its parents, where represents the weight of parent unit , indicates whether is a parent of and is a bias. The weighted sum is then corrupted by a zero mean Gaussian noise of precision , so that . The noisy preactivation

is then passed through a sigmoid nonlinearity, producing the output value

. It turns out that the density function of this random output can be represented in closed-form, a property used to form the likelihood function given the data. An ICP prior is placed on the structure represented by along with priors , and . To complete the prior on parameters, we specify , and .The inference is done with MCMC where structure operator are given in 3 and we refer to [adams2010a] for the parameter and activation operators. The Markov chain explores the space of structures by creating and destroying edge and nodes, which means that posterior samples are of varying size and shape, while remaining infinitely layered due to . We also simulate the random activations and add them into the chain state.

This experiment aims at reproducing the generative process of synthetic data sources. In the learning phase, we simulate the posterior distribution conditioned on 2000 training points. Fantasy data from the posterior are generated by first sampling a model from set of posterior network samples and then one point is generated from the selected model. Figure 2 shows 2000 test samples from the true distribution along with the samples generated from the posterior accounting for the model uncertainty.

Next, we compare the ICP (with observables at ) against other Bayesian nonparametric approaches: The Cascading Indian Buffet Process [adams2010a] and the Extended CIBP [dallaire2014learning]. The inference for these models was done with an MCMC algorithm similar to the one used for the ICP and we used similar priors for the parameters to ensure a fair comparison. The comparison metric used in this experiments is the Hellinger distance (HD), a function quantifying the similarity between two probability densities. Table 1 shows the HD’s between fantasy datasets generated and groundtruth.

Data set | Ring (2) | Two Moons (2) | Pinwheel (2) | Geyser (2) | Iris (4) | Yeast (8) | Abalone (9) | Cloud (10) | Wine (12) |
---|---|---|---|---|---|---|---|---|---|

ICP | 0.0402 | 0.0342 | 0.0547 | 0.0734 | 0.2666 | 0.3817 | 0.1379 | 0.1495 | 0.3629 |

CIBP | 0.0493 | 0.0469 | 0.0692 | 0.1246 | 0.2667 | 0.4056 | 0.1502 | 0.1713 | 0.4079 |

ECIBP | 0.0419 | 0.0450 | 0.0685 | 0.1171 | 0.2632 | 0.3840 | 0.1470 | 0.1501 | 0.3855 |

Baseline | 0.0312 | 0.0138 | 0.0436 | 0.0234 | 0.1930 | 0.3059 | 0.1079 | 0.1299 | 0.3387 |

### 4.2 Bayesian nonparametric convolutional neural networks

So far, we introduced the ICP as a prior on the space of directed acyclic graphs. In this section we will use this formalism in order to construct a prior on the space of convolutional neural architectures. The fundamental building blocks of (2D) convolutional networks are tensors

whose entries encode the presence of local features in the input image. A convolutional neural network can be described as a sequence of convolution operators acting on these tensors followed by entry-wise nonlinearity .In our nonparametric model, a convolutional network is constructed from a directed acyclic graph. Each node of the graph represents a tensor . The entries of this tensor are given by the following:

(17) |

where is a tensor of convolutional weights and is the discrete convolution operator. In most hand-crafted architectures, the spatial dimensions of the tensor are course-grained as the depth increases while the number of channels (each representing a local feature of the input) increases. In the ICP, the depth of a node is represented by its popularity . In order to encode the change of shape in the nonparametric prior, we set the number of channels to be a function of :

(18) |

where is the number of different possible tensor shapes and is the number of channels of the lowest layers. Similarly, the number of pixels is given by:

(19) |

where is the number of pixels in the original image. The shape of the weight tensors is determined by the shape of parent and child tensor.

In a classification problem, the nonparametric convolutional network is connected to the data through two observed nodes. The input node stores the input images and we set . On the other hand, for the output node we set , and have it receive input through fully connected layers:

(20) |

where is a tensor of weights. During sampling of a connection in the directed acyclic graph, the log acceptance ratio needs to be summed to the log marginal-likelihood ratio:

(21) |

In this paper, we use a point estimate of the log model evidence:

(22) |

where are the parameters of the network optimized using Adam (alpha=0.1, beta1=0.9, beta2=0.999, eps=1e-08, eta=1.0) [kingma2014adam].

We performed architecture sampling on the MNIST dataset. For computational reasons, we restricted the dataset to the first three classes (1,2 and 3). We sampled the DAGs using the MCMC sampler introduced in section 3 with prior parameters , and . For each sampled DAG, we trained the induced convolutional architecture until convergence (540 iterations, batch size equal to 100). The number of bins in the partition of the range of the popularity was five and the number of channels of the first convolutional layer was four. We ran 15 independent Markov chains in order to sample from multiple modes. Each chain consisted of 300 accepted samples. After sampling, all chains were merged, resulting in a total of 4500 sampled architectures.

Figure 3

A shows accuracy and descriptive statistics of the sampled convolutional architectures. In all these statistics, we only considered nodes that receive input from the input node (directly or indirectly) as the remaining nodes do not contribute to the forward pass of the network. The network

*width*is quantified as the total number of directed paths between input and output nodes, while

*depth*is quantified as the maximal directed path length. The sampler reaches a wide range of different architectures, whose number of layers range from three to fifteen, and whose average degree range from one to four. Some examples of architectures are shown in Figure 3C. Interestingly, the correlation between the number of nodes, degree, width and depth and accuracy is very low. Most likely, this is due to the simple nature of the MNIST task. The ensemble accuracy (0.95), obtained by averaging the label probabilities over all samples, is higher that the average accuracy (0.91), but lower that the maximum accuracy (0.99). Figure 3

B shows the histograms of mean, standard deviation, minimum and maximum of the popularity values in the networks.

## 5 Conclusion and Future Work

This paper introduces the Indian Chefs Process (ICP), a Bayesian nonparametric distribution on the joint space of infinite directed acyclic graphs and orders. The model allows for a novel way of learning the structure of deep learning models. As a proof-of-concept, we have demonstrated how the ICP can be used to learn the architecture of convolutional deep networks trained on the MNIST data set. However. for more realistic applications, several efficiency improvements are required. Firstly, the inference procedure over the model parameters could be performed using stochastic Hamiltonian MCMC [ma2015complete]. This removes the need to fully train the network for every sampled DAG. Another possible improvement is to add deep learning specific sampling moves. For example, it is possible to include a increase-depth move that replaces a connection with a path comprised by two connections and a latent node. Future applications of the ICP may extend beyond deep learning architectures. For example, the ICP may serve as a basis for nonparametric causal inference, where a DAG structure is learned when the exact number of relevant variables is not known a priori, or when certain relevant input variables are not observed [Mohan2013].

Comments

There are no comments yet.