1 Introduction
Inference in probabilistic graphical models can be challenging in situations where there are a large number of hidden variables, each of which may take on one of several state values. The Expectation Maximization (EM) algorithm is widely applied for inference when hidden variables are present, however inference can quickly become intractable as the dimensionality of hidden states increases.
Expectation truncation (ET) (Lücke and Eggert, 2010) is a meta algorithm for accelerating EM learning in graphical models, which restricts the inference performed during the Estep to an “interesting” subset of states of the latent variables, chosen per data point according to a selection function. This subspace reduction can lead to a significant decrease in computation with very little loss of accuracy (compared with the full model). In previous work, functions to select states of high posterior mass were derived individually for each model of interest, e.g. by taking upper bounds or noiseless limits. (Lücke and Eggert, 2010; Shelton et al., 2012; Bornschein et al., 2013; Sheikh et al., 2014)
. The crucial underlying assumption remains that when EM has converged, the posterior mass is concentrated in small volumes of the latent state space. This property is observed to hold in many visual data, auditory data, and general pattern recognition settings.
For more complex graphical models, notably the hierarchical model of Dai and Lücke (2014), the design of suitable selection functions is extremely challenging: it requires both expert knowledge on the problem domain and considerable computational resources to implement (indeed, the design of such functions for particular problems has been a major contribution in previous work on the topic).
In the present work, we propose a generalization of the ET approach, where we avoid completely the challenge of problemspecific selection function design. Instead, we learn selection functions adaptively and nonparametrically from the data, while learning the model parameters simultaneously using EM. We emphasize that the selection function is used only to “guide” the underlying base inference algorithm to regions of high posterior probability, but is not itself used as an approximation to the posterior distribution. As such, the learned function does not have to be a completely accurate indication of latent variable predictivity, as long as the relative importance of states is preserved. We use Gaussian process regression
(Rasmussen and Williams, 2005) to learn the selection function, though other regression techniques could also be applied. The main advantage of GPs is that they have an analytic oneshot learning procedure, and that learning different functions based on the same inputs is computationally cheap, which makes adaptive learning of a changing target function efficient. We term this part of our approach GPselect. Our nonparametric generalization of ET may be applied as a blackbox meta algorithm for accelerating inference in generative graphical models, with no expert knowledge required.Our approach is the first to make ET a general purpose algorithm for discrete latents,whereas previously new versions of ET were required for every new latent variable model addressed. For instance, in Section 5.3 we will show that preselection is crucial for efficient inference in complex models. Although ET has already been successful in some models, this work shows that more complex models will crucially depend on an improved selection step and focuses on automating this step.
For empirical evaluation, we have applied GPselect in a number of experimental settings. First, we considered the case of sparse coding models (binary sparse coding, spikeandslab, nonlinear spikeandslab), where the relationship between the observed and latent variables is known to be complex and nonlinear.^{1}^{1}1Note that even when linear relations exist between the latents and outputs, a nonlinear regression may still be necessary in finding relevant variables, as a result of explaining away.
We show that GPselect can produce results with equal performance to the respective manuallyderived selection functions. Interestingly, we find it can be essential to use nonlinear GP regression in the spikeandslab case, and that simple linear regression is not sufficiently flexible in modeling the posterior shape. Second, we illustrate GPselect on a simple Gaussian mixture model, where we can explicitly visualize the form of the learned regression function. We find that even for a simple model, it can be be essential to learn a nonlinear mapping. Finally, we present results for a recently published hierarchical model for translation invariant occlusive components analysis
(Dai and Lücke, 2014). The performance of our inference algorithm matches that of the complex handengineered selection function of the previous work, while being straightforward to implement and having a far lower computational cost.2 Related work
The general idea of aiding inference in graphical models by learning a function that maps from the observed data to a property of the latent variables is quite old. Early work includes the Helmholtz machine (Dayan et al., 1995) and its bottomup connections trained using the wakesleep algorithm (Hinton et al., 1995)
. More recently, the idea has surfaced in the context of learning variational distributions with neural networks
(Kingma and Welling, 2014). A twostage inference has been discussed in the context of computer vision
(Yuille and Kersten, 2006) and neural inference (Körner et al., 1999). Recently, researchers (Mnih and Gregor, 2014) have generalized this idea to learning in arbitrary graphical models by training an “inference network” that efficiently implements sampling from the posterior distribution.GPs have recently been widely used to ”learn” the results of complicated models in order to accelerate inference and parameter selection. GP approximations have been used in lieu of solving complex partial differential equations
(Sacks et al., 1989; Currin et al., 1991), to learn datadriven kernel functions for recommendation systems (Schwaighofer et al., 2004), and recently for quantum chemistry (Rupp et al., 2012). Other work has used GPs to simplify computations in approximate Bayesian computation (ABC) methods: namely to model the likelihood function for inference (Wilkinson, 2014), to aid in making MetropolisHastings (MH) decisions (Meeds and Welling, 2014), and to model the discrepancies between simulated/observed data parameter space simplification (Gutmann and Corander, 2015). Recently, instead of the typical choice of GPs for large scale Bayesian optimization, neural networks have been used to learn an adaptive set of basis functions for Bayesian linear regression (Snoek et al., 2015).Our work follows the same high level philosophy in that we use GPs to approximate complex/intractable probabilistic models. None of the cited prior work address our problem setting, namely the selection of relevant latent variables by learning a nonparametric relevance function, for use in expectation truncation (ET).
3 Variable selection for accelerated inference
Notation. We denote the observed data by the matrix
, where each vector
is the th observation in a dimensional space. Similarly we define corresponding binary latent variables by the matrix where each is the th vector in the dimensional latent space, and for each individual hidden variable , the vector . Reduced latent spaces are denoted by , where . Note that although we restrict ourselves to binary latent variables here, the procedure could in principle be generalized to variables with higher cardinality (Exarchakis et al., 2012, e.g. see). We denote the prior distribution over the latent variables with and the likelihood of the data with . Using these expressions, the posterior distribution over latent variables is(1) 
3.1 Selection via Expectation Truncation in EM
Expectation Maximization (EM) is an iterative algorithm to optimize the model parameters of a given graphical model (see e.g. (Dempster et al., 1977; Neal and Hinton, 1998)). EM iteratively optimizes a lower bound on the data likelihood by inferring the posterior distribution over hidden variables given the current parameters (the Estep), and then adjusting the parameters to maximize the likelihood of the data averaged over this posterior (the Mstep). When the number of latent states to consider is large (e.g. exponential in the number of latent variables), the computation of the posterior distribution in the Estep becomes intractable and approximations are required.
Expectation truncation (ET) is a meta algorithm, which improves convergence of the expectation maximization (EM) algorithm (Lücke and Eggert, 2010). The main idea underlying ET is that the posterior probability mass is concentrated in a small subspace of the full latent space. This is the case, for instance, if for a given data point only a subset of the latent variables are relevant. Note, however, that the posterior may still be concentrated even when all latents are relevant, since most of the probability mass may be concentrated on few of these.
A selection function can be used to identify a subset of salient variables, denoted by where , which in turn is used to define a subset, denoted , of the possible state configurations of the space per data point. State configurations not in this space (of variables deemed to be nonrelevant) are fixed to (assigned zero probability mass). The posterior distribution (1) can then be approximated by a truncated posterior distribution, computed on the reduced support,
(2) 
where contains the latent states of the relevant variables for data point , and if and zero otherwise. In order words, Eq. (3.1) is proportional to Eq. (1) if (and zero otherwise). The set contains only states for which for all that are not selected, i.e. all states where for nonselected are assigned zero probability. The sum over is much more efficient than the sum for the full posterior, since it need only be computed over the reduced set of latent variable states deemed relevant: the state configurations of the irrelevant variables are fixed to be zero. The variable selection parameter H’ is selected based on compute resources available: i.e. as large as resources allow in order to be closer to true EM, although empirically it’s been shown that much smaller values suffice (see e.g. Sheikh et al., 2014, App. B on complexityaccuracy tradeoffs).
3.2 ET with affinity
One way of constructing a selection function is by first ranking the latent variables according to an affinity function which directly reflects the relevance of latent variable . A natural choice for such a function is the one that approximates the marginal posterior probability of each variable, e.g. we try to learn as follows:
(3) 
meaning that, for the relevant variables, the marginal posterior probability exceeds some threshold. See Figure 1 for a simplified illustration. When the latent variables in the marginal posterior probability are conditionally independent given a data point , this affinity function correctly isolates the most relevant variables in the posterior. Even when this strong assumption does not hold in practice (which is often the case), however, the affinity can still correctly highlight relevant variables, and has been empirically shown to be quite effective even when dependencies exist (see e.g. source separation tasks in (Sheikh et al., 2014)).
Next, using all from the affinity function , we define to simultaneously sort the indices of the latent variables in descending order and reduce the sorted set to the highest (most relevant) variables’ indices. To ensure that there is a nonzero probability of selecting each variable per EM iteration, of the indices are uniformly chosen from at random. This prevents the possible propagation of errors from continuously assigning small probabilities to a variable in early EM iterations. thus returns the selected variable indices deemed by the affinity to be relevant to the th data point. Finally, using the indices from , we define to return an dimensional subset of selected relevant latent states for each data point . All ‘nonrelevant’ variable states for all variables are effectively set to in Equation (3.1) by not being present in the state set .
Using , , and , we can define a selection function to select subsets per data point . Again, the goal is for the states to contain most of the probability mass and to be significantly smaller than the entire latent space. The affinity based selection function can be expressed as
(4) 
3.3 Inference in EM with selection
In each iteration of EM, the following occurs: prior to the Estep, the selection function in (4) is computed to select the most relevant states , which are then used to compute the truncated posterior distribution in (3.1). The truncated posterior can be computed using any standard inference method, such as exact inference or e.g. Gibbs sampling from if inference is still intractable or further computational reduction is desired. The result of the Estep is then used to update the model parameters in the Mstep.
4 GPSelect
In previous work, the selection function was a deterministic function derived individually for each model (see e.g. Shelton et al., 2011, 2012; Dai and Lücke, 2012a, b; Bornschein et al., 2013; Sheikh et al., 2014; Shelton et al., 2015). We now generalize the selection approach: instead of predefining the form of for variable selection, we want to learn it in a blackbox and modelfree way based on the data. We learn using Gaussian process (GP) regression (e.g. Rasmussen and Williams, 2005), which is a flexible nonparametric model and scales cubicly^{2}^{2}2If the scaling with is still too expensive, an incomplete Cholesky approximation is used, with cost linear in and quadratic in the rank of the approximation (see Section 5.3 for details). with the number of data points but linearly with the number of latent variables . We define the affinity function as being drawn from a Gaussian process model: , where is the covariance kernel, which can be flexibly parameterized to represent the relationship between variables. Again, we use to approximate the marginal posterior probability that . A nice property of Gaussian processes is that the kernel matrix
need only be computed once (until the kernel function hyperparameters are updated) to approximate
for the entire set of latent variables .Thus, prior to each Estep in each EM iteration, within each calculation of the selection function, we calculate the affinity using a GP to regress the expected values of the latent variables onto the observed data . Specifically, we train on from the previous EM iteration (where is equal to ), for training data of , where we recall that is the approximate posterior distribution for in Eq. (3.1). In the first EM iteration, the expectations are initialized randomly; in each subsequent EM iteration, the expectations w.r.t. the truncated posterior are used. The EM algorithm is run for iterations and the hyperparameters of the kernel are optimized by maximum likelihood every EM iterations.
For each data point and latent variable we compute the predicted mean of the GP by leaving this data point out of the training set and considering all others, which is called leaveoneout (LOO) prediction. It can be shown that this can be implemented efficiently (see Section 5.4.2 in Rasmussen and Williams, 2005), and we use this result to update the predicted affinity as follows:
(5) 
Equation (5) can be efficiently implemented for all latent variables and all data points using matrix operations, thereby requiring only one kernel matrix inversion for the entire dataset.
5 Experiments
We apply our GPselect inference approach to five different probabilistic generative models. First, we considered three sparse coding models (binary sparse coding, spikeandslab, nonlinear spikeandslab), where the relationship between the observed and latent variables is known to be complex and nonlinear. Second, we apply GPselect to a simple Gaussian mixture model, where we can explicitly visualize the form of the learned regression function. Finally, we apply our approach to a recent hierarchical model for translation invariant occlusive components analysis (Dai and Lücke, 2012a; Dai et al., 2013; Dai and Lücke, 2014).
5.1 Sparse coding models
Using handcrafted functions to preselect latent variables, a variety of sparse coding models have been successfully scaled to highdimensional latent spaces with use of selection (Henniges et al., 2010; Bornschein et al., 2013; Sheikh et al., 2014) and selection with Gibbs sampling (Shelton et al., 2011, 2012, 2015) inference approaches. In order to demonstrate our method, we consider three of these sparse generative models, and perform inference with our GPselect approach instead of a handcrafted selection function. The models are:
 A.

Binary sparse coding:
where denotes the dictionary elements and parameterizes the sparsity (see e.g. (Henniges et al., 2010)).
 B.

Spikeandslab sparse coding:
latents: observations: where the pointwise multiplication of the two latent vectors, i.e., generates a ‘spikeandslab’ distributed variable (), that has either continuous values or exact zero entries (e.g. (Titsias and LázaroGredilla, 2011; Goodfellow et al., 2013; Sheikh et al., 2014)).
 C.

Nonlinear Spikeandslab sparse coding:
latents: observations: for which the mean of the Gaussian for each is centered at , where is a nonlinearity that considers all latent components and takes the yielding the maximum value for (Lücke and Sahani, 2008; Shelton et al., 2012; Bornschein et al., 2013; Shelton et al., 2015), instead of centering the data at the linear combination of .
In the above models, inference with the truncated posterior of Equation (3.1) using manuallyderived selection functions has yielded results as good as or more robust than exact inference (converging less frequently to local optima than exact inference; see earlier references for details). For models A and C
, the selection function was the cosine similarity between the weights
(e.g. dictionary elements, components, etc.) associated with each latent variable and each data point : . For model B, the selection function was the data likelihood given a singleton state: , where represents a singleton state in which only the entry is nonzero.We generate data points consisting of observed dimensions and latent components according to each of the models AC: images of randomly selected overlapping ’bars’ with varying intensities for models B and C, and additive Gaussian noise parameterized by groundtruth and we choose , (e.g. following the spikeandslab prior). On average, each data point contains bars, i.e. groundtruth is , and we choose . With this choice, we can select sufficiently many latents for virtually all data points.
For each of the models considered, we run 10 repetitions of each of the following set of experiments: (1) selection using the respective handcrafted selection function, (2) GPselect using a linear covariance kernel, (3) GPselect using an RBF covariance kernel, and (4) GPselect using a kernel composed by adding the following kernels: RBF, linear, bias and white noise kernels, which we will term the
composition kernel. As hyperparameters of kernels are learned, the composition kernel (4) can adapt itself to the data and only use the kernel components required. See Rasmussen and Williams (2005, Chapter 4, Secton 4.2.4) for a discussion on kernel adaptation. Kernel parameters were modelselected via maximum marginal likelihood every EM iterations. For models A and B, inference was performed exactly using the truncated posterior (3.1), but as exact inference is analytically intractable in model C, inference was performed by drawing Gibbs samples from the truncated space (Shelton et al., 2011, 2012, 2015). We run all models until convergence.Results are shown in Figure 2. In all experiments, the GPselect approach was able to infer groundtruth parameters as well as the handcrafted function. For models where the cosine similarity was used (in A and C), GP regression with a linear kernel quickly learned the groundtruth parameters, and hence fewer iterations of EM were necessary. In other words, even without providing GPselect explicit weights as required for the handcrafted function, its affinity function using GP regression (5) learned a similar enough function to quickly yield identical results. Furthermore, in the model with a less straightforward handcrafted function (in the spikeandslab model of B), only GP regression with an RBF kernel was able to recover groundtruth parameters. In this case (model B), GPselect using an RBF kernel recovered the groundtruth ’bars’ in out of repetitions, whereas the handcrafted function recovered the bases in instances. For the remaining models, GPselect converged to the groundtruth parameters with the same average frequency as the handcrafted functions.
Finally, we have observed empirically that the composition kernel is flexible enough to subsume all other kernels: the variance of the irrelevant kernels dropped to zero in simulations. This suggests the composition kernel is a good choice for general use.
5.2 Gaussian mixture model
Next, we apply GPselect to a simple example, a Gaussian mixture model, where the flexibility of the approach can be easily and intuitively visualized. The model of the data likelihood is
(6) 
where is the number of mixture components; the task is to assign each data point to its latent cluster.
The training data used for GP regression was
, where the targets were the expected cluster responsibilities (posterior probability distribution for each cluster) for all data points,
, and we use onehot encoding for cluster identity. With this, we apply our GPselect approach to this model, computing the selection function according to Equation (
4) with affinity defined by GP regression (5) and following the approximate EM approach as in the previous experiments. In these experiments we consider two scenarios for EM learning of the data likelihood in Equation (6): GPselect with an RBF covariance kernel and a linear covariance kernel. We do not include the composition kernel suggested (based on experiments) in Section 4.1, as the goal of the current experiments is to show the effects of using the ’wrong’ kernel. These effects would further support the use of the flexible composition kernel in general, as it can subsume both kernels considered in the current experiments (RBF and linear).To easily visualize the output, we generate dimensional observed data () from clusters – first with randomly assigned cluster means, and second such that the means of the clusters lie roughly on a line. In the GPselect experiments, we select clusters from the full set, and run EM iterations for both kernel choices (linear and RBF). Note that for mixture models, the notation of selected clusters of the set is analogous to the selected latent variables from the full set, as described in the nonmixture model setting, and the GPselect algorithm proceeds unchanged. We randomly initialize the variance of the clusters and initialize the cluster means at randomly selected data points. Results are shown in Figure 3.
With cluster parameters initialized randomly on these data, the linear GP regression prediction cannot correctly assign the data to their clusters (as seen in Figure 3B), but the nonlinear approach successfully and easily finds the groundtruth clusters (Figure 3A). Furthermore, even when both approaches were initialized in the optimal solution, the cluster assignments from GPselect with a linear kernel quickly wandered away from the optimal solution and were identical to random initialization, converging to the same result shown in iteration 20 of Figure 3B). The RBF kernel cluster assignments remained at the optimal solution even with number of selected clusters set to .
These experiments demonstrate that the selection function needs to be flexible even for very simple models, and that nonlinear selection functions are an essential tool even in such apparently straightforward cases.
5.3 Translation Invariant Occlusive models
Now that we have verified that GPselect can be applied to various generative graphical models and converge to groundtruth parameters, we consider a more challenging model that addresses a problem in computer vision: translations of objects in a scene.
Model. Translation invariant models are particularly difficult to optimize because they must consider a massive latent variable space: evaluating multiple objects and locations in a scene leads a latent space complexity of the number of locations exponentiated by the number of objects. Inference in such a massive latent space heavily relies on the idea of variable selection to reduce the number of candidate objects and locations. In particular, handengineered selection functions that consider translational invariance have been successfully applied to this type of model (Dai and Lücke, 2012b, 2014; Dai et al., 2013). The selection function used so far for reducing latent space complexity in this model has been constructed as follows: first, the candidate locations of all the objects in the model are predicted, and then a subset of candidate objects that might appear in the image are selected according to the predicted locations. Next, the subset of states is constructed according to the combinatorics of the different locations of all the candidate objects. The posterior distribution is then computed following Equation (3.1).
This selection system is very costly: the selection function has parameters which need to be handtuned, e.g., the number of representative features, and it needs to scan through the entire image, considering all possible locations, which becomes computationally demanding for largescale experiments. In translation invariant models, instead of predicting the existence of a component, the selection function has to predict all possible locations a component could be. To maximally exploit the capabilities of GPselection function, we directly use the GP regression model to predict the possible locations of a component without introducing any knowledge of translation invariance into the selection function. In this work, a GP regression model is fitted from the input image to marginal posterior probabilities of individual components appearing at all possible locations. Therefore, the input to the GPselection function is the image to be inferred and the output is a score for each possible location of each component in the model. For example, when learning components in a pixel image patch, the output dimensionality of GPselect is . This task is computationally feasible, since GP models scale linearly with output dimensionality. The inference of components’ locations with GPselect is significantly faster than the selection function in the original work, as it avoids explicitly scanning through the image.
Although there are additional computations necessary for an automatic selection function like GPselect, for instance due to the adjustment of its parameters, there are many options to reduce computational costs. First, we may approximate the full Gram matrix by an incomplete Cholesky approximation (Fine and Scheinberg, 2001) resulting in a cost of , where is the rank of the Cholesky approximation. Second, we may reduce the frequency of kernel hyperparameter updates to only every EM iterations, where a represents a corresponding computation reduction. The combination of the Cholesky approximation plus infrequent updates will have the following benefits: a factor of five speedup for infrequent updates, and a factor of speedup from incomplete Cholesky, where is the rank of the Cholesky approximation and is the number of original data points.
COIL Dataset.
We apply our GPselection function to the Invariant Exclusive Component Analysis (InvECA) model (Dai and Lücke, 2012b; Dai et al., 2013). For our experiments, we consider an image dataset used in previous work: data were generated using objects from the COIL100 image dataset (Nene et al., 1996), taking different objects, downscaled to pixels and segmented out from the black background. A given image was generated by randomly selecting a subset of the objects, where each object has a probability of of appearing. The appearing objects were placed at random positions on a black image. When the objects overlap, they occlude each other with a different random depth order for each image. In total, images were generated in the dataset (examples shown in Figure 4).
The task of the InvECA model is to discover the visual components (i.e. the images of objects) from the image set without any label information. We compare the visual components learned by using four different selection functions in the InvECA model: the handcrafted selection function used in the original work by Dai and Lücke (2012b), GPselect updated every iteration, GPselect updated every iterations, and GPselect with incomplete Cholesky decomposition updated every iteration, or (in this manner we isolate the improvements due to Cholesky from those due to infrequent updates). In these experiments, the parameters of GPselect are optimized at the end of each EM iteration(s), using a maximum of gradient updates. The number of objects to be learned is and the algorithm preselects objects for each data point. The kernel used was the composition kernel, as suggested in Section 4.1, although after fitting the hyperparameters only the RBF kernel remained with large variance (i.e. a linear kernel alone would not have produced good variable selection, thus the flexible composition kernel was further shown to be a good choice).
Results.
All four versions of the InvECA model using each of the selection functions considered successfully recover each individual objects in our modified COIL image set. The learned object representations with GPselect are shown in Figure 5. Four additional components developed into representations, however these all had very low mask values, allowing them to easily be distinguished from other true components.
Next, we compare the accuracy of the four selection functions. For this, we collected the object locations (pixels) indicated by each selection function after all EM iterations, applied the selection functions (for the GP selection functions, this was using the final function learned after all EM iterations) to the entire image dataset again, then compared these results with the groundtruth location of all of the objects in the dataset. The accuracy of the predicted locations was then computed by comparing the distance of all groundtruth object location to the location of the top candidate locations from each selection function. See Figure 6 for a histogram of these distances and the corresponding accuracy for all selection functions. Note that the percentages in the histogram are plotted in log scale. Also, as a baseline verification, we computed and compared the pseudo log likelihood (Dai et al., 2013) of the original selection function to the three GPselect based ones. The pseudo log likelihood for all selection functions is shown in Figure 7. Figures 67 show that all four selection functions can very accurately predict the locations of all the objects in the dataset – the GPselect selection functions yields no loss in inference performance in comparison to the original handengineered selection function. Even those using speedconsiderate approximations (incomplete Cholesky decomposition of the kernel matrix (GP IChol) and updating kernel hyperparameters only every 5 EM iterations (GP every5)) have indistinguishable prediction accuracy on the task.
An analysis of the benefits indicate that, as GPselect avoids explicitly scanning through the image, the time to infer the location of an object is significantly reduced compared to the handcrafted function. GPselect requires seconds on a single CPU core to infer the locations of objects across the whole image set, while the handcrafted function requires seconds. In the original work, the selection function was implemented with GPU acceleration and parallelization. Although we must compute the kernel hyperparameters for GPselect, it is important to note that the hyperparameters need not be fit perfectly each iteration – for the purposes of our approach, a decent approximation suffices for excellent variable selection. In this experiment, updating the parameters of GPselect with gradient steps took about seconds for the fullrank kernel matrix. When we compute the incomplete Cholesky decomposition while inverting the covariance matrix, compute time was reduced to seconds (corresponding to the speedup, where is the rank of the Cholesky approximation), with minimal loss in accuracy. Furthermore, when updating the GPselect hyperparameters only every 5 iterations, average compute time was reduced by another one fifth, again without loss in accuracy.
6 Discussion
We have proposed a means of achieving fast EM inference in Bayesian generative models, by learning an approximate selection function to determine relevant latent variables for each observed variable. The process of learning the relevance functions is interleaved with the EM steps, and these functions are used in obtaining an approximate posterior distribution in the subsequent EM iteration. The functions themselves are learned via Gaussian process regression, and do not require domainspecific engineering, unlike previous selection functions. In experiments on mixtures and sparse coding models with interpretable output, the learned selection functions behaved in accordance with our expectations for the posterior distribution over the latents.
The significant benefit we show empirically is that by learning the selection function in a general and flexible nonparametric way, we can avoid using expensive handengineered selection functions. Cost reduction is both in terms of required expertise in the problem domain, and computation time in identifying the relevant latent variables. Inference using our approach required 22.1 seconds on a single CPU core, versus 1830.9 seconds with the original handcrafted function for the complex hierarchical model of (Dai et al., 2013).
A major area where further performance gains might be expected is in improving computational performance, since we expect the greatest advantages of GPselect to occur for complex models at large scale. For instance, kernel ridge regression may be parallelized
(Zhang et al., 2014), or the problem may be solved in the primal via random Fourier features (Le et al., 2013). Furthermore, there are many recent developments regarding the scaling up of GP inference to largescale problems, e.g., sparse GP approximation (Lawrence et al., 2002), stochastic variational inference (Hensman et al., 2013, 2012), using parallelization techniques and GPU acceleration (Dai et al., 2014), or in combination with stochastic gradient descent
(Bottou and Bousquet, 2008). For instance, for very large datasets where the main model is typically trained with minibatch learning, stochastic variational inference can be used for GP fitting as in (Hensman et al., 2013) and the kernel parameters can be efficiently updated each (or only every few) iteration with respect to a minibatch.Acknowledgments
We acknowledge funding by the German Research Foundation (DFG) under grant LU 1196/42 (JS), by the Cluster of Excellence EXC 1077/1 ”Hearing4all” (JL), and by the RADIANT and WYSIWYD (EU FP7ICT) projects (ZD).
References
 Bornschein et al. (2013) Bornschein, J., Henniges, M., and Lücke, J. (2013). Are V1 simple cells optimized for visual occlusions? A comparative study. PLoS Computational Biology, 9(6):e1003062.
 Bottou and Bousquet (2008) Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20 (NIPS), Vancouver, British Columbia, Canada, pages 161–168. MIT Press.
 Currin et al. (1991) Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1991). Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. J. American Statistical Association, 86(416):953–963.
 Dai et al. (2014) Dai, Z., Damianou, A., Hensman, J., and Lawrence, N. (2014). Gaussian process models with parallelization and gpu acceleration. NIPS 2014, Workshop on Modern nonparametrics: automating the learning pipeline. In NIPS Workshop on Modern nonparametrics: automating the learning pipeline.
 Dai et al. (2013) Dai, Z., Exarchakis, G., and Lücke, J. (2013). What are the invariant occlusive components of image patches? a probabilistic generative approach. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger., K., editors, Advances in Neural Information Processing Systems, pages 243–251.
 Dai and Lücke (2012a) Dai, Z. and Lücke, J. (2012a). Autonomous cleaning of corrupted scanned documents – a generative modeling approach. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3338–3345.
 Dai and Lücke (2012b) Dai, Z. and Lücke, J. (2012b). Unsupervised learning of translation invariant occlusive components. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2400–2407.
 Dai and Lücke (2014) Dai, Z. and Lücke, J. (2014). Autonomous document cleaning – a generative approach to reconstruct strongly corrupted scanned texts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(10):1950–1962.
 Dayan et al. (1995) Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The helmholtz machine. Neural Computation, 7:889–904.
 Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39:1–38.
 Exarchakis et al. (2012) Exarchakis, G., Henniges, M., Eggert, J., and Lücke, J. (2012). Ternary sparse coding. In LVA/ICA, Lecture Notes in Computer Science, pages 204–212. Springer.

Fine and Scheinberg (2001)
Fine, S. and Scheinberg, K. (2001).
Efficient SVM training using lowrank kernel representations.
Journal of Machine Learning Research
, 2:243–264.  Goodfellow et al. (2013) Goodfellow, I. J., Courville, A., and Bengio, Y. (2013). Scaling up spikeandslab models for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1902–1914.
 Gutmann and Corander (2015) Gutmann, M. U. and Corander, J. (2015). Bayesian Optimization for LikelihoodFree Inference of SimulatorBased Statistical Models. Technical report, University of Helsinki. http://arxiv.org/abs/1501.03291.
 Henniges et al. (2010) Henniges, M., Puertas, G., Bornschein, J., Eggert, J., and Lücke, J. (2010). Binary Sparse Coding. In Proceedings LVA/ICA, LNCS 6365, pages 450–57. Springer.

Hensman et al. (2013)
Hensman, J., Fusi, N., and Lawrence, N. (2013).
Gaussian processes for big data.
In
Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI 2013, Bellevue, WA, USA, August 1115, 2013
.  Hensman et al. (2012) Hensman, J., Rattray, M., and Lawrence, N. D. (2012). Fast Variational Inference in the Conjugate Exponential Family. In Bartlett, P. L., Pereira, F. C. N., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25 (NIPS), Lake Tahoe, Nevada, United States, pages 2897–2905.
 Hinton et al. (1995) Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The ‘wakesleep’ algorithm for unsupervised neural networks. Science, 268:1158 – 1161.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Efficient gradientbased inference through transformations between bayes nets and neural nets. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 1782–1790.
 Körner et al. (1999) Körner, E., Gewaltig, M. O., Körner, U., Richter, A., and Rodemann, T. (1999). A model of computation in neocortical architecture. Neural Networks, 12:989 – 1005.
 Lawrence et al. (2002) Lawrence, N., Seeger, M., and Herbrich, R. (2002). Fast sparse gaussian process methods: The informative vector machine. In Advances in Neural Information Processing Systems 15 (NIPS), Vancouver, British Columbia, Canada, pages 609–616. MIT Press.
 Le et al. (2013) Le, Q., Sarlos, T., and Smola, A. J. (2013). Fastfood — computing hilbert space expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, pages 244–252.
 Lücke and Eggert (2010) Lücke, J. and Eggert, J. (2010). Expectation truncation and the benefits of preselection in training generative models. Journal of Machine Learning Research, 11:2855–900.
 Lücke and Sahani (2008) Lücke, J. and Sahani, M. (2008). Maximal causes for nonlinear component extraction. Journal of Machine Learning Research, 9:1227–67.
 Meeds and Welling (2014) Meeds, E. and Welling, M. (2014). GPSABC: gaussian process surrogate approximate bayesian computation. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI), Quebec City, Quebec, Canada, July 2327, 2014, pages 593–602.
 Mnih and Gregor (2014) Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 1791–1799.
 Neal and Hinton (1998) Neal, R. and Hinton, G. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in Graphical Models. Kluwer.
 Nene et al. (1996) Nene, S. A., Nayar, S. K., and Murase, H. (1996). Columbia object image library (coil100). Technical report, CUCS00696.
 Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
 Rupp et al. (2012) Rupp, M., Tkatchenko, A., Müller, K.R., and von Lilienfeld, O. A. (2012). Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108:058301.
 Sacks et al. (1989) Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989). Design and analysis of computer experiments. Statist. Sci., 4(4):433–435.
 Schwaighofer et al. (2004) Schwaighofer, A., Tresp, V., and Yu, K. (2004). Learning gaussian process kernels via hierarchical bayes. In Advances in Neural Information Processing Systems 17 (NIPS), Vancouver, British Columbia, Canada, pages 1209–1216. MIT Press.
 Sheikh et al. (2014) Sheikh, A.S., Shelton, J., and Lücke, J. (2014). A truncated EM approach for spikeandslab sparse coding. Journal of Machine Learning Research, 15:2653–2687.
 Shelton et al. (2011) Shelton, J., Bornschein, J., Sheikh, A.S., Berkes, P., and Lücke, J. (2011). Select and sample  A model of efficient neural inference and learning. In Advances in Neural Information Processing Systems 24 (NIPS), Granada, Spain., pages 2618–2626.
 Shelton et al. (2012) Shelton, J., Sterne, P., Bornschein, J., Sheikh, A.S., and Lücke, J. (2012). Why MCA? nonlinear sparse coding with spikeandslab prior for neurally plausible image encoding. In Advances in Neural Information Processing Systems 25 (NIPS), Lake Tahoe, Nevada, United States, pages 2285–2293.
 Shelton et al. (2015) Shelton, J. A., Sheikh, A.S., Bornschein, J., Sterne, P., and Lücke, J. (2015). Nonlinear spikeandslab sparse coding for interpretable image encoding. PLoS ONE, 10(5):1–25.
 Snoek et al. (2015) Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Narayanan Sundaram, M., Ali, M., Patwary, P., and Adams, R. (2015). Scalable bayesian optimization using deep neural networks. Technical report, Harvard University. http://arxiv.org/abs/1502.05700.
 Titsias and LázaroGredilla (2011) Titsias, M. and LázaroGredilla, M. (2011). Spike and slab variational inference for multitask and multiple kernel learning. In Advances in Neural Information Processing Systems 24 (NIPS), Granada, Spain, pages 2510–2518.
 Wilkinson (2014) Wilkinson, R. D. (2014). Accelerating ABC methods using gaussian processes. Technical report, University of Sheffield. http://arxiv.org/abs/1401.1436.

Yuille and Kersten (2006)
Yuille, A. and Kersten, D. (2006).
Vision as Bayesian inference: analysis by synthesis?
Trends in Cognitive Sciences, 10(7):301–308.  Zhang et al. (2014) Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2014). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Technical report, University of California, Berkeley. http://arxiv.org/abs/1305.5029.