Mapping Energy Landscapes of Non-Convex Learning Problems

10/02/2014 ∙ by Maria Pavlovskaia, et al. ∙ 0

In many statistical learning problems, the target functions to be optimized are highly non-convex in various model spaces and thus are difficult to analyze. In this paper, we compute Energy Landscape Maps (ELMs) which characterize and visualize an energy function with a tree structure, in which each leaf node represents a local minimum and each non-leaf node represents the barrier between adjacent energy basins. The ELM also associates each node with the estimated probability mass and volume for the corresponding energy basin. We construct ELMs by adopting the generalized Wang-Landau algorithm and multi-domain sampler that simulates a Markov chain traversing the model space by dynamically reweighting the energy function. We construct ELMs in the model space for two classic statistical learning problems: i) clustering with Gaussian mixture models or Bernoulli templates; and ii) bi-clustering. We propose a way to measure the difficulties (or complexity) of these learning problems and study how various conditions affect the landscape complexity, such as separability of the clusters, the number of examples, and the level of supervision; and we also visualize the behaviors of different algorithms, such as K-mean, EM, two-step EM and Swendsen-Wang cuts, in the energy landscapes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 20

page 22

page 24

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many statistical learning problems, the energy functions to be optimized are highly non-convex. A large body of research has been devoted to either approximating the target function by convex optimization, such as replacing norm by norm in regression, or designing algorithms to find a good local optimum, such as EM algorithm for clustering. Much less work has been done in analyzing the properties of such non-convex energy landscapes.

In this paper, inspired by the success of visualizing the landscapes of Ising and Spin-glass models by Becker and Karplus (1997) and Zhou (2011a), we compute Energy Landscape Maps

(ELMs) in the high-dimensional model spaces (i.e. the hypothesis spaces in the machine learning literature) for some classic statistical learning problems — clustering and bi-clustering.

Figure 1: An energy function and the corresponding Energy Landscape Map (ELM). The y-axis of the ELM is the energy level, each leaf node is a local minimum and the leaf nodes are connected at the ridges of their energy basins.

The ELM is a tree structure, as Figure 1 illustrates, in which each leaf node represents a local minimum and each non-leaf node represents the barrier between adjacent energy basins. The ELM characterizes the energy landscape with the following information.

  • The number of local minima and their energy levels;

  • The energy barriers between adjacent local minima and their energy levels; and

  • The probability mass and volume of each local minimum (See Figure 3).

Such information is useful in the following tasks.

  1. Analyzing the intrinsic difficulty (or complexity) of the optimization problems, for either inference or learning tasks. For example, in bi-clustering, we divide the problem into the easy, hard, and impossible regimes under different conditions.

  2. Anlyzing the effects of various conditions on the ELM complexity, for example, the separability in clustering, the number of training examples, the level of supervision (i.e. how many percent the examples are labeled), and the strength of regularization (i.e. prior model).

  3. Analyzing the behavior of various algorithms by showing their frequencies of visiting the various minima. For example, in the muilti-Gaussian clustering problem, we find that when the Gaussian components are highly separable, K-means clustering works better than the EM algorithm Dempster, Laird and Rubin (1977), and the opposite is true when the components are less separable. In contrast to the frequent visits of local minimum by K-means and EM, the Swendsen-Wang cut method Barbu and Zhu (2007) converges to the global minimum in all separability conditions.

Figure 2: (a) Energy Landscape for a 4-component 1-d GMM with all parameters fixed except two means. Level sets are highlighted in red. The local minima are shown in yellow dots and the first 200 MCMC samples are shown in black dots. (b) The resulting ELM and the correspondence between the leaves and the local minima from the energy landscape.
(a) probability mass
(b) volume
Figure 3: The probability mass and volume of the energy basins for the 2-d landscape shown in Figure 2.

We start with a simple illustrative example in Figures 2 and 3

. Suppose the underlying probability distribution is a 4-component Gaussian mixture model (GMM) in 1D space, and the components are well separated. The model space is 11-dimensional with parameters

denoting the means, variance and weights for each components. We sampled

data points from the GMM and construct the ELM in the model space. We bound the model space to a finite range defined by the samples.

As we can only visualize 2D maps, we set all parameters to equal the truth value except keeping and as the unknowns. Figure 2(a) shows the energy map on a range of . The asymmetry in the landscape is caused by the fact that the true model has different weights between the first and second component. Some shallow local minima, like E, F, G,H, are little “dents” caused by the finite data samples.

Figure 2 (a) shows that all the local minima are identified. Additionally, it shows the first 200 MCMC samples that were accepted by the algorithm that we will discuss late. The samples are clustered around the local minima, and cover all energy basins. They are not present in the high energy areas away from the local minima, as would be desired. Figure 2 (b) shows the resulting ELM and the correspondence between the leaves and the local minima in the energy landscape. Furthermore, Figures 3 (a) and (b) show the probability mass and the volume of these energy basins.

In the literature, Becker and Karplus (1997) presents the first work for visualizing multidimensional energy landscapes for the spin-glass model. Since then statisticians have developed a series of MCMC methods for improving the efficiency of the sampling algorithms traversing the state spaces. Most notably, Liang (2005a, b) generalize the Wang-Landau algorithm Wang and Landaul (2001) for random walks in the state space. Zhou (2011a) uses the generalized Wang-Landau algorithm to plot the disconnectivity graph for Ising model with 100s of local minimum and proposes an effective way for estimating the energy barrier. Furthermore, Zhou and Wong (2008)

construct the energy landscape for Bayesian inference of DNA sequence segmentation by clustering Monte Carlo samples.

In contrast to the above work that compute the landscapes in “state” spaces for inference problems, our work is focused on the landscapes in “model” spaces (the sets of all models; also called hypothesis spaces in the machine learning community) for statistical learning and model estimation problems. There are some new issues in plotting the model space landscapes. i) Many of the basins have a flat bottom, for example, basin A in Figure 2.(a). This may result in a large number of false local minima. ii) The are constraints among some parameters, for example the weights have to sum to one — . Thus we may need to run our algorithm on a manifold.

2 ELM construction

In this section, we introduce the basic ideas for constructing the ELM and estimating its properties - mass, volume and complexity.

2.1 Space partition

Let be the model space over which a probability distribution and energy are defined. In this paper, we assume is bounded using properties of the samples. is partitioned into disjoint subspaces which represent the energy basins

(1)

That is, any point will converge to the same minimum through gradient descent.

Figure 4: The model space is partitioned into energy basins (along the x-axis), and the energy (the y-axis) is partitioned into uniform intervals .

As Figure 4 shows, the energy is also partitioned into intervals . Thus we obtain a set of bins as the quantized atomic elements in the product space ,

(2)

The number of basins and the number of intervals are unknown and have to be estimated during the computing process in an adaptive and iterative manner.

2.2 Generalized Wang-Landau algorithm

The objective of the generalized Wang-Landau (GWL) algorithm is to simulate a Markov chain that visits all the bins with equal probability, and thus effectively reveal the structure of the landscape.

Let be the mapping between the model space and bin indices: if . Given any , by gradient descent or its variants, we can find and record the basin that it belongs to, compute its energy level , and thus find the index .

We define to be the probability mass of a bin

(3)

Then, we can define a new probability distribution which has equal probability among all the bins,

(4)

with being a scaling constant.

To sample from , one can estimate by a variable . We define the probability function to be

We start with an initial , and update iteratively using stochastic approximation Liang, Liu and Carroll (2007). Suppose is the MCMC state at time , then is updated in an exponential rate,

(5)

is the step size at time . The step size is decreased over time and the decreasing schedule is either pre-determined as in Liang, Liu and Carroll (2007) or determined adaptively as in Zhou (2011b).

Each iteration with given uses a Metropolis step. Let be the proposal probability for moving from to , then the acceptance probability is

Intuitively, if , then the probability of visiting is reduced. For the purpose of exploring the energy landscape, the GWL algorithm improves upon conventional methods, such as the simulated annealing Geyer and Thompson (1995) and tempering Marinari and Parisi (1992) process. The latter sample from and do not visit the bins with equal probability even at high temperature.

Figure 5: First two steps of projected gradient descent. The algorithm is initialized with MCMC sample . is the gradient of at the point . Armijo line search is used to determine the step size

along the vector

. is the projection onto the subspace . Then is the projection , and so on.

In performing gradient descent, we employ Armijo line search to determine the step size; if the model space is a manifold in , we perform projected gradient descent, as shown in Figure 5. To avoid erroneously identifying multiple local minima within the same basin (especially when there is large flat regions), we merge local minima identified by gradient descent based on the following criteria: (1) the distance between two local minima is smaller than a constant ; or (2) there is no barrier along the straight line between two local minima.

Figure 6 (a) illustrates a sequence of Markov chain states over two energy basins. The dotted curves are the level sets of the energy function.

Figure 6: Sequential MCMC samples . For each sample, we perform gradient descent to determine which energy basin the sample belongs to. If two sequential samples fall into different basins ( and in this example), we estimate or update the upper-bound of the energy barrier between their respective basins ( and in this example).

2.3 Constructing the ELM

Suppose we have collected a chain of samples from the GWL algorithm. The ELM construction consists of the following two processes.

Figure 7: The ridge descent algorithm is used for estimating the energy barrier between basins and initialized at consecutive MCMC samples where and .

1, Finding the energy barriers between adjacent basins. We collect all consecutive MCMC states that move across two basins and ,

(7)

we choose with the lowest energy

Next we iterate the following step as Figure 7 illustrates

until . The neighborhood is defined by an adaptive radius. Then is the energy barrier and is the energy level of the barrier. A discrete version of this ridge descent method was used in Zhou (2011a).

2, Constructing the tree structure. The tree structure of the ELM is constructed from the set of energy basins and the energy barriers between them via an iterative algorithm modified from the hierarchical agglomerative clustering algorithm. Initially, the energy basins are represented by leaf nodes that are not connected, whose y-coordinates are determined by the local minima of the basins. In each iteration, the two nodes representing the energy basins with the lowest barrier are connected by a new parent node, whose y-coordinates is the energy level of the barrier; and are then regarded as merged, and the energy barrier between the merged basin and any other basin is simply the lower one of the energy barriers between and . When all the energy basins are merged, we obtain the complete tree structure. For clarity, we can remove from the tree basins of depth less than a constant .

2.4 Estimating the mass and volume of nodes in the ELM

In the ELM, we can estimate the probability mass and the volume of each energy basin. When the algorithm converges, the normalized value of approaches the probability mass of bin :

Therefore the probability mass of a basin can be estimated by

(8)

Suppose the energy is partitioned into sufficiently small intervals of size . Based on the probability mass, we can then estimate the size111Note that the size of a bin/basin in the model space is called its volume by Zhou and Wong (2008), but here we will use the term “volume” to denote the capacity of a basin in the energy landscape. of the bins and basins in the model space . A bin with energy interval can be seen as having energy and probability density ( is a normalization factor). The size of bin can be estimated by

The size of basin can be estimated by

(9)

Further, we can estimate the volume of a basin in the energy landscape which is defined as the amount of space contained in a basin in the space of .

(10)

where the range of depends on the definition of the basin. In a restricted definition, the basin only includes the volume under the closest barrier, as Figure 8 illustrates. The volume above the basins and is shared by the two basins, and is between the two energy barriers and . Thus we define the volume for a non-leaf node in the ELM to be the sum of its childen plus the volume between the barriers. For example, node has volume .

Figure 8: The volume of basins. Assuming that is sufficiently small, the volume of an energy basin can be approximated by the summation of the estimated volume at each energy interval.

If our goal is to develop a scale-space representation of the ELM by repeatedly smoothing the landscape, then basins and will be merges into one basin at certain scale, and volume above the two basins will be also added to this new merged basin.

Note that the partition of the space into bins, rather than basins, facilitates the computation of energy barriers, the mass and volume of the basins.

Figure 9: Characterizing the difficulty of learning in the ELM. For two learning tasks with ELM I and ELM II, the colored bar show the frequency that a learning algorithm converges to the basins, from which two Error-recall curves are plotted. The difficulty of learning task, with respect to this algorithm, can be measured by the area under the curve within an acceptable maximum error.

2.5 Characterizing the difficulty (or complexity) of learning tasks

It is often desirable to measure the difficulty of the learning task by a single number. For example, we compare two ELMs in Figure 9. Learning in the landscape of ELM I looks easier than that of ELM II. However, the difficulty also depends on the learning algorithms. Thus we can run the learning algorithm many times and record the frequency that it converges to each basin or minimum. The frequency is shown by the lengths of the colored bars under the leaf nodes.

Suppose that is the true model to be learned. In Figure 9, corresponds to nodes in ELM I and node in ELM II. In general, may not be the global minimum or not even a minimum. We then measure the distance (or error) between

and any other local minima. As the error increases, we accumulate the frequency to plot a curve. We call it the Error-Recall curve (ERC), as the horizontal axis is the error and the vertical axis is the frequency of recall the solutions. This is like the ROC (receptor-operator characteristics) curves in Bayesian decision theory, pattern recognition and machine learning. By sliding the threshold

which is maximum error tolerable, the curve characterizes the difficulty of the ELM with respect to the algorithm.

A single numeric number that characterizes the difficulty can be the area under the curve (AUC) for a given . this is illustarted by the shadowed area in 9.(c) for ELM II. When is close to , the task is easy, and when is close to , learning is impossible.

In a learning problem, we can set different conditions which correspond to a range of ELMs. The difficulty measures of these ELMs can be visualized in the space of the parameters as a difficulty map. We will show such maps in experiment III.

2.6 MCMC moves in the model space

To design the Markov chain moves in the model space , we use two types of proposals in the metropolis-Hastings design in equation (2.2).

1, A random proposal probability in the neighborhood of the current model .

2, Data augmentation. A significant portion of non-convex optimization problems involve latent variables. For example, in the clustering problem, the class label of each data point is latent. For such problems, we use data augmentation Tanner and Wong (1987) to improve the efficiency of sampling. In order to propose a new model , we first sample the values of the latent variables based on and then sample the new model based on . The proposal is then either accepted or rejected based on the same acceptance probability in Equation 2.2.

Note that, however, our goal in ELM construction is to traverse the model space instead of sampling from the probability distribution. When enough samples are collected and therefore the weights become large, the reweighted probability distribution would be significantly different from the original distribution and the rejection rate of the models proposed via data augmentation would become high. Therefore, we use the proposal probability based on data augmentation more often at the beginning and increasingly rely on random proposal when the weights become large.

Figure 10: Two ELMs generated from two MCMC chains and initialized at different starting points after convergence in iterations.

2.7 ELM convergence analysis

The convergence of the GWL algorithm to a stationary distribution is a necessary but not sufficient condition for the convergence of the ELMs. As shown in Figure 10, the constructed ELMs may have minor variations due to two factors: (i) the left-right ambiguity when we plot the branches under a barrier; and (ii) the precision of the energy barriers will affect the internal structure of the tree.

In experiments, firstly we monitor the convergence of the GWL in the model space. We run multiple MCMC initialized with random starting values. After a burn-in period, we collect samples and project in a 2-3 dimensional space using Multi-dimensional scaling. We check whether the chains have converged to a stationary distribution using the multivariate extension of the Gelman and Rubin criterion Gelman and Rubin (1992)Brooks and Gelman (1998).

Once the GWL is believed to have converged, we can monitor the convergence of the ELM by checking the convergence of the following two sets over time .

  1. The set of leaf notes of the tree in which each point is a local minimum with energy . As increase, grows monotonically until no more local minimum is found, as is shown in Figure 11.(a).

  2. The set of internal nodes of the tree in which each point is an energy barrier at level . As increases, we may find lower barrier as the Markov chain crosses different ridge between the basins. Thus decreases monotonically until no barrier in is updated during a certain time period.

We further calculate a distance measure between two ELMs constructed by two MCMCs with different initialization. To do so, we compute a best node matching between the two trees and then the distance is defined on the differences of the matched leaf nodes and barriers, and penalties on unmatched nodes. We omit the details of this definition as it is not important for this work. Figure 11.(b) shows the distance decreases as more samples are generated.

Figure 11: Monitoring the convergence of ELMs generated from two MCMC chains and initialized at different starting points. (a) The number of local minima found vs number of iterations for and . (b) the distance between the two ELMs vs. number of iterations.

3 Experiment I: ELMs of Gaussian Mixture Models

In this section, we compute the ELMs for learning Gaussian mixture models for two purpuses: (i) study the influences of different conditions, such as separability and level of supervision; and ii) compare the behaviors and performances of popular algorithms including K-mean clustering, EM (Expectation-Maximization), two-step EM, and Swendson-Wang cut. We will use both synthetic data and real data in the experiments.

3.1 Energy and Gradient Computations

A Gaussian mixture model with components in dimensions have weights , means and covariance matrices for . Given a set of observed data points , we write the energy function as

(11)
(12)

is the product of a Dirichlet prior and a NIW prior. Its partial derivatives are trivial to compute. is the likelihood for data , where is a Gaussian model.

For a sample , we have the following partial derivatives of the log likelihood for calculating the gradient in the energy landscape.

a), Partial derivative with respect to each weight :

b), Partial derivative with respect to each mean :

c), Partial derivative with respect to each covariance :

During the computation, we need to restrict the matrices so that each inverse exists in order to have a defined gradient. Each

is semi-positive definite, so each eigenvalue is greater than or equal to zero. Consequently we only need the minor restriction that for each eigenvalue

of , for some . However, it is possible that after one gradient descent step, the new GMM parameters will be outside of the valid GMM space, i.e. the new matrices at step will not be symmetric positive definite. Therefore, we need to project each into the symmetric positive definite space with the projection

The function projects the matrix into the space of symmetric matrices by

Assuming that is symmetric, the function projects into the space of symmetric matrices with eigenvalues greater than . Because is symmetric, it can be decomposed into where is the diagonal eigenvalue matrix , and

is an orthonormal eigenvector matrix. Then the function

ensures that is symmetric positive definite.

3.2 Bounding the GMM space

From the data points , we can estimate a boundary of the space of possible parameter if is sufficiently large.

Let and be the sample mean and sample covariance matrix of all points. We set a range for the means of the Gaussian components,

is a constant that we will select in experiments. To bound the covariance matrices , let be the eigenvalue decomposition of with . We denote by the upper bound of the eigen-values, and bound all the eigenvalues of by .

(a) unbounded GMM space
(b) bounded GMM space
Figure 12: We sampled 70 data points from a 1-dimensional 4-component GMM and ran the MCMC random walk for ELM construction algorithm in the (a) unbounded (b) bounded GMM space. The plots show the evolution of the location of the centers of the 4 components over time. The width of the line represents the weight of the corresponding component.

Figure 12 (a,b) compare the MCMCs in unbounded and bounded spaces repsectively. We sampled data points from a 1-dimensional, 4-component GMM and ran the MCMC random walk for ELM construction algorithm. The plots show the evolution of the locations of over time. Notice that in Figure 12 (a), the MCMC chain can move far from the center and spends the majority of the time outside of the bounded subspace. In Figure 12 (b), by forcing the chain to stay within the boundary, we are able to explore the relevant subspace more efficiently.

3.3 Experiments on Synthetic Data

We start with synthetic data with component GMM on dimensional space, draw samples and run our algorithm to plot the ELM under different settings.

Figure 13: ELMs for samples drawn from GMMs with low, medium and high separability respectively. The circle represents the probability mass of the basins.

1) The effects of separability. The separability of the GMM represents the overlap between components of the model and is defined as . This is often used in the literature to measure the difficulty of learning the true GMM model.

Figure 13 shows three representative ELMs with the separability respectively for data points. This clearly shows that at , the model is hardly identifiable with many local minima reaching similar energy levels. The energy landscape becomes increasingly simple as the separability increases. When , the prominent global minimum dominates the landscape.

2) The effects of partial supervision. We assign ground truth labels to a portion of the data points. For , its label indicates which component it belongs to. We set , separability . Figure 14 shows the ELMs with

data points labels. While unsupervised learning (

) is very challenging, it becomes much simpler when or data are labeled. When data are labeled, the ELM has only one minimum. Figure 15 shows the number of local minima in the ELM when labeling samples. This shows a significant decrease in landscape complexity for the first 10$ labels, and diminishing returns from supervised input after the initial 10%.

Figure 14: ELMs with of synthesized GMMs (separability , nSamples = 100) with labelled data points.
Figure 15: Number of local minima versus the percentage of labelled data points for a GMM with separability .

3) Behavior of Learning Algorithms. We compare the behaviors of the following algorithms under different separability conditions.

  • Expectation-maximization (EM) is the most popular algorithms for learning GMM in statistics.

  • K-means clustering is a popular algorithm in machine learning and pattern recognition.

  • Two-step EM is a variant of EM proposed in Dasgupta and Schulman (2000) who have proved a performance guarantee under certain separability conditions. It starts with an excessive number of components and then prune them.

  • The Swedsen-Wang Cut (SW-cut) algorithm proposed in Barbu and Zhu (2007) and Barbu and Zhu (2005). This generalizes the SW method Swendsen and Wang (1987) from Ising/Potts models to arbitrary probabilities.

We modified EM, two-step EM and SW-cut in our experiments so that they minimize the energy function defined in Equation 11. K-means does not optimize our energy function, but it is frequently used as an approximate algorithm for learning GMM and therefore we include it in our comparison.

For each synthetic dataset in the experiment, we first construct the ELM, and then ran each of the algorithms for times and record which of the energy basins the algorithm lands to. Hence we obtain the visiting frequency of the basins by each algorithm, which are shown as bars of varying length at the leaf nodes in Figures 16 and 17.

Figure 16: The performance of the k-means, EM and 2-step EM algorithms on the ELMs with 10 samples drawn from a GMM with low separability ()
(a) EM
(b) k-means
(c) SW-cut
(d) EM
(e) k-means
(f) SW-cut
Figure 17: The performance of the EM, k-means, and SW-cut algorithm on the ELM. (a-c) Low separability . (d-f) High separability .

Figure 16 shows a comparison between the K-means, EM and two-step EM algorithms for samples drawn from a low () separability GMM. The results are scattered across different local minima regardless of the algorithm. This illustrates the difficulty in learning a model from a landscape with many local minima separated by large energy barriers.

Figure 17 show a comparison of the EM, k-means, and SW-cut algorithms for samples drawn from low () and high () separability GMMs. The SW-cut algorithm performs best in each situation, always converging to the global optimal solution. In the low separability case, the k-means algorithm is quite random, while the EM algorithm almost always finds the global minimum and thus outperforms k-means. However, in the high separability case, the k-means algorithm converges to the true model the majority of the time, while the EM almost always converges to a local minimum with higher energy than the true model. This result confirms a recent theoretical result showing that the objective function of hard-EM (with k-means as a special case) contains an inductive bias in favor of high-separability models Tu and Honavar (2012); Samdani, Chang and Roth (2012). Specifically, we can show that the actual energy function of hard-EM is:

where is the model parameters, is the set of observable data points, is the set of latent variables (the data point labels in a GMM), is an auxiliary distribution of , and is the entropy of measured with . The first term in the above formula is the standard energy function of clustering with GMM. The second term is called a posterior regularization term Ganchev et al. (2010), which essentially encourages the distribution to have a low entropy. In the case of GMM, it is easy to see that a low entropy in implies high separability between Gaussian components.

3.4 Experiments on Real Data

We ran our algorithm to plot the ELM for the well-known Iris data set from the UCI repository Blake and Merz (1998). The Iris data set contains points in 4 dimensions and can be modeled by a 3-components GMM. The three components each represent a type of iris plant and the true component labels are known. The points corresponding to the first component are linearly separable from the others, but the points corresponding to the remaining two components are not linearly separable.

Figure 18 shows the ELM of the Iris dataset. We visualize the local minima by plotting the ellipsoids of the covariance matrices centered at the means of each component in 2 of the 4 dimensions.

The 6 lowest energy local minima are shown on the right and the 6 highest energy local minima are shown on the left. The high energy local minima are less accurate models than the low energy local minima. The local minima (E) (B) and (D) have the first component split into two and the remaining two (non-separable) components merged into one. The local minima (A) and (F) have significant overlap between the 2nd and 3rd components and (C) has the components overlapping completely. The low-energy local minima (G-L) all have the same 1st components and slightly different positions of the 2nd and 3rd components.

Figure 18: ELM and some of the local minima of the Iris dataset.

We ran the algorithm with percent of the points with the ground truth labels assigned. Figure 19 shows the global minimum of the energy landscape for these cases.

Figure 19: Global minima for learning from the Iris dataset with 0, 5, 10, 50, 90, and 100% of the data labeled with the ground truth values. Unlabeled points are drawn in grey and labelled points are colorized in red, green or blue.

4 Experiment II: ELM of Bernoulli Templates

The synthetic data and Iris data in experiment I are in low dimensional spaces. In this section, we experiment with very high dimensional data for a learning task in computer vision and pattern recognition.

The objective is to learn a number of templates for object recognition. Figure 20 illustrates templates of animal faces. Each template consists of a number of sketches or edges in the image lattice, and is denoted by a Boolean vector with being the number of quantized positions and orientations of the lattice which is typically a large number . if there is a sketch at location , and otherwise. Images are generated from one of the templates with noise. Suppose is an image generated from template , then with probability and with probability . Thus we call the Bernoulli templates. For simplicity we assume is fixed for all the templates and all the locations.

The energy function that we use is the negative log of the posterior, given by for examples . The model parameter consists of the Boolean vectors and the mixture weights for . By assuming a uniform prior we have

In the following we present experiments on synthetic and real data.

(a) cat
(b) chilchilla
(c) dog
(d) elephant
(e) goat
(f) lion
(g) monkey
(h) mouse
(i) owl
(j) rabbit
Figure 20: Bernoulli templates for animal faces. These templates have low overlap and are well separable.

4.1 Experiments on Synthetic Data

Barbu, Wu and Wu (2014) proposes a Two-Round EM algorithm for learning Bernoulli templates with a performance bound that is dependent on the number of components , the Beronouilli template dimension , and noise level . In this experiment, we examine how the ELM of the model space changes with these factors. We discretize the model space by allowing the weights to take values . In order to adapt the GWL algorithm to the discrete space, we use coordinate descent in lieu of gradient descent.

(a) Sketch dictionary
(b) Noisy dog
Figure 21: (a) Quantized dictionary with 18 sketches for each cell in the image lattice. (b) Sample from the dog animal face template with noise level

We construct Bernouilli templates which represent animal faces in Figure 20. Each animal face is aligned to a grid of cells. Each cell may contain up to sketches. Within a cell, the sketches are quantized to discrete location and orientations. More specifically, each sketch is a straight line connecting the endpoints or midpoints of the edges of a square cell, and the possible sketches in a cell are shown in Figure 21.(a). They can well approximate the detected edges or Gabor sketches from real images. The Bernouilli template can therefore be represented as a dimensional binary vector. Figure 21.(b) shows a noisy dog face generated with noisy level .

We compute the ELMs of the Bernouilli mixture model for varying numbers of samples and varying noise level . The number of local minima in each energy landscape is tabulated in Figure 22 (b) and drawn as a heat map in Figure 22 (a). As expected, the number of local minima increases as the noise level increases, and decreases as the number of samples decreases. In particular, with no noise, the landscape is convex and with noise , there are too many local minima and the algorithm does not converge.

(a) map
(b) number of local minima
Figure 22: ELM complexity for varying values of and number of samples in learning the Bernouilli Templates.
Figure 23: Bernoulli templates for mouse faces with high overlap and low separability.

We repeat the same experiment using variants of a mouse face. We swap out each component of the mouse face (the eyes, ears, whiskers, nose, mouth, head top and head sides) with three different variants. We thereby generate Bernouilli templates in Figure 23, which have relatively high degrees of overlap. We generated the ELMs of various Bernouilli mixture models containing three of the templates and noise level . In each Bernouilli mixture model, the three templates have different degrees of overlap. Hence we plot the number of local minima in the ELMs versus the degree of overlap as show in Figure 24. As expected, the number of local minima increases with the degree of overlap, and there are too many local minima for the algorithm to converge past overlap .

Figure 24: Number of local minima found for varying degrees of overlap in the Bernoulli templates. Each marker corresponds to a Bernoulli mixture model that consists of three of the 20 Bernoullie templates.

4.2 Experiments on Real Data

We perform the Bernouilli templates experiment on a set of real images of animal faces. We binarize the images by extracting the prominent sketches on a 9x9 grid. Eight Gabor filters with eight different orientations centered in the centers and corners of each cell are applied to the image. The filters with a strong response above a fixed threshold correspond to edges detected in the image; these are mapped to the dictionary of

elements. Thus each animal face is represented as a dimensional binary vector. The Gabor filter responses on animal face pictures are shown in Figure 26. The binarized animal faces are shown in Figure 26.

Figure 25: Animal face images and corresponding binary sketches indicates the existence of a Gabor filter response above a fixed threshold.
Figure 26: Deer face sketches binarized from real images.

We chose 3 different animal types – deer, cat and mouse, with an equal number of images chosen from each category (Figure 27). The binarized versions of these images can be modeled as a mixture of 3 Bernouilli templates - each template corresponding to one animal face type.

Figure 27: Animal face images of three categories.

The ELM is shown in Figure 28 along with the Bernouilli templates corresponding to three local minima separated by large energy barriers. We make two observations: 1. The templates corresponding to each animal type are clearly identifiable, and therefore the algorithm has converged on reasonable local minima. 2. The animal faces have differing orientations across the local minima (the deer face on in the left-most local minimum is rotated and tilted to the right and the dog face in the same local minimum is rotated and lilted to the left), which explains the energy barriers between them.

Figure 28: ELM of the three animal faces dataset (dog, cat, and deer). We show the Bernouilli templates corresponding to three local minima with large energy barriers.
(a) SW-cut
(b) EM
(c) k-means
Figure 29: Comparison of SW-cut, k-means, and EM algorithm performance on the ELM of animal face Bernouilli mixture model.

Figure 29 shows a comparison of the SW-cut, k-means, and EM algorithm performance as a histogram on the ELM of animal face Bernouilli mixture model. The histogram is obtained by running each algorithm 200 times with a random initialization, then finding the closest local minimum in the ELM to the output of the algorithm. The counts of the closest local minima are then displayed as a bar plot next to each local minimum. It can be seen that SW-cut always finds the global minimum, while k-means performs the worst probably because of the high degree of overlap between the sketches of the three types of animal faces.

5 Experiment III: ELM of bi-clustering

bi-clustering is a learning process (see a survey by Madeira and Oliviera (2004)) which has been widely used in bioinformatics, e.g., finding genes with similar expression patterns under subsets of conditions (Cheng and Church (2000); Getz, Levine and Domany (2000); Cho et al. (2004)). It is also used in data mining, e.g., finding people who enjoy similar movies (Yang et al. (2002)), and in learning language models by finding co-occurring words and phrases in grammar rules (Tu and Honavar (2008)).

Figure 30.(a) shows a bi-clustering model (with multiplicative coherence) in the form of a three layer And-Or graph. The underlying pattern has two conjunction factors and . can choose from a number of alternative elements at probability respectively. Similarly, can choose from elements with probability respectively. It can be seen that and have shared elements . For comparison, we note that the clustering models in experiments I and II can be seen as three-layer Or-And graphs with a mixture (Or-node) on the top and each component is a conjunction of multiple variables.

Figure 30: (a) A bi-clustering model. (b) The co-occurrence matrix with the theoretical frequencies of its elements.

From data sampled from this model, one can compute a co-occurrence matrix for the two elements chosen by and , and the theoretical co-occurring frequency is shown in Figure 30.(b). When only a small number of observations are available, this matrix may have significant fluctuations. There may also be unwanted background elements in the matrix. The goal of bi-clustering is to identify the bi-cluster (one of the two submatrixes in Figure 30.(b)) from the noisy matrix. Note that this is a simple special case of the bi-clustering problem and in general the matrix may contain many bi-clusters that are not necessarily symmetrical.

We denote the bi-cluster to be identified by where is the set of rows and

is the set of columns of the bi-cluster. Note that the goal of bi-clustering is not to explain all the data but to identify a subset of the data that exhibit certain properties (e.g., coherence). Therefore, instead of using likelihood or posterior probability to define the energy function, we use the following energy function adapted from

Tu and Honavar (2008).

In the above formula, is the element at row and column , is the sum of row , is the sum of column , and is the total sum of the bi-cluster. The first term in the energy function measures the coherence of the bi-cluster, which reaches its minimal value of 0 if the bi-cluster is perfectly multiplicatively coherent (i.e., the elements are perfectly proportional). The second term corresponds to the prior, which favors bi-clusters that cover more data; the term is added to exclude rows and columns that are entirely zero from the bi-cluster.

We experimented with synthetic bi-clustering models in which and each have alternative elements. We varied the following factors to generate a set of different models: (i) the levels of overlaps between and : ; and (ii) random background noises at level . We generated data points from each model and constructed the matrix. For each data matrix, we ran our algorithm to plot the ELMs with different values of , the strength of the prior.

Figure 31 shows some of the ELMs with the overlap being respectively, the prior strength being , and the noise level . The local maxima corresponding to the correct bi-clusters (either the target bi-cluster or its transposition) are marked with solid red circles; the empty bi-cluster is marked with a gray circle; and the maximal bi-cluster containing the whole data matrix is marked with a solid green circle.

These ELMs can be divided into three regimes.

  • Regime I: the true model is easily learnable; the global maxima correspond to the correct bi-clusters and there are fewer than 6 local minima.

  • Regime II: the prior is too strong, the ELM has a dominating minimum which is the maximal bi-cluster. Thus the model is biased and cannot recover the underlying bi-cluster.

  • Regime III: the prior is too weak, resulting in too many local minima at similar energy levels. The true model may not be easily learned, although it is possible to obtain approximately correct solutions.

Thus we transfer the table in Figure 31 into a “difficulty map”. Figure 32(a) shows the difficulty map with three regimes with a noise level ; Figure 32(b) shows the difficulty map with . Such difficulty maps visualize the effects of various conditions and parameters and thus can be useful in choosing and configuring the biclustering algorithms.

Figure 31:

Energy Landscape Maps for learning two bi-clusters with 0%,20%, 40% overlap and hyperparameter

. Red: correct bi-cluster; Grey: empty bi-cluster; Green: maximal bi-cluster.
(a) Noise
(b) Noise
Figure 32: Difficulty map for bi-clustering (a) without noise (b) with noise. Region I: the true model is easily learnable. Region II: the true model cannot be learned. Region III: approximations to the true model may be learned with some difficulty.

6 Conclusion and Discussion

We present a method for computing the energy landscape maps (ELMs) in model spaces for cluster and bi-cluster learning, and thus visualize for the first time the non-convex energy minimization problems in statistical learning. By plotting the ELMs, we have shown how different problem settings, such as separability, levels of supervision, levels of noise, and strength of prior impact the complexity of the energy landscape. We have also compared the behaviors of different learning algorithms in the ELMs.

Our study leads to the following problems which are worth exploring in future work.

  1. If we repeatedly smooth the energy function, adjacent branches in the ELM will gradually be merged, and this produces a series of ELMs representing the coarser structures of the landscape. These ELMs construct the scale space of the landscape. From this scale space ELM, we shall be able to study the difficulty of the underlying learning problem.

  2. One way to control the scale space of ELM is to design a learning strategy. It starts with lower-dimensional space, simple examples, and proper amount of supervision, and thus the ELM is almost convex. Once the learning algorithm reaches the global minimum of this ELM, we increase the number of hard examples and dimensions and the ELM becomes increasingly complex. Hopefully, the minimum reached in the previous ELM will be close to the global minimum of ELM at the next stage. We have studied a case of such curriculum learning on dependency grammars in Pavlovskaia, Tu and Zhu (2014).

  3. The clustering models are defined on Or-And graph structure (here, ’or’ means mixture and ’and’ means conjunction of dimensions, and the bi-clustering models are defined on And-Or graph. In general, it was shown that many advanced learning problems are defined on hierarchical and compositional graphs, which is summarized as multi-layers of And-Or graphs in  Zhu and Mumford (2006). Studying the ELMs for such models will be more challenging but crucial for many learning tasks of practical importance.

References

  • Barbu, Wu and Wu (2014) [author] Barbu, AdrianA., Wu, TianfuT. Wu, YingnianY. (2014). Learning Mixtures of Bernoulli Templates by Two-Round EM with Performance Guarantee. Electronic Journal of Statistics (under review).
  • Barbu and Zhu (2005) [author] Barbu, AdrianA. Zhu, Song-ChunS.-C. (2005). Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 1239–1253.
  • Barbu and Zhu (2007) [author] Barbu, AdrianA. Zhu, Song-ChunS.-C. (2007). Generalizing Swendsen-Wang for Image Analysis. Journal of Computational and Graphical Statistics 16 877-900.
  • Becker and Karplus (1997) [author] Becker, Oren M.O. M. Karplus, MartinM. (1997). The topology of multidimensional potential energy surfaces: Theory and application to peptide structure and kinetics. The Journal of Chemical Physics 106 1495–1517.
  • Blake and Merz (1998) [author] Blake, C LC. L. Merz, C JC. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Sciences.
  • Brooks and Gelman (1998) [author] Brooks, Stephen P.S. P. Gelman, AndrewA. (1998). General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics 7 434–455. 10.2307/1390675
  • Cheng and Church (2000) [author] Cheng, YizongY. Church, George M.G. M. (2000). Biclustering of Expression Data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB-00) 93–103. AAAI Press, Menlo Park, CA.
  • Cho et al. (2004) [author] Cho, HyukH., Dhillon, Inderjit S.I. S., Guan, YuqiangY. Sra, SuvritS. (2004). Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004.
  • Dasgupta and Schulman (2000)

    [author] Dasgupta, SanjoyS. Schulman, Leonard J.L. J. (2000). A two-round variant of EM for Gaussian mixtures. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. UAI’00 152–159. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

  • Dempster, Laird and Rubin (1977) [author] Dempster, Arthur PA. P., Laird, Nan MN. M. Rubin, Donald BD. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1–38.
  • Ganchev et al. (2010) [author] Ganchev, KuzmanK., Graça, JoãoJ., Gillenwater, JenniferJ. Taskar, BenB. (2010). Posterior Regularization for Structured Latent Variable Models. Journal of Machine Learning Research 11 2001-2049.
  • Gelman and Rubin (1992) [author] Gelman, AndrewA. Rubin, Donald B.D. B. (1992). Inference from Iterative Simulation Using Multiple Sequences. Statistical Science 7 457–472. 10.2307/2246093
  • Getz, Levine and Domany (2000)

    [author] Getz, G.G., Levine, E.E. Domany, E.E. (2000). Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci U S A 97 12079–12084.

  • Geyer and Thompson (1995) [author] Geyer, Charles JC. J. Thompson, Elizabeth AE. A. (1995). Annealing Markov chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association 90 909–920.
  • Liang (2005a) [author] Liang, FamingF. (2005a). Generalized Wang-Landau algorithm for Monte Carlo computation. Journal of the American Statistical Association 100 1311-1327.
  • Liang (2005b) [author] Liang, FamingF. (2005b). A generalized Wang-Landau algorithm for Monte Carlo computation. JASA. Journal of the American Statistical Association 100 1311-1327. 10.1198/016214505000000259
  • Liang, Liu and Carroll (2007) [author] Liang, FamingF., Liu, ChuanhaiC. Carroll, Raymond J.R. J. (2007). Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association 102 305–320.
  • Madeira and Oliviera (2004) [author] Madeira, S. C.S. C. Oliviera, A. L.A. L. (2004). Biclustering algorithms for biological data analysis: a survey. IEEE Trans. on Comp. Biol. and Bioinformatics 1 24–45.
  • Marinari and Parisi (1992) [author] Marinari, EnzoE. Parisi, GiorgioG. (1992). Simulated tempering: a new Monte Carlo scheme. EPL (Europhysics Letters) 19 451.
  • Pavlovskaia, Tu and Zhu (2014) [author] Pavlovskaia, MariaM., Tu, KeweiK. Zhu, Song-ChunS.-C. (2014). Mapping the Energy Landscape for Curriculum Learning of Dependency Grammars. Technical Report, UCLA Stat Department.
  • Samdani, Chang and Roth (2012) [author] Samdani, RajhansR., Chang, Ming-WeiM.-W. Roth, DanD. (2012). Unified expectation maximization. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 688–698. Association for Computational Linguistics.
  • Swendsen and Wang (1987) [author] Swendsen, R. H.R. H. Wang, J. S.J. S. (1987). Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters 58 86–88. 10.1103/physrevlett.58.86
  • Tanner and Wong (1987) [author] Tanner, Martin AM. A. Wong, Wing HungW. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American statistical Association 82 528–540.
  • Tu and Honavar (2008) [author] Tu, KeweiK. Honavar, VasantV. (2008). Unsupervised Learning of Probabilistic Context-Free Grammar Using Iterative Biclustering. In Proceedings of 9th International Colloquium on Grammatical Inference (ICGI 2008), LNCS 5278.
  • Tu and Honavar (2012)

    [author] Tu, KeweiK. Honavar, VasantV. (2012). Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Natural Language Learning (EMNLP-CoNLL 2012).

  • Wang and Landaul (2001) [author] Wang, FF. Landaul, DD. (2001). Efficient multi-range random-walk algorithm to calculate the density of states. Phys. Rev. Lett. 86 2050-2053.
  • Yang et al. (2002) [author] Yang, JiongJ., Wang, WeiW., Wang, HaixunH. Yu, Philip S.P. S. (2002). -Clusters: Capturing Subspace Correlation in a Large Data Set. In ICDE 517–528. IEEE Computer Society.
  • Zhou (2011a) [author] Zhou, QingQ. (2011a). Random Walk over Basins of Attraction to Construct Ising Energy Landscapes. Phys. Rev. Lett. 106 180602. 10.1103/PhysRevLett.106.180602
  • Zhou (2011b)

    [author] Zhou, QingQ. (2011b). Multi-domain sampling with applications to structural inference of bayesian networks. Journal of the American Statistical Association 106 1317–1330.

  • Zhou and Wong (2008) [author] Zhou, QingQ. Wong, Wing HungW. H. (2008). Reconstructing the energy landscape of a distribution from Monte Carlo samples. The Annals of Applied Statistics 2 1307-1331.
  • Zhu and Mumford (2006) [author] Zhu, Song-ChunS.-C. Mumford, DavidD. (2006). A stochastic grammar of images. Found. Trends. Comput. Graph. Vis. 2 259–362. http://dx.doi.org/10.1561/0600000018