Multiscale Energy Pyramid
In this work we consider discrete pair-wise minimization problems, defined over a (weighted) graph , of the form:
where is the set of variables, is the set of edges, and the solution is discrete: , with variables taking possible labels. Many problems in computer vision are cast in the form of (the-at-equationgroup-at-ID) (see [Szeliski2008]
). Furthermore, we do not restrict the energy to be submodular, and our framework is also applicable to more challenging non-submodular energies. Our aim is to build an energy pyramid with a decreasing number of degrees of freedom. The key component in constructing such a pyramid is the interpolation method. The interpolation maps solutions between levels of the pyramid, and defines how to approximate the original energy with fewer degrees of freedom. We propose a novel principled energy aware interpolation method. The resulting energy pyramid exposes the multiscale landscape of the energy making low energy assignments apparent at coarse levels. However, it is counter intuitive to directly interpolate discrete values, since they usually have only semantic interpretation. Therefore, we substitute an assignmentby a binary matrix . The rows of correspond to the variables, and the columns corresponds to labels: iff variable is labeled “” (). This representation allows us to interpolate discrete solutions, as will be shown in the subsequent sections. Expressing the energy (the-at-equationgroup-at-ID) using yields a relaxed quadratic representation (along the lines of [Anand2000]) that forms the basis for our multiscale framework derivation:
where is sparse with entries , s.t. , and s.t. , . A detailed derivation of (the-at-equationgroup-at-IDa) can be found in Sec. . An energy over variables with labels is now parameterized by . We first describe the energy pyramid construction for a given interpolation matrix , and defer the detailed description of our novel interpolation to Sec. Document.
Energy coarsening by variables
Let be the fine scale energy. We wish to generate a coarser representation with . This representation approximates using fewer variables: with only rows. Given an interpolation matrix s.t. , it maps coarse to fine assignments through:
We have generated a coarse energy parameterized by that approximates the fine energy . This coarse energy is of the same form as the original energy allowing us to apply the coarsening procedure recursively to construct an energy pyramid.
Energy coarsening by labels
So far we have explored the reduction of the number of degrees of freedom by reducing the number of variables. However, we may just as well look at the problem from a different perspective: reducing the search space by decreasing the number of labels from to (). It is a well known fact that optimization algorithms (especially large move making, e.g., [Boykov2001]) suffer from significant degradation in performance as the number of labels increases ([Bleyer2010]). Here we propose a novel principled and general framework for reducing the number of labels at each scale. Let be the fine scale energy. Looking at a different interpolation matrix , we may interpolate a coarse solution by . This time the interpolation matrix acts on the labels, i.e., the columns of . The coarse labeling matrix has the same number of rows (variables), but fewer columns (labels). We use notation to emphasize that the coarsening here affects the labels rather than the variables. Coarsening the labels yields:
Again, we end up with the same type of energy, but this time it is defined over a smaller number of discrete labels: , where and . The main theoretical contribution of this work is encapsulated in the multiscale “trick” of equations () and (). This formulation forms the basis of our unified framework allowing us to coarsen the energy directly and exploits its multiscale landscape for efficient exploration of the solution space. This scheme moves the multiscale completely to the optimization side and makes it independent of any specific application. We can practically now approach a wide and diverse family of energies using the same multiscale implementation. The effectiveness of the multiscale approximation of () and () heavily depends on the interpolation matrix ( resp.). Poorly constructed interpolation matrices will fail to expose the multiscale landscape of the functional. In the subsequent section we describe our principled energy-aware method for computing it.
In this section we use terms and notations for variable coarsening (), however the motivation and methods are applicable for label coarsening () as well due to the similar algebraic structure of () and (). Our energy pyramid approximates the original energy using a decreasing number of degrees of freedom, thus excluding some solutions from the original search space at coarser scales. Which solutions are excluded is determined by the interpolation matrix . A desired interpolation does not exclude low energy assignments at coarse levels. The matrix can be interpreted as an operator that aggregates fine-scale variables into coarse ones (Fig. Document). Aggregating fine variables and into a coarser one excludes from the search space all assignments for which . This aggregation is undesired if assigning and to different labels yields low energy. However, when variables and are strongly correlated by the energy (i.e., assignments with yield low energy), aggregating them together efficiently allows exploration of low energy assignments. A desired interpolation aggregates and when and are strongly correlated by the energy.
Measuring energy-aware correlations
We provide two correlations measures, one used in computing variable coarsening () and the other used for label coarsening (). Energy-aware correlations between variables:
A reliable estimation for the correlations between the variables allows us to construct a desirablethat aggregates strongly correlated variables. A naïve approach would assume that neighboring variables are correlated (this assumption underlies [Felzenszwalb2006]). This assumption clearly does not hold in general and may lead to an undesired interpolation matrix . [Kim2011] proposed several “closed form formulas” for energy-aware variable grouping. However, their formulas take into account either the unary term or the pair-wise term. Indeed it is difficult to decide which term dominates and how to fuse these two terms together. Therefore, there is no “closed form” method that successfully integrates both of them. As opposed to these “closed form” methods, we propose a novel empirical scheme for correlation estimation. Empirical estimation of the correlations naturally accounts for and integrates the influence of both the unary and the pair-wise terms. Moreover, our method, inspired by [Ron2011, Livne2011], extends to all energies (the-at-equationgroup-at-IDa): submodular, non-submodular, metric , arbitrary , arbitrary , energies defined over regular grids and arbitrary graphs. Variables and are correlated by the energy when yields relatively low energy value. To estimate these correlations we empirically generate several “locally” low energy assignments, and measure the label agreement between neighboring variables and . We use Iterated Conditional Modes (ICM) of [Besag1986] to obtain locally low energy assignments: Starting with a random assignment, ICM chooses, at each iteration and for each variable, the label yielding the largest decrease of the energy function, conditioned on the labels assigned to its neighbors. Performing ICM iterations for random initializations provides locally low energy assignments . Our empirical dissimilarity between and is given by , and their correlation is given by , with . It is interesting to note that strong correlation between variables and usually implies that the pair-wise term binding them together () is a smoothness-preserving type of relation. We assume that even for challenging energies with many contrast-enhancing pair-wise terms, there are still significant amount of smoothness-preserving terms to allow for effective coarsening. Energy-aware correlations between labels: Correlations between labels are easier to estimate, since this information is explicit in the matrix that encodes the “cost” (i.e., dissimilarity) between two labels. Setting , we get a “closed-form” expression for the correlations between labels.
From correlations to interpolation
Using our measure for the variable correlations, , we follow the Algebraic Multigrid (AMG) method of [Brandt1986] to compute an interpolation matrix that softly aggregates strongly correlated variables. We begin by selecting a set of coarse representative variables , such that every variable in is strongly correlated with . That is, every variable in is either in or is strongly correlated to other variables in . A variable is considered strongly correlated to if . affects the coarsening rate, i.e., the ratio , smaller results in a lower ratio. We perform this selection greedily and sequentially, starting with adding to if it is not yet strongly correlated to . Given the selected coarse variables , maps indices of variables from fine to coarse: is the coarse index of the variable whose fine index is (in Fig. Document: and ). The interpolation matrix is defined by:
We further prune rows of leaving only maximal entries. Each row is then normalized to sum to 1. Throughout our experiments we use (), () for computing ( resp.).
Unified Discrete Multiscale Framework
[t] . fine scale Energy pyramid construction: Estimate pair-wise correlations at scale (Sec. Document). Compute interpolation matrix (Sec. Document). Derive coarse energy (Eq. ). Coarse-to-fine optimization: interpolate a solution where uses an “off-the-shelf” algorithm to optimize the energy with as an initialization. So far we have described the different components of our multiscale framework. Alg. Document puts them together into a multiscale minimization scheme. Given an energy , our framework first works fine-to-coarse to compute interpolation matrices that generates an “energy pyramid”. Typically we end up at the coarsest scale with less than variables. As a result, exploring the energy at this scale is robust to the initial assignment of the single-scale method used111In practice we use “winner-take-all” initialization as suggested by [Szeliski2008, §3.1].. Starting from the coarsest scale, a coarse solution at scale is interpolated to a finer scale . At the finer scale it serves as a good initialization for an “off-the-shelf” single-scale optimization that refines this interpolated solution. These two steps are repeated for all scales from coarse to fine. The interpolated solution , at each scale, might not satisfy the binary constraints (the-at-equationgroup-at-IDa). We round each row of by setting the maximal element to and the rest to . The most computationally intensive modules of our framework are the empirical estimation of the variable correlations and the single-scale optimization used to refine the interpolated solutions. The complexity of the correlation estimation is , where is the number of non-zero elements in and is the number of labels. However, it is fairly straightforward to parallelize this module. It is now easy to see how our framework generalizes [Felzenszwalb2006], [Komodakis2010] and [Kim2011]: They are restricted to hard aggregation in . [Felzenszwalb2006] and [Komodakis2010] use a multiscale pyramid, however their variable aggregation is not energy-aware, and is restricted to diadic pyramids. On the other hand, [Kim2011] have limited energy-aware aggregation, applied to a two level only “pyramid”. They only optimize at the coarse scale and cannot refine the solution on the fine scale.KwInitInit KwOptRefine KwCoarseCoarsen Energy
Our experiments has two main goals: first, to stress the difficulty of approximating non-submodular energies and to show the advantages of primal methods for this type of minimization problems. The other goal is to demonstrate how our unified multiscale framework improved the performance of existing single-scale primal methods. We evaluated our multiscale framework on a diversity of discrete optimization tasks222code available at www.wisdom.weizmann.ac.il/~bagon/matlab.html.: ranging from challenging non-submodular synthetic and co-clustering energies to low-level submodular vision energies such as denoising and stereo. In addition we provide a comparison between the different methods for measuring variable correlations that were presented in Sec. Document. We conclude with a label coarsening experiment. In all of these experiments we minimize a given publicly available benchmark energy, we do not attempt to improve on the energy formulation itself. We use ICM ([Besag1986]), -swap and -expansion (large move making algorithms of [Boykov2001]) as representative single-scale “off-the-shelf” primal optimization algorithms. To help large move making algorithms to overcome the non-submodularity of some of these energies we augment them with QPBO(I) of [Rother2007]. We follow the protocol of [Szeliski2008] that uses the lower bound of TRW-S ([Kolmogorov2006]) as a baseline for comparing the performance of different optimization methods for different energies. We report how close the results (in percents) to the lower bound: closer to is better. We show a remarkable improvement for ICM combined in our multiscale framework compared with a single-scale scheme. For the large move making algorithms there is a smaller but consistent improvement of the multiscale over a single scale scheme. TRW-S is a dual method and is considered state-of-the-art for discrete energy minimization [Szeliski2008]. However, we show that when it comes to non-submodular energies it struggles behind the large move making algorithms and even ICM. For these challenging energies, multiscale gives a significant boost in optimization performance.
We begin with synthetic non-submodular energies defined over a 4-connected grid graph of size (), and labels. The unary term . The pair-wise term () and . The parameter controls the relative strength of the pair-wise term, stronger (i.e., larger ) results with energies more difficult to optimize (see [Kolmogorov2006]). Table Document shows results, averaged over 100 experiments. The resulting synthetic energies are non-submodular (since may become negative). For these challenging energies, state-of-the-art dual method (TRW-S) performs rather poorly333We did not restrict the number of iterations, and let TRW-S run until no further improvement to the lower bound is made. (worse than single scale ICM) and there is a significant gap between the lower bound and the energy of the actual primal solution provided. Among the primal methods used, These results motivate our focusing on primal methods, especially -swap.
|2*ours Ours||single||2*ours Ours||single||Ann.|
= lower than baseline). Our multiscale approach combined with QPBO achieves consistently better energies than baseline, with very low variance. TRW-S improves on only 25% of the instances with very high variance in the results.
Chinese character inpainting
We further experiment with learned binary energies of [Nowozin2011, §5.2]444available at www.nowozin.net/sebastian/papers/DTF_CIP_instances.zip.
. These 100 instances of non-submodular pair-wise energies are defined over a 64-connected grid. These energies were designed and trained to perform the task of learning Chinese calligraphy, represented as a complex, non-local binary pattern. Despite the very large number of parameters involved in representing such complex energies, learning is conducted very efficiently using Decision Tree Field (DTF). The main challenge in these models becomes the inference at test time. Our experiments show how approaching these challenging energies using our unified multiscale framework allows for better approximations. TableDocument and Fig. Document
compare our multiscale framework to single-scale methods acting on the primal binary variables. Since the energies are binary, we use QPBO instead of large move making algorithms. We also provide an evaluation of a dual method (TRW-S) on these energies. In addition to the quantitative results, Fig.Document provides a visualization of some of the instances of the restored Chinese characters. These “real world” energies highlight the advantage primal methods has over dual ones when it comes to challenging non-submodular energies. It is further clear that significant improvement is made by our multiscale framework.
The problem of co-clustering addresses the matching of superpixels within and across frames in a video sequence. Following [Bagon2012, §6.2], we treat co-clustering as a discrete minimization of non-submodular Potts energy. We obtained 77 co-clustering energies, courtesy of [Glasner2011], used in their experiments. The number of variables in each energy ranges from 87 to 788. Their sparsity (percent of non-zero entries in ) ranges from to , The resulting energies are non-submodular, have no underlying regular grid, and are very challenging to optimize [Bagon2012]. Table Document compares our discrete multiscale framework combined with ICM and -swap. For these energies we use a different baseline: the state-of-the-art results of [Glasner2011] obtained by applying specially tailored convex relaxation method (We do not use the lower bound of TRW-S here since it is far from being tight for these challenging energies). Our multiscale framework improves state-of-the-art for this family of challenging energies.
We further applied our multiscale framework to optimize less challenging semi-metric energies. We use the diverse low-level vision MRF energies from the Middlebury benchmark [Szeliski2008]555Available at vision.middlebury.edu/MRF/.. For these semi-metric energies, TRW-S (single scale) performs quite well and in fact, if enough iterations are allowed, its lower bound converges to the global optimum. As opposed to TRW-S, large move making and ICM do not always converge to the global optimum. Yet, we are able to show a significant improvement for primal optimization algorithms when used within our multiscale framework. Tables Document and Document and Figs. Document and Document show our multiscale results for the different submodular energies. One of the conclusions of the Middlebury challenge was that ICM is no longer a valid candidate for optimization. Integrating ICM into our multiscale framework puts it back on the right track. Table Document exemplifies how our framework improved running times for two difficult energies (“Penguin” denoising and “Venus” stereo).
|oursOurs||single scale||oursOurs||single scale||oursOurs||single scale|
|oursOurs||Single scale||oursOurs||Single scale||oursOurs||Single scale||truth|
|oursOurs||single scale||oursOurs||single scale||oursOurs||single scale|
|oursOurs||Single scale||oursOurs||Single scale||oursOurs||Single scale|
|ours + [sec]||[sec]|
|ours + [sec]||[sec]|
Comparing variable correlation estimation methods
As explained in Sec. Document the correlations between the variables are the most crucial component in constructing an effective multiscale scheme. In this experiment we compare our energy-aware correlation measure (Sec. Document) to three methods proposed by [Kim2011]: “unary-diff”, “min-unary-diff” and “mean-compat”. These methods estimate the correlations based either on the unary term or the pair-wise term, but not both. We also compare to an energy-agnostic measure, that is , this method underlies [Felzenszwalb2006]. We use ICM within our framework to evaluate the influence these methods have on the resulting multiscale performance for four representative energies. Fig. Document shows percent of lower bound for the different energies. Our measure consistently outperforms all other methods, and successfully balances between the influence of the unary and the pair-wise terms.
-swap does not scale gracefully with the number of labels. Coarsening an energy in the labels domain (i.e., same number of variables, fewer labels) proves to significantly improve performance of -swap, as shown in Table Document. For these examples constructing the energy pyramid took only milliseconds, due to the “closed form” formula for estimating label correlations. Our principled framework for coarsening labels improves -swap performance for these energies.
This work presents a unified multiscale framework for discrete energy minimization that allows for efficient and direct exploration of the multiscale landscape of the energy. We propose two paths to expose the multiscale landscape of the energy: one in which coarser scales involve fewer and coarser variables, and another in which the coarser levels involve fewer labels. We also propose adaptive methods for energy-aware interpolation between the scales. Our multiscale framework significantly improves optimization results for challenging energies. Our framework provides the mathematical formulation that “bridges the gap” and relates multiscale discrete optimization and algebraic multiscale methods used in PDE solvers (e.g., [Brandt1986]). This connection allows for methods and practices developed for numerical solvers to be applied in multiscale discrete optimization as well.
Derivation of eq. (the-at-equationgroup-at-IDa)
In this work we consider discrete pair-wise minimization problems of the form: (the-at-equationgroup-at-ID) Using the following parameterizations: s.t. , and s.t. we claim that (the-at-equationgroup-at-ID) is equivalent to: (the-at-equationgroup-at-IDa) s.t. (the-at-equationgroup-at-IDa) Assuming both and are symmetric666if they are not we need to be slightly more careful with transposing them, but roughly similar expression can be derived.. Looking at the different components in (the-at-equationgroup-at-IDa):
Looking at the trace of the second term:
As for the unary term:
Note that the diagonal of is assumed to be zero: this is a reasonable assumption as represents an interaction of variable with itself. This type of interaction is well represented by the unary term . When coarsening the energy it may happen that will no longer have zeros on the diagonal. This case may arise when a single coarse variable represents neighboring fine scale variables. In that case the fine scale pair-wise interaction should be absorbed into the coarse scale unary term. It is easy to see that the term should be added to the unary term , whenever has non zeros on the diagonal. After this rectification, the non-zeros entries on the diagonal of can be set to zero.
More General Energy Functions
This chapter (LABEL:cp:multiscale) has focused on the construction of an energy pyramid for energies of the form (the-at-equationgroup-at-ID). However, this form does not cover all possible pair-wise energies. We have used this slightly restricted form throughout the paper since it emphasizes the simplicity of the algebraic derivation of our multiscale framework. Nevertheless, our multiscale framework can be as easily applied to more general energies.
A general from for pair-wise energy over a graph can be written as
In this more general form, the pair-wise term can be entirely different for each pair: . This is in contrast to the energy (the-at-equationgroup-at-ID) where the pair-wise terms differ only by the scaling factor of a single fixed term . The Photomontage energies of [Szeliski2008, §4.2] are an example of such general pair-wise energies. Instead of using a pair of matrices and to parameterize the pair-wise terms, we use a collection of matrices for the same purpose in the general settings. Each matrix is of size and is defined as . A general energy is now parameterized by . Coarsening variables: The computation of the interpolation matrix is unchanged. ICM can be applied to energies of the form () to estimate agreement between neighboring variables. These agreements are then used to compute the interpolation matrix (Sec. Document). In order to write the coarsening of the pair-wise term we need to introduce some notations: Let be variables at the fine scale, and denote variables in the coarse scale. An entry, , in the interpolation matrix indicates how fine scale variable is affected by a coarse scale variable . Given an interpolation matrix we can coarsen the energy by
The coarse graph, , is defined by all the non-zero pair-wise terms . The coarser energy is now parameterized by
Coarsening labels: In this case, the computation of the interpolation matrix is not trivial, since we no longer have a single matrix to derive the agreements from. We leave this issue of deriving an efficient interpolation matrix for the general case for future work. However, given a matrix the coarsening of labels can be done easily:
yielding a coarser energy parameterized by
with the same number of variables, , but fewer labels . It is fairly straight forward to see that the energy (the-at-equationgroup-at-ID) is a special case of the more general form () and so the coarsening of variables and labels of Eq. () and () can be seen as special cases of Eq. () and () resp.
Discrete energies may involve terms that are beyond pair-wise: that is, describe interaction between sets of variables. These energies are often referred to as high-order energies. Examples of such energies can be found in e.g., [Kohli2009, Rother2009]. A high order energy is defined over a hyper-graph where the hyper-edges are subsets of variables s.t. .
Where the high-order terms are -way discrete functions:
) can be parameterized using tensors. Each high-order term,, is parameterized by a -order tensor
A high-order energy () is now parameterized by . Coarsening variables: The computation of the interpolation matrix is unchanged. ICM can be applied to energies of the form () to estimate agreement between neighboring variables. These agreements are then used to compute the interpolation matrix (Sec. Document). Given an interpolation matrix we can coarsen the energy by
For all coarse variables with non-zero entry . These non-zeros interactions in defines the coarse scale hyper-edges in . Note that when two (or more) fine-scale variable are represented by the same coarse variable , then the size of (the coarse scale hyper-edge) is reduced relative to the size of (the fine scale hyper-edge). Coarsening labels: In this case, the computation of the interpolation matrix is not trivial, since we no longer have a clear representation of the interactions between the different labels. We leave this issue of deriving an efficient interpolation matrix for the general case for future work. However, given a matrix the coarsening of labels can be done easily. Using to denote coarse scale labels
yielding a coarser energy parameterized by
with the same number of variables, , but fewer labels .