1 Problem Formulation
The multi-class image labeling problem consists in finding, for each pixel in the image domain , a label which assigns one of class labels to so that the labeling function adheres to some local data fidelity as well as nonlocal spatial coherency constraints.
This problem class occurs in many applications, such as segmentation, multiview reconstruction, stitching, and inpainting [PCF06]. We consider the variational formulation
The data term assigns to each possible label a local cost , while the regularizer enforces the desired spatial coherency. We will in particular be interested in regularizers that penalize the weighted length of interfaces between regions of constant labeling. Minimizing is an inherently combinatorial problem, as there is a discrete decision to be made for each point in the image.
In the fully discrete setting, the problem can be expressed in terms of a Markov Random Field [Win06] with a discrete state space, where the data and regularization terms can be thought of as unary and binary potentials, respectively. For graph-based discretizations of , the resulting objective only contains terms that depend on the labels at one or two points, and the problem can be approached with fast graph cut-based methods. Unfortunately, this scheme introduces an anisotropy [Boy03] and thus does not represent isotropic regularizers well. Using ternary or higher-order terms, the metrication error can be reduced, but graph-based methods then cannot be directly applied.
However, it can be shown that even in the graph-based representation the problem is NP-hard for relatively simple [BVZ01], so we cannot expect to easily derive fast solvers for this problem. This is in part caused by the discrete nature of the feasible set. In the following, we will relax this set. This allows to solve the problem in a globally optimal way using convex optimization methods. On the downside, we cannot expect the relaxation to be exact for any problem instance, i.e. we might get non-discrete (or discrete but suboptimal) solutions.
There are several choices for the relaxation method, of which in our opinion the following is the most transparent (Fig. 1): Identify label with the -th unit vector , set , and solve
The data term is now linear in and fully described by the vector
Due to the linearization, the local costs may be arbitrarily complicated, possibly derived from a probabilistic model, without affecting the overall problem class. In this form, we relax the label set by allowing to take “intermediate” values in the convex hull of the original label set. This is just the standard simplex ,
The problem is then considered on the relaxed feasible set ,
The space of functions of bounded variation guarantees a minimal regularity of the discontinuities of , see Sect. 2.2. Assuming we can extend the regularizer to the whole relaxed set , we get the relaxed problem
If can be made convex, the overall problem is convex as well, and it may likely be computationally tractable. In addition, should ideally have a closed-form expression, or at least lead to a computationally tractable problem.
Whether these points are satisfied depends on the way a given regularizer is extended to the relaxed set. The prototypical example for such a regularizer is the total variation,
where denotes the Frobenius norm, in this case on . Note that may be discontinuous, so the gradient has to be understood in a distributional sense (Sect. 2.2). Much of the popularity of stems from the fact that it allows to include boundary-length terms: The total variation of the indicator function of a set ,
called the perimeter of , is just the classical length of the boundary .
In this paper, we will in more generality consider ways to construct regularizers which penalize interfaces between two adjacent regions with labels according to the perimeter (i.e. length or area) of the interface weighted by an interaction potential depending on the labels (in a slight abuse of notation the interaction potential is also denoted by , since there is rarely any ambiguity with respect to the ambient space dimension). The simplest case is the Potts model with the uniform metric iff and otherwise . In this case, the regularizer penalizes the total interface length, as seen above for the total variation.
As a prime motivation for our work, consider the two-class case with . As here the second component of is given by the first via , we may pose the relaxed problem in the form
where is a scalar. This formulation is also referred to as continuous cut in analogy to graph cut methods. It can be shown [CEN06] that while there may be non-discrete solutions of the relaxed problem, a discrete – i.e. – global optimal solution can be recovered from any solution of the relaxed problem. We can thus reduce the combinatorial problem to a convex problem. While there are reasons to believe that this procedure cannot be extended to the multi-class case, we may still hope for “nearly” discrete solutions.
1.1 Related Work
The difficulty of the labeling problem varies greatly with its precise definition. Formulations of the labeling problem can be categorized based on
whether they tackle the binary (two-class) or the much harder multiclass problem, and
whether they rely on a graph representation or are formulated in a spatially continuous framework.
An early analysis of a variant of the binary continuous cut problem and the associated dual maximal flow problem was done by Strang [Str83]. Chan et al. [CEN06] pointed out that by thresholding a nonbinary result of the relaxed, convex problem at almost any threshold one can generate binary solutions of the original, combinatorial problem (this can be carried over to any threshold and to slightly more general regularizers [Ber09, ZNF09]. The proof heavily relies on the coarea formula [FR60], which unfortunately does not transfer to the multiclass case. The binary continuous case is also closely related to the thoroughly analyzed Mumford-Shah [MS89] and Rudin-Osher-Fatemi (ROF) [ROF92] problems, where a quadratic data term is used.
For the graph-based discretization, the binary case can be formulated as a minimum-cut problem on a grid graph, which allows to solve the problem exactly and efficiently for a large class of metrics using graph cuts [KB05, BK04]. Graph cut algorithms have become a working horse in many applications as they are very fast for medium sized problems. Unfortunately they offer hardly any potential for parallelization. As mentioned in the introduction, the graph representation invariably introduces a grid bias [Boy03] (Fig. 2). While it is possible to reduce the resulting artifacts by using larger neighborhoods, or by moving higher-order potentials through factor graphs, this greatly increases graph size as well as processing time.
Prominent methods to handle the graph-based multiclass case rely on finding a local minimum by solving a sequence of binary graph cuts [BVZ01] (see [KT07] for a recent generalization). These methods have recently been extended to the continuous formulation [TPCB08] with similar theoretical performance [Ols09]. Our results can be seen as a continuous analogon to [Ish03], where it was shown that potentials of the form can be exactly formulated as a cut on a multi-layered graph. An early analysis can be found in [KT99]
, where the authors also derive suboptimality bounds of a linear programming relaxation for metric distances. All these methods rely on the graph representation with pairwise potentials.
In this paper, we will focus on the continuous multiclass setting with higher order potentials in the discretization. Closely related to our approach is [CCP08]. In contrast to approaches that rely on a linear ordering of the labels [Ish03, BT09], the authors represent labels in a higher-dimensional space. In a certain sense, [LLT06] can be seen as a predecessor of this approach: in this work, the authors represent the label assignment using a piecewise constant real-valued function, but parametrize this function using a set of polynomial basis functions, which enables them to employ the Potts regularizer.
The approach of [CCP08] allows to formulate interaction potentials of the form with nondecreasing, positive, concave . The authors provide a thorough analysis of the continuous model and propose a relaxation based on the convex envelope, which gives almost discrete solutions in many cases. We will extend this approach to the more general class of regularizers where is an arbitrary metric. The same authors proposed a “Fast Primal-Dual” algorithm with proven convergence to solve the associated saddle point problem [PCBC09a]. By lifting the objective to a higher dimensional space, it turns out that the same method can be used to solve many problems also with nonlinear data term [PCBC09b].
Our approach is a generalization of [ZGFN08, LKY09], where a similar linearization is used with the regularizer restricted to the Potts distance, and with less strong convergence results. These methods have also been extended to segmentation on manifolds [DFPH09].
Regarding optimization, several authors proposed smoothing of the primal or dual objective together with gradient descent [Ber09, BYT10]. In contrast, our approach does not require any a priori smoothing. Using Nesterov’s method [Nes04] for the labeling problem was proposed in [LKY09]. An earlier analysis of the method in the context of -norm and TV minimization can be found in [WABF07]. In [BBC09] the method is applied to a class of general regularized problems. In [GBO09] a predecessor of the proposed Douglas-Rachford approach was presented in the Split Bregman framework [Set09a] and restricted to the two-class case. We provide an extension to the multi-class case, with proof of convergence and a sound stopping criterion.
The paper is organized as follows:
Given such a metric, we study two possible approaches to extend regularizers on the relaxed set (Sect. 4):
The “Euclidean distance” approach (Sect. 4.4), which yields exact extensions for Euclidean metrics only but has a closed form expression. We review some methods for the approximation of non-Euclidean metrics.
We provide a unified continuous framework and show existence of a minimizer.
We provide and analyze two different methods that are capable of minimizing the specific class of saddle point problems (Sect. 6):
A specialization of a method for nonsmooth optimization as suggested by Nesterov (Sect. 6.2). The method is virtually parameter-free and provides explicit a priori and a posteriori optimality bounds.
A Douglas-Rachford splitting approach (Sect. 6.3). We show that the approach allows to compute a sequence of dual iterates that provide an optimality bound and stopping criterion in form of the primal-dual gap.
Finally, we illustrate and compare the above methods under varying conditions and demonstrate the applicability of the proposed approaches on real-world problems (Sect. 7).
In contrast to existing graph-based methods, we provide a continuous and isotropic formulation, while in comparison with existing continuous approaches, we provide a unified framework for arbitrary, non-uniform metrics . The Euclidean metric method and Nesterov optimization have been announced in less generality in [LKY09, LBS09].
2 Mathematical Preliminaries
In the following sections we provide a reference of the notation used, and a brief introduction to the concept of functions of bounded variation and corresponding functionals. We aim to provide the reader with the basic ideas. For more detailed expositions we refer to [AFP00, Zie89].
In the following, superscripts denote a collection of vectors or matrix columns, while subscripts denote vector components, i.e. we denote, for ,
An additional bracket indicates an element of a sequence . We will frequently make use of the Kronecker product [Gra81]
in order to formulate all results for arbitrary dimensions. The standard simplex in is denoted by , where .
is the identity matrix inand the usual Euclidean norm for vectors resp. the Frobenius norm for matrices. Analogously, the standard inner product extends to pairs of matrices as the sum over their elementwise product. denotes the ball of radius at , and the set of with
. The characteristic functionof a set is defined as iff and otherwise. By iff and otherwise we denote the corresponding indicator function. For a convex set , is the support function from convex analysis. denotes the classical Jacobian of .
is the space of -times continuously differentiable functions on with compact support, and the completion of under the supremum norm. As usual, denotes the -dimensional Lebesgue measure, while denotes the -dimensional Hausdorff measure. For some measure and set , denotes the restriction of to , i.e. .
2.2 Total Variation and
The total variation will be our main tool to construct the regularizer . For a differentiable scalar-valued function , the total variation is simply the integral over the norm of its gradient:
As is the designated labeling function, which ideally should be piecewise constant, the differentiability and continuity assumptions have to be dropped. In the following we will shortly review the general definition of the total variation and its properties.
We require the image domain to be a bounded open domain with compact Lipschitz boundary, that is the boundary can locally be represented as the graph of a Lipschitz-continuous function. For simplicity, we will assume in the following that .
We consider general vector-valued functions which are locally absolutely integrable, i.e. . As is bounded this is equivalent to being absolutely integrable, i.e. . For any such function , its total variation is defined in a dual way [AFP00, (3.4)] as
This definition can be derived for continuously differentiable by extending (12) to vector-valued ,
replacing the norm by its dual formulation and partial integration. If has finite total variation, i.e. , is said to be of bounded variation. The vector space of all such functions is denoted by :
Equivalently, iff and its distributional derivative corresponds to a finite Radon measure; i.e. and there exist -valued measures on the Borel subsets of such that [AFP00, p.118]
These measures form the distributional gradient , which is again a measure that vanishes on any -negligible set. If then , where is the total variation of the measure in the measure-theoretic sense [AFP00, 3.6]. The total variation of characteristic functions has an intuitive geometrical interpretation: For a Lebesgue-measurable subset , its perimeter is defined as the total variation of its characteristic function,
Assuming the boundary is sufficiently regular, is just the classical length () or area () of the boundary.
2.3 Properties of and Compactness
We review the most important ingredients for proving existence of minimizers for variational problems on involving involving .
Convexity. As is the pointwise supremum of a family of linear functions, it is convex and positively homogeneous, i.e. for .
Lower Semicontinuity. A functional is said to be lower semicontinuous with respect to some topology, if for any sequence converging to ,
It can be shown that for fixed , the total variation is well-defined on and lower semicontinuous in w.r.t. the topology [AFP00, 3.5,3.6]; hence also in due to the boundedness of .
Compactness. For , instead of the norm topology induced by
which makes a Banach space but is often too strong, one frequently uses the weak* convergence: Define weakly* iff
weakly* in measure, i.e.
For this is equivalent to in , and being uniformly bounded in , i.e. there exists a constant such that [AFP00, 3.13]. For the weak* topology in , a compactness result holds [AFP00, 3.23]: If and is uniformly bounded in , then contains a subsequence weakly*-converging to some .
2.4 General Functionals on
We will now review how general functionals depending on the distributional gradient can be defined. Recall that for any the distributional gradient is a measure. Moreover, it can be uniquely decomposed into three mutually singular measures
that is: An absolutely continuous part , the jump part , and the Cantor part . Mutual singularity refers to the fact that can be partitioned into three subsets, such that each of the measures is concentrated on exactly one of the sets. We will give a short intuitive explanation, see [AFP00, 3.91] for the full definitions.
The part is absolutely continuous with respect to the -dimensional Lebesgue measure , i.e. it vanishes on any -negligible set. It captures the “smooth” variations of : in any neighborhood where has a (possibly weak) Jacobian , the jump and Cantor parts vanish and
The jump part is concentrated on the set of points where locally jumps between two values and along a -dimensional hypersurface with normal (unique up to a change of sign). In fact, there exists a jump set of discontinuities of and Borel functions and such that [AFP00, 3.78, 3.90]
where denotes the restriction of the -dimensional Hausdorff measure on the jump set , i.e. for measurable sets . The Cantor part captures anything that is left.
As an important consequence of the mutual singularity, the total variation decomposes into . Using this idea, one can define functionals depending on the distributional gradient [AFP00, Prop. 2.34]. For and some convex, lower semi-continuous , define
Here is the recession function of , and denotes the polar decomposition of , which is the density of with respect to its total variation measure . If is positively homogeneous, holds, and
From (25) it becomes clear that the meaning of acting on the Jacobian of is extended to the jump set as acting on the difference of the left and right side limits of at the discontinuity. This is a key point: by switching to the measure formulation, one can handle noncontinuous functions as well.
3 Necessary Properties of the Interaction Potential
Before applying the methods above to the labeling problem, we start with some basic considerations about the regularizer and the interaction potential . We begin by formalizing the requirements on the regularizer of the relaxed problem as mentioned in the introduction. Let us assume we are given a general interaction potential . Intuitively, assigns a weight to switching between label and label . We require
but no other metric properties (i.e. symmetry or triangle inequality) for now. Within this work, we postulate that the regularizer should satisfy
is convex and positively homogeneous on .
for any constant , i.e. there is no penalty for constant labelings.
For any partition of into two sets with , and any ,
i.e. a change from label to label gets penalized proportional to as well as the perimeter of the interface. Note that this implies that is isotropic (i.e. rotationally invariant).
We require convexity in (P1) in order to render global optimization tractable. Indeed, if is convex, the overall objective function (6) will be convex as well due to the linearization of the data term. Positive homogeneity is included as it allows to be represented as a support function (i.e. its convex conjugate is an indicator function and for some closed convex set ), which will be exploited by our optimization methods.
Requirements (P3) and (P2) formalize the principle that the multilabeling problem should reduce to the classical continuous cut (9) in the two-class case. This allows to include boundary length-based terms in the regularizer that can additionally be weighted depending on the labels of the adjoining region (Fig. 4).
Together, these requirements pose a natural restriction on :
Let satisfy (P1) – (P3). Then must necessarily be a metric, i.e. for all ,
1. follows from (P2) and (P3) by choosing and with . Symmetry in 2. is obtained from (P3) by replacing with , as . To show 3., first note that for any constant and all , since . Fix any set with perimeter
Then, using the above mentioned fact and the positive homogeneity of ,
Note that if the requirement (27) is dropped, it is easy to show that if for some , then for any . In this case the classes and can be collapsed into a single class as far as the regularizer is concerned. The decision between label and is then completely local, i.e. depends only on the data term and can be postponed to a post-processing step by modifying the data term to
Thus (27) is not a real limitation and can be always assured. As a side note, it can be shown that, under some assumptions and in the space of piecewise constant functions, the subadditivity of already follows if is required to be lower semicontinuous [Bra02, p.88].
Proposition 1 implies that for non-metric , we generally cannot expect to find a regularizer satisfying (P1)–(P3). Note that here is not required to be of any particular form. In the following sections, we will show that on the other hand, if is metric as in Proposition 1, such a regularizer can always be constructed. This implies that the interaction potentials allowing for a regularizer that satisfies (P1)–(P3) are exactly the metrics.
4 Constructing Regularizers from the Interaction Potential
We study next how to construct regularizers on satisfying (P1)–(P3). As in (25) we set
We additionally require to be a support function, i.e. for some closed, convex ,
As a support function, coincides with its recession function , thus
Also, we have an equivalent dual formulation in analogy to (13),
For simplicity we will also assume that is rotationally invariant along the image dimensions, i.e. for any rotation matrix ,
This is equivalent to being isotropic.
We will now show under which circumstances a minimizer exists in , and then see how the approach can be used to construct regularizers for specific interaction potentials.
4.1 Existence of Minimizers
where as in (36),
for some closed convex , and
Note that is convex, as it is the pointwise supremum of affine functions (39). Again for simplicity we set . Then we have the following
Let be closed convex, , , and
Additionally assume that for some . Then is lower semicontinuous in with respect to convergence.
As the data term is continuous, it suffices to show that the regularizer is lower semicontinuous. This is an application of [AFP00, Thm. 5.47]. In fact, the theorem shows that is the relaxation of ,
on w.r.t. (thus here ) convergence and thus lower semicontinuous in . To apply the theorem, we have to show that is quasiconvex in the sense of [AFP00, 5.25], which holds as it is convex by construction. The other precondition is (at most) linear growth of , which holds with ∎
Let as in Prop. 2 and additionally assume that
for some . Then the problem
has at least one minimizer.
From the inner and outer bounds it holds that for any with . Moreover, the constraint implies that satisfies . Combining this with the positive homogeneity of , it follows from (26) that
the fact that , and boundedness of and , it follows that the data term is bounded from below on .
We now show coercivity of with respect to the norm: Let with . As the data term is bounded from below, and using the fact that , it follows that . Thus is coercive.
Equations (48) and (49) also show that is bounded from below. Thus we can choose a minimizing sequence . Due to the coercivity, the sequence must then be bounded from above. From this and [AFP00, Thm. 3.23] we conclude that there is a subsequence of weakly*- (and thus -) converging to some . With the lower semicontinuity from Prop. 2 and closedness of with respect to convergence, existence of a minimizer follows. ∎
4.2 Relation to the Interaction Potential
To relate such to the labeling problem in view of (P3), we have the following
Let as in Prop. 2. For some and vectors , let . Then for any vector with ,
In particular, if for some with , then fulfills (P3).
To show the first equality, (26) implies
We make use of the property , which is a direct consequence of the definition of the total variation measure and the fact that for any vector (note that are also vectors). Therefore
which by positive homogeneity of implies
Since the density function assumes values in , there exists, for a.e. , a rotation matrix mapping to . Together with the rotational invariance of from (40) this implies
which proves the first equality in (50). The second equality can be seen as follows:
Denote by a rotation matrix mapping to , i.e. , then
The rotational invariance of provides , therefore
As a consequence, if the relaxed multiclass formulation is restricted to two classes by parametrizing for , it essentially reduces to the scalar continuous cut problem: Solving
is equivalent to solving
which is just the classical binary continuous cut approach with data and regularizer weight , where is some arbitrary unit vector. For the multiclass case, assume that
for some partition with . Then the absolutely continuous and Cantor parts vanish [AFP00, Thm. 3.59, Thm. 3.84, Rem. 4.22], and only the jump part remains: