where is a proper, closed and convex function; is a nonempty, closed and convex set; and and are known. In the sequel, we develop efficient numerical methods to approximate an optimal solution to (1) and rigorously characterize how common structural assumptions on (1) affect the efficiency of the methods.
1.1 Scalable numerical methods for (1) and their limitations
In principle, we can obtain high accuracy solutions to (1) through an equivalent unconstrained problem [13, 54]. For instance, when is absent and is smooth, we can eliminate the linear constraint by using a projection onto the null-space of and then applying well-understood smooth minimization techniques. Whenever available, we can also exploit barrier representations of the constraints and avoid non-smooth via reformulations, such as lifting, as in the interior point method using disciplined convex programming [13, 33, 46, 48]. While the resulting smooth and unconstrained problems are simpler than (1
) in theory, the numerical efficiency of the overall strategy severely suffers from the curse-of-dimensionality as well as the loss of the numerical structures in the original formulation.
Alternatively, we can obtain low- or medium-accuracy solutions when we augment the objective with simple penalty functions on the constraints. For instance, we can solve
where is a penalty parameter. Despite the fundamental difficulties in choosing the penalty parameter, this approach enhances our computational capabilities as well as numerical robustness since we can apply modern proximal gradient, alternating direction, and primal-dual methods. Intriguingly, the scalability of virtually all these solution algorithms rely on three key structures that stand out among many others:
Structure 1 (Decomposability):
We say that the constrained problem (1) is -decomposable if the objective function and the feasible set can be represented as follows
where , , is proper, closed and convex for , and
. Decomposability immediately supports parallel and distributed implementations in synchronous hardware architectures. This structure arises naturally in linear programming, network optimization, multi-stages models and distributed systems. With decomposability, the problem (1) is also referred to as a monotropic convex program .
Structure 2 (Proximal tractability):
Unconstrained problems can still pose significant difficulties in numerical optimization when they include non-smooth terms. However, many non-smooth problems (e.g., of the form (2)) can be solved nearly as efficiently as smooth problems, provided that the computation of the proximal operator is tractable111It can be solved in a closed form, low computational cost or polynomial time. [4, 58, 62]:
where is a constant. While the proximal operators simply use in the canonical setting, we employ (4) to do away with the -feasibility of the algorithmic iterates. Many smooth and non-smooth functions support efficient proximal operators [18, 21, 43, 70]. Clearly, decomposability proves useful in the computation of (4).
Structure 3 (Special function classes):
Often times, the function in (1) or the individual terms in (3) possess additional properties that can enhance numerical efficiency. Table 1 highlights common properties that are typically (but not necessarily) associated with function smoothness. These structures provide iterative algorithms with analytic upper and lower bounds on the objective (or its gradient), and aid the theoretical design of their iterations as well as their practical step-size and momentum parameter selection [4, 13, 48, 54, 67].
On the basis of these structures, we can design algorithms featuring a full spectrum of (nearly) dimension-independent, global convergence rates for composite convex minimization problems with well-understood analytical complexities [4, 48, 53, 52, 67]. Unfortunately, the scalable, penalty-based approaches above invariably feature one or both of the following two drawbacks which blocks their full impact.
Limitation 1 (Non-ideal convergence characterizations):
Ideally, the convergence characterization of an algorithm for solving (1) must establish rates both on absolute value of the primal objective residual as well as the primal feasibility of its linear constraints , simultaneously on its iterates . The constraint feasibility is critical so that the primal convergence rate has any significance. Rates on weighted primal objective residual and feasibility gap is not necessarily meaningful since (1) is a constrained problem and can easily be negative at all times as compared to the unconstrained setting where we trivially have .
Table 2 demonstrates that the convergence results for some existing methods are far from ideal. Most algorithms have guarantees in the ergodic sense (i.e., on the averaged history of iterates without any weight) [15, 37, 38, 57, 64, 71] with non-optimal rates, which diminishes the practical performance; they rely on special function properties to improve convergence rates on the function and feasibility [56, 57], which reduces the scope of their applicability; they provide rates on dual functions , or a weighted primal residual and feasibility score , which does not necessarily imply convergence on the absolute value of the primal residual or the feasibility; or they obtain convergence rate on the gap function value sequence composed both the primal and dual variables via variational inequality and gap function characterizations [15, 37, 38], where the rate is scaled by a diameter parameter which is not necessary bounded.222We refer to the standard ADMM (see, e.g., ) and not the parallel ADMM variant or multi-block ADMM, which can have convergence guarantees given additional assumptions.
|ADMM||-decomposable||on the joint using a gap function||[15, 37, 38]|
|[Fast] ADMM||-decomposable and or||on the dual-objective|||
|Fast Linearized ADMM||-decomposable and or||and|||
|Primal-Dual Hybrid Gradient (PDHG)||Saddle point problem||based on gap function values composed both primal-dual variables|||
|[Inexact] augmented Lagrangian method||-decomposable||and (non-ergodic)||This work|
|Decomposition methods [Inexact] 1P2D and 2P1D||-decomposable||and (non-ergodic)||This work|
|-decomposable and||, , and (non-ergodic)|
|New ADMM and its preconditioned variants||-decomposable||and (non-ergodic)||This work|
Limitation 2 (Computational inflexibility):
Recent theoretical developments customize algorithms to exploit special function classes for scalability. We have indeed moved away from the black-box model of optimization, which forms the foundation of the interior point method’s flexibility, where, for instance, we restrict ourselves to compute solely the values and the (sub)gradients of the objective and the constraints at a point.
Unfortunately, specialized algorithms requires knowledge of function class parameters, do not address the full scope of (1) (e.g., with self-concordant functions or fully non-smooth decompositions), and often have complicated algorithmic implementations with backtracking steps, which create computational bottlenecks. Moreover, these issues are further compounded by their penalty parameter selection, such as in (2) (cf.,  for an extended discussion), which can significantly decrease numerical efficiency, as well as the inability to handle -decomposability in an optimal fashion, which rules out parallel architectures for their computation.
1.2 Our contributions
To this end, we address the following two questions in this paper: “Is it possible to efficiently solve (1) using only the proximal tractability assumption with global convergence guarantees?” and “Can we actually characterize the convergence rate of the primal objective residual and primal feasibility gap separately?” The answer is indeed positive provided that there exists a solution in a bounded primal feasible set .
Surprisingly, we can still exploit favorable function classes, such as and when available, optimally exploit -decomposability and its special -decomposable sub-case, and have a penalty parameter-free black-box optimization method. The second question is also important since in primal-dual framework, trade-off between the primal objective residual and the primal feasibility gap is crucial, which makes algorithm numerically stable, see, e.g.,  for numerical examples.
Primal-dual methods rely on strong duality in convex optimization  and are also related to many other methods for solving saddle points, monotone inclusions and variational inequalities . In our approach, we reformulate the optimality condition of (1) as a mixed-variational inequality and use the gap function as our main tool to develop the algorithms.
Smoothing techniques are widely used in optimization to replace non-smooth functions with differentiable approximations. In this work, we describe two smoothing strategies for the dual function of (1) in the Lagrange formulation based on Bregman distances and the augmented Lagrangian technique. We show that the augmented Lagrangian smoother preserves convergence properties for the algorithm to solve (1) and feature a convergence rate independent of the spectral norm of . In addition, the Bregman smoother allows us to handle -decomposability by only relying on the proximal tractability assumption.
Excessive gap function:
Excessive gap technique was introduced by Nesterov in  and has been used to develop primal-dual solution methods for solving nonsmooth unconstrained problems. In this paper, we exploit the same excessive gap idea but in a structured form for a variational inequality characterizing the optimality condition of (1). We then combine these three existing techniques in order to develop a unified primal-dual framework for solving (1) and analyze the convergence of its algorithmic instances under mild assumptions.
Our specific theoretical and practical contributions are as follows:
We present a unified primal-dual framework for solving constrained convex optimization problems of the form (1). This framework covers augmented Lagrangian method [39, 45], (preconditioned) ADMM , proximal-based decomposition  and decomposition method  as special cases, which we make explicit in Section 6.
We prove the convergence and establish rates for three variants (cf., Theorem 4.1) of our algorithmic framework without any need to select a penalty parameter. An important result is the convergence rate in a non-ergodic sense of both primal objective residual and the primal feasibility gap , where or . Our rates are considered optimal given our particular assumptions (cf., Table 2).
We consider an inexact variant of our algorithmic framework for the special case of -decomposability, which allows one to solve the subproblems up to given predetermined accuracy so that it still maintains the same worst-case analytical complexity as in the exact case provided that the accuracy of solving the subproblems is controlled appropriately. This variant allows us to handle -decomposability with only proximal tractability assumption.
We show how special function classes can be exploited and describe their convergence implications.
Our characterization is radically different from existing results such as in [5, 15, 23, 37, 38, 57, 64]. We clarify the importance of this result in Section 4 as well as Section 6 in the context of existing convergence results for ADMM and its variants. For the -decomposability, the variants corresponding to our Bregman smoothing technique can be implemented in a fully parallel and distributed manner, where the feasibility guarantee acts as a consensus rate. In special case, where , we propose a strategy to enhance the practical convergence rate by trading off the objective residual with the feasibility gap.
On the computational front, we test our algorithms on several well-studied numerical problems using both synthetic and real-world data, compare them to other existing state-of-the-art methods, and provide open-source code for each application. We also discuss the update of the smoothness parameters in order to enhance the performance of the algorithms by trading-off between the optimality gap and the feasibility gap. Numerical results show the advantages of our methods on several numerical tests.
1.3 Related work
Due to the generality of (1), there has been an explosion of interest in the convex optimization in developing solution algorithms for it. Unfortunately, it is impossible to provide a comprehensive summary of the ever-expanding literature in any reasonable space. Hence, this subsection attempts to relate some important algorithmic frameworks for solving (1) to our work with selected, representative citations in each.
One of the oldest primal-dual methods for solving (1) is the method-of-multipliers (MoM), which is based on Lagrange dualization . Without further assumptions on and , the dual step of this method can be viewed as a subgradient iteration, which features a provably slow convergence rate, i.e., , where is the iteration count. MoM is also known to be sensitive to the step-size selection rules for damping the search direction.
In order to overcome the difficulty of nonsmoothness in the dual function, several attempts have been made. For instance, we can add either a proximal term or an augmented term to the Lagrange function of (1) to smooth the dual function [20, 34, 35, 44, 45, 61]. Intriguingly, while the specific methods studied in [20, 34, 35, 61] are quite borad, no global convergence rate has been established so far.
The works in [44, 45] provide convergence rates by applying Nesterov’s accelerated scheme to the dual problem of (1). In recent paper , the authors shows that the method proposed in  has convergence rate . However, this convergence rate is a joint between the objective residual and the primal feasibility gap, i.e., for given. We note that this convergence rate on the weighted measure does not imply the convergence rate of and separately in constrained optimization.
In  the author studies several variants of the primal-dual algorithm and presented several applications in image processing. Convergence analysis of these variants are also presented in , however the global convergence rate has not been provided. In , the authors describe a primal-dual hybrid gradient (PDHG) algorithm, which can be considered as a variant of the same primal-dual algorithm. In 
, the authors also studied several heuristic strategies to update the parameters, and show that the convergence rate of this algorithm isin an ergodic sense with respect to a VIP gap function values.
Methods from monotone inclusions and variational inequalities:
The optimality condition of (1) can be viewed as a monotone inclusion or a mixed variational inequality (VIP) corresponding to both the primal and dual variables . As a result, we can leverage algorithms from these two respective fields to solve (1) [15, 28, 37, 38]. For instance, the work in  exploit the idea from variational inequality proposed in [47, 51]. Splitting methods including Douglas-Rachford and predictor-corrector methods considered [21, 22, 26, 36, 55] also belong to this direction. However, since monotone inclusions or variational inequalities are much more general than (1), using methods tailored for optimization purposes may be more efficient in practice for solving the specific optimization problem (1).
Augmented Lagrangian and alternating direction methods:
Augmented Lagrangian (AL) methods have come to offer an important computational perspective on a broad class of constrained convex problems of the form (1). In this setting, we first define the Lagrangian function associated with the linear constraint of (1) as . Then, we introduce the augmented Lagrangian function: for a given penalty parameter . Classical augmented Lagrangian method  solving (1) produces a sequence starting from as
Under a suitable choice of , it is well-known that method (5) converges to a global optimal of (1) at rate under mild assumptions, i.e., . In fact, this method can be accelerated by applying Nesterov’s accelerating scheme  to obtain convergence rate.
Within the class of augmented Lagrangian methods, perhaps the most famous variant is the alternating direction method of multipliers (ADMM), which appears in many guises in the literature. This method has been recognized as a special case of Douglas-Rachford splitting algorithm applying to its optimality condition [12, 26, 32]. In ADMM, given that and are separable with . This case also covers the composite minimization problem of the form , where both and are convex. By using a slack variable, we can reformulate the composite problem into (1) as subject to . In the ADMM context, the first problem in (5) can be solved iteratively as
The main computational difficulty of ADMM is the -update problem (i.e., the first subproblem) in (6). Indeed, we have to numerically solve this step in general except when
is efficiently diagonalizable. Interestingly, the diagonalization step in many cases can be done via Fourier Transform. Many notable applications support this feature, such as matrix completion wheremodels sub-sampled matrix entries, image deblurring where is a convolution operator, and total variation regularization where is a differential operator with periodic boundary conditions. We can also circumvent this computational difficulty by using a preconditioned ADMM variant .
ADMM is one of the most popular method in practice. However, its efficiency depends significantly on the choice of the penalty parameter . Unfortunately, theoretical guarantee for choosing this parameter is still an open problem and is not yet well-understood. When is strongly convex, we can drop the quadratic term in the first line of (6) in order to obtain an alternating minimization algorithm (AMA) . This method turns out to be a forward-backward splitting algorithm for its optimality inclusion .
A note on :
We note that the approach presented in this paper builds upon the excessive gap idea in . Technically, we use the same idea but in a much structured fashion, whereby we enforce a particular linear form in preserving the excessive gap as shown in Definition 3.2. This particular structure is key in obtaining our convergence rates.
Moreover, since our problem setting (1) is different from the general minmax formulation considered in , there are still several differences between our algorithmic framework and the methods studied in  as a result of the excessive gap technique. First, we use augmented Lagrangian functions and Bregman distances for smoothing the dual problem of (1). Second, we consider the Lagrangian primal-dual formulation for (1
) where we do not have the boundedness of the feasible set of the dual variable. In this case the key estimate[50, estimate (3.3)]
does not apply to our setting. Third, we update all algorithmic parameters simultaneously and do not need an odd-even switching strategy[49, Method 1: b) and c)]. Four, we do not assume that the objective function of (1) has Lipschitz gradient which is required in . Note that there are several important applications, where this assumption simply does not hold . Fifth, our method is applied to the constrained problem (1), which requires the feasibility gap characterization as opposed to unconstrained problems where we only need to worry about the primal-dual optimality.
1.4 Paper organization
The rest of this paper is organized as follows. In the next section, we recall basic concepts, and introduce a mixed-variational inequality formulation of (1). In Section 3, we propose two key smoothing techniques for (1), called the Bregman and augmented Lagrangian smoothing techniques. We also provide a formal definition for the excessive gap function from  and further investigate its properties. Section 4 presents the main primal-dual algorithmic framework for solving (1) and its convergence theory. Section 5 specifies different instances of our algorithmic framework for (1) under given assumptions. Section 6 makes further connections to existing methods in the literature. Section 7 is devoted to implementation issues and Section 8 presents numerical simulations. The appendix provides detail proofs of the theoretical results in the main text.
First we recall the well-known definition of the Bregman distance, the primal-dual formulation for (1), and a variational inequality characterization for the optimality condition of (1), which will be used in the sequel.
2.1 Basic notation
Given a proper, closed and convex function , we denote the domain of , the subdifferential of at . If is differentiable, denotes the gradient of at
. For given vector, we define the Euclidean norm of . We use a superscripted notation to denote the corresponding Lipschitz constant of a differentiable function . Similarly, we use a subscripted notation to denote the corresponding strong convexity constant of a convex function .
2.2 Proximity functions and Bregman distances
Given a nonempty, closed convex set , a nonnegative, continuous and -strongly convex function is called a proximity function (or prox-function) of if . For example, the simplest prox-function is for any and . Whenever unspecified, we use this specific prox-function with .
Given a smooth prox-function of with the parameter . We define
the Bregman distance between and with respect to . Given a matrix , we also define the projected prox-diameter of a given set with respect to as
Here, we project the set onto the range space of matrix . If is bounded, then . For , we have , which is indeed the Euclidean distance.
2.3 Primal-dual formulation
We write the min-max formulation of (1) based on the Lagrange dualization as follows:
where is the Lagrange function and is the dual variable. We write the dual function as
which leads to the following definition of the so-called dual problem
Let be a solution of (10) at a given . Corresponding to , we also define the domain of as
If is continuous on and if is compact, then exists for any . Unfortunately, the dual function is typically nonsmooth, and hence the numerical solutions of (11) are usually difficult . In general, we have , which is known as weak-duality in convex optimization. In order to guarantee strong duality, i.e., for (1) and (11), we require the following assumption:
Assumption A. 1
The constraint set and the solution set of (1) are nonempty. The function is proper, closed and convex. In addition, either is a polytope or the following Slater condition holds:
where is the relative interior of .
Under Assumption 1, the solution set of the dual problem (11) is also nonempty and bounded. Moreover, the strong duality holds, i.e., . Any point is a primal-dual solution to (1) and (11), and is also a saddle point of the Lagrange function , i.e., for all and . These inequalities lead to the following estimate
Our goal in this paper is to solve the primal constrained problem (1), while numerical algorithms only give an approximate solution up to a certain accuracy. Hence, we need to specify the concept of an approximate solution for (1).
Given a target accuracy , a point is said to be an -solution of (1) if and .
Here, we assume in Definition 2.1 that , i.e., is exactly feasible to . This requirement is reasonable in practice since is usually a “simple” set where the projection onto can be computed exactly. Moreover, we can use different accuracy levels for the absolute value of the primal objective residual and the primal feasibility gap in Definition 2.1.
2.4 Mixed-variational inequality formulation and gap function
Let be the primal-dual variable and be a partial Karush-Kuhn-Tucker mapping. Then, the optimality condition of (1) becomes
which is known as a mixed-variational inequality . If we define
Let . Then, by the definition of , we can see that
It is clear that if and only if , which is indeed the strong duality property.
3 Primal-dual smoothing techniques
This section shows how to use augmented Lagrangian functions and Bregman distances as a principled smoothing technique [48, 3] within our primal-dual framework. We can then obtain different algorithmic variants by simply choosing an appropriate prox-center at each iteration.
3.1 Dual function is a smoothable function
The dual function defined by (10) is convex but in general nonsmooth. We approximate this function by a smoothed function defined as:
where is a given Bregman distance with the strong convexity parameter , is the prox-center of , is a given consistent projection matrix and is a [primal] smoothness parameter. The following definition characterizes approximation properties of the smoothed function .
Definition 3.1 ()
The dual function defined by (10) is called a -smoothable function if there exist positive numbers , and and a concave and smooth function so that:
In addition, is Lipschitz continuous with a Lipschitz constant .
We call the -smoothed function of or simply the smoothed function of when these parameters are specified. We note that defined by (17) is not necessarily Lipschitz gradient for an arbitrary choice of and . We consider two cases as follows.
3.1.1 Smoothing via augmented Lagrangian
Let us choose , and so that . Then, we have trivially . As a result, the function defined by (17) becomes the augmented dual function, that is
Here, is exactly the augmented Lagrangian of (1) associated with the linear constraint . We denote by the solution of (19) and . It is well-known that is concave as well as smooth, and its gradient is Lipschitz continuous with a Lipschitz constant . We refer to as an augmented Lagrangian smoother (in short, AL smoother) of . The following lemma shows that is a smoothed function of , whose proof can be found, e.g., in .
3.1.2 Smoothing via Bregman distances
If we choose
to be the identity matrix of, then the smoothed function defined by (17) becomes
Let us denote by the solution of (21), which always exists. We refer to as a Bregman distance smoother (shortly, BD smoother) of . The following lemma summarizes the properties of (see, e.g., [50, 68]):
The function defined by (21) satisfies:
where is the prox-diameter of with respect to and is the solution of (10).
Moreover, is concave and smooth. Its gradient is given by for all , and satisfies
for . Consequently, is a -smoothed function of , where and is the strong convexity parameter of .
We note that if is bounded and is continuous (or ), then always exists for any . In this case, the prox-diameter of is finite. Consequently, (22) holds for all .
3.2 Smoothed gap function
As we observe from the previous section, the optimality condition of (1) can be represented as a variational inequality of the form (15). By using Auslender’s gap function defined by (16), we can show that is a primal-dual solution to (1) and (11). Since the gap function is generally nonsmooth, we smooth it by adding the following smoothing function:
where is a given Bregman distance, is a projection matrix and and are two positive smoothness parameters.
For simplicity of our analysis, we use a simple quadratic prox-function in (24) for the dual variable . However, we can replace this term by , where is a given Bregman distance and is a given point in . However, depending on the choice of , the dual variable may no longer have a closed form expression. However, the overall practical performance may be improved.
The smoothed gap function for is then defined as follows:
It is clear that the maximization problem (25) is a convex optimization problem. We denote by the solution of this problem. Then, by using the optimality condition of (25) we can easily check that is the optimal solution to (17) at , while can be computed explicitly as
Our goal is to generate two sequences and so that becomes firmly contractive. We formally encode this idea using the following definition.
Definition 3.2 (Model-based Excessive Gap)
Given and , a new point and so that is said to be firmly contractive w.r.t. defined by (25) if:
where , and are two given parameters.
Here, the parameter and the decay term will be specified accordantly with different algorithmic schemes.
In the context of excessive gap technique introduced by Nesterov, the smoothed gap function measures the excessive gap in [49, cf., (2.5) and (2.9)]). Hence, we will call Nesterov’s smoothed gap function customized for the constrained convex problem (1). We note that the excessive gap condition in [49, (3.2)] only requires . In our case, we structure this condition using the basic model in (27) so that we can manipulate and the new parameter simultaneously to analyze the convergence of our algorithms.
In the sequel, we often assume that the second parameter is nonnegative, which allows us to estimate the convergence rate of . However, the following remark shows that the sequence can still converge to even if is positive. However, we find the ensuing convergence analysis to be difficult.
Let and be sequences in Definition 3.2. If
then the sequence converges to .
Consequently, the rate of convergence of depends on the rate of and .
The next lemma shows the relation between problem (1) and its smoothed function and . The proof of this lemma can be found in the appendix.
where , which is the norm of a minimum norm dual solution. The estimate (33) hints that we can derive algorithms based on whose convergence rate depends directly on how we update the sequence .
4 The main algorithmic framework
The key objective in this section is to design a primal-dual update template from and to and so that the conditions in Definition 3.2 hold. We develop two distinct schemes to update and in the following two subsections.
4.1 An iteration scheme with two primal steps
Since the objective function is not necessary smooth, we consider the following mapping under Assumption 1: