In this paper, we deal with the structured nonsmooth nonconvex minimization problem
where we will systematically assume the following hypotheses (see sec:algBPALM for details): [requirements for composite minimization (1)]
is proper and lower semicontinuous (lsc);
is , which is -smooth relative to ;
is multi-block strictly convex, -coercive and essentially smooth;
the first-order oracle of , , and is available, , has a nonempty set of minimizers, i.e., , and .
Although, the problem (1
) has a simple structure, it covers a broad range of optimization problems arising in signal and image processing, statistical and machine learning, control and system identification. Consequently, needless to say, there is a huge number of algorithmic studies around solving the optimization problems of the form (1). Among all of such methodologies, we are interested in the class of alternating minimization algorithms such as block coordinate descent [beck2015cyclic, beck2013convergence, latafat2019block, nesterov2012efficiency, razaviyayn2013unified, richtarik2014iteration, tseng2001convergence, tseng2009coordinate], block coordinate [combettes2015stochastic, fercoq2019coordinate, latafat2019new], and Gauss-Seidel methods [auslender1976optimisation, bertsekas1989parallel, grippo2000convergence], which assumes that all blocks are fixed except one and solves the corresponding auxiliary problem with respect to this block, update the latter block, and continue with the others. In particular, the proximal alternating minimization has received much attention in the last few years; see for example [attouch2008alternating, attouch2007new, attouch2006inertia, attouch2010proximal, attouch2013convergence, beck2016alternating]. Recently, the proximal alternating linearized minimization and its variation has been developed to handle (1); see for example [bolte2014proximal, pock2016inertial, shefi2016rate].
Traditionally, the Lipschitz (Hölder) continuity of partial gradients of in (1) is a necessary tool for providing the convergence analysis of optimization algorithms; see, e.g., [bolte2014proximal, pock2016inertial]
. It is, however, well-known that it is not the Lipschitz (Hölder) continuity of gradients playing a key role in such analysis, but one of its consequence: an upper estimation ofincluding a Bregman distance called descent lemma ; cf. [bauschke2016descent, lu2018relatively]. This idea is central to convergence analysis of many optimization schemes requiring such an upper estimation; see, e.g., [ahookhosh2019bregman, bauschke2019linear, bauschke2016descent, bolte2018first, teboulle2018simplified, hanzely2018fastest, hanzely2018accelerated, lu2018relatively, nesterov2018implementable]. If , , is bi-strongly convex, , and is -smooth, alternating proximal point and alternating proximal gradient algorithms suggested in [li2019provable] with saddle-point avoidance guarantee. Further, if the -th block of is -smooth relative to a kernel function (), a block-coordinate proximal gradient was recently proposed in [wang2018block], which involves only a limited convergence analysis. Beside of these two relevant papers, to the best of our knowledge, there are no more alternating minimization methods for solving (1) under relative smoothness assumption on , i.e., this motivates the quest for such algorithmic development.
1.1. Contribution and organization
[wide, labelwidth=!, labelindent=0pt]
(Bregman proximal alternating linearized minimization) We introduce BPALM, a multi-block generalization of the proximal alternating linearized minimization (PALM) [bolte2014proximal] using Bregman distances, and its adaptive version (A-BPALM). To do so, we extend the notion of relative smoothness [bauschke2016descent, lu2018relatively] to its multi-block counterpart to support a structured problem of the form (1). Owing to multi-block relative smoothness of , unlike PALM, our algorithm does not need to know the local Lipschitz moduli of partial gradients () and their lower and upper bounds, which are hard to provide in practice. Our framework recovers [wang2018block] by exploiting a sum separable kernel, and the corresponding algorithm in [li2019provable] is a special case of our algorithm if , , .
(Efficient framework for ONMF and SONMF) Exploiting a suitable kernel function for Bregman distance, it turns out that the objective functions of ONMF and SONMF are multi-block relatively smooth with respect to this kernel. Further, it is shown that the auxiliary problems of ONMF and SONMF are solved in closed forms making BPALM and A-BPALM suitable for large-scale machine learning and data analysis problems.
This paper has three sections, besides this introductory section. In sec:algBPALM, we introduce the notion of multi-block relative smoothness, and verify the fundamental properties of Bregman Proximal alternating linearized mapping. In sec:convAnalysis, we introduce BPALM and A-BPALM and investigate the convergence analysis of the sequences generated by these algorithms. In sec:appONMF, we show that the objective functions of ONMF and SONMF satisfy our assumptions and the related subproblems can be solved in closed form.
We denote by
the extended-real line. For the identity matrix, we set such that . The open ball of radius centered in is denoted as . The set of cluster points of is denoted as . A function is proper if and , in which case its domain is defined as the set . For , is the -(sub)level set of ; and are defined similarly. We say that is level bounded if is bounded for all
. A vectoris a subgradient of at , and set of all such vectors is called the subdifferential , i.e.
see [rockafellar2011variational, Definition 8.3].
2. Multi-block Bregman proximal alternating linearized mapping
We first establish the notion multi-block relative smoothness, which is an extension of the relative smoothness [bauschke2016descent, lu2018relatively] for problems of the form (1). We then introduce Bregman alternating linearized mapping and study some of its basic properties.
In order to extend the definition of Bregman distances for the multi-block problem (1), we first need to introduce the notion of multi-block kernel functions, which will coincide with the standard one (cf. [ahookhosh2019bregman, Definition 2.1]) if .
[multi-block convexity and kernel function]Let be a proper and lsc function with and such that . For a fixed vector , we define the function given by
Then, we say that is
multi-block (strongly/strictly) convex if the function is (strongly/strictly) convex for all and ;
multi-block locally strongly convex around if, for , there exists and such that
a multi-block kernel function if is multi-block convex and is -coercive for all and , i.e., ;
multi-block essentially smooth, if for every sequence converging to a boundary point of for all ;
of multi block Legendre type if it is multi-block essentially smooth and multi-block strictly convex.
[popular kernel functions] There are many kernel functions satisfying def:kernel. For example, for , energy, Boltzmann-Shannon entropy, Fermi-Dirac entropy (cf. [bauschke2018regularizing, Example 2.3]) and several examples in [lu2018relatively, Section 2]; and for see two examples in [li2019provable, Section 2]. Two important classes of multi-block kernels are sum separable kernels, i.e.,
and product separable kernels, i.e.,
see such a kernel for ONMF in pro:relSmoothNMF0.
We now give the definition of Bregman distances (cf. [bregman1967relaxation]) for multi-block kernels.
[Bregman distance] For a kernel function , the Bregman distance is given by
Fixing all blocks except the -th one, the Bregman distance with respect to this block is given by
which measures the proximity between and with respect to the -th block of variables. Moreover, the kernel is multi-block convex if and only if for all and and . Note that if is multi-block strictly convex, then () if and only if .
We are now in a position to present the notion of multi-block relative smoothness, which is the central tool for our analysis in sec:convAnalysis.
[multi-block relative smoothness] Let be a multi-block kernel and let be a proper and lower semicontinuous function. If there exists () such that the functions given by
are convex for all and , then, is called -smooth relative to .
Note that if , the multi-block relative smoothness is reduced to standard relative smoothness, which was introduced only recently in [bauschke2016descent, lu2018relatively]. In this case, if is -Lipschitz continuous, then both and are convex, i.e., the relative smoothness of generalizes the notions of Lipschitz continuity using Bregman distances. If , this definition will be reduced to the relative bi-smoothness given in [li2019provable] for .
We next characterize the notion of multi-block relative smoothness.
[characterization of multi-block relative smoothness] Let be a multi-block kernel and let be a proper lower semicontinuous function and . Then, the following statements are equivalent:
-smooth relative to ;
for all and ,
for all and ,
if and for all , then
Fixing all the blocks except one of them, the results can be concluded in the same way as [lu2018relatively, Proposition 1.1]. ∎
2.1. Bregman proximal alternating linearized mapping
Recall that if , for a kernel function and a proper lower semicontinuous function , the Bregman proximal mapping is given by
which is a generalization of the classical one using the Bregman distance (3) in place of the Euclidean distance; see, e.g., [chen1993convergence] and references therein. We note that
which implies . The function is -prox-bounded if there exists such that for some ; cf. [ahookhosh2019bregman]. We next extend this definition to our multi-block setting.
[multi-block -prox-boundedness] A function is multi-block -prox-bounded if for each there exists and such that
The supremum of the set of all such is the threshold of the -prox-boundedness, i.e.,
For the problem (1), we have leading to
i.e., we therefore denote . If is multi-block -prox-bounded for , so is for all . We next present equivalent conditions to this notion.
[characteristics of multi-block -prox-boundedness] For a multi-block kernel function and proper and lsc functions with , the following statements are equivalent:
is multi-block -prox-bounded;
for all and given in (2), is bounded below on for some ;
for all , .
Suppose and let . Then, for all , it holds that
Notice that is strictly convex and coercive, and as such is lower bounded. Conversely, suppose that . Then, from (9), we obtain
which is finite, owing to -coercivity of .
Suppose that . Since is -coercive, we have
Conversely, suppose . Then, there exists such that whenever . In particular
where the last inequality follows from coercivity of . Since owing to lower semicontinuity, we conclude that is lower bounded on . ∎
Let us now define the function as
and the set-valued Bregman proximal alternating linearized mapping as
which reduces to the Bregman forward-backward splitting mapping if ; cf. [bolte2018first, ahookhosh2019bregman].
[majorization model] Note that invoking fac:relSmoothEqvi2, the multi-block ()-relative smoothness assumption of entails a majorization model
In the next lemma, we show that the cost function is monotonically decreasing by minimizing the model (10) with respect to each block of variables.
[Bregman proximal alternating inequality] Let the conditions in ass:basic:fgh hold, and let with . Then,
for all .
Recall that a function with values is level-bounded in locally uniformly in if for each and there is a neighborhood of along with a bounded set such that for all , cf. [rockafellar2011variational]. Using this definition, the fundamental properties of the mapping are investigated in the subsequent result.
[properties of Bregman proximal alternating linearized mapping] Under conditions given in ass:basic:fgh for , the following statements are true:
is nonempty, compact, and outer semicontinuous (osc) for all ;
If , then ;
For a fixed and a vector , let us define the function given by
Since and are proper and lsc, so is on the set , for a constant . We show that is level-bounded in locally uniformly in . If it is not, then there exists , with , and such that with and . This guarantees that, for sufficiently large , , i.e., and
Setting , pro:proxBoundedness2 ensures that there exists a constant such that
Subtracting the last two inequalities, it holds that
Expanding , dividing both sides by , and taking limit from both sides of this inequality as , it can be deduced that
This leads to the contradiction , which implies that is level-bounded. Therefore, all assumptions of the parametric minimization theorem [kan2012moreau, Theorem 2.2 and Corollary 2.2] are satisfied, i.e., pro:proxPro1. If , then lem:proxAltIneq implies that for , i.e., , the second inclusion follows from ass:basic:argmin. ∎
[sum or product separable kernel] Let us observe the following.
If is an additive separable function, i.e., , then (11) can be written in the form