 # Multi-block Bregman proximal alternating linearized minimization and its application to sparse orthogonal nonnegative matrix factorization

We introduce and analyze BPALM and A-BPALM, two multi-block proximal alternating linearized minimization algorithms using Bregman distances for solving structured nonconvex problems. The objective function is the sum of a multi-block relatively smooth function (i.e., relatively smooth by fixing all the blocks except one bauschke2016descent,lu2018relatively) and block separable (nonsmooth) nonconvex functions. It turns out that the sequences generated by our algorithms are subsequentially convergent to critical points of the objective function, while they are globally convergent under KL inequality assumption. The rate of convergence is further analyzed for functions satisfying the Łojasiewicz's gradient inequality. We apply this framework to orthogonal nonnegative matrix factorization (ONMF) and sparse ONMF (SONMF) that both satisfy all of our assumptions and the related subproblems are solved in closed forms.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

In this paper, we deal with the structured nonsmooth nonconvex minimization problem

 \minimizex=(x1,…,xN)∈\R∑ini  φ(x)≡f(x)+N∑i=1gi(xi), (1)

where we will systematically assume the following hypotheses (see sec:algBPALM for details): [requirements for composite minimization (1)]

is proper and lower semicontinuous (lsc);

is , which is -smooth relative to ;

is multi-block strictly convex, -coercive and essentially smooth;

the first-order oracle of , , and is available, , has a nonempty set of minimizers, i.e., , and .

Although, the problem (1

) has a simple structure, it covers a broad range of optimization problems arising in signal and image processing, statistical and machine learning, control and system identification. Consequently, needless to say, there is a huge number of algorithmic studies around solving the optimization problems of the form (

1). Among all of such methodologies, we are interested in the class of alternating minimization algorithms such as block coordinate descent [beck2015cyclic, beck2013convergence, latafat2019block, nesterov2012efficiency, razaviyayn2013unified, richtarik2014iteration, tseng2001convergence, tseng2009coordinate], block coordinate [combettes2015stochastic, fercoq2019coordinate, latafat2019new], and Gauss-Seidel methods [auslender1976optimisation, bertsekas1989parallel, grippo2000convergence], which assumes that all blocks are fixed except one and solves the corresponding auxiliary problem with respect to this block, update the latter block, and continue with the others. In particular, the proximal alternating minimization has received much attention in the last few years; see for example [attouch2008alternating, attouch2007new, attouch2006inertia, attouch2010proximal, attouch2013convergence, beck2016alternating]. Recently, the proximal alternating linearized minimization and its variation has been developed to handle (1); see for example [bolte2014proximal, pock2016inertial, shefi2016rate].

Traditionally, the Lipschitz (Hölder) continuity of partial gradients of in (1) is a necessary tool for providing the convergence analysis of optimization algorithms; see, e.g., [bolte2014proximal, pock2016inertial]

. It is, however, well-known that it is not the Lipschitz (Hölder) continuity of gradients playing a key role in such analysis, but one of its consequence: an upper estimation of

including a Bregman distance called descent lemma ; cf. [bauschke2016descent, lu2018relatively]. This idea is central to convergence analysis of many optimization schemes requiring such an upper estimation; see, e.g., [ahookhosh2019bregman, bauschke2019linear, bauschke2016descent, bolte2018first, teboulle2018simplified, hanzely2018fastest, hanzely2018accelerated, lu2018relatively, nesterov2018implementable]. If , , is bi-strongly convex, , and is -smooth, alternating proximal point and alternating proximal gradient algorithms suggested in [li2019provable] with saddle-point avoidance guarantee. Further, if the -th block of is -smooth relative to a kernel function (), a block-coordinate proximal gradient was recently proposed in [wang2018block], which involves only a limited convergence analysis. Beside of these two relevant papers, to the best of our knowledge, there are no more alternating minimization methods for solving (1) under relative smoothness assumption on , i.e., this motivates the quest for such algorithmic development.

### 1.1. Contribution and organization

In this paper, we propose a Bregman proximal alternating linearized minimization (BPALM) algorithm and its adaptive version (A-BPALM) for (1). Our contribution is summarized as follows:

1. [wide, labelwidth=!, labelindent=0pt]

2. (Bregman proximal alternating linearized minimization) We introduce BPALM, a multi-block generalization of the proximal alternating linearized minimization (PALM) [bolte2014proximal] using Bregman distances, and its adaptive version (A-BPALM). To do so, we extend the notion of relative smoothness [bauschke2016descent, lu2018relatively] to its multi-block counterpart to support a structured problem of the form (1). Owing to multi-block relative smoothness of , unlike PALM, our algorithm does not need to know the local Lipschitz moduli of partial gradients () and their lower and upper bounds, which are hard to provide in practice. Our framework recovers [wang2018block] by exploiting a sum separable kernel, and the corresponding algorithm in [li2019provable] is a special case of our algorithm if , , .

3. (Efficient framework for ONMF and SONMF) Exploiting a suitable kernel function for Bregman distance, it turns out that the objective functions of ONMF and SONMF are multi-block relatively smooth with respect to this kernel. Further, it is shown that the auxiliary problems of ONMF and SONMF are solved in closed forms making BPALM and A-BPALM suitable for large-scale machine learning and data analysis problems.

This paper has three sections, besides this introductory section. In sec:algBPALM, we introduce the notion of multi-block relative smoothness, and verify the fundamental properties of Bregman Proximal alternating linearized mapping. In sec:convAnalysis, we introduce BPALM and A-BPALM and investigate the convergence analysis of the sequences generated by these algorithms. In sec:appONMF, we show that the objective functions of ONMF and SONMF satisfy our assumptions and the related subproblems can be solved in closed form.

### 1.2. Notation

We denote by

the extended-real line. For the identity matrix

, we set such that . The open ball of radius centered in is denoted as . The set of cluster points of is denoted as . A function is proper if and , in which case its domain is defined as the set . For , is the -(sub)level set of ; and are defined similarly. We say that is level bounded if is bounded for all

. A vector

is a subgradient of at , and set of all such vectors is called the subdifferential , i.e.

 ∂f(x)= \setv∈\Rp[∃\seqxk,vk s.t. xk→x, f(xk)→f(x), ˆ∂f(xk)∋vk→v],\shortintertextand$ˆ∂f(x)$isthesetof\DEFregularsubgradientsof$f$at$x$,namelyˆ∂f(x)= \setv∈\Rp[f(z)≥f(x)+\innprodvz−x+o(∥z−x∥), ∀z∈\Rp\seqxk],

see [rockafellar2011variational, Definition 8.3].

## 2. Multi-block Bregman proximal alternating linearized mapping

We first establish the notion multi-block relative smoothness, which is an extension of the relative smoothness [bauschke2016descent, lu2018relatively] for problems of the form (1). We then introduce Bregman alternating linearized mapping and study some of its basic properties.

In order to extend the definition of Bregman distances for the multi-block problem (1), we first need to introduce the notion of multi-block kernel functions, which will coincide with the standard one (cf. [ahookhosh2019bregman, Definition 2.1]) if .

[multi-block convexity and kernel function]Let be a proper and lsc function with and such that . For a fixed vector , we define the function given by

 hix(z):=h(x+Ui(z−xi)). (2)

Then, we say that is

1. multi-block (strongly/strictly) convex if the function is (strongly/strictly) convex for all and ;

2. multi-block locally strongly convex around if, for , there exists and such that

 hix(xi)≥hix(yi)+\innprod∇ih(y)xi−yi+σih2∥xi−yi∥2∀x,y∈\ballx⋆δ;
3. a multi-block kernel function if is multi-block convex and is -coercive for all and , i.e., ;

4. multi-block essentially smooth, if for every sequence converging to a boundary point of for all ;

5. of multi block Legendre type if it is multi-block essentially smooth and multi-block strictly convex.

[popular kernel functions] There are many kernel functions satisfying def:kernel. For example, for , energy, Boltzmann-Shannon entropy, Fermi-Dirac entropy (cf. [bauschke2018regularizing, Example 2.3]) and several examples in [lu2018relatively, Section 2]; and for see two examples in [li2019provable, Section 2]. Two important classes of multi-block kernels are sum separable kernels, i.e.,

 h(x1,…,xN)=h1(x1)+…+hN(xN),

and product separable kernels, i.e.,

 h(x1,…,xN)=h1(x1)×…×hN(xN),

see such a kernel for ONMF in pro:relSmoothNMF0.

We now give the definition of Bregman distances (cf. [bregman1967relaxation]) for multi-block kernels.

[Bregman distance] For a kernel function , the Bregman distance is given by

 (3)

Fixing all blocks except the -th one, the Bregman distance with respect to this block is given by

 Dh(x+Ui(yi−xi),x) =h(x+Ui(yi−xi))−h(x)−⟨∇h(x),Ui(yi−xi)⟩ =hix(yi)−hix(xi)−⟨∇ih(x),yi−xi⟩,

which measures the proximity between and with respect to the -th block of variables. Moreover, the kernel is multi-block convex if and only if for all and and . Note that if is multi-block strictly convex, then () if and only if .

We are now in a position to present the notion of multi-block relative smoothness, which is the central tool for our analysis in sec:convAnalysis.

[multi-block relative smoothness] Let be a multi-block kernel and let be a proper and lower semicontinuous function. If there exists () such that the functions given by

 ϕix(z):=Lih(x+Ui(z−xi))−f(x+Ui(z−xi))

are convex for all and , then, is called -smooth relative to .

Note that if , the multi-block relative smoothness is reduced to standard relative smoothness, which was introduced only recently in [bauschke2016descent, lu2018relatively]. In this case, if is -Lipschitz continuous, then both and are convex, i.e., the relative smoothness of generalizes the notions of Lipschitz continuity using Bregman distances. If , this definition will be reduced to the relative bi-smoothness given in [li2019provable] for .

We next characterize the notion of multi-block relative smoothness.

[characterization of multi-block relative smoothness] Let be a multi-block kernel and let be a proper lower semicontinuous function and . Then, the following statements are equivalent:

-smooth relative to ;

for all and ,

 f(x+Ui(yi−xi))≤f(x)+\innprod∇if(x)yi−xi+LiDh(x+Ui(yi−xi),x); (4)

for all and ,

 \innprod∇if(x)−∇if(y)xi−yi≤Li\innprod∇ih(x)−∇ih(y)xi−yi; (5)

if and for all , then

 Li∇2xixih(x)−∇2xixif(x)⪰0, (6)

for .

###### Proof.

Fixing all the blocks except one of them, the results can be concluded in the same way as [lu2018relatively, Proposition 1.1]. ∎

### 2.1. Bregman proximal alternating linearized mapping

Recall that if , for a kernel function and a proper lower semicontinuous function , the Bregman proximal mapping is given by

 \proxhγg(x):=\argminz∈\Rn\setg(z)+1γDh(z,x). (7)

which is a generalization of the classical one using the Bregman distance (3) in place of the Euclidean distance; see, e.g., [chen1993convergence] and references therein. We note that

 \proxhγg(x)=\sety∈\domg∩\domh | g(y)+1γDh(y,x)=minz\setg(z)+1γDh(z,x)<+∞,

which implies . The function is -prox-bounded if there exists such that for some ; cf. [ahookhosh2019bregman]. We next extend this definition to our multi-block setting.

[multi-block -prox-boundedness] A function is multi-block -prox-bounded if for each there exists and such that

 g\nicefrachγi(x):=minz∈\Rni\setg(x+Ui(z−xi))+1γiDh(x+Ui(z−xi),x)>−∞.

The supremum of the set of all such is the threshold of the -prox-boundedness, i.e.,

 (8)

For the problem (1), we have leading to

 g\nicefrachγi(x)=∑j≠igi(xi)+minz∈\Rni\setgi(z)+1γiDh(x+Ui(z−xi),x), (9)

i.e., we therefore denote . If is multi-block -prox-bounded for , so is for all . We next present equivalent conditions to this notion.

[characteristics of multi-block -prox-boundedness] For a multi-block kernel function and proper and lsc functions with , the following statements are equivalent:

is multi-block -prox-bounded;

for all and given in (2), is bounded below on for some ;

for all , .

###### Proof.

Suppose and let . Then, for all , it holds that

 gi(z) +rihix(z)=gi(z)+1γDh(x+Ui(z−xi),x)+rihix(z)−1γDh(x+Ui(z−xi),x) ≥g\nicefrachγi(x)−∑j≠igi(xi)+riγi−1γihix(z)+1γi(h(x)+\innprod∇ih(x)z−xi)\eqqcolon~gi(z).

Notice that is strictly convex and coercive, and as such is lower bounded. Conversely, suppose that . Then, from (9), we obtain

 g\nicefrachγi(x) =∑j≠igi(xi)+minz∈\Rni\setgi(z)+1γiDh(x+Ui(z−xi),x), ≥∑j≠igi(xi)+αi+infz\set−rihix(z)+1γiDh(x+Ui(z−xi),x) ≥∑j≠igi(xi)+αi−1γih(x)+1γi\innprod∇ih(x)xi+infz\set1−γiriγihix(z)−1γi\innprod∇ih(x)z,

which is finite, owing to -coercivity of .

Suppose that . Since is -coercive, we have

 liminf∥z∥→∞gi(z)hix(z)≥−ri+liminf∥z∥→∞αihix(z)=−ri>−∞.

Conversely, suppose . Then, there exists such that whenever . In particular

 inf∥z∥≥Migi(z)+rihix(z)≥inf∥x∥≥Mihix(z)(ℓi+ri)>−∞,

where the last inequality follows from coercivity of . Since owing to lower semicontinuity, we conclude that is lower bounded on . ∎

Let us now define the function as

 M\nicefrachγ(z,x):=\innprod∇f(x)z−x+1γDh(z,x)+N∑i=1gi(zi) (10)

and the set-valued Bregman proximal alternating linearized mapping as

 Ti\nicefrachγi(x):=\argminz∈\RniM\nicefrachγi(x+Ui(z−xi),x), (11)

which reduces to the Bregman forward-backward splitting mapping if ; cf. [bolte2018first, ahookhosh2019bregman].

[majorization model] Note that invoking fac:relSmoothEqvi2, the multi-block ()-relative smoothness assumption of entails a majorization model

 φ(x+Ui(yi−xi)) ≤f(x)+\innprod∇if(x)yi−xi+LiDh(x+Ui(yi−xi),x)+gi(yi)+∑j≠igj(xj) ≤f(x)+\innprod∇if(x)yi−xi+1γiDh(x+Ui(yi−xi),x)+gi(yi)+∑j≠igj(xj),

for .

In the next lemma, we show that the cost function is monotonically decreasing by minimizing the model (10) with respect to each block of variables.

[Bregman proximal alternating inequality] Let the conditions in ass:basic:fgh hold, and let with . Then,

 φ(x+Ui(¯¯¯z−xi))≤φ(x)−1−γiLiγiDh(x+Ui(¯¯¯z−xi),x), (12)

for all .

###### Proof.

For , (11) is simplified in the form

 Ti\nicefrachγi(x)=\argminz∈\Rni\set\innprod∇f(x)Ui(z−xi)+1γiDh(x+Ui(z−xi),x)+N∑i=1gi(z)=\argminz∈\Rni\set\innprod∇if(x)z−xi+1γiDh(x+Ui(z−xi),x)+gi(z). (13)

Considering , we have

 \innprod∇if(x)¯¯¯z−xi+1γiDh(x+Ui(¯¯¯z−xi),x)+gi(¯¯¯z)≤gi(xi).

Since is -smooth relative to , it follows from fac:relSmoothEqvi2 for and that

 f(x +Ui(¯¯¯z−xi))≤f(x)+\innprod∇if(x)¯¯¯z−xi+LiDh(x+Ui(¯¯¯z−xi),x) ≤f(x)+LiDh(x+Ui(¯¯¯z−xi),x)+gi(xi)−gi(¯¯¯z)−1γiLiDh(x+Ui(¯¯¯z−xi),x) =f(x)+gi(xi)−gi(¯¯¯z)−1−γiLiγiDh(x+Ui(¯¯¯z−xi),x),

giving (12). ∎

Recall that a function with values is level-bounded in locally uniformly in if for each and there is a neighborhood of along with a bounded set such that for all , cf. [rockafellar2011variational]. Using this definition, the fundamental properties of the mapping are investigated in the subsequent result.

[properties of Bregman proximal alternating linearized mapping] Under conditions given in ass:basic:fgh for , the following statements are true:

1. is nonempty, compact, and outer semicontinuous (osc) for all ;

2. ;

3. If , then ;

###### Proof.

For a fixed and a vector , let us define the function given by

 Φi(z,x,γi):=gi(z)+\innprod∇if(x)z−xi+⎧⎪ ⎪⎨⎪ ⎪⎩1γiDh(x+Ui(z−xi),x)if γi∈(0,γ0i],0if γi=0 and z=xi,+∞otherwise.

Since and are proper and lsc, so is on the set , for a constant . We show that is level-bounded in locally uniformly in . If it is not, then there exists , with , and such that with and . This guarantees that, for sufficiently large , , i.e., and

 gi(zk)+\innprod∇if(xk)zk−xki+1γkiDh(xk+Ui(zk−xki),xk)≤β.

Setting , pro:proxBoundedness2 ensures that there exists a constant such that

 gi(zk)+1~γih(xk+Ui(zk−xki))≥gi(zk)+rih(xk+Ui(zk−xki))≥~β

Subtracting the last two inequalities, it holds that

 \innprod∇if(xk)zk−xki+1γkiDh(xk+Ui(zk−xki),xk)−1~γih(xk+Ui(zk−xki))≤β−~β.

Expanding , dividing both sides by , and taking limit from both sides of this inequality as , it can be deduced that

 limk→∞(\Innprod∇if(xk)−1γki∇ih(xk)zk−xki∥zk∥−1γkih(xk)∥zk∥)+(1γki−1~γi)limk→∞hixk(zk)∥zk∥≤limk→∞β−~β∥zk∥.

This leads to the contradiction , which implies that is level-bounded. Therefore, all assumptions of the parametric minimization theorem [kan2012moreau, Theorem 2.2 and Corollary 2.2] are satisfied, i.e., pro:proxPro1. If , then lem:proxAltIneq implies that for , i.e., , the second inclusion follows from ass:basic:argmin. ∎

[sum or product separable kernel] Let us observe the following.

1. If is an additive separable function, i.e., , then (11) can be written in the form

 Ti\nicefrachγi(x) =\argminz∈\Rni\setgi(z)+\innprod∇if(x)z−xi+1γi(h(x+Ui(z−xi))−h(x)−\innprod∇ih(x)z−xi) =\argminz∈\Rni\setgi(z)+1γi(hi(z)−hi(xi)−\innprod∇hi(xi)−γi∇if(x)z−xi) =\argminz∈