# Online Convex Dictionary Learning

Dictionary learning is a dimensionality reduction technique widely used in data mining, machine learning and signal processing alike. Nevertheless, many dictionary learning algorithms such as variants of Matrix Factorization (MF) do not adequately scale with the size of available datasets. Furthermore, scalable dictionary learning methods lack interpretability of the derived dictionary matrix. To mitigate these two issues, we propose a novel low-complexity, batch online convex dictionary learning algorithm. The algorithm sequentially processes small batches of data maintained in a fixed amount of storage space, and produces meaningful dictionaries that satisfy convexity constraints. Our analytical results are two-fold. First, we establish convergence guarantees for the proposed online learning scheme. Second, we show that a subsequence of the generated dictionaries converges to a stationary point of the approximation-error function. Experimental results on synthetic and real world datasets demonstrate both the computational savings of the proposed online method with respect to convex non-negative MF, and performance guarantees comparable to those of online non-convex learning.

## Authors

• 7 publications
• 5 publications
• 41 publications
03/02/2021

### Online Orthogonal Dictionary Learning Based on Frank-Wolfe Method

Dictionary learning is a widely used unsupervised learning method in sig...
11/02/2017

11/05/2019

### Online matrix factorization for Markovian data and applications to Network Dictionary Learning

Online Matrix Factorization (OMF) is a fundamental tool for dictionary l...
05/03/2016

### Decentralized Dynamic Discriminative Dictionary Learning

We consider discriminative dictionary learning in a distributed online s...
01/22/2017

### Neurogenesis-Inspired Dictionary Learning: Online Model Adaption in a Changing World

In this paper, we focus on online representation learning in non-station...
10/15/2021

### NNK-Means: Dictionary Learning using Non-Negative Kernel regression

An increasing number of systems are being designed by first gathering si...
02/08/2016

### Compressed Online Dictionary Learning for Fast fMRI Decomposition

We present a method for fast resting-state fMRI spatial decomposi-tions ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Dictionary learning is a widely used dimensionality reduction technique (Tosic & Frossard, 2011; Mairal et al., 2009) whose goal is to find a basis that allows for a sparse representation of the underlying data. Compared to other dimensionality reduction techniques based on eigendecomposition (Van Der Maaten et al., 2009), dictionary learning enforces fewer restrictions on the choice of the basis and hence ensures larger representation flexibility for complex datasets. At the same time, it provides a natural, application-specific interpretation for the dictionaries. In many settings, a sparse representation is sought to accommodate applications such as clustering, in which dictionary elements may be viewed as cluster centroids (Rubinstein et al., 2010; Dai et al., 2012).

Dictionary learning is a special instance of matrix factorization (MF) with sparsity constraints. Many matrix factorization problems are non-convex and NP-hard (Li et al., 2016; Lin, 2007; Vavasis, 2009), but can often be solved (suboptimally) using alternating optimization approaches for finding local optima (Lee & Seung, 2001). MF methods have also been studied under various modeling constraints (Ding et al., 2010; Srebro et al., 2005; Liu et al., 2012; Bach et al., 2008). The most frequently used constraints are non-negativity, semi-non-negativity, orthogonality and convexity (Ding et al., 2010). For data a priori known to be non-negative (Lee & Seung, 2001; Paatero & Tapper, 1994), non-negativity constraints lead to results that better capture relevant data properties (Wang et al., 2017; Févotte & Dobigeon, 2015). Furthermore, adding certain constraints can accelerate the convergence rate of MF solvers (Gillis & Glineur, 2012).

A dictionary learning method of special interest is convex MF (Ding, Li & Jordan, 2010). Convex MF requires the dictionary to be a convex combination of the observed data samples (Esser et al., 2012; Bouchard et al., 2013). In the context of clustering, the convexity constraint ensures that the cluster centroids are probabilistic combinations of the data samples, which is of great relevance for practical knowledge extraction. Unfortunately, MF and convex MF methods have scalability issues, as the number of matrix multiplications and convex optimization steps depends both on the number of data samples and their dimensionality. To address the scalability issue for MF techniques (Gribonval et al., 2015; Lee et al., 2007; Kasiviswanathan et al., 2012), the authors of (Mairal et al., 2010) introduced an online algorithm for dictionary learning that minimizes a surrogate function amenable to sequential optimization. Their online algorithm comes with strong performance guarantees, asserting that its solution converges almost surely to a local optima of the generalization loss.

No currently known dictionary learning algorithms are able to combine convexity constraints, needed for interpretability, and online processing approaches, needed for scaling. We propose the first dictionary learning method accounting for both constraints, termed batch online convex dictionary learning. The algorithm takes as it input sequential batches of samples and updates a running version of a convex dictionary. The method allows for both sparse data and sparse dictionary representations. In the latter context, sparsity refers to restricting each dictionary element to be a convex combination of a small number of data samples. In addition to providing a simple explanation for the dictionary, the sparse dictionary constraints also allows one to maintain small, fixed-size subsets of data samples that are typical of the underlying clusters. This is of special relevance in applications for which storing the entire dataset is prohibitively costly.

Our online convex dictionary learning method, described in Section 2 and Section 3, comes with provable performance guarantees. In Section 4, we show that the output of the algorithm converges and that a subsequence of the learned dictionaries converges to a stationary point of the approximation-error function. To demonstrate the practical utility of the learning method, we also empirically test its performance on both synthetic and real world datasets. The real world datasets include image, text, and other types of data retrieved from the UCI Machine Learning repository (Dua & Graff, 2017), and are described in Section 5. Our experimental results reveal that the online convex method runs significantly faster than its non-online counterpart, and produces quality clustering results and data sample representations.

## 2 Notation and Problem Formulation

Matrices are denoted by capital bold font letters, while random variables (RVs) are denote by capital letters. Random vectors are described by capital underlined letters, while realizations of random vectors are denoted by lower-case bold letters. Point sets (regions) are denoted by calligraphic letters. The set

for any positive integer is denoted by . Furthermore, the column of a matrix is denoted by , and the element in row , column is written as . Similarly, the coordinate of a vector is denoted by .

We also use to denote the set of columns of , and let stand for the convex hull of . With a slight abuse of notation, we let stand for the matrix obtained by stacking vectors in in an arbitrary order, column-wise.

Data samples are modeled as random vectors drawn from the union of disjoint, convex compact regions (clusters),

. Each cluster is independently selected as the support with probability

, and the vector is subsequently sampled from the chosen cluster according to a mildly constrained marginal distribution (described in detail in Section 4). Our goal is to find a dictionary in which corresponds to cluster such that every data sample can be (sparsely) approximated in the form . Here, stands for the coefficient vector of the dictionary.

The data approximation-error induced by the dictionary is defined as:

 g(D)≜EX––[minα∈Rk||X––−Dα||2F+λ||α||1],D∈D, (1)

where , and the expectation is taken with respect to the data distribution. We enforce a convexity constraint on the dictionaries by requiring to be a convex combination of points in the cluster . The parameter controls the sparsity of the coefficient vector , and it can be chosen according to the application at hand.

The online algorithm described in the next section outputs a dictionary that asymptotically minimizes the approximation-error via sequential processing of batches of data samples . We prove that the proposed algorithm converges almost surely. Moreover, we prove that there exists a subsequence of the generated sequence of dictionaries for which the approximation-error converges almost surely to a stationary point of .

## 3 Online Convex Dictionary Learning

We first introduce the approximation-error of a single data point as

 ℓ(x,D,α)≜12||x−Dα||2F+λ||α||1. (2)

Furthermore, we define the following functions of the approximation-error :

 ℓ(x,D)=minα∈Rkℓ(x,D,α), (3) ℓ(S,D)≜1|S|∑(x,αx)∈Sℓ(x,D,αx), (4) ℓ⋆(S,D)≜1|S|∑(x,αx)∈Sℓ(x,D), (5)

where denotes an arbitrary set of pairs of real-valued vectors, .

The original problem of minimizing the approximation-error in Eq. 1 cannot be efficiently solved in an iterative manner since the function

is non-convex and NP-hard to optimize. One way to reduce the complexity of the problem is to find a sub-optimal solution that sequentially optimizes its empirical estimate

,

 gt(D)≜1tt∑n=1ℓ⋆(Sn,D). (6)

Keeping a running dictionary estimate that optimizes involves updating the coefficient vectors for all the previous batches at each step . Hence, we introduce the notion of surrogate functions to make the task of updating both the coefficient vector and the dictionary estimate tractable.

The proposed online convex dictionary learning algorithm in Alg. 2 optimizes the surrogate function , defined below, to update the running estimate of the dictionary at the time , :

 ^gt(D)≜1tt∑n=1ℓ(Sn,D). (7)

Here, denotes a batch of pairs at step . Each pair contains data and it’s corresponding coefficient vector computed with respect to .

The surrogate function processes the data in batches, , such that a batch contains exactly one entry from each cluster. This is done to ensure that converge to a local optima (Lemma 2). To satisfy the convexity constraint on , , our algorithm constraints the dictionary using the observed data samples since the regions are unknown a priori. For each cluster , we dynamically maintain and update a set of a fixed number of data samples arranged in a matrix , and constrain to lie in . Clearly, and

For simplicity, we refer to both and as the representative region of . The representative region provides both an “estimate” of the unknown cluster and a small set of characteristic samples from the cluster which may be used for additional processing.

The initialization algorithm, described in Alg. 1, generates an initial estimate of the dictionary and of the representative regions . As part of the process, one first collects data samples, described by the matrix . Then, a procedure similar to the one described in (Ding et al., 2010)

is executed: it involves running the K-means algorithm on the collected points and creating a cluster indicator matrix

such that if data sample lies in cluster . The sizes of the obtained clusters equal to , so that . The convex hulls of the points within the clusters serve as the representative regions. The matrix equals where .

Two remarks are in order. First, note that initialization is performed using only a small set of data samples. Hence, it negligibly increases the overall complexity of the online algorithm. Second, to ensure that the online algorithm covers all cluster, at least one point per cluster needs to be identified. Nevertheless, collecting points is desirable for proper estimation of the representative regions in the first iteration of the algorithm, and for reducing the number of iterations in the update rules.

Upon initialization, for each sampling time , we collect a batch of points in Step 4. Step 4 calls Alg. 3 and terminates when the obtained batch of points contains at least one entry from each cluster. Alg. 3classifies the points in according to the procedure described in Step 7. Note that Step 7 in Alg. 3 may not always produce the correct classification. Nevertheless, for minimizing we do not necessarily require the assignment described in Eq. 10 to be error-free. This claim holds due to the following observations. Errors introduced in this step affect the constraint that . However, the volume of the estimated clusters produced by the algorithm in the presence of errors is always larger than that obtained without errors. Thus, the approximation-error can only be smaller. Although the generated dictionary elements may not satisfy , extensive simulations reveal that for sufficiently large values of and proper initialization, the algorithm still produces representative dictionaries for the clusters. For analytical tractability, we assume that the classification procedure is error-free.

Candidate data points for the representative region are obtained in Step 7 of Alg. 2. In Step 9 we update the representative regions and the running dictionary estimate by minimizing the approximation-error obtained by solving convex problems. Each convex problem involves one out of choices for the candidate representative region of cluster , .

In theory, the online convex dictionary learning method works for all values of and , and our analysis pertains to this unrestricted scenario. Nevertheless, in practice, the batch method and convex methods in general are only tractable for small and , in which case one may actually solve convex problems in each iteration. For large or , a greedy algorithm may be used instead; the greedy algorithm updates sequentially for each cluster using Alg. 4.

It is worth to point out that optimizing the surrogate function does not influence the asymptotic approximation-error with respect to the function , as iteratively optimizing the surrogate function with respect to produces a subsequence that converges to a stationary point of . Thus, asymptotically, optimizing over the representative region is equivalent to optimizing over the whole region .

## 4 Convergence Analysis

In what follows, we prove that Alg. 2 converges almost surely, and that using the surrogate function gives the same asymptotic result as if were used instead. More precisely, we show that there exists a subsequence of such that converges almost surely to a stationary point of .

First, from Eq. 8, we have,

 Dt=\operatornamewithlimitsargminD⟦i⟧∈cvx\lparen^X(i)t\rparen,i∈[k]^gt(D).

Second, we let stand for the global optima we wish to compute at each step, i.e.,

 D⋆t≜\operatornamewithlimitsargminD∈D^gt(D).

Finally, for an integer , we let

stand for the estimate of the dictionary obtained by restricting the column of the dictionary (henceforth referred to as the dictionary element), to lie in . Note that in this definition, has been replaced by . For , we have , and is the dictionary obtained when .

To establish the desired convergence results, we make the following modeling assumptions, summarized below. Using these assumptions, we perform our analysis as outlined in Section 4.1.

#### Assumptions

The data distribution on a compact support set

has bounded “skewness

. The compact support assumption naturally arises in audio, image, and video processing applications, and it is a consequence of the data acquisition process.

We assume that all clusters are equally likely, and that the following inequality holds for the distribution of the random vector given that it lies in ,

 P\lparen||X––−p||≤r\rparen≥κvolBC(r,p)vol(C\lpareni\rparen). (11)

Here, is a constant, stands for the ball of radius around . This assumption is satisfied for appropriately chosen and distributions of that are “close” to uniform.

The quadratic surrogate functions are strictly convex, and have Hessians that are lower-bounded by a positive constant.

The smallest eigenvalue of the positive semi-definite matrix

, defined in subroutine Alg. 3 of Alg. 2, is lower bounded by a constant . It is straightforward to enforce this assumption by adding a term to the surrogate or original objective function; this leads to replacing the positive semi-definite matrix of Alg. 3 by . For simplicity, we do not explicitly include the regularizer in our derivations.

The approximation-error function is “well-behaved”. We assume that the function defined in Eq. 3 is continuously differentiable, and its expectation is continuously differentiable and Lipschitz on the compact set . This assumption parallels the one made in (Mairal et al., 2010, Proposition 2), and it holds if the solution to the optimization problem Eq. 3 is unique. The uniqueness condition can be enforced by adding a regularization term () to the objective function in Eq. 2. This term makes the LARS problem strictly convex and hence ensures that it has a unique solution.

### 4.1 Main Results

The proof of our main result proceeds though the following steps. In Lemma 1 we show that the difference-sequence of the optimal dictionaries converges to zero with rate . Furthermore, when optimizing over the same representative region, the difference-sequence of the dictionaries exhibits the same behavior as that of the optimal dictionaries. Lemma 2 shows that restricting the optima to the representative region does not affect convergence to the asymptotic global optima . Lemma 4 establishes that our proposed algorithm converges almost surely. Finally, using the results in Lemma 4 and Lemma 5 allows us to establish that there exists a subsequence of dictionaries that converges to a stationary point of .

Note that the sums in the expressions for the surrogate functions at time and only differ in one term, . Thus, intuitively, the optima of these functions over the same optimization region are expected to be close to each other. Lemma 1 formally states this observation, which is used in Lemma 5 and Lemma 4 to establish the desired convergence result.

###### Lemma 1.
 ||D⋆t+1−D⋆t||=O(1t)

and

 ||D(t)t+1−Dt||=O(1t).
###### Proof.

The proof is similar to that of Lemma 1 of (Mairal et al., 2010) and hence omitted. ∎

###### Lemma 2.

There exists a subsequence of such that almost surely.

###### Proof.

We start by proving that converges almost surely. Lemma 2 easily follows from this result. To this end, we use the proposition below.

###### Proposition 3.

Let be a RV defined recursively according to

 Yt=min{Yt−1,Ut}+1t−1,

where are RVs satisfying with a constant . Then,

###### Proof.

By definition, we have

 Yt=min{Yt−1,Ut}+1t−1=mini∈[t] {Ui+t−1∑j=i−11j} ⟹E[ Yt]≤E[mini∈[t−n,t] Ui]+t−1∑j=t−n−11j, =O(1na)+t−1∑j=t−n−11j,  ∀n∈[t], =O(1taa+1),

where the last line is a consequence of the expectation constraint. ∎

To establish the almost sure convergence of the sequence , we invoke the quasi-martingale convergence theorem of (Fisk, 1965). In this setting, the necessary condition in the convergence theorem is of the form

 ∑tE[||Dt−D⋆t||t]=O(1).

Using the proposition, it can be shown that

 (12)

This establishes the validity of the necessary condition. The remainder of the proof follows from Eq. 12.

Let . Then,

 Δt=||Dt−D⋆t||^gt (13a) =min{||D(t−1)t−D⋆t||^gt,||D(t)t−D⋆t||^gt} (13b) ≤min{||D(t−1)t−Dt−1||^gt+||Dt−1−D⋆t−1||^gt+||D⋆t−1−D⋆t||^gt, ||D(t)t−D⋆t||^gt} (13c) ≤min{||Dt−1−D⋆t−1||^gt+κ1t−1, ||D(t)t−D⋆t||^gt} (13d)

where Eq. 13b follows from Alg. 2, since

 Dt=\operatornamewithlimitsargminD⟦i⟧∈cvx\lparen^X(i)t\rparen⋃cvx\lparen^X(i)t−1\rparen,i∈[k]^gt(D),

and Eq. 13d follows form the fact that is Lipschitz and the bound in Lemma 1. Consequently,

 ||Dt−1−D⋆t−1||^gt=∣∣^gt(Dt−1)−^gt(D⋆t−1)∣∣ ≤||Dt−1−D⋆t−1||^gt−1+∣∣∣ℓ(St,Dt−1)−ℓ(St,D⋆t−1)t∣∣∣ (14a) ≤||Dt−1−D⋆t−1||^gt−1+κ2t, (14b)

where Eq. 14b follows from Sections 4 and 4. Combining Eqs. 14b and 13d we get

 Δt≤min{Δt−1+κ3t−1,||Dt−D⋆t||^gt}. (15)

Thus, upper bounding the condition numbers of over all by , we have

 ||D(t)t−D⋆t||^gt≤c^g||Xt−D⋆t|| ≤c^g||Xt−D⋆t−1||+c^g||D⋆t−1−D⋆t|| (16)

where .

Combining Eqs. 16 and 15, we obtain

 Δt≤min{Δt−1,c^g||Xt−D⋆t−1||}+κt−1.

Next, let . Then, from Section 4

where . Since the RVs are independent, and invoking Proposition 3 establishes that

 E[Δt]=O(1t1/(1+m)).

Using Section 4 on the smallest eigenvalue of , we can show that

 Δt =|^gt(Dt)−^gt(D⋆t)|≥κ^g||Dt−D⋆t||2 ⟹E[||Dt−D⋆t||] ≤√E[||Dt−D⋆t||2]≤√1κ^gE[Δt] ⟹E[||Dt−D⋆t||] =O(1t1/2(1+m))

###### Lemma 4.

converges almost surely;

converges almost surely to ;

converges almost surely .

###### Proof.

To prove that converges almost surely, we follow the approach in (Mairal et al., 2010). Let . For Lemma 4, we prove almost sure convergence by showing that the positive stochastic process satisfies

 ∑tE[E[γt+1−γt|Ft]+]<∞, (17)

where denotes the filtration up to time . The above condition establishes that the process is a quasi-martingale (Fisk, 1965) that converges almost surely.

To compute the positive variations of needed in LABEL:{eq1_QMart_cond:as_conv}, we write

 γt+1−γt=^gt+1(Dt+1)−^gt(Dt) =^gt+1(Dt+1)−^gt+1(Dt)+^gt+1(Dt)−^gt(Dt) =^gt+1(Dt+1)−^gt+1(Dt)+ℓ⋆(St+1,Dt)−gt(Dt)t+1+ gt(Dt)−^gt(Dt)t+1 (18) ≤ℓ⋆(St+1,Dt)−gt(Dt)t+1, (19)

where Eqs. 19 and 18 follows since Alg. 2 guarantees

 ℓ(St+1,Dt)=ℓ⋆(St+1,Dt), ^gt+1(Dt+1)−^gt+1(Dt)≤0, gt(Dt)−^gt(Dt)≤0.

Therefore, we have,

 =g(Dt)−gt(Dt)t+1≤||g−gt||∞t+1, (20)

where Eq. 20 follows since the clusters are equiprobable. The differences are bounded and smooth, which allows us to use Donsker’s theorem (Van Der Vaart, 1998, lemma 19.36). Hence,

 E⎡⎣E[||g−gt||∞t+1]+⎤⎦≤κ2t32. (21)

To prove Lemmas 4 and 4, we use the quasi-martingale convergence theorem of Fisk (1965) and Eq. 18, so that

 ∑t|E[γt+1−γt|Ft]|<∞⟹∑t^gt(Dt)−gt(Dt)t+1<∞.

Therefore,

 ∞ >∑t^gt(Dt)−gt(Dt)t+1 ≥∑t^gt(D⋆t)−gt(D⋆t)t+1+∑tgt(D⋆t)−gt(Dt)t+1.

Using Lemma 2 and the Lipschitz continuity of , we have

 ∑t|gt(D⋆t)−gt(Dt)|t+1≤∑tκ2||D⋆t−Dt||t+1<∞ ⟹∑t^gt(D⋆t)−gt(D⋆t)t+1<∞.

Therefore, since and has a Lipschitz constant independent on , using Lemma 8 in (Mairal et al., 2010) we get

 ^gt(D⋆t)−gt(D⋆t)~{}a.s.−−−−→0.

The proofs of Lemmas 4 and 4 follow from the Glivenko-Cantelli’s theorem (Van Der Vaart, 1998, Thm 19.4) and Lemma 2. ∎

###### Lemma 5.

There exists a subsequence of the dictionary that, under Sections 4, 4 and 4 converges to some stationary point of .

###### Proof.

Consider a convergent subsequence such that . In this case, assume that converges to a matrix . Let be a matrix in . Since upperbounds for all ,

 ^gtj(Dtj+Q)≥gtj(Dtj+Q).

Letting , we get

 ^g∞(D∞+Q)≥g(D∞+Q).

Let be a sequence that converges to . Using the first order Taylor series expansions of and , and recalling the fact that