# Scalable sparse covariance estimation via self-concordance

We consider the class of convex minimization problems, composed of a self-concordant function, such as the metric, a convex data fidelity term h(·) and, a regularizing -- possibly non-smooth -- function g(·). This type of problems have recently attracted a great deal of interest, mainly due to their omnipresence in top-notch applications. Under this locally Lipschitz continuous gradient setting, we analyze the convergence behavior of proximal Newton schemes with the added twist of a probable presence of inexact evaluations. We prove attractive convergence rate guarantees and enhance state-of-the-art optimization schemes to accommodate such developments. Experimental results on sparse covariance estimation show the merits of our algorithm, both in terms of recovery efficiency and complexity.

## Authors

• 37 publications
• 3 publications
• 22 publications
• 67 publications
• ### A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions

We propose an algorithmic framework for convex minimization problems of ...
01/08/2013 ∙ by Quoc Tran-Dinh, et al. ∙ 0

• ### Generalized Self-Concordant Functions: A Recipe for Newton-Type Methods

We study the smooth structure of convex functions by generalizing a powe...
03/14/2017 ∙ by Tianxiao Sun, et al. ∙ 0

• ### Stochastic model-based minimization of weakly convex functions

We consider an algorithm that successively samples and minimizes stochas...
03/17/2018 ∙ by Damek Davis, et al. ∙ 0

• ### Composite convex minimization involving self-concordant-like cost functions

The self-concordant-like property of a smooth convex function is a new a...
02/04/2015 ∙ by Quoc Tran-Dinh, et al. ∙ 0

• ### Composite Self-Concordant Minimization

We propose a variable metric framework for minimizing the sum of a self-...
08/13/2013 ∙ by Quoc Tran-Dinh, et al. ∙ 0

• ### A single-phase, proximal path-following framework

We propose a new proximal, path-following framework for a class of const...
03/05/2016 ∙ by Quoc Tran-Dinh, et al. ∙ 0

• ### Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

The L1-regularized maximum likelihood estimation problem has recently be...
11/12/2012 ∙ by Dominique Guillot, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Convex -regularized divergence criteria have been proven to produce – both theoretically and empirically – consistent modeling in diverse top-notch applications. The literature on the setup and utilization of such criteria is expanding with applications in Gaussian graphical learning [Dahl, Vandenberghe, and Roychowdhury2008, Banerjee, El Ghaoui, and d’Aspremont2008, Hsieh et al.2011], sparse covariance estimation [Rothman2012], Poisson-based imaging [Harmany, Marcia, and Willett2012], etc.

In this paper, we focus on the sparse covariance estimation problem. Particularly, let be a collection of

-variate random vectors, i.e.,

, drawn from a joint probability distribution with covariance matrix

. In this context, assume there may exist unknown marginal independences among the variables to discover; we note that when the -th and -th variables are independent. Here, we assume is unknown and sparse, i.e., only a small number of entries are nonzero. Our goal is to recover the nonzero pattern of , as well as compute a good approximation, from a (possibly) limited sample corpus.

Mathematically, one way to approximate is by solving:

 Θ∗=argminΘ{−logdet(Θ)+h(Θ)+g(Θ)}, (1)

where is the optimization variable, where is the sample covariance and is a convex nonsmooth regularizer function, accompanied with an easily computable proximity operator [Combettes and Wajs2005]. and .

Whereas there are several works [Becker and Fadili2012, Lee, Sun, and Saunders2012] that compute the minimizer of such composite objective functions, where the smooth term is generally a Lipschitz continuous gradient function, in (1) we consider a more tedious task: The objective function has only locally Lipschitz continuous gradient. However, one can easily observe that (1) is self-concordant; we refer to some notation and definitions in the Preliminaries section. Within this context, [Tran-Dinh, Kyrillidis, and Cevher2013a] present a new convergence analysis and propose a series of proximal Newton schemes with provably quadratic convergence rate, under the assumption of exact algorithmic calculations at each step of the method.

Here, we extend the work of [Tran-Dinh, Kyrillidis, and Cevher2013a] to include inexact evaluations and study how these errors propagate into the convergence rate. As a by-product, we apply these changes to propose the inexact Self-Concordant OPTimization (iSCOPT) framework. Finally, we consider the sparse covariance estimation problem as a running example for our discussions. The contributions are:

• We consider locally Lipschitz continuous gradient convex problems, similar to (1), where errors are introduced in the calculation of the descent direction step. Our analysis indicates that inexact strategies achieve similar convergence rates as the corresponding exact ones.

• We present the inexact SCOPT solver (iSCOPT) for the sparse covariance estimation problem, with several variations that increase the convergence rate in practice.

## Preliminaries

Notation: We reserve to denote the vectorization operator which maps a matrix into a vector, by stacking its columns and, let be the inverse operation.

denotes the identity matrix.

###### Definition 1 (Self-concordant functions [Nesterov and Nemirovskii1994]).

A convex function is self-concordant if . A function is self-concordant if is self-concordant .

For , we define as the local norm around with respect to . The corresponding dual norm is . We define as , and as . Note that and are both nonnegative, strictly convex, and increasing.

Problem reformulation: We can transform the matrix formulation of (1) in the following vectorized problem:

 minx∈Rp{F(x):=−logdet(mat(x))+h(x)=f(x)+g(x)}, (2)

for , where is a convex, self-condordant function and is a proper, lower semi-continuous and non-smooth convex regularization term. For our discussions, we assume is -norm-based.

## The algorithm in a nutshell

For our convenience and without loss of generality, we use the vectorized reformulation in (2). Here, we describe the SCOPT optimization framework, proposed in [Tran-Dinh, Kyrillidis, and Cevher2013a]. SCOPT generates a sequence of putative solutions , according to:

 xi+1=(1−τi)xi+τiδ⋆i,τi=1λi+1, (3)

where is a descent direction, and is a step size along this direction. To compute , we minimize the non-smooth convex surrogate of around ; observe that assumes exact evaluations of :

 δ⋆i=argminδ∈Rp{U(δ,xi)+g(δ)}; (4)

is a quadratic approximation of such that where and denote the gradient (first-order) and Hessian (second-order) information of function around , respectively.

While quadratic approximations of smooth functions (of the form ) have become de facto approaches for general convex smooth objective functions, to the best of our knowledge, there are not many works considering a composite non-smooth and non-Lipschitz gradient minimization case with provable convergence guarantees under the presence of errors in the descent direction evaluations.

## Inexact solutions in (4)

An important ingredient for our scheme is the calculation of the descent direction through (4). For sparsity based applications, we use FISTA – a fast -norm regularized gradient method for solving (4) [Beck and Teboulle2009] – and describe how to efficiently implement such solver for the case of sparse covariance estimation where .

Given the current estimate , the gradient and the Hessian of around can be computed respectively as: Given the above, let . After calculations on (4), we easily observe that (4) is equivalent to:

 δi=argminδ{12δT∇2f(xi)δ+zTδφ(δ)+g(δ)}, (5)

where is smooth and convex with Lipschitz constant :

 L=1ρ+1λ2min(mat(xi)), (6)

where

denotes the minimum eigenvalue of a matrix. Combining the above quantities in a ISTA-like procedure

[Daubechies, Defrise, and De Mol2004], we have:

 δk+1=SλLρ(δk−1L∇φ(δk)), (7)

where we use superscript to denote the -th iteration of the ISTA procedure (as opposed to the subscript for the -th iteration of (3)). Here, and . Furthermore, to achieve an convergence rate, one can use acceleration techniques that lead to the FISTA algorithm, based on Nesterov’s seminal work [Nesterov1983]. We repeat and extend FISTA’s guarantees, as described in the next theorem; the proof is provided in the supplementary material.

###### Theorem 1.

Let be the sequence of estimates generated by FISTA. Moreover, define where is the minimizer with for some global constant . Then, to achieve a solution such that:

 G(δK)−G(δ⋆)≤ϵ, ϵ>0, (8)

the FISTA algorithm requires at least iterations. Moreover, it can be proved that:

 G(δK)−G(δ⋆)≥12∥δ⋆−δK∥2xi

We note that, given accuracy , satisfies (8) and in the recursion (3). In general, is not known apriori; in practice though, such a global constant can be found during execution, such that Theorem 1 is satisfied. A detailed description is given in the supplementary material.

For the sparse covariance problem, one can observe that and are precomputed once before applying FISTA iterations. Given , we compute in time complexity, while can be computed with time cost using the Kronecker product property . Similarly, can be iteratively computed in time cost. Overall, the FISTA algorithm for this problem has computational cost.

## iSCOPT: Inexact Scopt

Assembling the ingredients described above leads to Algorithm 1, which we call as the Inext Self-Concordant Optimization (iSCOPT) with the following convergence guarantees; our objective function satisfies the assumptions A.1, defined in [Tran-Dinh, Kyrillidis, and Cevher2013a]; the proof is provided in the supplementary material.

###### Theorem 2 (Global convergence guarantee).

Let where is the Newton decrement, is the solution of (4) and is the requested accuracy for solving (4). Assume and let the set be bounded. Then, iSCOPT generates such that satisfies:

 F(xi+1)≤F(xi)−ξ(τi),where

, , i.e., is a strictly non-increasing sequence.

### Quadratic convergence rate of iSCOPT algorithm

For strictly convex criteria with unique solution , the above proof guarantees convergence, i.e., for sufficiently large . Given this property, we prove the convergence rate towards the minimizer using local information in norm measures: as long as is away from , the algorithm has not yet converged to . On the other hand, as , the sequence converges to its minimum and , as increases.

In our analysis, we use the weighted distance to characterize the rate of convergence of the putative solutions. By (3) and given is a computable solution where , we observe:

 ∥xi+1−xi∥xi=∥τi(δi−xi)∥xi∝∥δi−xi∥xi:=εi.

This setting is nearly algorithmic: given and at each iteration, we can observe the behavior of through the evolution of and identify the region where this sequence decreases with a quadratic rate.

###### Definition 2.

We define the quadratic convergence region as such where satisfies , for some constant , and bounded and small constant .

The following lemma provides a first step for a concrete characterization of for the iSCOPT algorithm; the proof can be found in the supplementary material.

###### Lemma 1.

For any selection, , the iSCOPT algorithm generates the sequence such that (9) holds.

We provide a series of corollaries and lemmata that justify the local quadratic convergence of our approach in theory.

###### Corollary 1.

In the ideal case where is computable exactly, i.e., , the iSCOPT algorithm is identical to the SCOPT algorithm [Tran-Dinh, Kyrillidis, and Cevher2013a].

We apply the bound to simplify (9) as:

 εi+1≤21−2τiεi⋅1−τi+2τ2iεi1−8τiεi+8τ2iε2i⋅εi+√2ϵ (10)

Next, we describe the convergence rate of iSCOPT for the two distinct phases in our approach: full step size and damped step size; the proofs are provided in the supplementary material.

###### Theorem 3.

Assume . Then, iSCOPT satisfies:

 εi+1≤βε2i+c,

where , and is user-defined. I.e., iSCOPT has locally quadratic convergence rate where is small-valued and bounded. Moreover, for , .

###### Theorem 4.

Assume the damped-step case where . Then, iSCOPT satisfies:

 εi+1≤βε2i+c,

where and is user-defined. I.e., iSCOPT has locally quadratic convergence rate where is small-valued and bounded. Moreover, for , .

### An iSCOPT variant

Starting from a point far away from the true solution, Newton-like methods might not show the expected convergence behavior. To tackle this issue, we can further perform Forward Line Search (FLS) [Tran-Dinh, Kyrillidis, and Cevher2013a]: starting from the current estimate , one might perform a forward binary search in the range . The selection of the new step size is taken as the maximum-valued step size in , as long as decreases the objective function , while satisfying any constraints in the optimization. The supplementary material contains illustrative examples which we omit due to lack of space.

## Application to sparse covariance estimation

Covariance estimation is an important problem, found in diverse research areas. In classic portfolio optimization [Markowitz1952], the covariance matrix over the asset returns is unknown and even the estimation of the most significant dependencies among assets might lead to meaningful decisions for portfolio optimization. Other applications of the sparse covariance estimation include inference in gene dependency networks [Schäfer and Strimmer2005], fMRI imaging [Varoquaux et al.2010], data mining [Alqallaf et al.2002], etc. Overall, sparse covariance matrices come with nice properties such as natural graphical interpretation, whereas are easy to be transfered and stored.

To this end, we consider the following problem:

Problem I: Given -dimensional samples

, drawn from a joint probability density function with

unknown sparse covariance , we approximate as the solution to the following optimization problem for some :

 Θ⋆=argminΘ{12ρ∥Θ−ˆΣ∥2F−logdet(Θ)+λρ∥Θ∥1}

A summary of the related work on the sparse covariance problem is given in Table 1 and a more detailed discussion is provided in the supplementary material.

## Experiments

All approaches are carefully implemented in Matlab code with no C-coded parts. In all cases, we set and . A more extensive presentation of these results can be found in the supplementary material.

### Benchmarking iSCOPT: time efficiency

To the best of our knowledge, only [Rothman2012] considers the same objective function as in Problem I. There, the proposed algorithm follows similar motions with the graphical Lasso method [Friedman, Hastie, and Tibshirani2008].

To show the merits of our approach as compared with the state-of-the-art in [Rothman2012], we generate as a random positive definite covariance matrix with . In our experiments, we test sparsity levels such that and . Without loss of generality, we assume that the variables are drawn from a joint Gaussian probability distribution. Given , we generate random -variate vectors according to , where . Then, the sample covariance matrix is ill-conditioned in all cases with . We observe that the number of unknowns is ; in our testbed, this corresponds to estimation of up to variables. To compute in (6), we use a power method scheme with iterations. All algorithms under comparison are initialized with . As an execution wall time, we set seconds (1 hour). In all cases, we set .

Table 2 contains the summary of results. Overall, the proposed framework shows superior performance across diverse configuration settings, both in terms of time complexity and objective function minimization efficiency: both iSCOPT and iSCOPT FLS find solutions with lower objective function value, as compared to [Rothman2012], within the same time frame. The regular iSCOPT algorithm performs relatively well in terms of computational time as compared to the rest of the methods. However, its convergence rate heavily depends on the conservative selection. We note that (4) benefits from warm-start strategies that result in convergence in Step 3 of Algorithm 1 within a few steps.

### Benchmarking iSCOPT: reconstruction efficiency

We also measure the reconstruction efficacy by solving Problem I, as compared to other optimization formulations for sparse covariance estimation. We compare our estimate with: the Alternating Direction Method of Multipliers (ADMM) implementation [Xue, Ma, and Zou2012], and the coordinate descent algorithm [Wang2012].

Table 3 aggregates the experimental results in terms of the normalized distance and the captured sparsity pattern in . Without loss of generality, we fix for the case and, for the case . iSCOPT framework is at least as competitive with the state-of-the-art implementations for sparse covariance estimation. It is evident that the proposed iSCOPT variant, based on self-concordant analysis, is at least one order of magnitude faster than the rest of algorithms under comparison. In terms of reconstruction efficacy, using our proposed scheme, we can achieve marginally better reconstruction performance, as compared to [Xue, Ma, and Zou2012].