Matrix Infinitely Divisible Series: Tail Inequalities and Applications in Optimization

09/04/2018 ∙ by Chao Zhang, et al. ∙ Dalian University of Technology 0

In this paper, we study tail inequalities of the largest eigenvalue of a matrix infinitely divisible (i.d.) series, which is a finite sum of fixed matrices weighted by i.d. random variables. We obtain several types of tail inequalities, including Bennett-type and Bernstein-type inequalities. This allows us to further bound the expectation of the spectral norm of a matrix i.d. series. Moreover, by developing a new lower-bound function for Q(s)=(s+1)(s+1)-s that appears in the Bennett-type inequality, we derive a tighter tail inequality of the largest eigenvalue of the matrix i.d. series than the Bernstein-type inequality when the matrix dimension is high. The resulting lower-bound function is of independent interest and can improve any Bennett-type concentration inequality that involves the function Q(s). The class of i.d. probability distributions is large and includes Gaussian and Poisson distributions, among many others. Therefore, our results encompass the existing work tropp2012user on matrix Gaussian series as a special case. Lastly, we show that the tail inequalities of a matrix i.d. series have applications in several optimization problems including the chance constrained optimization problem and the quadratic optimization problem with orthogonality constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Random matrices have been widely used in many machine learning and information theory problems,

e.g., compressed sensing [2, 3, 4], coding theory [5], kernel method [6]

, estimation of covariance matrices

[7, 8], and quantum information theory [9, 10, 11]

. In particular, sums of random matrices and the tail behavior of their extreme eigenvalues (or singular values) are of significant interest in theoretical studies and practical applications (

cf. [12]). Ahlswede and Winter presented a large-deviation inequality for the extreme eigenvalues of sums of random matrices [13]. Tropp improved upon their results using Lieb’s concavity theorem [1]. Hsu et al. provided tail inequalities for sums of random matrices that depend on intrinsic dimensions instead of explicit matrix dimensions [14]. By introducing the concept of effective rank, Minsker extended Bernstein’s concentration inequality for random matrices [15] and refined the results in [14]. There have also been many other works on the eigenproblems of random matrices (cf. [16, 17, 18, 19, 20]), and the list provided here is incomplete.

A simple form of sums of random matrices can be expressed as with random variables and fixed matrices

. This form has played an important role in recent works on neural networks

[21], kernel methods [22]

and deep learning

[23]

, where the original weighted (or projection) matrices can be replaced with structured random matrices, such as circulant and Toeplitz matrices with Gaussian or Bernoulli entries. Note that these two distributions, along with uniform distributions and Rademacher distributions, belong to the family of sub-Gaussian distributions

111A random variable

is said to be sub-Gaussian if its moment generating function (mgf) satisfies

(), where is an absolute constant.
, and many techniques dedicated to sub-Gaussian random matrices have been developed (e.g., [1, 14]). However, to the best of our knowledge, random matrix research beyond that is still very limited.

The tail behavior of , where stands for the spectral norm of the matrix , is strongly related to several optimization problems, including the Procrustes problem and the quadratic assignment problem (cf. [24, 25]). Nemirovski analyzed efficiently computable solutions to these optimization problems [24], and showed that the tail behavior of provides answers to 1) the safe tractable approximation of chance constrained linear matrix inequalities, and 2) the quality of semidefinite relaxations of a general quadratic optimization problem. He also proved a tail bound for , where obey either distributions supported on or Gaussian distributions with unitvariance, and presented a conjecture for the “optimal” expression of the tail bound [24]. Anthony So applied the non-commutative Khintchine’s inequality to achieve a solution to Nemirovski’s conjecture [25]. Note that the aforementioned results assume that obey distributions supported on or Gaussian distributions with unit variance. These assumptions will not always be satisfied in practice, and it is advantageous to explore whether these efficiently computable optimization solutions would also hold in a broader setting. We answer this question in the affirmative in this paper.

In this work, we study and prove tail bounds for the random matrix , where random variables are infinite divisible distributions. The class of infinitely divisible (i.d.) distributions includes Gaussian distributions, Poisson distributions, stable distributions and compound Poisson distributions as special cases (cf. [26, 27]). In recent years, techniques developed for i.d. distributions have been employed in important applications in the fields of image processing [28] and kernel methods [29]. Note that there is no intersection between sub-Gaussian distributions and i.d. distributions except for Gaussian distributions (cf. Lemma 5.5 of [19]). We therefore believe that our works on random matrix with respect to i.d. distributions will complement earlier results for sub-Gaussian distributions and provide useful applications in the fields of learning and optimization, and beyond.

I-a Overview of the Main Results

There are three main contributions of this paper: 1) we obtain tail inequalities for the largest eigenvalue of the matrix infinitely divisible (i.d.) series , where the are i.d. random variables; 2) we construct a piecewise function to bound the function from below when for any given , and the new lower bound function is the tightest up to date; and 3) we show that the tail inequalities of matrix i.d. series provide efficiently computable solutions to several optimization problems.

First, we develop a matrix moment-generating function (mgf) bound for i.d. distributions as the starting point for deriving the subsequent tail inequalities for the matrix i.d. series. Then, we derive the tail inequality given in (III.1) for the matrix i.d. series, which is difficult to compute because of the integral of an inverse function. Therefore, by introducing the additional condition that the Lévy measure has a bounded support, we simplify the aforementioned result into a Bennett-type tail inequality [cf. (III.1)] that contains the function , and we also replace with to obtain a Bernstein-type tail inequality [cf. (III.2)] for the matrix i.d. series. In addition, we bound the expectation of the spectral norm of the matrix i.d. series.

Since cannot bound from below sufficiently tightly when is large (cf. Fig. 1), we introduce another function [cf. (16)] to bound from below more tightly than when for any (cf. Remark III.3). Although is a piecewise function, all sub-functions of share the simple form (where ) and thus have a low computational cost, and the subdomains of can be arbitrarily selected as long as points and are included in the ordered sequence as the smallest and largest elements, respectively. Based on (especially with ), we obtain another type of tail inequality for matrix i.d. series that is tighter than the Bernstein-type result given in (III.2) when .222In general, the tail inequality describes the probability characteristics of the event in which the value of a random variable is greater than a given positive constant . Consequently, the tail inequality provides more useful information in the case of than in the case of . We show that the tail result based on provides a tighter upper bound on the largest eigenvalue of a matrix i.d. series than is possible with the Bernstein-type result when the matrix dimension is high. The results regarding and are applicable for any Bennett-type concentration inequality that involves the function .

Using the resulting tail bounds for random i.d. series, we study the properties of two optimization problems: chance constrained optimization problems and quadratic optimization problems with orthogonality constraints, which covers several well-studied optimization problems as special cases, e.g., the Procrustes problem and the quadratic assignment problem. Although these problems have been exhaustively explored in the works [24, 25], their results are built under the assumption that obey either distributions supported on or Gaussian distributions with unit variance, which restricts the feasibility of the results in practical problems that do not satisfy the assumption. By using the tail inequalities for random i.d. series to resolve an extension of Nemirovski’s conjecture (cf. Conjecture IV.1), we show that the results obtained in [24, 25] are also valid in the i.d. scenario, where obey i.d. distributions instead of distributions supported on or Gaussian distributions.

The remainder of this paper is organized as follows. Section II introduces necessary preliminaries on i.d. distributions and Section III presents the main results of this paper. In Section IV, we study the application of random i.d. series in a number of optimization problems. Section V concludes the paper. In the appendix, we provide a detailed introduction to the Lévy measure (part A) and prove the main results of this paper (part B).

Ii Preliminaries on Infinitely Divisible Distributions

In this section, we first introduce several definitions related to infinitely divisible (i.d.) distributions and then present the matrix mgf inequality for i.d. distributions.

Ii-a Infinitely Divisible Distributions

A random variable has an i.d. distribution if for any , there exists a sequence of independent and identically distributed (i.i.d.) random variables such that has the same distribution as . Equivalently, i.d. distributions can be defined by means of a characteristic exponent, as follows.

Definition II.1

Let be the characteristic exponent of a random variable :

The distribution of is said to be i.d. if for any , there exists a characteristic exponent such that

Now, we need to introduce the concept of the Lévy measure.

Definition II.2 (Lévy Measure)

A Borel measure defined on is said to be a Lévy measure if it satisfies

(1)

The Lévy measure describes the expected number of jumps of a certain height in a time interval of unit length; a more detailed explanation is given in Appendix A. The following theorem provides a sufficient and necessary condition for i.d. distributions:

Theorem II.1 (Lévy-Khintchine Theorem)

A real-valued random variable is i.d. if and only if there exists a triplet such that for any , the characteristic exponent is of the form

(2)

where , and is a Lévy measure.

This theorem states that an i.d. distribution can be characterized by the triplet . Refer to [26, 27] for details.

Ii-B Matrix Inequalities for Infinitely Divisible Distributions

Let the symbol denote the semidefinite order on self-adjoint matrices. For any real functions and , the transfer rule states that if for any , then when the eigenvalues of the semidefinite matrix lie in . Below, we present the matrix mgf bound for i.d. distributions as the starting point for deriving the desired tail results for matrix i.d. series.

Lemma II.1

Let be an i.d. random variable with the triplet , and suppose that . Given a fixed self-adjoint matrix with , it holds that for any ,

(3)

where stands for the largest eigenvalue, and

(4)

The proof of this lemma is given in Appendix B-A. Note that if the Lévy measure is the zero measure, then the mgf result given in (3) is analogous to the mgf result () when is Gaussian (cf. Lemma 4.3 of [1]).

Iii Tail Inequalities for Matrix Infinitely Divisible Series

In this section, we first present two types of tail inequalities for matrix i.d. series: Bennett-type and Bernstein-type inequalities. By analyzing the characteristics of the function that appears in the Bennett-type result, we introduce a piecewise function to bound Q(s) from below and thus obtain a new tail inequality for matrix i.d. series. We also study the upper bound of the expectation of .

Iii-a Tail Inequalities for Matrix Infinitely Divisible Series

By using the the matrix mgf bound (3), we first obtain the tail inequality for the matrix i.d. series :

Theorem III.1

Let be fixed d-dimensional self-adjoint matrices with (), and let be independent centered i.d. random variables with the triplet and . Define . Then for all , we have

(5)

where is the left limit at , and is the inverse of

The proof of this theorem is given in Appendix B-B.

Remark III.1

Since the matrices () are self-adjoint, the matrix is self-adjoint and positive semidefinite. Therefore, is non-negative and the above result is non-trivial.

Considering the difficulties that arise in computing the function and its inverse , we introduce the additional condition that has a bounded support to simplify the above result, which leads to the following corollary.

Corollary III.1

If has a bounded support with , then for any ,

(6)

where , and

(7)

The proof of this corollary is given in Appendix B-C.

Roughly speaking, the condition that has a bounded support means that large jumps may not occur on the path of the Lévy process that is generated from the i.d. distribution with triplet . Refer to Appendix A for the explanation for this condition.

Note that the tail inequality (III.1) is similar in form to the matrix Bennett result (cf. Theorem 6.1 of [1]). Following the classical method of bounding from below, the Bernstein-type result can be derived based on the fact that

(8)

where

(9)

As shown in Fig. 1, the function can tightly bound from below when is close to the origin, whereas there will be a large discrepancy between and when is far from the origin. This is because is derived from the Taylor expansion at the point (cf. Chapter 2.7 of [30]). To facilitate the analysis of , the function is relaxed to a looser lower-bound function , which is a piecewise function with the following sub-functions: when ; and when . Although the function does not bound sufficiently tightly, the result presented in (15) below shows that provides the same rate of growth as when is close to the origin or approaches infinity. This phenomenon suggests that the coefficients and of the sub-functions and , respectively, are probably not sufficiently well-tuned.

(a)
(b)
Fig. 1: The function curves of , and .
Corollary III.2

Let be independent i.d. random variables satisfying the conditions in Corollary III.1. Then for any ,

(10)

This corollary shows that the probability of the event is bounded by when is large and that its upper bound is of the form when is small.

Recalling Inequality (4.9) of [1], the expectation for a random Gaussian series is bounded by the term . In a similar way, we use the tail bound presented in (III.2) to obtain an upper bound on for a random i.d. series.

Theorem III.2

Let be independent i.d. random variables satisfying the conditions in Corollary III.1. Then

(11)

Because of the existence of the Lévy measure , the upper bound on for a random i.d. series is of the form , which differs from the Gaussian bound of . Recalling the Lévy-Itô decomposition (cf. [27]), the higher expectation bound for a matrix i.d. series arises from the existence of the compound Poisson (with drift) components of the i.d. distribution.

Remark III.2

Note that the aforementioned tail results for matrix i.d. series can be generalized to the scenario of sums of independent i.d. random matrices , all of whose entries are i.d. random variables with the generating triplet . As a starting point, we first obtain the mgf bound for the self-adjoint i.d. random matrix with and :

(12)

which can be proven in a manner similar to Lemma II.1. We then arrive at upper bounds on and with the same forms as those of the proposed results for matrix i.d. series except that the term is replaced by [cf. (III.1), (III.1), (III.2), (17) and (B-D

)]. These results can also be regarded as an extension of the existing vector-version results (cf.

[31, 32]).

Iii-B A Lower-Bound Function of

As discussed above, both and are lower bound functions for , but they do not bound sufficiently tightly when is far from the origin (cf. Fig. 1) because they stem from the Taylor expansion at the origin. We adopt a more direct strategy to analyze the behavior of the function ; for earlier discussions on this topic, refer to [33, 34].

We consider the following inequality:

(13)

where the parameter is expected to be a constant independent of such that bounds from below as tightly as possible. For any , define

(14)

Then, it follows L’Hospital’s rule that

(15)

The two limits in (15) suggest that piecewise function indeed captures the rate of growth of the function as approaches either the origin or infinity.

Now, we must choose the parameter . As shown in Fig. 2, the function is sensitive to the choice of , and the value of will vary dramatically near the point if parameter is not chosen well. Therefore, we should select a such that the variation of near is kept as small as possible, i.e., such that the discrepancy between and is minimized. The follow lemma is also derived from L’Hospital’s rule:

Lemma III.1

Let . Then,

This lemma shows that with the parameter choice , the point is a removable discontinuity of the function ; i.e., . In other words, if we add a supplementary definition of , the resulting function will be continuous on the domain . Therefore, parameter should be selected such that .

Fig. 2: The function curves of w.r.t. different settings

By using the function , we can develop another lower-bound function for as follows.

Proposition III.1

Given an arbitrary positive constant and an integer , let be an ordered sequence such that , and define

(16)

where and (). Then, for all , we have , where the first equality holds when or ; and the second equality holds when .

As suggested by this result, a piecewise function to bound from below can be built when has a bounded domain by means of the following steps:

  1. [(i)]

  2. Let , and select a constant to form an interval .

  3. Select an integer and an ordered sequence such that .

  4. If , then ; if , then , where ().

The resulting function has the following characteristics:

  • There is no additional restriction on the choice of the constant , the integer and the points other than . This means that suitable parameters , and can be chosen in accordance with the requirements of various practical problems.

  • Although is a piecewise function, all parts of share the same coefficient , and the parameters are the values of function at the partition points (). Therefore, the computation of has a low cost.

  • For any choice of , the piecewise function has the same form when . In particular, (i.e., with ) is a continuous function on , and the difference between and is not significant for any other choice of (cf. Fig. 3). Hence, can be adopted as the lower-bound function for if there are no additional requirements on the ordered sequence .

Fig. 3: The function curves of w.r.t. different settings, where and . Although the function is closer to than is, the curve of is not continuous and the discrepancy between and is not significant.
(a)
(b)
Fig. 4: The function curves of , , and with . The curves of and intersect approximately at the point , and the function is closer to than is when .
Remark III.3

As shown in Fig. 4, the lower-bound function performs better than the function , which is derived from the Taylor expansion, when ; moreover, although bounds more tightly than does when , there is only a slight discrepancy between and on this interval.333The range of is the numerical solution to the inequality . As a result, the method of bounding that is proposed in (13) is not only effective but also corrects for the shortcoming of the Taylor-expansion-based method (8), i.e., the local approximation at the origin.

By recalling the tail inequality (III.1) and replacing the function with , we obtain, for any ,

(17)

where . As shown in Fig. 5, the above result provides a bound that is tighter than the one achieved by the Bernstein-type results in (III.2) when , and is only slightly looser than the Bernstein-type bound based on when .

Fig. 5: The curves of -based, -based, -based and -based tail bounds, where, for simplicity, the parameters are set as , and .
Remark III.4

Since the function is defined on the bounded interval , the result given in (17) cannot be used to analyze the asymptotic behavior of as goes to infinity. However, since bounds from below more tightly than (or ) does on the bounded domain , the result given in (17) provides a more accurate description of the non-asymptotic behavior of when . The following alternative expressions for the Bernstein-type result given in (III.2) and the -based result given in (17) can respectively be obtained: with probability at least ,

(18)

and

These expressions suggest that is bounded by the term with , which is a tighter bound than the right-hand side of the Bernstein-type result (18) when the matrix dimension is high.

Iv Applications in Optimization

In this section, we will show that the derived tail inequalities for random i.d. series can be used to solve two types of optimization problems: chance constrained optimization problems and quadratic optimization problems with orthogonality constraints. These optimization problems are reviewed in Section IV-A, and Nemirovski’s conjecture [24] for efficiently computable solutions to these two optimization problems is introduced. We argue that the requirement in Nemirovski’s conjecture is not practical, generalize the requirement using matrix i.d. series, and provide a solution to the extended version of Nemirovski’s conjecture in Section IV-B. Lastly, we re-derive efficiently computable solutions to both types of optimization problems with a matrix i.d. series requirement in Section IV-C.

Iv-a Relevant Optimization Problems

It has been pointed out in the pioneering work of [24] that the behavior of is strongly related to the efficiently computable solutions to many optimization problems, e.g., the chance constrained optimization problem and the quadratic optimization problem with orthogonality constraints. Several well-studied optimization problems are included in the latter as special cases, such as the Procrustes problem and the quadratic assignment problem. We begin with a brief introduction of these optimization problems.

Iv-A1 Chance Constrained Optimization Problem

Consider the following chance constrained optimization problem (cf. [25]): given an -dimensional vector and an , find

(19)

where is an efficiently computable vector-valued function with convex components; are affine functions taking values in the space of symmetric matrices with for all ; and are independent random variables with zero mean. The main challenge in solving this optimization lies in the chance constraint (19-b).

By letting , we have

It is subsequently necessary to find a sufficient condition for the inequality

(20)

and to guarantee that the condition can be efficiently computable in optimization. For example, So proposed the following condition [25]:

(21)

By using the Schur complement, it can be equivalently expressed as a linear matrix inequality:

(22)

If the constraint (19-b) is replaced with the inequality (22), the chance-constrained optimization problem will become tractable. To guarantee the validity of this replacement, it is necessary to consider the following problem:

(P1)

Is the condition (21) sufficient for the inequality (20)?

Iv-A2 Quadratic Optimization Problems with Orthogonality Constrains

Let be the space of real matrices equipped with the trace inner product . Consider the following quadratic optimization problem:

(23)

where are self-adjoint linear mappings (note that they can be represented as symmetric matrices); are positive semidefinite; is a linear mapping (which can be represented as symmetric matrices); and is the spectral norm of . As addressed in [24], this optimization problem covers many well-studied optimization problems with the orthogonality constraint as special cases, e.g., the Procrustes problem and the quadratic assignment problem. By exploiting the structure of these problems, the orthogonality constraint can be relaxed to the constraint (IV-A2-c) without loss of generality.

The optimization problem can be directly tackled by using the semidefinite programming (SDP) relaxation:

(24)

where is the space of symmetric matrices; are the symmetric matrices corresponding to the self-adjoint linear mappings respectively; is the matrix corresponding to the mappings ; is the linear mapping such that given , if and only if ; and is the linear mapping such that if and only if . Refer to Section 3.1.1 of [25] for details of these notations.

By using the ellipsoid method, the solution to the optimization problem (IV-A2) can be obtained with an additive error in polynomial time. That is, if is the optimal value of (IV-A2), the ellipsoid method can be used for any to obtain a solution in polynomial time such that is feasible for (IV-A2) and satisfies , where is the symmetric matrix corresponding to the self-adjoint linear mapping in (IV-A2).

The solution to the optimization problem (IV-A2) can be achieved by using along with a degree of randomness. Since , there exists a positive semidefinite matrix such that . Since is also symmetric, it has a spectral decomposition , where is an orthogonal matrix and is an diagonal matrix. Let be an -dimensional random vector, where are i.i.d. with zero mean and unit variance. The solution is ultimately achieved via . Alternatively, can be expressed as

(25)

where and is the -th column vector of the matrix (). To explore the quality of solution , the following problem should be considered:

(P2)

Does act as a high-quality solution to the optimization problem (IV-A2) with a reasonable (at least larger than ) probability?

Iv-B An Extension of Nemirovski’s Conjecture

Nemirovski [24] pointed out that the aforementioned two problems P1 and P2 can be reduced to a question about the behavior of the upper bound of and the “optimal” answer to this question can be achieved by resolving the following conjecture:

Conjecture IV.1

([24, 25]) Let be i.i.d. random variables with zero mean, each of which obeys either distribution supported on or Gaussian distribution with unit variance. Let be arbitrary matrices satisfying

Then, whenever , we have

(26)

where and are absolute constants.

Nemirovski [24] showed that the inequality (26) is achieved when , while there is a gap between this value of and the conjectured value . Anthony So used a non-commutative Khintchine inequality to show that when , for any (cf. [25]),

(27)

Note that these results are built under the assumption that are either Gaussian distributions or distributions supported on . However, the assumption will not always be satisfied in practice. Therefore, we extend the content of the conjecture to the i.d. scenario, i.e., whether the inequality (26) is still valid when are independent i.d. random variables with zero mean and unit variance. The following theorem provides a solut