I Introduction
Random matrices have been widely used in many machine learning and information theory problems,
e.g., compressed sensing [2, 3, 4], coding theory [5], kernel method [6], estimation of covariance matrices
[7, 8], and quantum information theory [9, 10, 11]. In particular, sums of random matrices and the tail behavior of their extreme eigenvalues (or singular values) are of significant interest in theoretical studies and practical applications (
cf. [12]). Ahlswede and Winter presented a largedeviation inequality for the extreme eigenvalues of sums of random matrices [13]. Tropp improved upon their results using Lieb’s concavity theorem [1]. Hsu et al. provided tail inequalities for sums of random matrices that depend on intrinsic dimensions instead of explicit matrix dimensions [14]. By introducing the concept of effective rank, Minsker extended Bernstein’s concentration inequality for random matrices [15] and refined the results in [14]. There have also been many other works on the eigenproblems of random matrices (cf. [16, 17, 18, 19, 20]), and the list provided here is incomplete.A simple form of sums of random matrices can be expressed as with random variables and fixed matrices
. This form has played an important role in recent works on neural networks
[21], kernel methods [22]and deep learning
[23], where the original weighted (or projection) matrices can be replaced with structured random matrices, such as circulant and Toeplitz matrices with Gaussian or Bernoulli entries. Note that these two distributions, along with uniform distributions and Rademacher distributions, belong to the family of subGaussian distributions
^{1}^{1}1A random variableis said to be subGaussian if its moment generating function (mgf) satisfies
(), where is an absolute constant., and many techniques dedicated to subGaussian random matrices have been developed (e.g., [1, 14]). However, to the best of our knowledge, random matrix research beyond that is still very limited.The tail behavior of , where stands for the spectral norm of the matrix , is strongly related to several optimization problems, including the Procrustes problem and the quadratic assignment problem (cf. [24, 25]). Nemirovski analyzed efficiently computable solutions to these optimization problems [24], and showed that the tail behavior of provides answers to 1) the safe tractable approximation of chance constrained linear matrix inequalities, and 2) the quality of semidefinite relaxations of a general quadratic optimization problem. He also proved a tail bound for , where obey either distributions supported on or Gaussian distributions with unitvariance, and presented a conjecture for the “optimal” expression of the tail bound [24]. Anthony So applied the noncommutative Khintchine’s inequality to achieve a solution to Nemirovski’s conjecture [25]. Note that the aforementioned results assume that obey distributions supported on or Gaussian distributions with unit variance. These assumptions will not always be satisfied in practice, and it is advantageous to explore whether these efficiently computable optimization solutions would also hold in a broader setting. We answer this question in the affirmative in this paper.
In this work, we study and prove tail bounds for the random matrix , where random variables are infinite divisible distributions. The class of infinitely divisible (i.d.) distributions includes Gaussian distributions, Poisson distributions, stable distributions and compound Poisson distributions as special cases (cf. [26, 27]). In recent years, techniques developed for i.d. distributions have been employed in important applications in the fields of image processing [28] and kernel methods [29]. Note that there is no intersection between subGaussian distributions and i.d. distributions except for Gaussian distributions (cf. Lemma 5.5 of [19]). We therefore believe that our works on random matrix with respect to i.d. distributions will complement earlier results for subGaussian distributions and provide useful applications in the fields of learning and optimization, and beyond.
Ia Overview of the Main Results
There are three main contributions of this paper: 1) we obtain tail inequalities for the largest eigenvalue of the matrix infinitely divisible (i.d.) series , where the are i.d. random variables; 2) we construct a piecewise function to bound the function from below when for any given , and the new lower bound function is the tightest up to date; and 3) we show that the tail inequalities of matrix i.d. series provide efficiently computable solutions to several optimization problems.
First, we develop a matrix momentgenerating function (mgf) bound for i.d. distributions as the starting point for deriving the subsequent tail inequalities for the matrix i.d. series. Then, we derive the tail inequality given in (III.1) for the matrix i.d. series, which is difficult to compute because of the integral of an inverse function. Therefore, by introducing the additional condition that the Lévy measure has a bounded support, we simplify the aforementioned result into a Bennetttype tail inequality [cf. (III.1)] that contains the function , and we also replace with to obtain a Bernsteintype tail inequality [cf. (III.2)] for the matrix i.d. series. In addition, we bound the expectation of the spectral norm of the matrix i.d. series.
Since cannot bound from below sufficiently tightly when is large (cf. Fig. 1), we introduce another function [cf. (16)] to bound from below more tightly than when for any (cf. Remark III.3). Although is a piecewise function, all subfunctions of share the simple form (where ) and thus have a low computational cost, and the subdomains of can be arbitrarily selected as long as points and are included in the ordered sequence as the smallest and largest elements, respectively. Based on (especially with ), we obtain another type of tail inequality for matrix i.d. series that is tighter than the Bernsteintype result given in (III.2) when .^{2}^{2}2In general, the tail inequality describes the probability characteristics of the event in which the value of a random variable is greater than a given positive constant . Consequently, the tail inequality provides more useful information in the case of than in the case of . We show that the tail result based on provides a tighter upper bound on the largest eigenvalue of a matrix i.d. series than is possible with the Bernsteintype result when the matrix dimension is high. The results regarding and are applicable for any Bennetttype concentration inequality that involves the function .
Using the resulting tail bounds for random i.d. series, we study the properties of two optimization problems: chance constrained optimization problems and quadratic optimization problems with orthogonality constraints, which covers several wellstudied optimization problems as special cases, e.g., the Procrustes problem and the quadratic assignment problem. Although these problems have been exhaustively explored in the works [24, 25], their results are built under the assumption that obey either distributions supported on or Gaussian distributions with unit variance, which restricts the feasibility of the results in practical problems that do not satisfy the assumption. By using the tail inequalities for random i.d. series to resolve an extension of Nemirovski’s conjecture (cf. Conjecture IV.1), we show that the results obtained in [24, 25] are also valid in the i.d. scenario, where obey i.d. distributions instead of distributions supported on or Gaussian distributions.
The remainder of this paper is organized as follows. Section II introduces necessary preliminaries on i.d. distributions and Section III presents the main results of this paper. In Section IV, we study the application of random i.d. series in a number of optimization problems. Section V concludes the paper. In the appendix, we provide a detailed introduction to the Lévy measure (part A) and prove the main results of this paper (part B).
Ii Preliminaries on Infinitely Divisible Distributions
In this section, we first introduce several definitions related to infinitely divisible (i.d.) distributions and then present the matrix mgf inequality for i.d. distributions.
Iia Infinitely Divisible Distributions
A random variable has an i.d. distribution if for any , there exists a sequence of independent and identically distributed (i.i.d.) random variables such that has the same distribution as . Equivalently, i.d. distributions can be defined by means of a characteristic exponent, as follows.
Definition II.1
Let be the characteristic exponent of a random variable :
The distribution of is said to be i.d. if for any , there exists a characteristic exponent such that
Now, we need to introduce the concept of the Lévy measure.
Definition II.2 (Lévy Measure)
A Borel measure defined on is said to be a Lévy measure if it satisfies
(1) 
The Lévy measure describes the expected number of jumps of a certain height in a time interval of unit length; a more detailed explanation is given in Appendix A. The following theorem provides a sufficient and necessary condition for i.d. distributions:
Theorem II.1 (LévyKhintchine Theorem)
A realvalued random variable is i.d. if and only if there exists a triplet such that for any , the characteristic exponent is of the form
(2) 
where , and is a Lévy measure.
IiB Matrix Inequalities for Infinitely Divisible Distributions
Let the symbol denote the semidefinite order on selfadjoint matrices. For any real functions and , the transfer rule states that if for any , then when the eigenvalues of the semidefinite matrix lie in . Below, we present the matrix mgf bound for i.d. distributions as the starting point for deriving the desired tail results for matrix i.d. series.
Lemma II.1
Let be an i.d. random variable with the triplet , and suppose that . Given a fixed selfadjoint matrix with , it holds that for any ,
(3) 
where stands for the largest eigenvalue, and
(4) 
Iii Tail Inequalities for Matrix Infinitely Divisible Series
In this section, we first present two types of tail inequalities for matrix i.d. series: Bennetttype and Bernsteintype inequalities. By analyzing the characteristics of the function that appears in the Bennetttype result, we introduce a piecewise function to bound Q(s) from below and thus obtain a new tail inequality for matrix i.d. series. We also study the upper bound of the expectation of .
Iiia Tail Inequalities for Matrix Infinitely Divisible Series
By using the the matrix mgf bound (3), we first obtain the tail inequality for the matrix i.d. series :
Theorem III.1
Let be fixed ddimensional selfadjoint matrices with (), and let be independent centered i.d. random variables with the triplet and . Define . Then for all , we have
(5) 
where is the left limit at , and is the inverse of
The proof of this theorem is given in Appendix BB.
Remark III.1
Since the matrices () are selfadjoint, the matrix is selfadjoint and positive semidefinite. Therefore, is nonnegative and the above result is nontrivial.
Considering the difficulties that arise in computing the function and its inverse , we introduce the additional condition that has a bounded support to simplify the above result, which leads to the following corollary.
Corollary III.1
If has a bounded support with , then for any ,
(6) 
where , and
(7) 
The proof of this corollary is given in Appendix BC.
Roughly speaking, the condition that has a bounded support means that large jumps may not occur on the path of the Lévy process that is generated from the i.d. distribution with triplet . Refer to Appendix A for the explanation for this condition.
Note that the tail inequality (III.1) is similar in form to the matrix Bennett result (cf. Theorem 6.1 of [1]). Following the classical method of bounding from below, the Bernsteintype result can be derived based on the fact that
(8) 
where
(9) 
As shown in Fig. 1, the function can tightly bound from below when is close to the origin, whereas there will be a large discrepancy between and when is far from the origin. This is because is derived from the Taylor expansion at the point (cf. Chapter 2.7 of [30]). To facilitate the analysis of , the function is relaxed to a looser lowerbound function , which is a piecewise function with the following subfunctions: when ; and when . Although the function does not bound sufficiently tightly, the result presented in (15) below shows that provides the same rate of growth as when is close to the origin or approaches infinity. This phenomenon suggests that the coefficients and of the subfunctions and , respectively, are probably not sufficiently welltuned.
Corollary III.2
Let be independent i.d. random variables satisfying the conditions in Corollary III.1. Then for any ,
(10)  
This corollary shows that the probability of the event is bounded by when is large and that its upper bound is of the form when is small.
Recalling Inequality (4.9) of [1], the expectation for a random Gaussian series is bounded by the term . In a similar way, we use the tail bound presented in (III.2) to obtain an upper bound on for a random i.d. series.
Theorem III.2
Let be independent i.d. random variables satisfying the conditions in Corollary III.1. Then
(11) 
Because of the existence of the Lévy measure , the upper bound on for a random i.d. series is of the form , which differs from the Gaussian bound of . Recalling the LévyItô decomposition (cf. [27]), the higher expectation bound for a matrix i.d. series arises from the existence of the compound Poisson (with drift) components of the i.d. distribution.
Remark III.2
Note that the aforementioned tail results for matrix i.d. series can be generalized to the scenario of sums of independent i.d. random matrices , all of whose entries are i.d. random variables with the generating triplet . As a starting point, we first obtain the mgf bound for the selfadjoint i.d. random matrix with and :
(12) 
which can be proven in a manner similar to Lemma II.1. We then arrive at upper bounds on and with the same forms as those of the proposed results for matrix i.d. series except that the term is replaced by [cf. (III.1), (III.1), (III.2), (17) and (BD
)]. These results can also be regarded as an extension of the existing vectorversion results (cf.
[31, 32]).IiiB A LowerBound Function of
As discussed above, both and are lower bound functions for , but they do not bound sufficiently tightly when is far from the origin (cf. Fig. 1) because they stem from the Taylor expansion at the origin. We adopt a more direct strategy to analyze the behavior of the function ; for earlier discussions on this topic, refer to [33, 34].
We consider the following inequality:
(13) 
where the parameter is expected to be a constant independent of such that bounds from below as tightly as possible. For any , define
(14) 
Then, it follows L’Hospital’s rule that
(15) 
The two limits in (15) suggest that piecewise function indeed captures the rate of growth of the function as approaches either the origin or infinity.
Now, we must choose the parameter . As shown in Fig. 2, the function is sensitive to the choice of , and the value of will vary dramatically near the point if parameter is not chosen well. Therefore, we should select a such that the variation of near is kept as small as possible, i.e., such that the discrepancy between and is minimized. The follow lemma is also derived from L’Hospital’s rule:
Lemma III.1
Let . Then,
This lemma shows that with the parameter choice , the point is a removable discontinuity of the function ; i.e., . In other words, if we add a supplementary definition of , the resulting function will be continuous on the domain . Therefore, parameter should be selected such that .
By using the function , we can develop another lowerbound function for as follows.
Proposition III.1
Given an arbitrary positive constant and an integer , let be an ordered sequence such that , and define
(16) 
where and (). Then, for all , we have , where the first equality holds when or ; and the second equality holds when .
As suggested by this result, a piecewise function to bound from below can be built when has a bounded domain by means of the following steps:

[(i)]

Let , and select a constant to form an interval .

Select an integer and an ordered sequence such that .

If , then ; if , then , where ().
The resulting function has the following characteristics:

There is no additional restriction on the choice of the constant , the integer and the points other than . This means that suitable parameters , and can be chosen in accordance with the requirements of various practical problems.

Although is a piecewise function, all parts of share the same coefficient , and the parameters are the values of function at the partition points (). Therefore, the computation of has a low cost.

For any choice of , the piecewise function has the same form when . In particular, (i.e., with ) is a continuous function on , and the difference between and is not significant for any other choice of (cf. Fig. 3). Hence, can be adopted as the lowerbound function for if there are no additional requirements on the ordered sequence .
Remark III.3
As shown in Fig. 4, the lowerbound function performs better than the function , which is derived from the Taylor expansion, when ; moreover, although bounds more tightly than does when , there is only a slight discrepancy between and on this interval.^{3}^{3}3The range of is the numerical solution to the inequality . As a result, the method of bounding that is proposed in (13) is not only effective but also corrects for the shortcoming of the Taylorexpansionbased method (8), i.e., the local approximation at the origin.
By recalling the tail inequality (III.1) and replacing the function with , we obtain, for any ,
(17)  
where . As shown in Fig. 5, the above result provides a bound that is tighter than the one achieved by the Bernsteintype results in (III.2) when , and is only slightly looser than the Bernsteintype bound based on when .
Remark III.4
Since the function is defined on the bounded interval , the result given in (17) cannot be used to analyze the asymptotic behavior of as goes to infinity. However, since bounds from below more tightly than (or ) does on the bounded domain , the result given in (17) provides a more accurate description of the nonasymptotic behavior of when . The following alternative expressions for the Bernsteintype result given in (III.2) and the based result given in (17) can respectively be obtained: with probability at least ,
(18) 
and
These expressions suggest that is bounded by the term with , which is a tighter bound than the righthand side of the Bernsteintype result (18) when the matrix dimension is high.
Iv Applications in Optimization
In this section, we will show that the derived tail inequalities for random i.d. series can be used to solve two types of optimization problems: chance constrained optimization problems and quadratic optimization problems with orthogonality constraints. These optimization problems are reviewed in Section IVA, and Nemirovski’s conjecture [24] for efficiently computable solutions to these two optimization problems is introduced. We argue that the requirement in Nemirovski’s conjecture is not practical, generalize the requirement using matrix i.d. series, and provide a solution to the extended version of Nemirovski’s conjecture in Section IVB. Lastly, we rederive efficiently computable solutions to both types of optimization problems with a matrix i.d. series requirement in Section IVC.
Iva Relevant Optimization Problems
It has been pointed out in the pioneering work of [24] that the behavior of is strongly related to the efficiently computable solutions to many optimization problems, e.g., the chance constrained optimization problem and the quadratic optimization problem with orthogonality constraints. Several wellstudied optimization problems are included in the latter as special cases, such as the Procrustes problem and the quadratic assignment problem. We begin with a brief introduction of these optimization problems.
IvA1 Chance Constrained Optimization Problem
Consider the following chance constrained optimization problem (cf. [25]): given an dimensional vector and an , find
(19)  
where is an efficiently computable vectorvalued function with convex components; are affine functions taking values in the space of symmetric matrices with for all ; and are independent random variables with zero mean. The main challenge in solving this optimization lies in the chance constraint (19b).
By letting , we have
It is subsequently necessary to find a sufficient condition for the inequality
(20) 
and to guarantee that the condition can be efficiently computable in optimization. For example, So proposed the following condition [25]:
(21) 
By using the Schur complement, it can be equivalently expressed as a linear matrix inequality:
(22) 
If the constraint (19b) is replaced with the inequality (22), the chanceconstrained optimization problem will become tractable. To guarantee the validity of this replacement, it is necessary to consider the following problem:
 (P1)
IvA2 Quadratic Optimization Problems with Orthogonality Constrains
Let be the space of real matrices equipped with the trace inner product . Consider the following quadratic optimization problem:
(23) 
where are selfadjoint linear mappings (note that they can be represented as symmetric matrices); are positive semidefinite; is a linear mapping (which can be represented as symmetric matrices); and is the spectral norm of . As addressed in [24], this optimization problem covers many wellstudied optimization problems with the orthogonality constraint as special cases, e.g., the Procrustes problem and the quadratic assignment problem. By exploiting the structure of these problems, the orthogonality constraint can be relaxed to the constraint (IVA2c) without loss of generality.
The optimization problem can be directly tackled by using the semidefinite programming (SDP) relaxation:
(24) 
where is the space of symmetric matrices; are the symmetric matrices corresponding to the selfadjoint linear mappings respectively; is the matrix corresponding to the mappings ; is the linear mapping such that given , if and only if ; and is the linear mapping such that if and only if . Refer to Section 3.1.1 of [25] for details of these notations.
By using the ellipsoid method, the solution to the optimization problem (IVA2) can be obtained with an additive error in polynomial time. That is, if is the optimal value of (IVA2), the ellipsoid method can be used for any to obtain a solution in polynomial time such that is feasible for (IVA2) and satisfies , where is the symmetric matrix corresponding to the selfadjoint linear mapping in (IVA2).
The solution to the optimization problem (IVA2) can be achieved by using along with a degree of randomness. Since , there exists a positive semidefinite matrix such that . Since is also symmetric, it has a spectral decomposition , where is an orthogonal matrix and is an diagonal matrix. Let be an dimensional random vector, where are i.i.d. with zero mean and unit variance. The solution is ultimately achieved via . Alternatively, can be expressed as
(25) 
where and is the th column vector of the matrix (). To explore the quality of solution , the following problem should be considered:
 (P2)

Does act as a highquality solution to the optimization problem (IVA2) with a reasonable (at least larger than ) probability?
IvB An Extension of Nemirovski’s Conjecture
Nemirovski [24] pointed out that the aforementioned two problems P1 and P2 can be reduced to a question about the behavior of the upper bound of and the “optimal” answer to this question can be achieved by resolving the following conjecture:
Conjecture IV.1
Nemirovski [24] showed that the inequality (26) is achieved when , while there is a gap between this value of and the conjectured value . Anthony So used a noncommutative Khintchine inequality to show that when , for any (cf. [25]),
(27) 
Note that these results are built under the assumption that are either Gaussian distributions or distributions supported on . However, the assumption will not always be satisfied in practice. Therefore, we extend the content of the conjecture to the i.d. scenario, i.e., whether the inequality (26) is still valid when are independent i.d. random variables with zero mean and unit variance. The following theorem provides a solut
Comments
There are no comments yet.