# Covariance-engaged Classification of Sets via Linear Programming

Set classification aims to classify a set of observations as a whole, as opposed to classifying individual observations separately. To formally understand the unfamiliar concept of binary set classification, we first investigate the optimal decision rule under the normal distribution, which utilizes the empirical covariance of the set to be classified. We show that the number of observations in the set plays a critical role in bounding the Bayes risk. Under this framework, we further propose new methods of set classification. For the case where only a few parameters of the model drive the difference between two classes, we propose a computationally-efficient approach to parameter estimation using linear programming, leading to the Covariance-engaged LInear Programming Set (CLIPS) classifier. Its theoretical properties are investigated for both independent case and various (short-range and long-range dependent) time series structures among observations within each set. The convergence rates of estimation errors and risk of the CLIPS classifier are established to show that having multiple observations in a set leads to faster convergence rates, compared to the standard classification situation in which there is only one observation in the set. The applicable domains in which the CLIPS performs better than competitors are highlighted in a comprehensive simulation study. Finally, we illustrate the usefulness of the proposed methods in classification of real image data in histopathology.

Comments

There are no comments yet.

## Authors

• 16 publications
• 4 publications
• 17 publications
12/16/2014

### Estimation of Large Covariance and Precision Matrices from Temporally Dependent Observations

We consider the estimation of large covariance and precision matrices fr...
02/12/2018

### Sparse and Robust Reject Option Classifier using Successive Linear Programming

In this paper, we propose a new sparse and robust reject option classifi...
12/22/2020

### Estimation in nonparametric regression model with additive and multiplicative noise via Laguerre series

We look into the nonparametric regression estimation with additive and m...
01/11/2015

### Identifiability and optimal rates of convergence for parameters of multiple types in finite mixtures

This paper studies identifiability and convergence behaviors for paramet...
04/21/2021

### Discovering Classification Rules for Interpretable Learning with Linear Programming

Rules embody a set of if-then statements which include one or more condi...
02/28/2021

### Optimal Imperfect Classification for Gaussian Functional Data

Existing works on functional data classification focus on the constructi...
01/09/2017

### On Reject and Refine Options in Multicategory Classification

In many real applications of statistical learning, a decision made from ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Classification is a useful tool in statistical learning with applications in many important fields. A classification method aims to train a classification rule based on the training data to classify future observations. Some popular methods for classification include linear discriminant analyses, quadratic discriminant analyses, logistic regressions, support vector machines, neural nets and classification trees. Traditionally, the task at hand is to classify an observation into a class label.

Advances in technology have eased the production of a large amount of data in various areas such as healthcare and manufacturing industries. Oftentimes, multiple samples collected from the same object are available. For example, it has become cheaper to obtain multiple tissue samples from a single patient in cancer prognosis (Miedema et al., 2012). To be explicit, Miedema et al. (2012) collected 348 independent cells, each contains observations of varying numbers (tens to hundreds) of nuclei. Here, each cell, rather than each nucleus, is labelled as either normal or cancerous. Each observation of nuclei contains 51 measurements of shape and texture features. A statistical task herein is to classify the whole set of observations from a single set (or all nuclei in a single cell) to normal or cancerous group. Such a problem was coined as set classification by Ning and Karypis (2009), studied in Wang et al. (2012) and Jung and Qiao (2014), and was seen in the image-based pathology literature (Samsudin and Bradley, 2010; Wang et al., 2010; Cheplygina et al., 2015; Shifat-E-Rabbi et al., 2020)

and in face recognition based on pictures obtained from multiple cameras, sometime called image set classification

(Arandjelovic and Cipolla, 2006; Wang et al., 2012). The set classification is not identical to the multiple-instance learning (MIL) (Maron and Lozano-Pérez, 1998; Chen et al., 2006; Ali and Shah, 2010; Carbonneau et al., 2018) as seen by Kuncheva (2010). A key difference is that in set classification a label is given to sets whereas observations in a set have different labels in the MIL setting.

While conventional classification methods predict a class label for each observation, care is needed in generalizing those for set classification. In principle, more observations should ease the task at hand. Moreover, higher-order statistics such as variances and covariances can now be exploited to help classification. Our approach to set classification is to use the extra information, available to us only when there are multiple observations. To elucidate this idea, we illustrate samples from three classes in Fig.

1. All three classes have the same mean, and Classes 1 and 2 have the same marginal variances. Classifying a single observation near the mean to any of these distributions seems difficult. On the other hand, classifying several independent observations from the same class should be much easier. In particular, a set classification method needs to incorporate the difference in covariances to differentiate these classes.

In this work, we study a binary set classification framework, where a set of observations is classified to either or

. In particular, we propose set classifiers that extend quadratic discriminant analysis to the set classification setting, and are designed to work well in set-classification of high-dimensional data whose distributions are similar to those in Fig.

1.

To provide a fundamental understanding of the set classification problem, we establish the Bayesian optimal decision rule under normality and homogeneity (i.i.d) assumptions. This Bayes rule utilizes the covariance structure of the testing set of future observations. We show in Section 2 that it becomes much easier to make accurate classification for a set when the set size, , increases. In particular, we demonstrate that the Bayes risk can be reduced exponentially in the set size . To the best of our knowledge, this is the first formal theoretical framework for set classification problems in the literature.

Built upon the Bayesian optimal decision rule, we propose new methods of set classification in Section 3. For the situation where the dimension of the feature vectors is much smaller than the total number of training samples, we demonstrate that a simple plug-in classifier leads to satisfactory risk bounds similar to the Bayes risk. Again, a large set size plays a key role in significantly reducing the risk. In high-dimensional situations where the number of parameters to be estimated () is large, we make an assumption that only a few parameters drive the difference of two classes. With this sparsity assumption, we propose to estimate the parameters in the classifier via linear programming, and the resulting classifiers are called Covariance-engaged LInear Programming Set (CLIPS) classifiers. Specifically, the quadratic and linear parameters in the Bayes rule can be efficiently estimated under the sparse structure, thanks to the extra observations in the training set due to having sets of observations. Our estimation approaches are closely related to and built upon the successful estimation strategies in Cai et al. (2011) and Cai and Liu (2011). In estimation of the constant parameter, we perform a logistic regression with only one unknown, given the estimates of quadratic and linear parameters. This allows us to implement CLIPS classifier with high computation efficiency.

We provide a thorough study of theoretical properties of CLIPS classifiers and establish an oracle inequality in terms of the excess risk, in Section 4

. In particular, the estimates from CLIPS are shown to be consistent, and the strong signals are always selected with high probability in high dimensions. Moreover, the excess risk can be reduced by having more observations in a set, one of the new phenomena for set classification, which are different from that obtained by naively having pooled observations.

In the conventional classification problem where , a special case of the proposed CLIPS classifier becomes a new sparse quadratic discriminant analysis (QDA) method (cf. Fan et al., 2015, 2013; Li and Shao, 2015; Jiang et al., 2018; Qin, 2018; Zou, 2019; Gaynanova and Wang, 2019; Pan and Mai, 2020). As a byproduct of our theoretical study, we show that the new QDA method enjoys better theoretical properties compared to state-of-the-art sparse QDA methods such as Fan et al. (2015).

The advantages of our set classifiers are further demonstrated in comprehensive simulation studies. Moreover, we provide an application to histopathology in classifying sets of nucleus images to normal and cancerous tissues in Section 5. Proofs of main results and technical lemmas can be found in the supplementary material. Also present in the supplementary material is a study on the case where observations in a set demonstrate certain spatial and temporal dependent structures. There, we utilize various (both short- and long-range) dependent time series structures within each set by considering a very general vector linear process model.

## 2 Set Classification

We consider a binary set-classification problem. The training sample contains sets of observations. Each set, , corresponds to one object, and is assumed to be from one of the two classes. The corresponding class label is denoted by . The number of observations within the th set is denoted by and can be different among different sets. Given a new set of observations , the goal of set classification is to predict accurately based on using a classification rule trained on the training sample.

To formally introduce set classification problem and study its fundamental properties, we start with a setting in which the sets in each class are homogeneous in the sense that all the observations in a class, regardless of the set membership, follow the same distribution independently. Specifically, we assume both the sets and the new set are generated in the same way as independently. To describe the generating process of , we denote the marginal class probabilities by and , and the marginal distribution of the set size by

. We assume that the random variables

and are independent. In other words, the class membership can not be predicted just based on the set size . Conditioned on and , observations in the set are independent and each distributed as .

### 2.1 Covariance-engaged Set Classifiers

Suppose that there are observations in the set that is to be classified (called testing set), and its true class label is . The Bayes optimal decision rule classifies the set to Class 1 if the conditional class probability of Class 1 is greater than that of Class 2, that is, . This is equivalent to

due to Bayes theorem and the independence assumption among

and . Let us now assume that the conditional distributions are both normal, that is, and . Then the Bayes optimal decision rule depends on the quantity

 g(x1,…,xm) =1mlog{π1pM(m)∏mj=1f1(xj)π2pM(m)∏mj=1f2(xj)} =1mlog(π1/π2)−12log(|Σ1|/|Σ2|)−12μT1Σ−11μ1+12μT2Σ−12μ2 +(Σ−11μ1−Σ−12μ2)T¯x+12¯xT(Σ−12−Σ−11)¯x+12tr{(Σ−12−Σ−11)S}. (2.1)

Here denotes the determinant of the matrix for , and are the sample mean and sample covariance of the testing set. Note that the realization implies both the number of observations and the i.i.d. observations for . The Bayes rule can be expressed as

 ϕB(X†) =2−1{g(x1,…,xm)>0}, where (2.2) g(x1,…,xm) =1mlog(π1/π2)+β0+βT¯x+¯xT∇¯x/2+tr(∇S)/2,

in which the constant coefficient , the linear coefficient vector and the quadratic coefficient matrix . The Bayes rule under the normal assumption in (2.2) uses the summary statistics , and of .

We refer to (2.2) and any estimated version of it as a covariance-engaged set classifier. In Section 3, several estimation approaches for , and will be proposed. In this section, we further discuss a rationale for considering (2.2).

The covariance-engaged set classifier (2.2) resembles the conventional QDA classifier. As a natural alternative to (2.2), one may consider the sample mean as a representative of the testing set and apply QDA to directly to make a prediction. In other words, one is about to classify this single observation to one of the two normal distributions, that is, and . This simple idea leads to

 ϕB,¯x(X†) =2−1{gQDA(¯x)>0}, where (2.3) gQDA(¯x) =1mlog(π1/π2)+β′0+βT¯x+¯xT∇¯x/2,

in which . One major difference between (2.2) and (2.3) is that the term is absent from (2.3). Indeed, the advantage of (2.2) over (2.3) comes from the extra information in the sample covariance of . In the regular classification setting, (2.2) coincides with (2.3) since vanishes when is a singleton.

Given multiple observations in the testing set, another natural approach is a majority vote applied to the QDA decisions of individual observations:

 ϕMV(X†)=2−1{1mm∑j=1sign[gQDA(xj)]>0}, (2.4)

where for and respectively. In contrast, since , our classifier (2.2) predicts the class label by a weighted vote of individual QDA decisions. In this sense, the majority voting scheme (2.4) can be viewed as a discretized version of (2.2). In Section 5, we demonstrate that our set classifier (2.2) performs significantly better than (2.4).

###### Remark 1.

We have assumed that and are independent in the setting. In fact, this assumption is not essential and can be relaxed. In a more general setting, there can be two different distributions of , and conditional on and respectively. Our analysis throughout the paper remains the same except that they would replace two identical factors in the first equality of (2.1). If and are dramatically different, then the classification is easier as one can make decision based on the observed value of . In this paper, we only consider the more difficult setting where and are independent.

### 2.2 Bayes Risk

We show below an advantage of having a set of observations for prediction, compared to having a single observation. For this, we suppose for now that the parameters and , , are known and make the following assumptions. Denote and

as the greatest and smallest eigenvalues of a symmetric matrix

.

###### Condition 1.

The spectrum of is bounded below and above: there exists some universal constant such that for .

###### Condition 2.

The support of is bounded between and , where and are universal constants and . In other words, for any integer or . The set size can be large or growing when a sequence of models are considered.

###### Condition 3.

The prior class probability is bounded away from and : there exists a universal constant such that .

We denote as the risk of the Bayes classifier (2.2) given . Let . For a matrix , we denote as its Frobenius norm, where is its th element. For a vector , we denote as its norm. The quantity plays an important role in deriving a convergence rate of the Bayes risk . Although the Bayes risk does not have a closed form, we show that under mild assumptions, it converges to zero at a rate on the exponent.

###### Theorem 1.

Suppose that Conditions 1-3 hold. If is sufficiently large, then for some small constant depending on , and only. In particular, as we have .

The significance of having a set of observations is illustrated by this fundamental theorem. When , which implies and , Theorem 1 provides a Bayes risk bound for the theoretical QDA classifier in the regular classification setting. To guarantee a small Bayes risk for QDA, it is clear that must be sufficiently large. In comparison, for the set classification to be successful, we may allow to be very close to zero, as long as is sufficiently large. The Bayes risk of can be reduced exponentially in because of the extra information from the set.

We have discussed an alternative classifier via using the sample mean as a representative of the testing set, leading to (2.3). The following proposition quantifies its risk, which has a slower rate than that of Bayes classifier .

###### Proposition 1.

Suppose that Conditions 1-3 hold. Denote the risk of classifier in (2.3) as . Assume is sufficiently large. Then for some small constant depending on , and only. In addition, the rate on the exponent cannot be improved in general, i.e., for some small constant .

###### Remark 2.

Compared to the result in Theorem 1, the above proposition implies that classifier needs a stronger assumption but has a slower rate of convergence when the mean difference is dominated by the covariance difference . After all, this natural

-based classification rule only relies on the first moment of the data set

while the sufficient statistics, the first two moments, are fully used by the covariance-engaged classifier in (2.2).

## 3 Methodologies

We now consider estimation procedures for based on training sets . In Section 3.1, we first consider a moderate-dimensional setting where with a sufficiently small constant . In this case we apply a naive plug-in approach using natural estimators of the parameters , and . A direct estimation approach using linear programming, suitable for high-dimensional data, is introduced in Section 3.2. Hereafter, and are considered as functions of as grows.

### 3.1 Naive Estimation Approaches

The prior class probabilities and can be consistently estimated by the class proportions in the training data, and , where . Let denote the total sample size for Class . The set membership is ignored at the training stage, due to the homogeneity assumption. Note and are random while is deterministic. One can obtain consistent estimators of and based on the training data and plug them in (2.2). It is natural to use the maximum likelihood estimators given ,

 ^μk=∑(i,j):Yi=kXij/nk and ^Σk=∑(i,j):Yi=k{(Xij−^μk)(Xij−^μk)T}/nk. (3.5)

For classification of with , , the set classifier (2.2) is estimated by

 ^ϕ(X†)=2−1{1mlog(^π1/^π2)+^β0+^βT¯x+¯xT^∇¯x/2+tr(^∇S)/2>0}, (3.6)

where , and . In (3.6) we have assumed so that is invertible.

The generalization error of set classifier (3.6) is where . The classifier itself depends on the training data and hence is random. In the equation above, is understood as the conditional probability given the training data. Theorem 2 reveals a theoretical property of in a moderate-dimensional setting which allows to grow jointly. This includes the traditional setting in which is fixed.

###### Theorem 2.

Suppose that Conditions 1-3 hold. For any fixed , if for some sufficiently large and , , for some sufficiently small constant , then with probability at least we have for some small constant depending on and .

In Theorem 2, large values of not only relax the assumption on but also reduce the Bayes risk exponentially in with high probability. A similar result for QDA, where and , was obtained in Li and Shao (2015) under a stronger assumption .

For the high-dimensional data where and hence with probability for by Condition 2, it is problematic to plug in the estimators (3.5) since is rank deficient with probability . A simple remedy is to use a diagonalized or enriched version of , defined by or , where and is a identity matrix. Both and are invertible. However, to our best knowledge, no theoretical guarantee has been obtained without some structural assumptions.

### 3.2 A Direct Approach via Linear Programming

To have reasonable classification performance in high-dimensional data analysis, one usually has to take advantage of certain extra information of the data or model. There are often cases where only a few elements in and truly drive the difference between the two classes. A naive plug-in method proposed in Section 3.1 has ignored such potential structure of the data. We assume that both and are known to be sparse such that only a few elements of those are nonzero. In light of this, the Bayes decision rule (2.2) implies the dimension of the problem can be significantly reduced, which makes consistency possible even in the high-dimensional setting.

We propose to directly estimate the quadratic term , the linear term and the constant coefficients respectively, taking advantage of the assumed sparsity. As the estimates are efficiently calculated by linear programming, the resulting classifiers are called Covariance-engaged Linear Programming Set (CLIPS) classifiers.

We first deal with the estimation of the quadratic term , which is the difference between the two precision matrices. We use some key techniques developed in the literature of precision matrix estimation (cf. Meinshausen and Bühlmann, 2006; Bickel and Levina, 2008; Friedman et al., 2008; Yuan, 2010; Cai et al., 2011; Ren et al., 2015). These methods estimate a single precision matrix with a common assumption that the underlying true precision matrix is sparse in some sense. For the estimation of the difference, we propose to use a two-step thresholded estimator.

As the first step, we adopt the CLIME estimator (Cai et al., 2011) to obtain initial estimators and of the precision matrices and . Let and be the vector norm and vector supnorm of a matrix respectively. The CLIME estimators are defined as

 ~Ωk=argminΩ∈Rp×p∥Ω∥1 subject to ∥^ΣkΩ−I∥∞<λ1,N, k=1,2, (3.7)

for some .

Having obtained and , in the second step, we take a thresholding procedure on their difference, followed by a symmetrization to obtain our final estimator where

 ~∇ij=min{˘∇ij,˘∇ji},˘∇ij=(~Ω2,ij−~Ω1,ij)1{∣∣~Ω2,ij−~Ω1,ij∣∣>λ′1,N}, (3.8)

for some thresholding level .

Although this thresholded CLIME difference estimator is obtained by first individually estimating , we emphasize that the estimation accuracy only depends on the sparsity of their difference rather than the sparsity of either or under a relatively mild bounded matrix norm condition. We will show in Theorem 3 in Section 4 that if the true precision matrix difference is negligible, with high probability. When , our method described in (3.12) becomes a linear classifier adaptively. The computation of (3.8) is fast, since the first step (CLIME) can be recast as a linear program and the second step is a simple thresholding procedure.

###### Remark 3.

As an alternative, one can also consider a direct estimation of that does not rely on individual estimates of . For example, by allowing some deviations from the identity , Zhao et al. (2014) proposed to minimize the vector norm of . Specifically, they proposed subject to , where is some thresholding level. This method, however, is computationally expensive (as it has number of linear constraints when casted to linear programming) and can only handle relatively small size of . See also Jiang et al. (2018). We chose to use (3.8) mainly because of fast computation.

Next we consider the estimation of the linear coefficient vector , where , . In the literature of sparse QDA and sparse LDA, typical sparsity assumptions are placed on and (see Li and Shao, 2015) or placed on both and (see, for instance Cai and Liu, 2011; Fan et al., 2015). In the latter case, is also sparse as it is the difference of two sparse vectors. For the estimation of , we propose a new method which directly imposes sparsity on , without specifying the sparsity for , or except for some relatively mild conditions (see Theorem 4 for details.)

The true parameter satisfies . However, due to the rank-deficiency of , there are either none or infinitely many ’s that satisfy an empirical equation . Here, and are defined in (3.5). We relax this constraint and seek a possibly non-sparse pair with the smallest norm difference. We estimate the coefficients by , where

 (~β1,~β2)=argmin(θ1,θ2):∥θk∥1≤L1∥θ1−θ2∥1\rm~{}subject to ∥^Σkθk−^μk∥∞<λ2,N, k=1,2, (3.9)

where is some sufficiently large constant introduced only to ease theoretical evaluations. In practice, the constraint can be removed without affecting the solution. Note that Jiang et al. (2018) proposed to estimate rather than . The direct estimation approach for above shares some similarities with that of Cai and Liu (2011), especially in the relaxed constraint. However Cai and Liu (2011) focused on a direct estimation of for linear discriminant analysis in which , while we target on instead. Our procedure (3.9) can be recast as a linear programming problem (see, for example, Candes and Tao, 2007; Cai and Liu, 2011) and is computationally efficient.

Finally, we consider the estimation of the constant coefficient . The conditional class probability that a set belongs to Class given

can be evaluated by the following logit function,

 log{η(x1,…,xm)1−η(x1,…,xm)}= logπ1π2+log{∏mi=1f1(xi)∏mi=1f2(xi)} = log(π1/π2)+m(β0+¯xTβ+12¯xT∇¯x+12tr(∇S)),

where and are the sample mean and covariance of the set respectively. Having obtained our estimators and from (3.8) and (3.9), and estimated and by and from the training data, we have only a scalar undecided. We may find an estimate by conducting a simple logistic regression with dummy independent variable and offset for the th set of observations in the training data, where , , and are sample size, sample mean, and sample covariance of the th set. In particular, we solve

 ~β0 =argminθ0∈R ℓ(θ0∣{(Xi,Yi)}Ni=1,~β,~∇),\rm where the negative log-likelihood is (3.10) ℓ(θ0∣{(Xi,Yi)}Ni=1,~β,~∇) (3.11) =1NN∑i=1((Yi−2)Mi(θ0+log(^π1/^π2)Mi+¯XTi~β+¯XTi~∇¯Xi/2+\rm tr(~∇Si)/2) +log[1+exp{Mi(θ0+log(^π1/^π2)Mi+¯XTi~β+¯XTi~∇¯Xi/2+\rm tr(~∇Si)/2)}])

Since there is only one independent variable in the logistic regression above, the optimization can be easily and efficiently solved.

For the purpose of evaluating theoretical properties, we apply the sample splitting technique (Wasserman and Roeder, 2009; Meinshausen and Bühlmann, 2010). Specifically, we randomly choose the first batch of and sets from two classes in the training data to obtain estimators and using (3.8) and (3.9). Then is estimated based on the second batch along with and using (3.10). We plug all the estimators in (3.8), (3.9) and (3.10) into the Bayes decision rule (2.2) and obtain the CLIPS classifier,

 ~ϕ(X†)=2−1{log(^π1/^π2)m+~β0+~βT¯x+¯xT~∇¯x/2+\rm tr(~∇S)/2>0}, (3.12)

where and are sample mean and covariance of and is its size.

## 4 Theoretical Properties of CLIPS

In this section, we derive the theoretical properties of the estimators from (3.8)–(3.10) as well as generalization errors for the CLIPS classifier (3.12). In particular, we demonstrate the advantages of having sets of independent observations in contrast to classical QDA setting with individual observations under the homogeneity assumption of Section 2. Parallel results under various time series structures can be found in the supplementary material.

To establish the statistical properties of the thresholded CLIME difference estimator defined in (3.8), we assume that the true quadratic parameter has no more than nonzero entries,

 ∇∈FM0(sq)={A=(aij)∈Rp×p,% \rm symmetric:p∑i,j=11{aij≠0}≤sq}. (4.13)

Denote as the support of the matrix . We summarize the estimation error and a subset selection result in the following theorem.

###### Theorem 3.

Suppose Conditions 1-3 hold. Moreover, assume , with some constant for and with some sufficiently small constant . Then for any fixed with probability at least , we have that

 ∥~∇−∇∥∞ ≤ 2λ′1,N, ∥~∇−∇∥F ≤ 2√sqλ′1,N, ∥~∇−∇∥1 ≤ 2sqλ′1,N,

as long as and in (3.8), where depends on and only. Moreover, we have

###### Remark 4.

The parameter space can be easily extended into an entry-wise ball or weak ball with (Abramovich et al., 2006) and the estimation results in Theorem 3 remain valid with appropriate sparsity parameters. The subset selection result also remains true and the support of contains those important signals of above the noise level . To simplify the analysis, we only consider balls in this work.

###### Remark 5.

Theorem 3 implies that both the error bounds of estimating under vector norm and Frobenius norm rely on the sparsity imposed on rather than those imposed on or . Therefore, even if both and are relatively dense, we still have an accurate estimate of as long as is very sparse and is not large.

The proof of Theorem 3, provided in the supplementary material, partially follows from Cai et al. (2011).

Next we assume is sparse in the sense that it belongs to the -sparse ball,

 β∈F0(sl)={α=(aj)∈Rp:p∑j=11{αj≠0}≤sl}. (4.14)

Theorem 4 gives the rates of convergence of the linear coefficient estimator in (3.9) under the and norms. Both depend on the sparsity of