Classification is a critically important technique for enabling automated data-driven machine intelligence and it has numerous applications in diverse fields ranging from science and technology to medicine, military and business. For classifying large-sample data theoretical understanding and practical applications have been successfully developed; nonetheless, when data dimension or sample size becomes big, the accuracy or scalability usually becomes an issue.
Modern data are increasingly high dimensional yet the sample size may be small. Such data usually have a large number of irrelevant entries; feature selection and dimension reduction techniques are often needed before applying classical classifiers that typically require a large sample size. Although some research efforts have been devoted to developing methods capable of classifying high-dimensional data without feature selection, classifiers with more desirable accuracy and efficiency are yet to be discovered.
Diverse areas of scientific research and everyday life are currently deluged with large-scale and/or imbalanced data. Achieving high classification accuracy, scalability, and balance of precision and recall (in the case of imbalanced data) simultaneously is challenging. While several classic data analytics and classifiers have been adapted to the large-scale or imbalanced setting, their performance is still far from being desired. There is an urgent need for developing effective, scalable data mining and prediction techniques suitable for large-scale and/or imbalanced data.
In this paper, we introduce a discriminative regression approach to classification. It estimates a representation for a new example by minimizing the fitting error while explicitly incorporating discrimination information between classes. Based on the estimated representation, categorical information for the new example is subsequently derived. This new type of discriminative regression models can be regarded as an extension of existing regression models such as the ridge, lasso, and group lasso regression, and when particular parameter values are taken the new models fall back to several existing models. Because the existing regression models take no explicit account of discrimination between classes whereas the proposed models do, this new family of regressions are more geared toward classification. As a special case, we consider a quadratic model, called the discriminative regression machine (DRM), as the particular classifier of interest in this paper. The DRM admits a closed-form analytical solution, which is suitable for small-scale or high-dimensional data. For large-scale data, three optimization algorithms of improved computational cost are established for the DRM.
The family of discriminative regression-based classifiers are applicable to general types of data including imagery or other high-dimensional data as well as imbalanced data. Extensive experiments on a variety of real world data sets demonstrate that the classification accuracy of the DRM is comparable to the support vector machine (SVM), a commonly used classifier, on classic large-sample data, and superior to several existing state-of-the-art classifiers, including the SVM, on high-dimensional data and imbalanced data. The DRM with linear kernel, called linear DRM in analogy to the linear SVM, has classification accuracy superior to the linear SVM on large-scale data. The efficiency of linear DRM algorithms is provably linear in data size, on a par with that of the linear SVM. Consequently, the linear DRM is a scalable classifier.
As an outline, the main contributions of this paper can be summarized as follows:
A new approach to classification, and thereby a family of new regression models and corresponding classifiers are introduced, which explicitly incorporate the discriminative information for multi-class classification on general data. The formulation is strongly convex and admits a unique global optimum.
The DRM is constructed as a special case of the family of new classifiers. It involves only quadratic optimization, with simple formulation yet powerful performance.
The closed-form solution to the DRM optimization is obtained which is computationally suited to classification of moderate-scale or high-dimensional data.
For large-scale data, three iterative algorithms are proposed for solving the DRM optimization substantially more efficiently. They all have theoretically proven convergence at a rate no worse than linear - the first two are at least linear and the third quadratic.
The DRM with a general kernel demonstrates an empirical classification accuracy that is comparable to the SVM on classic (small- to mid-scale) large-sample data, while superior to the SVM or other state-of-the-art classifiers on high-dimensional data and imbalanced data.
The linear DRM demonstrates an empirical classification accuracy superior to the linear SVM on large-scale data, using any of the three iterative optimization algorithms. All methods have linear efficiency and scalability in the sample size or the number of features.
In this paper, capital letters denote matrices while small letters denote variables or vectors. denotes the identify matrix whose size will be clear in the context. The transpose of (or ) is denoted by (or ). The superscript is the iteration counter. We use to represent diagonal or block diagonal matrix, and
the trace operator. The maximal and minimal singular values ofare demoted by and , respectively. The spectral norm of is denoted by .
The rest of this paper is organized as follows. Section II discusses related research work. Sections III and IV introduce our formulations of discriminative regression and its kernel version. The discriminative regression-based classification is derived in Section V. In Section VI we construct the discriminative regression machine. Experimental results are presented in Section VII. Finally, Section VIII concludes this paper.
Ii Related Work
There is of course a vast literature for classification and a variety of methods have been proposed. Here we briefly review some closely related methods only; more thorough accounts of various classifiers and their properties are extensively discussed in , , , , , etc.
Fisher’s linear discriminant analysis (LDA) 
has been commonly used in a variety of applications, for example, Fisherface and its variants for face recognition. It is afflicted, however, by the rank deficiency problem or covariance matrix estimation problem when the sample size is less than the number of features. For a given test example, the nearest neighbor method  uses only one closest training example (or more generally,
-nearest neighbor (KNN) usesclosest training examples) to supply categorical information, while the nearest subspace method 
adopts all training examples to do it. These methods are affected much less by the rank deficiency problem than the LDA. Due to increasing availability of high-dimensional data such as text documents, images and genomic profiles, analytics of such data have been actively explored for about a decade, including classifiers tailored to high-dimensional data. As the number of features is often tens of thousands while the sample size may be only a few tens, some traditional classifiers such as the LDA suffer from the curse of dimensionality. Several variants of the LDA have been proposed to alleviate the affliction. For example, regularized discriminant analysis (RDA) use the regularization technique to address the rank deficiency problem, while penalized linear discriminant analysis (PLDA) adds a penalization term to improve covariance matrix estimation .
Support vector machine (SVM) 
is one of the most widely used classification methods with solid theoretical foundation and excellent practical applications. By minimizing a hinge loss with either an L-2 or L-1 penalty, it constructs a maximum-margin hyperplane between two classes in the instance space and extends to a nonlinear decision boundary in the feature space using kernels. Many variants have been proposed, including primal solver or dual solver, and extension to multiclass case .
The size of the training sample can be small in practice. Besides reducing the number of features with feature selection , the size of training sample can be increased with relevance feedback 
and active learning techniques, which may help specify and refine the query information from the users as well. However, due to the high cost of wet-lab experiments in biomedical research or the rare occurrence of rare events, it is often hard or even impractical to increase the sizes of minority classes in many real world applications; therefore, imbalanced data sets are often encountered in practice. It has been noted that the random forest (RF) is relatively insensitive to imbalanced data. Furthermore, various methods have been particularly developed for handling imbalanced data , which may be classified into three categories: data-level, algorithm-level, and hybrid methods. Data-level methods change the training set to make the distributions of training examples over different classes more balanced, with under-sampling [46, 62], over-sampling [16, 12], and hybrid-sampling techniques . Algorithm-level methods may compensate for the bias of some existing learning methods towards majority classes. Typical methods include cost-sensitive learning [60, 47] which assigns greater weights to losses from minority classes, kernel perturbation techiques that perturb the kernel to increase the densities of minority classes to alleviate the adverse effect of imbalance , and multi-objective optimization approaches with, e.g., bi-objective of accuracy and G-mean  and multi-objective of accuracies of all classes 
. Multi-objective optimization approaches may naturally balance conflicting accuracies of majority and minority classes; their optimizations, however, can only seek Pareto optimal solutions with genetic programming often deployed. Hybrid methods attempt to combine the advantages of data-level and algorithm-level methods . In this paper we develop an algorithm-level method whose objective function is convex, leading to a high-quality, fast, scalable optimization and solution.
Large-scale data learning has been an active topic of research in recent years. For large-scale data, especially big data, linear kernel is often adopted in learning techniques as a viable way to gain scalability. For classification on large-scale data, fast solver for primal SVM with the linear kernel, called linear SVM, has been extensively studied. Early work uses finite Newton methods to train the L2 primal SVM  . Because the hinge loss is not smooth, generalized second-order derivative and generalized Hessian matrix need to be used which is not sufficiently efficient 
. Stochastic descent method has been proposed to solve primal linear SVM for a wide class of loss functions.  solves the L1 primal SVM with a cutting plane technique. Pegasos alternates between gradient descent and projection steps for training large-scale linear SVM . Another method , similar to the Pegasos, also uses stochastic descent for solving primal linear SVM, and has more competitive performance than the . For L2 primal linear SVM, a Trust RegiOn Newton method (TRON) has been proposed which converges at a fast rate though the arithmetic complexity at each iteration is high 
. It can also solves logistic regression problems. More efficient algorithms than the Pegasos or TRON use coordinate descent methods for solving the L2 primal linear SVM - primal coordinate descent (PCD) works in the primal domain while dual coordinate descent (DCD) in the dual domain . A library of methods have been provided in the toolboxes of Liblinear and Libsvm  .
Iii Problem Statement and Discriminative Regression
Iii-a Notations and Problem Statement
Our setup is the usual multiclass setting where the training data set is denoted by , with observation and class label . Supposing the th class has observations, we may also denote these training examples in by . Obviously . Without loss of generality, we assume the examples are given in groups; that is, with , and . A data matrix is formed as of size . For traditional large-sample data, , with a particular case being big data where is big; while for high-dimensional data, . As will be clear later on, the proposed discriminative regressions have no restrictions on or , and thus are applicable to general data including both large-sample and high-dimensional data. Given a test example , the task is to decide what class this test example belongs to.
Iii-B Discriminative Regression
In this paper, we first consider linear relationship of the examples in the instance space; subsequently in Section IV we will account for the nonlinearity by exploiting kernel techniques to map the data to a kernel space where linear relationship may be better observed.
If the class information is available and , a linear combination of , , is often used in the literature to approximate in the instance space; see, for example,   for face and object recognition. By doing so, it is equivalent to assuming . In the absence of the class information for , a linear model can be obtained from all available training examples:
where is a vector of combining coefficients to be estimated, and is additive zero-mean noise. Note that we may re-index with in accordance with the groups of .
To estimate which is regarded as a representation of , classic ridge regression or lasso uses an optimization model,
with corresponding to ridge regression or to lasso. The first term of this model is for data fitting while the second for regularization. While this model has been widely employed, a potential limitation in terms of classification is that the discriminative information about the classes is not taken into account in the formulation.
To enable the model to be more geared toward classification, we propose to incorporate the class discriminativeness into the regression. To this end, we desire to have the following properties : 1) maximal between-class separability and 2) maximal within-class similarity.
The class-level regularization may help achieve between-class separability. Besides this, we desire maximal within-class similarity because it may help enhance accuracy and also robustness, for example, in combating class noise . To explicitly incorporate maximal within-class similarity into the objective function, inspired by the LDA, we propose to minimize the within-class dis-similarity induced by the combining vector and defined as
where are the weighted class mean values. Being a quadratic function in , this helps facilitate efficient optimization as well as scalability of our approach.
It is noted that defined above is starkly different from the classic LDA formulation of the scattering matrix. The LDA seeks a unit vector - a direction - to linearly project the features; whereas we formulate the within-class dis-similarity using the weighted instances by any vector . Consequently, the projection direction in the LDA is sought in the -dimensional feature space, while the weight vector of is defined in the -dimensional instance space.
We may also define -weighted between-class separability and total separability as follows,
where . With straightforward algebra the following equality can be verified.
, for any .
Incorporating discriminative information between classes into the regression, we formulate the following unconstrained minimization problem,
where and are nonnegative balancing factors, is defined as , , , and is a nonnegative valued function. Often is chosen to be the identity function or other simple, convex functions to facilitate optimization.
1. When , model creftype 4 is simply least squares regression. 2. When , , and , creftype 4 reduces to standard ridge regression. 3. When , , and , creftype 4 falls back to classic lasso type of sparse representation. 4. When , discriminative information is injected into the regression, hence the name of discriminative regression. 5. We consider mainly for regularizing the minimization.
Iv Discriminative Regression in Kernel Space
When approximating from the training observations , the objective function of creftype 4 is empowered with discriminative capability explicitly. This new discriminative regression model, nonetheless, takes no account of any nonlinearity in the input space. As shown in , the examples may reside (approximately) on a low-dimensional nonlinear manifold in an ambient space of . To capture the nonlinear effect, we allow this model to account for the nonlinearity in the input space by exploiting the kernel technique.
Consider a potentially nonlinear mapping where is the dimension of the image space after the mapping. To reduce computational complexity, the kernel trick  is applied to calculate the Euclidean inner product in the kernel space, , where is a kernel function induced by . The kernel function satisfies the finitely positive semidefinite property: for any , the Gram matrix with element is symmetric and positive semidefinite (p.s.d.). Suppose we can find a nonlinear mapping such that, after mapping into the kernel space, the examples approximately satisfy the linearity assumption, then creftype 4 will be applicable in such a kernel space.
Specifically, we consider the extended linear model creftype 1 in the kernel space, , to minimize ,
, and the variance ofas creftype 4 does. Here , is obtained by replacing each with in creftype 3, and . The kernel matrix of training examples is denoted by
Obviously the kernel matrix is p.s.d. by the property of . We may derive a simple matrix form of with straightforward algebra,
, and , .
The matrix is p.s.d.
Because is block diagonal, we need only to show that is p.s.d. with . For any , we have
The last inequality holds because the kernel is finitely p.s.d.; that is, for any and
Similarly we obtain a simple matrix form of ,
The matrix is p.s.d.
Let with in accordance with the groups of , i.e., the blocks of . Corresponding to , we define in a block-wise form as . Then
The last inequality holds due to the fact that is the kernel matrix of . ∎
Having formulated creftype 4 in the input space, we directly extend it to the discriminative regression in the kernel space
Plugging in the expression of and omitting the constant term, we obtain the discriminative regression model,
and , . Apparently we have , , .
With special types of -regularization such as linear, quadratic, and conic functions, the optimization in creftype 9 is a quadratic (cone) programming problem that admits a global optimum. Regarding , we have the freedom to choose it so as to facilitate the optimization thanks to the following property.
With varying over the range of while fixed, the set of minimum points of creftype 9 remains the same for any monotonically increasing function .
It can be shown by scalarization for multicriterion optimization similarly to Proposition 3.2 of . ∎
V Discriminative Regression-Based Classification
Having estimated from creftype 9 the vector of optimal combining coefficients, denoted by , we use it to determine the category of . In this paper, we mainly consider the case in which the groups are exhaustive of the whole instance space; in the non-exhaustive case, the corresponding minimax decision rule  can be used analogously.
In the exhaustive case, the decision function is built with the projection of onto the subspace spanned by each group. Specifically, the projection of onto the subspace of is , where represents restricting to in that , with being the indicator function. Similarly, denoting by , the projection of onto the subspace of is . Now define
which measures the dis-similarity between and the examples in class . Then the decision rule chooses the class with the minimal dis-similarity. This rule has a clear geometrical interpretation and it works intuitively: When truly belongs to , , while due to the class separability properties imposed by the discriminative regression, and thus, is approximately ; whereas for , , while as , and thus is approximately . Hence the decision rule picks for . In the kernel space the corresponding is derived in the minimax sense as follows ,
And the corresponding decision rule is
Now we obtain the discriminative regression-based classification outlined in Algorithm 1. Related parameters such as and , and a proper kernel can be chosen using the standard cross validation (CV).
How to solve optimization problem creftype 9 efficiently in Step 3 is critical to our algorithm. We mainly intend to use such that is convex in to efficiently attain the global optimum of creftype 9. Any standard optimization tool may be employed including, for example, CVX package . In the following, we shall focus on a special case of for the sake of deriving efficient and scalable optimization algorithms for high-dimensional data and large-scale data; nonetheless, it is noted that the discriminative regression-based classification is constructed for general and .
Vi Discriminative Regression Machine
Discriminative regression-based classification with a particular regularization of will be considered, leading to the DRM.
Vi-a Closed-Form Solution to DRM
Let and , then we have the regularization term . The discriminative regression problem creftype 9 reduces to
which is an unconstrained convex quadratic optimization problem leading to a closed-form solution,
with and . Hereafter we always require to ensure the existence of the inverse matrix, because is p.s.d. and is strictly positive definite. The minimization problem creftype 13 can be regarded as a generalization of kernel ridge regression. Indeed, when kernel ridge regression is obtained. More interestingly, when the discriminative information is incorporated to go beyond the kernel ridge regression.
For clarity, the DRM with the closed-form formula is outlined as Algorithm 2.
1. The computational cost of the closed-form solution creftype 14
is , because calculating and from costs while, in general,
the matrix inversion is by using, for example, QR decomposition or Gaussian elimination.
2. For high-dimensional data with
by using, for example, QR decomposition or Gaussian elimination. 2. For high-dimensional data with, the cost is with Algorithm 2. This cost is scalable in when is moderate. 3. For large-sample data with , the complexity of using creftype 14 is , which renders Algorithm 2 impractical with big . To have a more efficient and scalable, though possibly inexact, solution to creftype 13, in the sequel we will construct three iterative algorithms suitable for large-scale data.
Vi-B DRM Algorithm Based on Gradient Descent (DRM-GD)
For large-scale data with big and , the closed-form solution creftype 14 may be computationally impractical in real applications, which calls for more efficient and scalable algorithms.
Three new algorithms will be established to alleviate this computational bottleneck. In light of the linear SVM on large-scale data classification, e.g., , we will consider the use of the linear kernel in the DRM, which turns out to have provable linear cost. The first algorithm relies on classic gradient descent optimization method, denoted by DRM-GD; the second hinges on an idea of proximal-point approximation (PPA) similar to  to eliminate the matrix inversion; and the third uses a method of accelerated proximal gradient line search for theoretically proven quadratic convergence. This section presents the algorithm of DRM-GD, and the other two algorithms will be derived in subsequent sections.
Denote by the objective function of the DRM,
Its gradient is
Suppose we have obtained at iteration . At next iteration, the gradient descent direction is , in which the optimal step size is given by the exact line search,
Thus, the updating rule using gradient descent is
The resulting DRM-GD is outlined in Algorithm 3. Its main computational burden is on which costs, in general, floating-point operations (flops)  for a general kernel. This paper mainly considers multiplications for flops. After getting , is obtained by matrix-vector product with flops, and so is with flops. Thus, the overall count of flops is about with a general kernel.
With the linear kernel the computational cost can be further reduced by exploiting its particular structure of . Given any , can be computed as which costs flops for any , and thus computing requires flops. Similarly, takes flops by matrix-vector product, since can be readily obtained from computation. As , the total count of flops is for getting . Therefore, computing needs flops. Each of Sections VI-B and 15 requires and types of computation, hence each iteration of updating to costs overall flops, including additional ones for computing and .
As a summary, the property of the DRM-GD, including the cost, is given in Theorem VI.1.
Theorem VI.1 (Property of DRM-GD).
With any , generated by Algorithm 3 converges in function values to the unique minimum of with a linear rate. Each iteration costs flops with a general kernel, in particular, flops with the linear kernel.
Vi-C DRM Algorithm Based on Proximal Point Approximation
PPA is a general method for finding a zero of a maximal monotone operator and solving non-convex optimization problems . Many algorithms have been shown to be its special cases, including the method of multipliers, the alternating direction method of multipliers, and so on. Our DRM optimization is a convex problem; nonetheless, we employ the idea of PPA in this paper for potential benefits of efficiency and scalability. At iteration , having obtained the minimizer , we construct an augmented function around it,
where is a constant satisfying . We minimize to update the minimizer,
As is a strongly convex, quadratic function, its minimizer is calculated directly by setting its first-order derivative equal to zero,
which results in
The computational cost of each iteration is reduced compared to the closed-form solution creftype 14, which is formally stated in the following.
With a general kernel, the updating rule creftype 18 costs about flops.
Proof: The cost of getting is , since it has elements and each costs flops. Subsequently, needs about flops. Hence computing needs about flops. Evidently needs flops.
As a summary, the PPA-based DRM, called DRM-PPA, is outlined in Algorithm 4. Starting from any initial point , repeatedly applying creftype 18 in Algorithm 4 generates a sequence of points . The property of this sequence will be analyzed subsequently.
Theorem VI.2 (Convergence and optimality of DRM-PPA ).
Before the proof, we point out two properties of which are immediate by the definition,
Proof of Theorem VI.2: By re-writing as
we know is lower bounded by . From the following chain of inequality it is clear that is a monotonically non-increasing sequence,
Next we will further prove that exists and it is the unique minimizer of . By using the following equality
Here, the first inequality holds by the definition of the spectral norm, and the subsequent equality holds because . Because and is p.s.d., we have , and . Hence, as , for any ; that is, . The uniqueness is because of the strict convexity of .
As a consequence of the above proof, the convergence rate is also readily obtained.
Theorem VI.3 (Convergence rate of DRM-PPA).
The convergence rate of generated by Algorithm 4 is at least linear. Given a convergence threshold , the number of iterations is upper bounded by , which is approximately when