The idea of using quantiles in classification is relatively recent and largely unexplored. The median classifier for high-dimensional problems proposed by Hall, Titterington and Xue (2009), which calculates the
distance of the coordinates of a multivariate data point from componentwise medians (rather than centroids), is particularly advantageous when data exhibit heavy-tailed or skewed distributions. Building onHall, Titterington and Xue’s (2009) idea, Hennig and Viroli (2016a) proposed quantile classifiers which hinge on the sum of distances from componentwise quantiles at some generic level . The ensemble quantile classifier by Lai and McLeod (2020)
assigns weights to the componentwise distances by minimising a regularised loss function, where the regularisation parameter is determined by cross-validation.
In all the studies mentioned above, quantiles are calculated marginally for each input variable (componentwise). This implies that their calculation ignores the possible interdependence among variables. In this study, we consider directional quantiles for multivariate distributions (Kong and Mizera, 2012) to address such a limitation. Our choice is motivated by several reasons. First, as already mentioned, the dependence among variables is taken into account by computing linear combinations of input variables. Second, directional quantiles have a simple interpretation since the projections’ weights embody the relative importance of the variables involved in the classification problem. Finally, in the special case of canonical directions (with equal to the number of variables), the use of directional quantiles leads to the componentwise quantile classifier (Hennig and Viroli, 2016a), and thus inherits asymptotic optimal properties as shown in Appendix. Directional quantiles have already found application in risk classification problems (Geraci et al., 2020) and proved to be a worthwhile alternative to risk classification based on componentwise quantile thresholds.
In general, the application of our methods does not require any assumption on the shape of the population distributions. We derive asymptotic theoretical properties of the proposed classifier, under the assumption that distributions for alternative populations differ by at most a location-shift. While this assumption may be unrealistic in practice, empirical results support the merit of the proposed classifier also when the distributions differ by shape and not just by location.
The rest of the paper is organised as follows. In the next section, we introduce notation and basic definitions, followed by our proposal of directional quantile classifiers. Theoretical results are stated in Section 3. We report the results of a simulation study in Section 4 and of a real data analysis in Section 5. Concluding remarks are given in Section 6. All proofs of theoretical results are reported in Appendix Appendix A - Proofs of Theorems. A software implementation of our approach can be found in the package Qtools (Geraci, 2016), freely available on the Comprehensive R Archive Network (R Core Team, 2020).
2.1 Notation and definitions
Let and denote two
-variate random variables with absolutely continuous distributionsand defined on the same space for two populations and , respectively. The marginal distributions of the components of are denoted by , for and . Further, denotes the indicator function which is equal to if its argument is true, and otherwise.
Our goal is to assign a new observation to either or according to how close the point is to one or the other. In quantile-based classification (Hennig and Viroli, 2016a), the distance is first calculated for each component of using the asymmetrically weighted loss function
for and , where is the componentwise quantile at level for the th population, which can be obtained by inversion of . Subsequently, is assigned to if the discrepancy
is positive, and to otherwise. The quantile classifier reduces to the componentwise median classifier of Hall, Titterington and Xue (2009) for . An extension of (2) to more than two populations is straightforward.
The classification rule based on (2) does not acknowledge the possible interdependence among the variables, since quantiles are obtained marginally for each variable. We address this limitation by using directional quantiles for multivariate data (Kong and Mizera, 2012). We now explain our idea informally and, in the next section, give a rigorous treatment.
to be a vector with unit norm in. Throughout this paper, our focus will be on the projected random variables , , defined on . By assumption, the ’s are continuous. We denote the corresponding distribution and density functions with and , respectively.
Our goal is to develop a classifier where the quantities in (1) are opportunely redefined on the corresponding projections along to capture the multivariate nature of the distributions, namely
for k = 1,2, where is the th quantile of . The latter is obtained by inverting and it can be recognised as the th directional quantile of in the direction (Kong and Mizera, 2012).
By working with projections, we basically summarise a multivariate problem as a univariate one. Clearly, one difficulty to address is how many and which directions should be considered. To this end, we should note that not all the directions are equally useful for classification. To exemplify, consider Figure 1
, which depicts bivariate normal samples from two independent populations centered at (1,1) and (3,3), respectively, and same variance. We want to assign the new observationto one of the two populations. The log-density at and , respectively. This suggests that has been generated more likely from than from .
Now compute , , as in (3) for four normalised directions. The results are reported in Table 1. Based on a principle of minimum distance, we assign to , thus consistently with a maximum likelihood principle, for three, though not all four, directions.
2.2 Directional quantile classifier
Let be a set of distinct quantile levels on . Also, define the set containing normalised directions associated with , , and let . (Note that for convenience one may set for .)
As mentioned in the previous section, we need to be wary of particular directions that may lead us to a classification error. Therefore, we introduce weights associated with each direction to decrease (or increase) their relative importance. Let denote the vector of all such weights. We propose the discrepancy
A difficulty associated with the calculation of (4) is the selection of quantile levels, directions, and weights in the training data, say , that give the best performance on the test data, say
. For some prior probabilitiesand , let
denote the population probability of correct classification by the DQC. Note that maximising (2.2) is equivalent to minimising the theoretical misclassification rate. For any given level and direction , the optimal misclassification rate is obtained when
which is equivalent to minimise
In the general problem with populations, the minimum misclassification rate is obtained when
Let . Given a sample of observations and corresponding class labels , we aim to solve
Problem (8) may seem daunting, but luckily we can solve for rather easily. Given and , problem (8) is linear with unit-norm constraints and can be minimised by using the Lagrange multiplier method. This problem has a closed-form solution given by with generic th element
We now turn to how to choose directions and quantile levels. A crude solution would consist in doing a multidimensional grid search on dimensions. However, such a solution would become computationally prohibitive even at modest values of . Thankfully, we are able to mitigate the computational cost of a naïve numerical solution with some theoretical results (Section 3); in particular, with Theorem 1, which guarantees that for each projection there exists (at least) a quantile that leads to the optimal Bayes misclassification probability, and Theorem 2, which, conversely, identifies the best direction for a given quantile level. Unfortunately, a theoretical result for the simultaneous optimisation with respect to and does not exist. Nevertheless, we show that our DQC is asymptotically optimal (i.e. the misclassification rate goes to zero) when the number of directions increases with and (Theorem 3) under certain assumptions.
In summary, there are different possible approaches including randomly selecting one or more directions and using the optimal quantile levels associated with those directions; or spanning a grid of quantile levels and using the optimal directions associated with those quantiles. After some empirical investigation, we found that a strategy that gives satisfactory results in different settings is as follows. First, we define a grid of
values spanning the unit interval and, for each of these values, randomly draw a set of normalised directions from the hyperplane that is identified as optimal according to Theorem2. The performance of a DQC based on each single value is evaluated using five-fold cross-validation. In the end, we use a single quantile level (optimal according to cross-validation), with the corresponding directions sampled from the optimal hyperplane. In particular, this strategy improves over the use of an asymptotically optimal quantile level when is small. Moreover, when is not too large, a similar strategy can be used to select an approximately optimal hyperplane.
3 Theoretical results
In this section, we present theoretical results concerning our DQC. The proofs of lemmas and theorems are reported in the Appendix.
3.1 Optimal quantile level
We derive the theoretical rate of correct classification as a function of , for given . We assume populations, although results can be generalised to .
For given , let with corresponding inverse , density , and prior probability of correct classification , and let with corresponding inverse , density , and prior probability of correct classification . The probability of correct classification of the directional quantile classifier is
where . Analogously, the theoretical misclassification rate is
Assume that the density functions and exist for and are nonzero on the same compact domain . Further assume that there is a point with so that for on one side of and for on the other side of . Then the quantile classifier using the quantile that minimises the theoretical misclassification probability achieves the optimal Bayes misclassification probability.
The consistency of the classifier may be illustrated with an example. Consider a two class decision problem where one population is a location-shift version of the other. Figure 2 shows two distributions which have both the same right skewness. The quantiles and are marked by dashed lines. The median classifier (Hall, Titterington and Xue, 2009) in the upper panel leads to a non-optimal misclassification probability equal to 0.30. However, the misclassification probability is reduced to 0.28 by setting .
3.2 Optimal direction
The next lemma and theorem give the optimal direction that minimises the misclassification rate at a given .
Let be a realisation of either or , then
where and , .
Let be a -variate random variable such that , for , and let be a vector of constants, . We assume that and its probability distribution function is
and its probability distribution function is, for . Moreover, assume that , where is the -quantile of . (Notice that there is no loss of generality with this assumption since the case can be reformulated as .) Under these assumptions, the normalised direction that minimises the misclassification error (2.2) is
The generalization of Theorem 2 to populations involves optimal directions for each of all the possible pairwise comparisons.
3.3 Asymptotic misclassification rate
In this section, we show that under certain assumptions, the correct classification probability converges to unity when the number of dimensions grows to infinity along with the sample size and the number of projections. The proof is built following a strategy similar to that used in Hall, Titterington and Xue (2009, Theorem 2), although our premises start from milder assumptions. In particular the projections are not required to obey the “ condition” (Bradley, 2005), which is rather strict in practice. Our theorem is developed for any , unit weights , and . Thus, the asymptotic result holds for sub-components of the summation in (8), which are then weighted and summed to minimise the misclassification rate. Hence, the overall criterion inherits the optimal properties of its additive components.
As we did with the theorems in the previous sections, we present this theorem for classes. Its extension to classes requires contrasting each class against the remaining classes, consistently with (7).
Consider a set of directions sampled from a unit -sphere and let , with and denoting the sample sizes of the two groups in the training set. Assume
For a constant , .
The variables have each the same distribution as , respectively. Moreover, and .
The first moments of the projections are uniformly bounded in a strong sense. This implies thatand , with such that
For some , the proportion of values for which
multiplied by , say , is of larger order than , which means goes to zero as and increase.
Under the previous assumptions, the directional quantile classifier based on
makes the correct choice asymptotically. More specifically, as , the classifier makes the correct decision with probability
converging to 1 if both and diverge with , where , denotes the probability computed under the assumption that is drawn from population .
4 Simulation study
We assessed the performance of the proposed classifier in a simulation study under three scenarios with two populations. In the first scenario, observations were generated independently from a multivariate Student’s
distribution with 3 degrees of freedom, with either uncorrelated or correlated variables. In the second scenario, observations were generated as in the first scenario, but each variable was subsequently transformed according toto induce asymmetry. In both cases, the two populations differed by a location shift equal to 0.4. Finally, in the third scenario observations were generated as in the first scenario, but each variable was subsequently transformed according to or to depending on whether observations belonged to one or the other population, respectively.
Data were generated for each combination of overall sample size (with observations in each class) and dimension . All in all, this resulted in simulation cases. The scale matrix used in the multivariate distribution with correlated variables was generated randomly for each using the function rcorrmatrix with default settings as provided in the package clusterGeneration (Qiu and Joe, 2015; Joe, 2006). This resulted in non-constant pairwise correlations on the interval . Observations in the training and test datasets were generated in the same way. Data generation under each setting was replicated 100 times.
We compared the directional quantile classifier (DQC) in terms of misclassification rate on the test data with that of the centroid classifier (Centroid) (Tibshirani et al., 2002), median classifier (Median) (Hall, Titterington and Xue, 2009), componentwise quantile classifier (CQC) (Hennig and Viroli, 2016a), ensemble quantile classifier (EQC) (Lai and McLeod, 2020), Fisher’s linear discriminant analysis (LDA),
-nearest neighbour (KNN)(Cover and Hart, 1967)
, penalised logistic regression (PLR)(Park and Hastie, 2008)
, support vector machines (SVM)(Cortes and Vapnik, 1995; Wang, Zhu and Zou, 2008), and naïve Bayes classifier (Bayes) (Hand and Yu, 2001). Tuning parameters for PLR, KNN, and SVM where selected using cross-validation. For the CQC, the Galton correction was used to reduce skewness and optimal quantile was selected by minimising the error rate on the training set (Hennig and Viroli, 2016a).
We used the package Qtools (Geraci, 2016, 2020) for the directional quantile classifier; the package quantileDA (Hennig and Viroli, 2016b) for the centroid, median and componentwise quantile classifiers; the package eqc (Lai and McLeod, 2019) for the ensemble quantile classifier; the package MASS (Venables and Ripley, 2002) for linear discriminant analysis; the package class (Venables and Ripley, 2002) for -nearest neighbour; the package e1071 (Meyer et al., 2019) for support vector machines and Bayes classifiers; and the package stepPlr (Park and Hastie, 2018) for penalised logistic regression. All analyses were carried out in R version 4.0.0 (R Core Team, 2020).
The misclassification rates averaged over 100 replications for all simulation cases are reported in Tables 2-4. The results indicate that the performance of our proposed classifier improves as and increase, in agreement with the theoretical results. In the first two scenarios, our classifier outperforms the competitors in both scenarios when variables are uncorrelated. When variables are correlated, the proposed classifier still performs very well, even if it is not uniformly the best. In the third scenario where class distributions have different shapes, the performance of our classifier is often, but not always, the best.
5 Clinical trial on Crohn’s disease
We analyse data from a matched case-control study in first-degree relatives (FDRs) of Crohn’s disease (CD) patients originally published by Sorrentino et al. (2014). The goal of the study was to identify asymptomatic FDRs with early CD signs using several intestinal inflammatory markers. The latter included hemoglobin, erythrocyte sedimentation rate, C-reactive protein, fecal calprotectin, and average mature ileum score. In our analysis, we grouped subjects into 2 classes, one with signs of inflammation ( subjects with early or frank CD) and one with normal values of markers ( subjects with no signs of inflammation, including healthy controls). In a separate analysis, we augment the dataset with 45 artificial markers generated from independent standard normal distributions to investigate the impact of uninformative noise on the performance of the DQC. We approach data analysis with leave-one-out validation and evaluate the misclassification rate as the proportion of subjects that are misclassified when each is left out of analysis.
We estimated the classification error for all the classifiers as included in our simulation study (Section 4). The results are reported in Table 5. The proposed DQC outperforms its competitors in both the original () and noisy () versions of the dataset.
We proposed directional quantile classifiers whose predictive ability is consistently good in both simulation and real data studies, on small and large dimensional classification problems. In particular, the empirical results show that our approach either outperforms its competitors or, when this is not the case, its performance is still in the ballpark of that of the best classifiers. Such a reliable behaviour across different scenarios is not shared by the other selected classifiers. Moreover, the directional quantile classifiers enjoy optimal theoretical properties under certain assumptions.
A limitation of the approach is that the number of directions needed to span a -sphere with a regular grid becomes prohibitive already at modest values of . On the other hand, our theoretical results indicate that one can sample directions from an optimal hyperplane, thus reducing the computational burden, but not at the expense of the classifier’s performance. Our strategy allows us to balance the importance of quantile levels and directions used for classification by means of weights, which can be optimised using a convenient closed-form expression.
Appendix A - Proofs of Theorems
a.1 Proofs of Lemma 1 and Theorem 1
The proofs of Lemma 1 and Theorem 1 follow the arguments given in Hennig and Viroli (2016a, Supplementary Material). Here, we briefly sketch the main idea. The optimal value that minimises the theoretical misclassification probability can be obtained by setting the first derivative of (10) to zero, from which
By assumption, there exists such that . Hence, the identity above is satisfied because and are continuous functions of that converge to the lower and upper bound of for approaching either 0 or 1, respectively. Furthermore, under the assumptions of Theorem 1, the optimal Bayesian classifier has a single decision boundary at . ∎
a.2 Proof of Lemma 2
Without loss of generality, assume . Let and consider three possible, distinct cases: , , and .
If , then
by definition. If , then
Finally, if , then
a.3 Proof of Theorem 2
a.4 Proof of Theorem 3
Let be the empirical quantile computed on the projected training data . We write
where . Let denote the vector of quantiles of , and put for and write . By the triangular inequality
where and satisfy , . Hence
where , and .
Given the convergence of the empirical quantiles to the respective population quantiles, it follows that
for any , where
Given , let denote the set of indices such that