Discriminant analysis is a widely used statistical tool to perform classification tasks. Historical discriminant analysis  assumes that observations are drawn from Gaussian distributions and the decision rule consists in choosing the cluster that maximizes the likelihood of the observation. However, when the underlying assumption fails to hold, the impact on the result can be significant. In early 80s,  and  studied the impact of contamination and mislabelling on the performances of such methods and  how non-normality impacts Quadratic Discriminant Analysis (QDA). To tackle such sensitivity,  suggests the use of robust M-estimators. The major drawback is its low breakdown point in high dimensions, but  came up with a robust S-estimator to alleviate this issue.
More recently,  dropped the Gaussian distribution hypothesis for the underlying distributions to replace it by the more general case of multivariate -distribution. In 2015,  generalized discriminant analysis methods to elliptical symmetric (ES) distributions (see [ollila2012complex] for a review on these distributions). The new method called Generalized QDA (GQDA) relies on the estimation of a threshold parameter, whose optimal value is fixed for each sub-family of distribution. The case corresponds to the Gaussian case. Finally,  improved the previous work by adding robust estimators, coming up with the Robust GQDA (RGQDA) method.
All these methods assume that all clusters belong to the same distribution family. In practice, such an hypothesis may not hold. Inspired by [roizman2019flexible], this paper proposes a new method that does not assume any prior on the underlying distributions, and allows for each observation to be drawn from a different family of distribution. Points in the same cluster do not need to be identically distributed, only to be drawn independently. The counterpart to such flexibility relies in the characteristics of the clusters. Indeed, we assume that points in the same cluster are drawn from distributions that share the same mean and scatter matrix. However, under assumption of existence, points in the same cluster have only proportional covariances matrices.
2 Flexible EM-inspired discriminant analysis
Statistical model: Let us assume that each observation is drawn from an ES distribution. The mean and scatter matrix depend on the cluster to which the point belongs to while the nuisance parameter may depend on both observation and class
. We then have the following probability density function for:
Expression of the log-likelihood and Maximum Likelihood estimators: Given independent observations in class , the log-likelihood of the sample can be rewritten as follows:
where and . Then, maximizing Eq. (1) w.r.t. , for fixed and leads to
Due to the assumptions on , the denominator always exists. Replacing by in Eq.(1) leads to:
At this stage, one can notice that the flexibility in the choice of the covariance matrix scale allows us to make the impact of the generator function in the likelihood boil down to a multiplicative constant that does not depend on . One obtains robust estimators (derived however using MLE) for the mean and scatter matrix as follows:
Note that is insensitive to the scale of and if is a solution to the fixed-point equation, is also solution. Estimators obtained are close to robust M-estimators, but with weights proportional to the squared Mahalanobis distance. The convergence of the two fixed-point equations has been analyzed in [roizman2019flexible].
Classification rule: Equipped with these estimators, used for the training part of discriminant analysis, one can now derive one of the contribution of this work, the classification rule. this is the following proposition.
The decision rule for Flexible EM-inspired discriminant analysis (FEMDA) is given by
with and .
Note that parameters for class and for class have been learned on the training dataset using the previously derived estimators.
The key idea of the proof is that the log-likelihood depends on only through the term
This decision rule is close to the robust version of classic QDA except that we compare the log of the squared Mahalanobis distances rather than the squared Mahalanobis distances. Also, it is insensitive to the scale of .
3 Experiments on synthetic data
The proposed FEMDA method is compared to the following methods: classic QDA for Gaussian distributions, using classic and robust -estimators (Robust QDA, see  for details), QDA for -distributions (t-QDA) , GQDA and RGQDA .
Simulation settings: Means of clusters are drawn randomly on the unit, , , and .
Considered scenarios: Points are drawn from four different families of ES distributions.
|Distribution family||Stochastic representation|
stands for the uniform distribution on the unit-sphere. Shape parameter (resp. ) is drawn uniformly in (resp. ) for generalized Gaussian (resp. for -distributions and -distributions).
Data generation scenario are identified as follows: corresponds to 50% of the points for each cluster is drawn from a generalized Gaussian distribution, 30% from a -distribution and 20% from a -distribution.
Concerning the parameters, we use the following color code: : same and are used for all points across the same cluster and : one different parameter is used for each point of each cluster.
While t-QDA and FEMDA rely on their own estimators, we will use either the classic empirical estimators (QDA and GQDA), or robust -estimators (Robust QDA and RGQDA).
As expected, one can see on Fig.1(a) that the classic empirical estimator is the fastest to be computed while -estimator is slower because it requires the estimation of more parameters at each step. For t-QDA though, since the estimation of the degree of freedom is already optimized, the relative time gain will be smaller, making FEMDA and the other methods even faster than t-QDA. On Fig.
-estimator is slower because it requires the estimation of more parameters at each step. For t-QDA though, since the estimation of the degree of freedom is already optimized, the relative time gain will be smaller, making FEMDA and the other methods even faster than t-QDA. On Fig.1(b), one observes that the speed for each decision rule is basically the convergence speed of the estimators used to compute the likelihood, except for GQDA that requires the estimation of an extra parameter.
Results for the classification
For several scenarios, specified in first column, Table 1 displays the difference of accuracy between the obtained accuracy and the accuracy of the best method on the corresponding scenario :
GG - T - K
In Table 1, one can see that GQDA performs better than QDA in scenarios with mixtures of distributions and evenly when only one type of distribution is used. However, GQDA performance does not compete with t-QDA and FEMDA, t-QDA being in most scenarios the best method but with a very slight improvement over FEMDA. This is due to the estimation of an extra parameter for t-QDA, namely . The couterpart is that tQDA is slower than FEMDA. In table 2, We add some contaminated data using the distribution to simulated a contaminated point. One can see that FEMDA is the most robust to contamination. At a 25% contamination rate, t-QDA is outperformed in almost all scenarios. Indeed, there are more parameters to estimate, and thus t-QDA is more sensitive to the contamination.
GG - T - K
4 Results on real datasets
4.1 Description of the datasets
In this section, we present results on real datasets obtained from the UCI machine learning repository where the objective is to classify emails between spams and non-spams. Attributes contain the frequency of use of usual words or characters;
In this section, we present results on real datasets obtained from the UCI machine learning repository: Spambase
where the objective is to classify emails between spams and non-spams. Attributes contain the frequency of use of usual words or characters;Ecoli where one wants to predict the localization site of a protein among 8 possible using 7 attributes about the cell that contains the protein; Statlog Landsat Satellite that contains multi-spectral values of pixels in 3*3 neighbourhoods in a satellite image. The goal is to predict the type of soil represented by the central pixel.
4.2 Classification accuracy results
The results have been averages over 100 simulations, and every 10, we reshuffle a new train and test set.
We can see on Fig. 2(a) and Fig. 2(b) that for the Spambase and Ecoli dataset, GQDA slightly outperforms the other methods. FEMDA is better than t-QDA that also suffers from higher variance. It is worth noting that for those two datasets, GQDA is outperformed by LDA which shows that its good performances come from the ability to neglect the covariances if needed. Fig.
that for the Spambase and Ecoli dataset, GQDA slightly outperforms the other methods. FEMDA is better than t-QDA that also suffers from higher variance. It is worth noting that for those two datasets, GQDA is outperformed by LDA which shows that its good performances come from the ability to neglect the covariances if needed. Fig.2(c) display the results obtained on the Statlog dataset. Again, GQDA slightly outperforms QDA but has much more variance. The two best methods are t-QDA and FEMDA with smaller variance.
4.3 Performance under contaminated model
As detailed on Fig.3, the amplitude of the contaminated data changes from one dataset to another. We observe that even when the contamination rate is very high, we still observe good results. This can be explained by the two following reasons:
Most dataset used have well separated clusters : even a linear classifier (LDA) achieves very good performance.
Contamination is mild, it is a random noise with no structure that could lead the classifier to consider all the noisy data as another cluster. These noisy points are well-handled by robust estimators thanks to the weighting.
On Fig.3(a) we can see that for the Spambase dataset, FEMDA starts to overwhelm GQDA at a 60% contamination rate, and t-QDA at a 20% contamination rate. The proposed method has less parameters to be estimated, and thus, it is less sensitive to noise and more robust. Concerning the Ecoli dataset, on Fig.3(b) , methods are not very impacted for low contamination rates. FEMDA and t-QDA remain very close. At a 50% contamination rate, FEMDA becomes to outperform both t-QDA and RGQDA. FEMDA manages to preserve its performances up to a 70% contamination rate, versus 50% for other methods. Again, t-QDA is the most sensitive method to outliers and FEMDA is the most robust, being able to deal with much higher contamination rates. On the last dataset, Fig.
, methods are not very impacted for low contamination rates. FEMDA and t-QDA remain very close. At a 50% contamination rate, FEMDA becomes to outperform both t-QDA and RGQDA. FEMDA manages to preserve its performances up to a 70% contamination rate, versus 50% for other methods. Again, t-QDA is the most sensitive method to outliers and FEMDA is the most robust, being able to deal with much higher contamination rates. On the last dataset, Fig.3(c), all methods obtain very similar results up to a 60% contamination rate. t-QDA is the less robust and its performances start to erode quickly. FEMDA manages to uphold its performances up to a 80% contamination rate, being again the most robust method to noise.
In this paper, we presented a new highly robust discriminant analysis method that outperforms several state of the art methods for both simulated and real datasets. In this new approach, clusters no longer share the same covariance matrix, but only the same shape matrix. Sacrificing the scale of the covariance matrix allows us to gain flexibility in order to deal with non identically distributed observations. Moreover, the flexibility of such approach makes it particularly suitable to deal with heavy-tailed and contaminated data. Tests performed on simulated data show that our new approach has a computational speed comparable to t-QDA or QDA with plug-in robust estimators. Performances are almost as good as the best methods with clean data in various scenarios. When data are contaminated, the proposed FEMDA outperforms other robust methods in most scenarios. Simulations on real data also lead to the same conclusions. FEMDA performs as well as other methods in the presence of clean data and shows remarkable robustness when data is contaminated. It has the highest resilience to contamination. It can be seen as an enhancement of t-QDA: almost as good accuracy results but faster and much more robust.