1 Introduction
Discriminant analysis is a widely used statistical tool to perform classification tasks. Historical discriminant analysis [2] assumes that observations are drawn from Gaussian distributions and the decision rule consists in choosing the cluster that maximizes the likelihood of the observation. However, when the underlying assumption fails to hold, the impact on the result can be significant. In early 80s, [3] and [4] studied the impact of contamination and mislabelling on the performances of such methods and [5] how nonnormality impacts Quadratic Discriminant Analysis (QDA). To tackle such sensitivity, [8] suggests the use of robust Mestimators. The major drawback is its low breakdown point in high dimensions, but [12] came up with a robust Sestimator to alleviate this issue.
More recently, [20] dropped the Gaussian distribution hypothesis for the underlying distributions to replace it by the more general case of multivariate distribution. In 2015, [21] generalized discriminant analysis methods to elliptical symmetric (ES) distributions (see [ollila2012complex] for a review on these distributions). The new method called Generalized QDA (GQDA) relies on the estimation of a threshold parameter, whose optimal value is fixed for each subfamily of distribution. The case corresponds to the Gaussian case. Finally, [22] improved the previous work by adding robust estimators, coming up with the Robust GQDA (RGQDA) method.
All these methods assume that all clusters belong to the same distribution family. In practice, such an hypothesis may not hold. Inspired by [roizman2019flexible], this paper proposes a new method that does not assume any prior on the underlying distributions, and allows for each observation to be drawn from a different family of distribution. Points in the same cluster do not need to be identically distributed, only to be drawn independently. The counterpart to such flexibility relies in the characteristics of the clusters. Indeed, we assume that points in the same cluster are drawn from distributions that share the same mean and scatter matrix. However, under assumption of existence, points in the same cluster have only proportional covariances matrices.
2 Flexible EMinspired discriminant analysis
Statistical model: Let us assume that each observation is drawn from an ES distribution. The mean and scatter matrix depend on the cluster to which the point belongs to while the nuisance parameter may depend on both observation and class
. We then have the following probability density function for
:Expression of the loglikelihood and Maximum Likelihood estimators: Given independent observations in class , the loglikelihood of the sample can be rewritten as follows:
(1) 
where and . Then, maximizing Eq. (1) w.r.t. , for fixed and leads to
Due to the assumptions on , the denominator always exists. Replacing by in Eq.(1) leads to:
where
At this stage, one can notice that the flexibility in the choice of the covariance matrix scale allows us to make the impact of the generator function in the likelihood boil down to a multiplicative constant that does not depend on . One obtains robust estimators (derived however using MLE) for the mean and scatter matrix as follows:
(2) 
where .
Note that is insensitive to the scale of and if is a solution to the fixedpoint equation, is also solution. Estimators obtained are close to robust Mestimators, but with weights proportional to the squared Mahalanobis distance. The convergence of the two fixedpoint equations has been analyzed in [roizman2019flexible].
Classification rule: Equipped with these estimators, used for the training part of discriminant analysis, one can now derive one of the contribution of this work, the classification rule. this is the following proposition.
Proposition 2.1.
The decision rule for Flexible EMinspired discriminant analysis (FEMDA) is given by
(3) 
with and .
Note that parameters for class and for class have been learned on the training dataset using the previously derived estimators.
Proof.
The key idea of the proof is that the loglikelihood depends on only through the term
∎
Remark 2.2.
This decision rule is close to the robust version of classic QDA except that we compare the log of the squared Mahalanobis distances rather than the squared Mahalanobis distances. Also, it is insensitive to the scale of .
3 Experiments on synthetic data
The proposed FEMDA method is compared to the following methods: classic QDA for Gaussian distributions, using classic and robust estimators (Robust QDA, see [8] for details), QDA for distributions (tQDA) [20], GQDA and RGQDA [22].
Simulation settings: Means of clusters are drawn randomly on the unit
sphere and covariance matrices are generated with a random orthogonal matrix and random eigenvalues. The set up for the simulations is
, , , and .Considered scenarios: Points are drawn from four different families of ES distributions.
Distribution family  Stochastic representation 

generalized Gaussian 

distribution  
distribution  

stands for the uniform distribution on the unit
sphere. Shape parameter (resp. ) is drawn uniformly in (resp. ) for generalized Gaussian (resp. for distributions and distributions).Data generation scenario are identified as follows: corresponds to 50% of the points for each cluster is drawn from a generalized Gaussian distribution, 30% from a distribution and 20% from a distribution.
Concerning the parameters, we use the following color code: : same and are used for all points across the same cluster and : one different parameter is used for each point of each cluster.
While tQDA and FEMDA rely on their own estimators, we will use either the classic empirical estimators (QDA and GQDA), or robust estimators (Robust QDA and RGQDA).
As expected, one can see on Fig.1(a) that the classic empirical estimator is the fastest to be computed while
estimator is slower because it requires the estimation of more parameters at each step. For tQDA though, since the estimation of the degree of freedom is already optimized, the relative time gain will be smaller, making FEMDA and the other methods even faster than tQDA. On Fig.
1(b), one observes that the speed for each decision rule is basically the convergence speed of the estimators used to compute the likelihood, except for GQDA that requires the estimation of an extra parameter.Results for the classification
For several scenarios, specified in first column, Table 1 displays the difference of accuracy between the obtained accuracy and the accuracy of the best method on the corresponding scenario :
Scenario  QDA  tQDA  GQDA  FEMDA 
GG  T  K 


76.27  

76.74  

76.43  

76.39  

77.08  

77.12  

80.85  

80.59  

80.79  

79.75  

In Table 1, one can see that GQDA performs better than QDA in scenarios with mixtures of distributions and evenly when only one type of distribution is used. However, GQDA performance does not compete with tQDA and FEMDA, tQDA being in most scenarios the best method but with a very slight improvement over FEMDA. This is due to the estimation of an extra parameter for tQDA, namely . The couterpart is that tQDA is slower than FEMDA. In table 2, We add some contaminated data using the distribution to simulated a contaminated point. One can see that FEMDA is the most robust to contamination. At a 25% contamination rate, tQDA is outperformed in almost all scenarios. Indeed, there are more parameters to estimate, and thus tQDA is more sensitive to the contamination.
Scenario  tQDA  FEMDA  tQDA  FEMDA 

Contamination 
10%  25%  
GG  T  K 


70.41  61.25  

71.70  62.00  

70.80  61.47  

70.03  61.29  

70.98  61.51  

71.01  61.52  

75.53  65.43  

74.72  64.55  

74.09  64.42  

73.44  63.45  

4 Results on real datasets
4.1 Description of the datasets
In this section, we present results on real datasets obtained from the UCI machine learning repository
[29]: Spambasewhere the objective is to classify emails between spams and nonspams. Attributes contain the frequency of use of usual words or characters;
Ecoli where one wants to predict the localization site of a protein among 8 possible using 7 attributes about the cell that contains the protein; Statlog Landsat Satellite that contains multispectral values of pixels in 3*3 neighbourhoods in a satellite image. The goal is to predict the type of soil represented by the central pixel.4.2 Classification accuracy results
The results have been averages over 100 simulations, and every 10, we reshuffle a new train and test set.
We can see on Fig. 2(a) and Fig. 2(b)
that for the Spambase and Ecoli dataset, GQDA slightly outperforms the other methods. FEMDA is better than tQDA that also suffers from higher variance. It is worth noting that for those two datasets, GQDA is outperformed by LDA which shows that its good performances come from the ability to neglect the covariances if needed. Fig.
2(c) display the results obtained on the Statlog dataset. Again, GQDA slightly outperforms QDA but has much more variance. The two best methods are tQDA and FEMDA with smaller variance.4.3 Performance under contaminated model
As detailed on Fig.3, the amplitude of the contaminated data changes from one dataset to another. We observe that even when the contamination rate is very high, we still observe good results. This can be explained by the two following reasons:

Most dataset used have well separated clusters : even a linear classifier (LDA) achieves very good performance.

Contamination is mild, it is a random noise with no structure that could lead the classifier to consider all the noisy data as another cluster. These noisy points are wellhandled by robust estimators thanks to the weighting.
On Fig.3(a) we can see that for the Spambase dataset, FEMDA starts to overwhelm GQDA at a 60% contamination rate, and tQDA at a 20% contamination rate. The proposed method has less parameters to be estimated, and thus, it is less sensitive to noise and more robust. Concerning the Ecoli dataset, on Fig.3(b)
, methods are not very impacted for low contamination rates. FEMDA and tQDA remain very close. At a 50% contamination rate, FEMDA becomes to outperform both tQDA and RGQDA. FEMDA manages to preserve its performances up to a 70% contamination rate, versus 50% for other methods. Again, tQDA is the most sensitive method to outliers and FEMDA is the most robust, being able to deal with much higher contamination rates. On the last dataset, Fig.
3(c), all methods obtain very similar results up to a 60% contamination rate. tQDA is the less robust and its performances start to erode quickly. FEMDA manages to uphold its performances up to a 80% contamination rate, being again the most robust method to noise.5 Conclusion
In this paper, we presented a new highly robust discriminant analysis method that outperforms several state of the art methods for both simulated and real datasets. In this new approach, clusters no longer share the same covariance matrix, but only the same shape matrix. Sacrificing the scale of the covariance matrix allows us to gain flexibility in order to deal with non identically distributed observations. Moreover, the flexibility of such approach makes it particularly suitable to deal with heavytailed and contaminated data. Tests performed on simulated data show that our new approach has a computational speed comparable to tQDA or QDA with plugin robust estimators. Performances are almost as good as the best methods with clean data in various scenarios. When data are contaminated, the proposed FEMDA outperforms other robust methods in most scenarios. Simulations on real data also lead to the same conclusions. FEMDA performs as well as other methods in the presence of clean data and shows remarkable robustness when data is contaminated. It has the highest resilience to contamination. It can be seen as an enhancement of tQDA: almost as good accuracy results but faster and much more robust.
Comments
There are no comments yet.