For qualitative analysis, high-dimensional datasets provide enough information, but in many cases, not all the measured variables are useful for qualitative model. In addition, traditional statistical methods require the number of variables smaller than the number of samples, otherwise, it will cause the curse of dimensionality1 . In order to solve these problems, we need to reduce the dimensionality of the dataset before qualitative analysis. Dimension reduction methods such as PCA 2 ; 3 ; 4 , LDA 5 and PLS 6 ; 7
are often used. These methods reduce or eliminate the statistical redundancy and noise between the components of high-dimensional vector data, obtaining a lower-dimensional representation without significant loss of information.
In unsupervised data analysis, PCA is a good tool of dimension reduction, the main idea is to reduce the dimensionality of a dataset in which there are a large number of interrelated variables while retaining as much as possible of the variation present in the dataset 8
. However, PCA can only work in the unsupervised dataset. After adding the sample labels, we need to use supervised methods for analyzing the dataset. LDA is a well-known supervised method for feature extraction and dimension reduction, it achieves maximum discrimination by maximizing the ratio of between-class and within-class distance9 . An intrinsic limitation of classical LDA is the so-called small sample size problem 5 , different methods have been proposed to solve this problem 10 ; 11 ; 12 . One of the most successful approaches is subspace LDA, which applies an intermediate dimension reduction stage before LDA. Among all the subspace LDA methods, the PCA plus LDA (PCA-LDA 13 ) and PLS plus LDA (PLS-LDA 14 ) have received significant attention. Other approaches use the algorithms based on PLS as a dimension reduction.
PLS algorithm has the ability to overcome both the dimensionality and the collinear problems 15 ; 16 , at the same time, and has exhibited excellent performance for solving the problem of small sample size 17
. However, PLS also has some problems, such as how to obtain more useful information, to enhance the robustness of the model, and to more accurately eliminate redundancy and noise. A solution to these problems is ensemble learning which is derived from the field of machine learning18 , and can be used for both classification and regression problems. In this study, we are more interested in dimensionality reduction and classification. Compared with the single model, ensemble models, including boosting 20 ; 21 , bagging 22 and stacked regression 23 ; 31 , report increased robustness and accuracy 19 and have been successfully applied in the last several years. In order to overcome the over-fitting problem, Zhang et al. used the idea of boosting to combine a set of shrunken PLS models, each with only one PLS component, and called it boosting PLS 20 . On the basis of boosting PLS, some scholars have modified and applied it for spectroscopic quantitative analysis 25 ; 26 . By using Bagging strategy 27 ; 28 , many training sets are generated from the original dataset, Bagging PLS trains a model from each of those training sets, the final model can be obtained by averaging the coefficient B from each sub-model. From overcoming the disadvantages of MWPLS and iPLS, Xu et al. presented a stack based PLS method using Monte Carlo Cross-validation 29 . Ni et al. have proposed two new stacked PLS which can establish PLS models based on all intervals of a set of spectra to take advantage of the information from the whole spectrum by incorporating parallel models in a way to emphasize intervals highly related to the target property 23 .
After the establishment of the PLS sub-models, various ensemble algorithms for the fusion of the final model are available, mainly including average weighting, cross-validation error weighting and minimum square error weighting rule and so on. In this paper, for adopting Bagging model training method, the dataset is divided into a number of sub-training sets. The PLS models are then employed on these sub-training sets. Subsequently, the coefficients B of all the PLS sub-models becoming an asymmetric positive semi-definite matrix BB
, are fused in a joint matrix. Finally, using the PCA, an eigenvalue decomposition by taking the largest variance model or final projection model is performed. This proposed method is termed as the Principal Model Analysis (PMA). In the subsequent sections, we discussed the relationship between the model parameters (the number of latents, models and remained dimensions) and the classification accuracy. The theory and experiments show that PMA increases the robustness and the generalization ability of the PLS algorithm. Also, PMA can provide a good idea for using the PLS algorithm to semi-supervised dimensionality reduction.
Boldface uppercase and lowercase letters are used to denote matrices and vectors, respectively. Lowercase italic letters denote the scalars. The detailed notations are as follows:
X matrix of samples
y vector of sample label
C covariance matrix of PCA
w vector of the PCA loading
B matrix of PLS regression coefficients
vector of PLS1 regression coefficient
the eigenvalue of PCA
number of samples
number of sample features
number of components
2.2 Overview of PLS and Bagging-based PLS
PLS intends to project the high-dimensional predictor variables into a smaller set of latent variables, which has a maximal covariance to the responses. Given a training set, the decomposition of PLS algorithm is as follows 36 ; 37 ; 38 ; 40 :
where and are score vectors, and and are loading vectors of X and y, respectively. E and F are residuals matrices. The is the number of feature vectors. The PLS inner relation between the projected score vectors is:
The detailed algorithm procedures of PLS are as follows:
Computing weight vector: , and making to be normalization.
Computing the input’s score vector: and its loading vector: .
Computing the output’s loading vector: , and making to be normalization.
Computing the output’s score vector: .
Computing internal regression coefficient: .
Computing residuals matrices: and .
Updating to , then go back the step 2 until the expected number of latent variables is achieved.
The ensemble learning method aims to improve the accuracy and robustness of traditional algorithms by combining the results of multiple sub-models. Bagging is a simple ensemble learning strategy and is widely used for the classification and regression problems, such as bagging SVM and bagging PLS.
The general PLS method usually shows bad or unstable results on the data with a very large number of collinear x-variables or the data with very limited training samples. By using the bagging strategy, the bagging PLS model could reduce the variance of the original unstable model without increasing the bias. Therefore, bagging PLS usually can achieve much more accurate and stable results than traditional PLS method.
Bagging-based PLS first generates several sub-training sets from the original training set based on the random sampling with replacement method, and then trains a PLS model on each sub-training set separately, finally averages the regression coefficients of all sub-PLS models and uses the averaged regression coefficient for the model prediction. In detail, we suppose that sub-training sets are generated by random sampling with replacement, and the PLS regression coefficient vector corresponding to each sub-training set is . The final regression coefficient of bagging-based PLS can be formulated as:
2.3 Overview of PCA
Principal component analysis (PCA) uses an orthogonal transformation to convert a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components while trying to preserve the data variance. Given a data matrix X, computing the covariance matrix C, then the projection directions of PCA can be solved by:
The above problem can be easily solved by the eigendecomposition methods, such as the singular-value decomposition (SVD) algorithm. The detailed algorithmic processes of PCA is as follows41 ; 42 ; 43 ; 44 :
where is the data matrix with samples and variables.
Computing the covariance matrix: .
Denote the first eigenvalues as
, their corresponding eigenvectors,, are principal components. The number of principal components can be decided by the cumulative contribution rate of the principal components, i.e., choosing such that
3 Principal Model Analysis
3.1 Theory and Algorithm
Combing the bagging strategy and PLS, we propose a principal model analysis (PMA) method in this section. The proposed PMA contains two steps. The first step is also to generate sub-training sets from the original training set with replacement method and the corresponding PLS regression coefficient vector of each sub-training set is denoted by . Different from the bagging PLS method which just simply averages the PLS sub-models, the second step adopted here is to use the PLS sub-models as the input of PCA algorithm to generate the final PMA model. It is mainly because PCA can effectively find the “major” elements, remove the noise, and reveal the essential structure hidden behind the complex data.
The original PCA algorithm is performed by decomposing the covariance matrix C, which is a symmetric and positive semi-definite matrix. However, the whole regression coefficient matrix in the PMA algorithm is not a symmetric and positive semi-definite matrix. So, we need to make the regression coefficient matrix B to be a symmetric positive semi-definite matrix. We replace the B by BB for the eigenvalue decomposition, and get the most representative models which called principal model as the final PMA model. The optimization of PMA algorithm can be expressed as:
The above problem can be easily solved by the singular-value decomposition (SVD) algorithm:
Denote the first eigenvalues as , their corresponding eigenvectors, , are principal components. The number of principal components can be decided by the cumulative contribution rate of the principal components, i.e., choosing such that
The detailed processes of PMA method are shown as follows.
Input: Training set and corresponding label vector, the number of PLS latent variables, the number of sub-models, the number of principal models (dim).
Output: The projection direction of PMA.
1. Preprocessing the training set .
2. Dividing using random sampling with replacement and generating PLS sub-models .
3. Denote , and doing the eigenvalue decomposition in (3), sorting the eigenvalues in descending order and rearranging their corresponding eigenvectors.
4. Denote the rearranged eigenvector matrix as W, outputting the final PMA model .
3.2 Determination of the number of Latent variables
The number of latent variables is an important parameter in the PLS model. There are many approaches to determine the number of latent variables, such as genetic algorithm, F-test and cross-validation methods. Cross-validation methods include K-fold cross-validation (k-CV), leave-one-out cross-validation (LOOCV), Monte Carlo cross-validation (MCCV) and so on. In this paper, we use 10-fold cross-validation method to determine the number of latent variables.
3.3 The sub-models selective rule
For ensemble strategy, usually sub-models who performed better or part of the performance can include more diversity 27 . So Zhou et al. suggested that it may be a better choice for using part of sub-models instead of all of the sub-models 32 . Herein, original training set is arbitrarily divided into tress parts: calibration sets, validation sets and prediction sets 45 . We establish 100 PLS sub-models in the validation sets by sub-sampling and re-weighting the existed calibration samples respectively. The proposed method directly constructs diverse models with virtual samples which are produced by original calibration samples, and this can increase the amount of ensemble diversity when the calibration samples are not enough 45 . Using these 100 sub-models on the validation set, we get 100 different classification accuracies. Then follow the classification accuracy in descending order, take the sub-models with largest classification accuracy participate final ensemble.
3.4 Determination of Dimensions
PCA does an EIG or SVD on a matrix and then generates an eigenvalue matrix. To select the principal components we have to take only the first few eigenvalues. How do we decide on the number of eigenvalues that we should take from the eigenvalue matrix? Usually we adopt accumulative contribution rate automatically retain useful eigenvalues.
Using PMA to reduce dimension is to obtain the scores by projecting the new samples to the direction of the principal models, so the number of the final dimensions is equal to the number of selected principal models. In the experiment, if the parameter of fixed dimensions, which is one of inputs, is greater than or equal to 1, we will use fixed dimensions to select the principal models. Otherwise, if the fixed dimensions greater than 0 and less than 1, we use cumulative contribution rate to obtain the principal models.
From the selection of sub-models can be inferred, the number of final principal models does not need much. Because the classification ability of the selected sub-models are almost the same, so only one principal model almost retains all sub-models classification ability. Therefore, in practical applications, we only take the eigenvector with the largest eigenvalue as the principal model.
4 Experimental results
4.1 Data Sets
In order to evaluate the performance of the proposed PMA method, we compare it with the PLS, LDA, PLS-LDA and Bagging PLS methods on three types of data sets:
Four UCI datasets, i.e., Breast data, Spambase data, Gas data, Musk data (Version 1) (obtained from http://archive.ics.uci.edu/ml/datasets.html);
Small data and Imbalanced data;
Raman spectral data (Raman).
The details of these datasets are shown in Table 1. Before using the datasets, we remove the non-numerical and missing inputs data and convert the class label to a numeric type.
|Data Set||Number of Examples||Number of Attributes||Class label||Year|
|Breast||569||30||1 and 2||1995|
|Spambase||4601||57||0 and 1||1999|
|Gas||4782||128||5 and 6||2012|
|Musk(Version 1)||168||476||0 and 1||1994|
|small||300||476||0 and 1||1994|
|imbalanced||7074||476||0 and 1||1994|
|Raman||925||101||0 and 4||N/A|
The data sets “small” and “imbalanced” are randomly sampled from the data set “Musk (Version 1)”. The data set “small” is a typical data set with high dimensionality and small samples, where the number of positive and negative samples are the same. The data set “imbalanced” is an imbalanced data set, where the ratio of positive and negative samples is 6:1.
Spectral data set “Raman” is obtained by a standard Raman spectroscope (HR LabRam invers, Jobin-Yvon-Horiba). The excitation wavelength of the used laser (Nd: YAG, frequency doubled) is centered at 532 nm. There are 2545 spectra for 20 different strains available 34 . Herein we select two classes (B. subtilis DSM 10 and M. luteus DSM 348) and use the spectra in the region 1100-1200 in calculations 35 .
Five dimension reduction methods, i.e., PLS, LDA, PLS-LDA, Bagging PLS and PMA, are compared in our experiments. For Bagging PLS algorithm, fifteen models are generated by the random sampling with replacement method and the final model is obtained by averaging these fifteen sub-models. For the PMA method, 100 sub-models are generated from the validation set, and the best fifteen sub-models with higher accuracies are chosen to perform model fusion. Except for the LDA, the number of latent variables in the PLS, PLS-LDA, Bagging PLS and PMA are determined by the 10-fold cross-validation. In the experiment, the dimensionality of the original data is reduced to 1. For fair comparison, the linear Naive Bayes classifier is used to evaluate the results of the above different dimension reduction methods.
For each data set, we randomly choose 49, 30 and 21 samples from the total samples to form the training set, test set, and validation set. The experiments are randomly run 20 times, and the averaged results are recorded.
5 Results and Discussion
5.1 Classification performance of different algorithms
This section mainly investigates the classification performance of various algorithms. We report the results on both the training and testing datasets. The classification accuracies accuracies are reported in Table 2 and Table 3, respectively.
|Data Set||PLS||LDA||PLS-LDA||Bagging PLS||PMA|
|Musk (Version 1)||0.9119||0.7176||0.9039||0.9126||0.9176|
The bold value means the maximum accuracy among different methods.
|Data Set||PLS||LDA||PLS-LDA||Bagging PLS||PMA|
|Musk (Version 1)||0.9052||0.7003||0.8980||0.9059||0.9108|
The bold value means the maximum accuracy among different methods.
The small-sample-size problem is often encountered in the field of pattern recognition. It may lead to the singularity of the within-class scatter matrix in the LDA. So, for the data sets “small” and “imbalanced”, the LDA algorithm shows bad results. PLS shows good overall classification performance. PLS-LDA algorithm firstly removes redundancy and noise in the data set by the PLS method, then performs the LDA algorithm on the PLS dimension reduction features. PLS-LDA shows better results than LDA except for the data sets “Spambase” and “Gas”. But PLS-LDA still seems to show over-fitting phenomenon in the data sets “small” and “imbalanced”. Because the PLS dimension reduction process may lose some information, the results of PLS-LDA are worse than PLS. The Bagging PLS achieves better results than PLS in the data sets “Breast”, “Spambase”, “Raman” and “Muskv (Version 1)”. Although many sub-models in Bagging PLS provide better performance than PLS, the improvement of Bagging PLS over PLS is not significant because of the average strategy. As observed in the Tables2 and 3, all algorithms appear over-fitting phenomenon on the data set “small”. Notwithstanding, the proposed PMA algorithm achieves the best results in either training and testing set except for the data set “imbalanced”. The superiorities are much more obvious on the data sets “Breast” and “Spambase” .
The above figures are the box diagrams of the accuracy of different algorithms. For the data sets “Breast”, “Spambase”, “Raman” and “Muskv (Version 1)”, the results of LDA algorithm are obviously much worse than other methods. The results of PLS are also unstable on the data set “Spambase”. The results of PLS-LDA are worse than others on the data sets “Spambase” and “Gas”. In the data set “small”, all algorithms show over-fitting phenomenon. Except for the data set “small”, PMA algorithm gets more stable results than other methods.
5.2 Investigation on the number of sub-models
From the Figure 8, we can see that the number of sub-models is less sensitive to the PMA model. In general, the classification accuracies on each data set decreases with the increase of the number of sub-models. It demonstrates that not all of the sub-models are valid. Meanwhile, it is likely to improve the classification performance by choosing some good sub-models. For the data sets “Breast” and “Raman”, the number of sub-models greatly affects the classification results. The number of sub-models can be empirically determined by the cross-validation method.
5.3 Impact of PMA dimensions
To investigate the effect of PMA dimensionalities, we show the classification results on different dimensionalities ranging from 1 to 30. As can be seen from the Figure 9, the classification accuracy on all data sets does not improve with the increase of dimensionality. A possible explanation could be that the first principal component already contains the majority information of the entire data. The results on the data sets “Gas” and “Muskv1” are relatively stable.
5.4 Discussions of the proposed method
The proposed PMA algorithm extends the original Bagging PLS for qualitative analysis. The results on the six data sets show that PMA algorithm can improve the classification accuracy to a certain extent. Model ensemble has many advantages, such as enhancing the robustness. However, the number of sub-models and the number of dimensionalities must be carefully chosen.
In this paper, we have proposed a PMA method for classification. By means of ensemble strategy, the proposed PMA method fuses the results of PLS sub-models and finds the principal model by performing PCA on the joint coefficient matrix of all sub-models. Experimental results demonstrate that the proposed PMA method can achieve better classification performance than original PLS and Bagging PLS. Our future work will focus on finding more comprehensive evaluation criteria for the selection of sub-models. In addition, we will perform PMA on semi-supervised problems by adding a large number of unsupervised data.
- (1) Afara, I., Singh, S., Oloyede, A.: Application of near infrared (nir) spectroscopy for determining the thickness of articular cartilage. Medical Engineering and Physics 35(1), 88–95 (2013)
- (2) Barker, M., Rayens, W.: Partial least squares for discrimination. Journal of Chemometrics 30(3), 446–452 (2012)
- (3) Bellman, R.: Adaptive Control Processes: A Guided Tour. The University Press (1961)
- (4) Bi, Y., Xie, Q., Peng, S., Tang, L., Hu, Y., Tan, J., Zhao, Y., Li, C.: Dual stacked partial least squares for analysis of near-infrared spectra. Analytica Chimica Acta 792(16), 19–27 (2013)
- (5) Bian, X., Li, S., Shao, X., Liu, P.: Variable space boosting partial least squares for multivariate calibration of near-infrared spectroscopy. Chemometrics and Intelligent Laboratory Systems 158 (2016)
- (6) Boulesteix, A.L.: Pls dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology 3(1), 392 (2004)
- (7) Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Chen, L.F., Liao, H.Y.M., Ko, M.T., Lin, J.C., Yu, G.J.: A new lda-based face recognition system which can solve the small sample size problem.Pattern Recognition 33(10), 1713–1726 (2000)
- (9) Chiang, K.Y., Hsieh, C.J., Dhillon, I.S.: Robust principal component analysis with side information. In: International Conference on Machine Learning, pp. 2291–2299 (2016)
- (10) Efron, Bradley: An introduction to the bootstrap. Chapman and Hall (1995)
- (11) Ferrari, A.C., Robertson, J.: Interpretation of raman spectra of disordered and amorphous carbon. Physical Review B 61(20), 14,095–14,107 (2000)
- (12) Folch-Fortuny, A., Arteaga, F., Ferrer, A.: Pls model building with missing data: New algorithms and a comparative study. Journal of Chemometrics 31(1-2) (2017)
Ginkel, J.R.V., Kroonenberg, P.M.: Using generalized procrustes analysis for multiple imputation in principal component analysis.Journal of Classification 31(2), 242–269 (2014)
- (14) Goodhue, D.L., Lewis, W., Thompson, R.: Does pls have advantages for small sample size or non-normal data? Mis Quarterly 36(3), 981–1001 (2012)
- (15) Hu, Y., Peng, S., Peng, J., Wei, J.: An improved ensemble partial least squares for analysis of near-infrared spectra. Talanta 94, 301–307 (2012)
- (16) Huang, R., Liu, Q., Lu, H., Ma, S.: Solving the small sample size problem of lda 3, 29–32 (2002)
- (17) Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philosophical Transactions 374(2065), 20150,202 (2016)
- (18) Kambhatla, N., Leen, T.: Dimension reduction by local principal component analysis. Neural Computation 9(7), 1493–1516 (1997)
- (19) Liu, Y., Rayens, W.: Pls and dimension reduction for classification. Computational Statistics 22(2), 189–208 (2007)
- (20) Long, C., Guizeng, W.: Soft sensing based on pls with iterated bagging method. Journal of Tsinghua University 48, 86–90 (2008)
Maclin, R., Opitz, D.: Popular ensemble methods: An empirical study.
Journal of Artificial Intelligence Research11, 169–198 (2011)
- (22) Marigheto, N.A., Kemsley, E.K., Defernez, M., Wilson, R.H.: A comparison of mid-infrared and raman spectroscopies for the authentication of edible oils. Journal of the American Oil Chemists’ Society 75(8), 987–992 (1998)
- (23) Martinez, A.M., Kak, A.C.: Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 228–233 (2001)
- (24) Mendes-Moreira J Soares C, J.A.M.: Ensemble approaches for regression: A survey. ACM Computing Surveys 45(1), 10 (2011)
- (25) Montanari, A.: Linear discriminant analysis and transvariation. Journal of Classification 21(1), 71–88 (2004)
Ni, W., Brown, S.D., Man, R.: Stacked partial least squares regression analysis for spectral calibration and prediction.Journal of Chemometrics 23(10), 505–517 (2010)
- (27) Peschke, K.D., Haasdonk, B., Ronneberger, O., Burkhardt, H., Harz, M.: Using transformation knowledge for the classification of raman spectra of biological samples. In: Proceedings of the 4th Iasted International Conference on Biomedical Engineering, pp. 288–293 (2006)
- (28) Qin, X., Gao, F., Chen, G.: Wastewater quality monitoring system using sensor fusion and machine learning techniques. Water Research 46(4), 1133–1144 (2012)
- (29) Ren, D., Qu, F., Lv, K., Zhang, Z., Xu, H., Wang, X.: A gradient descent boosting spectrum modeling method based on back interval partial least squares. Neurocomputing 171(C), 1038–1046 (2012)
- (30) Shao, X., Bian, X., Cai, W.: An improved boosting partial least squares method for near-infrared spectroscopic quantitative analysis. Analytica Chimica Acta 666(1-2), 32–37 (2010)
- (31) Shao-Hong, G.U., Wang, Y.S., Wang, G.X.: Application of principal component analysis model in data processing. Journal of Surveying and mapping 24(5), 387–390 (2007)
- (32) Tan, Chao, Wang, Jinyue, Wu, Tong, Qin, Xin, Li, Menglong: Determination of nicotine in tobacco samples by near-infrared spectroscopy and boosting partial least squares. Vibrational Spectroscopy 54(1), 35–41 (2010)
- (33) Trendafilov, N.T., Unkel, S., Krzanowski, W.: Exploratory factor and principal component analyses: some new aspects. Kluwer Academic Publishers (2013)
- (34) Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2(1), 37–52 (1987)
- (35) Wold, S., Sjostrom, M., Eriksson, L.: Pls-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems 58(2), 109–130 (2001)
- (36) Xu, L., Jiang, J.H., Zhou, Y.P., Wu, H.L., Shen, G.L., Yu, R.Q.: Mccv stacked regression for model combination and fast spectral interval selection in multivariate calibration. Chemometrics and Intelligent Laboratory Systems 87(2), 226–230 (2007)
- (37) Ye, J.: Least squares linear discriminant analysis. In: Proceedings of the 24 International Conference on Machine Learning, pp. 1087–1093 (2007)
- (38) Ye, J., R, J., Q, L.: Two-dimensional linear discriminant analysis. Advances in Neural Information Processing Systems pp. 1431–1441 (2005)
- (39) Zhang, M.H., Xu, Q.S., Massart, D.L.: Boosting partial least squares. Analytical Chemistry 77(5), 1423–1431 (2005)
- (40) Zheng, W., Zhao, L., Zou, C.: An efficient algorithm to solve the small sample size problem for lda. Pattern Recognition 37(5), 1077–1079 (2004)
Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than all.Artificial intelligence 137(1), 239–263 (2002)