As a promising step, feature extraction has become an important approach to data mining and pattern recognition. And traditional methods usually suffer from intrinsic limitations from characteristics of original data. The first one refers to high-dimensionality of samples that hinders efficient calcualtion, and the outstanding solutions come down to direct approach to scatter matrices decomposition. Furthermore, there arise broad interests in large-scale data mining in many real-world applications, and this pushes new challege for feature analysis and reduction. In terms of such demand, it has become a vivid research topic to devise improved learning methods to conduct high-dimensional data with large amount meanwhile.
In the literature, principal component analysis (PCA) and linear discriminant analysis (LDA)  
have become popular methods for pattern analysis in statistical learning theory. To address high-dimensional problem of original data, there has been a common sense that reaches the eigen-decomposition of scatter matrices with calculational tricks of sub-matrices’ multiplications. Besides traditional ratio LDA, there is another approach to do discriminant analysis with subtraction formalism, i.g., maximum margin criterion (MMC), while null denominator problem can be avoided  . Nevertheless, just like an old Chinese saying goes, ”There would be something else in loss if something was obtained”. The calculational trick of sub-matrices is unavailable for MMC anymore, and extra calculations are to be invovled in general. In the previous work of ours, a direct solution is proposed to handle the calculational limitation of large discriminant scatter of high-dimensional data for MMC . And discriminant analysis can be proceeded straightforward while both and scatters are considered together with preserved efficiency.
In this work, the direct MMC framework is further developed to conduct adaptive learning, and insensitive to high-dimensionality problem of data in any scenarios. Furthermore, several extensions of MMC are proposed to conduct adaptive classification of different categories of data. The rest of this paper is organized as follows. The background knowledge of direct MMC is reviewed in Section 2, while the calculation efficiency is discussed in theory, followed by the details of an improved MMC in Section 3. And then, several extensions of MMC is proposed for applications of different scenarios. A set of comparison experiments on discriminant learning are given in Section 4. Finally, the conclusion is draw in Section 5.
2 Direct Maximum Margin Criterion
The original MMC considers the substraction of discriminant scatters of original data. Given data set in classes, the and scatters are defined as
where and respectively denote the mean data of -th class and whole data, and denotes the sample amount of -th class. Besides, the disrciminant scatters can also be desceibed in graph formula with definition of Laplacian matrices and . As a result, MMC solves the following quadratic optimization objective to find the ideal ,
Here, indicates the trade-off parameter to balance between-class and within-class scatters, and refers to the graph Laplacian of MMC . Obviously, the solution to such objective can be reached via eign-decomposition of , such as
Compared with traditional LDA framework, it is able to avoid rank calamity of scatters and exceed the restricted bounding of the category of samples. Nevertheless, it is hardly to be adopted to high-dimensional data, as eigen analysis of large matrix usually leads to overflow of memory. In light of kernel-view idea, we proposed a direct approach to efficient discriminant analysis , and the whole procedure is given in Algorithm 1.
Though MMC involves the similar discriminant scatters with LDA, the proposed direct approach is quite distinctive compared with traditional ideas. To avoid extra branches of execution, an dimensional threshold is added in MMC. If the sample dimension is less than the given threshold, standard procedure of MMC would be proceeded. On the contrary, an efficient calculational idea would be adopted, and sample kernel is constructed to reach the kernel scatter. As a consequence, the original MMC problem is transformed to
where denotes the sample kernel, and is the resulting orthogonal directions of MMC. It is noticeable that, the size of decomposed matrix is reduced to . Then, the final results can be obtained via calculational tricks of matrix decomposition.
The most distinguishing points mainly come from step 3 and 4. In step 3, the SVD is proceeded on , namely,
which is different from tranditional LDA that generally do similar operation on within-class scatter
. Furthermore, the obtained orthogonal vectors
is sorted in descending order corresponding to discriminant power, though they are actually adhere to the largest singular values. Thereafter, the final projective directionsneeds to be reverse vectors of in step 4. In addition, a dimensional threshold is absorbed into original MMC for adaptive dimensionality reduction. That is, the efficient calculation approach would be referred if dimensionality of original data is larger than given threshold.
It is demonstrated that, the proposed MMC method is enough to deal with linear supervised learning in gernal, while calculational efficiency is preserved. The computational cost mainly depends onfor spectral decomposition, compared with of original MMC. For convenience, such approach is called MMC directly in this context. Nevertheless, there also exists some exceptions that is still large for direct calculation of big data, and efficiency is unable to be reached. In terms of this limitation, an improved MMC is designed in this work to make supplement of the previous work of ours. The main improvement refers to construction of sample kernel in MMC, and a subset of whole samples are picked up to form the kernel matrix for following step . Suppose that there are samples are selected, the whole procedure is summarized as random MMC (RMMC) in Algorithm 2. Obviously, the computational complexity reduces to in this improved procedure, and the theoretical basis can be derived.
3 Adaptive Learning of MMC
As progress of information technology, there are diversity of handled data categories and application scenarios. The goal of adaptive learning is to exploit the intrinsic patterns of data with different analysis demands in a unified framework as possible.
3.1 Layered MMC and 2D MMC
With a multi-layer structure, it is believed that the hidden features can be exploited by enlarging original ones from data . More specifically, a median layer is added to transform each data into a much high-dimensional space, and reduced in following steps with general feature learning. Supposed that there is a given data , the transformation can be formalized as
where denotes the data transformation from original space with dimensionality to a much higher dimensionality in general. Obviously, such approach is quite identical with kernel learning framework, and can be conduced as a median learning step, e.g.,
denotes the linear transformation directions. For the more specific scenarios, it can be defined as a collaborative learning combined with a nonlinear mappingand a linear projection and , which has been employed in extreme learning machine (ELM) . Surprisingly, it is learned that there is a little disparity among different layered approches for MMC. The whole procedure of layered MMC is given as LMMC algorithm.
On the other hand, there are lots of real-world applications intuitively refer to multi-dimensional media information, e.g., images, videos, which rely on surface of 2D-dimensional space. In order to handle those kinds of data directly, some 2D based methods are to make learning succinct, e.g., two-dimensional PCA , two-dimensional LDA . In general, 2D raw data is involved to find reflecting information between rows of images, e.g., . As the original 2D methods that calculate the single direction for feature extraction , two-directional two-dimensional methods, e.g., 2DPCA  and 2DLDA , are proposed to address the limitation of single-directional learning.
In terms of this consideration, 2DMMC is devised as a natural extension of original MMC. The main difference between 2DMMC and original ones is on the fact that 2D data are referred in construction of scatters, while certain steps need to be modified correspondingly. For a given 2D data , it aims to find bi-directional projections and , and yields a smaller 2D data , e.g.,
The calculation of and is mainly based on construction of 2D scatters with respect to MMC, e.g.,
Here, , , and indicate the 2D scatters with respect to left and right directions, and denote the 2D data of the -th intra-class mean and the total mean respectively, and denotes the -th 2D data belonging to -th class. Then, the desired and can be obtained with standard process of MMC, e.g.,
and the whole procedure is summarized as 2DMMC algorithm.
Similarly, it is also feasible to employ a dimension threshold to ensure the calculational efficiency, especially if large 2D data are referred, e.g., high-resolution images. The related dimension of directional side is reduced with data kernel if the original length ( or ) is larger than threshold . By an example, this can be done with randomly selected rows (or columns) of 2D sample data. Furthermore, it is straightforward to extend original 2D methods to layered ones. Due to limited space, only the layered 2DMMC is discussed here. The main branches are quite similar to LMMC algorithm, that is, row and column projections are transformed into high-dimensional space firstly, and followed by MMC approach. The differences come from the handling of high dimensionality of mapped 2D data, which can also be conducted with solution of RMMC similarly. Instead, only one sample data is generally selected to apply economical calculations, and partial columns (or rows) are employed. The single layered-2DMMC (L2DMMC) is summarized in Algorithm 5. As a consequence, multi-layered 2DMMC can be deduced easily, and is able to be proceeded in hierarchical structures of sequential networks.
3.2 MMC Network
For each image data , an image patch is taken around each pixel with size of as the manners of local binary patterns (LBP) . As a consequence, there are vectorized patches picked up from , i.e., . Assume that the mean-removed patches of each image is indicated by . Then, the class mean and intra-class scatter of -th category with images can be defined as
Similarly, the inter-class scatter of image patches belonging to different categories can be defined as
where indicates the mean of class means. In the second stage, it is to almostly repeat the same process as the first stage. And the output stage is composed of hashing and histogram of input images.
The results of different methods on three data sets. (a) Experimental results on SUN database. (b) Experimental results on MNIST database. (c) Experimental results on STL-10 data set.
In this section, several experiments are performed to evaluate the performance of proposed MMC methods111The implementations are available at: http://mch.one/resources. First of all, the ability of linear feature extraction is tested, and three data sets, namely SUN scene categorization database222http://vision.princeton.edu/projects/2010/SUN , MNIST digit database333http://yann.lecun.com/exdb/mnist , and STL-10 data set , are involved.
In the SUN database 
, the deep features of each image are extracted by keras toolkit444https://keras.io
with pre-learned VGG-16 model of imagenet, and a 512 dimensional feature is obtained to describe each image. Among all categories, random 100 classes are selected to be employed in experiments, and random half images of each class are used for training and testing, respectively. Among image data of each digit in the MNIST data set, 2,000 images in training data are randomly selected to form training set, while 500 images in testing data are used for testing stage. As a consequence, the total training set are organized by 20,000 images, while testing set contains 5,000 images. Furthermore, the simple sparse coding features of MNIST data are adopted to make an improvement for classification . For STL-10 data set , the deep representation of each image with target coding is employed in experiments , and a 255 dimensional feature is adopted for each data. Similarly, separate 2,000 and 500 data from each training and testing categories are randomly selected to be training and testing set correspondingly.
The results on different data sets are shown in Fig. 1. For simplicity, the amount of randomly selected samples are set to be double quantity of reduced dimension for RMMC algorithm. In terms of results, PCA can present stable results for SUN and MNIST data sets, but cannot learn discriminative information with unsupervised features. Similarly, LDA is only up to discriminant analysis for two data sets. The results of RMMC gets close to the best ones, while MMC is incompetent to pattern analysis compared with other linear methods for SUN database. Especially, MMC is hardly to be proceeded for MNIST in our experiments with both high dimensionality and large sample amount, which can be accomplished by RMMC instead. Furthermore, RMMC is able to reach results approximate to MMC in most cases, but much more efficiency can be preserved. For the layered MMC algorhtms, stable performance are still available, and deeper layers lead to better recogniton performance except for SUN data set.
In the second experiment, the discriminant ability of 2D features are evaluated, while the ALOI555http://aloi.science.uva.nl and MNIST databases are involved. In the ALOI data set, , the whole data are combined while the original order of data is disordered, and then a subset of 50 categories are randomly selected to be involved into experiments. For each category of object, separate 18 and 54 images are randomly picked up to form a small training set compared with testing set of remaining images. For each digit of MNIST, 2,000 and 500 images from training and testing sets are randomly selected to be 2D data, respectively. Since most methods give the close results in different dimensions, the bar chart is adopted to illustrated the results in Fig. 2.
As experimental results shown, the discriminant ability of different 2D methods are quite close to each other for ALOI data set. And the best result is contributed by 2DLDA with 95.89%, followed by 95.63% of 2DPCA. Nevertheless, it is noticeable that, 2DMMC can obtain close result 91.7%, and 2DMMC is able to reach 95.37%. In other words, MMC methods can attain similar performance to other methods. For MNIST database, 2DPCA presents the best result of 95.44%, and other methods are hardly to reach above 90%. On the contrary, the results from 2DLDA and 2DMMC are pessimistic among all algorithms, but hopefully, 2DMMC can still get close recognition results to 2DPCA method.
In the third experiment, different dimensionality reduction methods are adopted to learn deep neural network structures, i.g., PCA, LDA and MMC Networks, are involved. Two data sets, MNIST and ALOI, are involved in this experiment. For each digit in MNIST, 100 images are randomly selected from training and testing sets, while random 30 categories are selected from ALOI data set with reshaped size of 3030. To reduce calculational complexity, three stages only are employed to learn the filter banks, and number of filters are set to be eight. With different size of patch sizes, it is able to disclose the intrinsic affection on subspace neural networks. The experimental results of patch sizes of 3, 4, 5 on two data sets are shown in Fig. 3. In terms of the results, there are few differences among three methods, and both LDA-Net and MMC-Net can reach better results in the stage of dimensionality reduction compared with PCA-Net. Furthermore, it seems that quite similar performance can be obtained with small patch sizes.
As a classical learning method, MMC is quite popular in various fields of data mining and pattern analysis, as well as its universality in intelligent computing.
In this work,
a direct MMC approach is given firstly, and then several variants of MMC are degisned for adaptive learning, i.g., random MMC, layered MMC, 2D based MMC.
Inspired by PCA Network, a MMC network method is proposed to make simple deep learning applicable. Experiments on several data sets demonstrate comparable performance of proposed methods for applications of different categories of data types, and it is compenent to learn the associated patterns in terms of adaptive recognition.
Acknowledgements. The authors would like to thank Universität zu Lübeck for sparse coding data set of MNIST, and the Chinese University of Hong Kong for target coding data set of STL-10. The corresponding author of this work is Dr. Miao Cheng.
-  Turk, M. A., Pentland, A. P.: Eigenfaces for Recognition. J. Cogn. Neurosci. 3(1), 195–197 (1981)
-  Belhumeur, P. N., Hespanha, J. P., and Kriegman, D. J.: Eigenfaces vs. Fisherfaces: Recognition using Class Specific Linear Projection. IEEE Trans. Patt. Anal. Machine Intell.19(7), 711-720 (1997)
Bishop, C. M.: Pattern Recognition and Machine Learning. Springer (2011)
Yu, H. and Yang, J.: A Direct LDA Algorithm for High-dimensional Data - With Application to Face Recognition. Patt. Recog. 19(7), 2067-2070 (2001)
-  Li, H. Jiang, T. and Zhang, K.: Efficient and Robust Feature Extraction by Maximum Margin Criterion. IEEE Trans. Neural Networks 17(1), 157-165 (2006)
-  Liu, J., Chen, S., Tan, X., and et. al.: Comments on ’Efficient and Robust Feature Extraction by Maximum Margin Criterion’. IEEE Trans. Neural Networks 18(6), 1862-1864, (2007)
-  Cheng, M., Tang, Y. Y., Pun, C.-M.: Nonparametric Feature Extraction via Direct Maximum Margin Alignment. In: Proc. of IEEE Intern. Conf. Machine Learning and Appl., pp. 585-591, Hawaii, USA (2010)
-  Cheng, M., Fang, B., Tang, Y. Y., and et. al.: Incremental Embedding and Learning in the Local Discriminant Subspace with Application to Face Recognition. IEEE Trans. Sys. Man and Cybern., Part C: Appl. and Review. 40(5), 580-591 (2010)
-  Yan, S., Xu, D., Zhang, B., and et. al.: Graph Embedding and Extensions: A General Framework for Dimensionality Reduction. IEEE Trans. Patt. Anal. and Machine Intell. 29(1), 40-51 (2007)
-  Cheng, M., and Tsoi, A. C.: CRH: A Simple Benchmark Approach to Continuous Hashing. In: Proc. IEEE Global Conf. Signal and Info. Process., pp. 1076-1080, Orlando, USA (2015)
-  Huang, G.-B., Zhou, H., Ding, X., and et. al.: Extreme Learning Machine for Regression and Multiclass Classification. IEEE Trans. Sys. Man and Cybern. Part B: Cybern. 42(2), 513-529 (2012)
-  Yang, J., Zhang, D., Frangi, A. F., and et. al.: Two-dimensional PCA: A New Approach to Appearance-based Face Representation and Recognition. IEEE Trans. Patt. Anal. Machine Intell. 26(1), 131-137 (2004)
-  Kong, H., Li, X., Wang, L., and et. al.: Generalized 2D Principal Component Analysis. In: Proc. of IEEE Intern. Joint Conf. Neural Networks, Montreal, Canada (2005)
-  Li, M., and Yuan, B.: 2D-LDA: A Statistical Linear Discriminant Analysis for Image Matrix. Patt. Recog. Lett. 26(5), 527-532 (2005)
-  Xiong, H., Swamy, M. N. S., and Ahmad, M. O.: Two-dimensional FLD for Face Recognition. Patt. Recog. 38(7), 1121-1124 (2005)
-  Ye, J., Janardan, R., and Li, Q.: Two-dimensional Linear Discriminant Analysis. In: NIPS. (2004)
-  Chan, T.-H., Jia, K., Gao, S., and et. al.: PCANet: A Simple Deep Learning Baseline for Image Classification ? IEEE Trans. Patt. Anal. Machine Intell. 24(12), 5017-5032 (2015)
-  Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Patt. Anal. Machine Intell. 24(7), 971-987 (2002)
-  Xiao, J., Ehinger, K. A., Hays, J., and et. al.: SUN Database: Exploring a Large Collection of Scene Categories. IJCV. 119(1), 3-22 (2016)
-  Lecun, Y., Bottou, L., Bengio, Y., and et. al.: Gradient-based Learning Applied to Document Recognition. Proc. of the IEEE. 86(11), 569-571 (1998)
-  Labusch, K., Barth, E., and Martinetz, T., Simple Method for High-Performance Digit Recognition based on Sparse Coding. IEEE Trans. Neural Networks. 19(11), 1985-1989 (2008)
-  Coates, A., Lee, H., and Ng, A. Y.: An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In: Proc. of International Conf. Artif. Intell. and Stat., pp. 215-223, Ft. Lauderdale, USA (2011)
-  Yang, S., Luo, P., Loly, C. C., and et. al.: Deep Representation Learning with Target Coding. In: AAAI, pp. 3848-3854, Austin Texas, USA (2015)
-  Geusebroek, J. M., Burghouts, G. J., and Smeulders, W. M.: The Amsterdam Library of Object Images. IJCV. 61(1), 103-112 (2005)