Now, high-dimensional data is widely used in pattern recognition and data miningISI:000187181700003 ; ISI:000459699900007 ; ISI:000440853500001 , which not only causes a huge waste of time and cost, but also produces the problem of ”curse of dimensionality”ISI:000085472300002 . Therefore, it is of great significance to the feature extraction of dataISI:000323503100002 ; ISI:000500691600053 ; ISI:000439363600022 .
have been proposed. These methods can be divided into unsupervised, semi-supervised and supervised. In this article, we mainly study unsupervised and supervised feature extraction. In the unsupervised feature extraction model, the classical Principal Component Analysis (PCA)ISI:000372380700010
has proved its wide applicability and effectiveness. PCA seeks the maximum variance of samples in the subspace, which is more conducive to the subsequent classification, clustering or other tasks. However, PCA has obvious limitations, it is a linear feature extraction method, and it does not have a good effect for dealing with nonlinear data. With the development of manifold learning, a large number of nolinear feature extraction methods have effectively solved this problem, such as Isometric Feature Mapping (ISOMAP)ISI:000165995800049 , Laplacian Eigenmap (LE)ISI:000180520100073 , and Locally Linear Embedding (LLE)Wang2012 . However, these nonlinear feature extraction methods can not be applied to the new sample points because they directly obtain the low dimensional representation of samples without the help of projection matrix. To solve this problem, many nonlinear methods based on manifold assumption have been recast into linearized versions. Locality Preserving Projections (LPP)He2003Locality is viewed as a linearized LE; Neighbor Preserving Embedding (NPE)inproceedings is viewed as a linearized version of LLE; isometric projection  can be seen as a linearized ISOMAP. Although they all preserve manifold structure in subspace, different feature extraction methods have different requirements for manifold learning. For instance, LPP obtains the neighbor graph of the original data in advance and hopes that the samples in the subspace can maintain the same neighbor relationship, and NPE hopes to maintain the linear reconstruction relationship of original space between the sample and the neighbor samples after feature extraction. However,for the above graph based methods, they only take into account the local structures of data while ignoring the global structures. In order to solve this problem effectively, Sparsity Preserving Projections (SPP)ISI:000270261500027 , Collaborative Representation based Projections (CRP)ISI:000344204000002 , and Low-rank Preserving Embedding (LRPE)ISI:000404497300009 are proposed. SPP constructs a -norm graph with adaptive eighbors by utilizing the sparsity technique; CRP aims to build -norm graph by ridged linear reconstruction of each datum based on the remaining data; LRPE constructs a nuclear norm graph with adaptive neighbors by low-rank representation. We can see that these graph based methods first learn an affine graph via different measure metrics and then calculate the projection based on the graphs. In particular, the above methods can be unified into a general graph embedding (GE)ISI:000241988300004 framework which integrates the manifold local into the regression model to learn the projection. However, the unsupervised GE just preserve the orginal similar samples closer in the subspace, and does not consider the dissimilar samples.
The supervised feature extraction methods obtain more discriminant information by using sample labels. For example, Linear Discriminant Analysis (LDA)ISI:000166933500013 hopes the minimum within-class scatter and the maximum between-class scatter of samples in subspace. But LDA, like PCA, is also a linear feature extraction method. In other words, if the samples in a class from several separate clusters (i.e. multi-mode), the performance of LDA may be poor. To overcome these problems, researchers proposed Local Fisher Discriminant Analysis (LFDA)ISI:000248351700005 and Marginal Fisher Analysis (MFA)ISI:000241988300004 based on GE according to the idea of manifold learning in unsupervised feature extraction models. LFDA combines the ideas of LDA and LPP to construct the levels of the within-class scatter and the between-class scatter in a local manner. This allows LFDA to achieve maximum between-class scatter and within-class local structure preservation at the same time. The difference between MFA and LFDA is that MFA not only considers the local structure within classes, but also constructs the local structure relationship between classes by considering the samples on the edge of different classes. However, MFA suffers from the problem of class isolation, that is, not the samples of all heterogeneous edges have local neighbor relationships. In view of this shortcoming, researchers propose Multiple Marginal Fisher Analysis (MMFA)ISI:000480309400069 , which selects the nearest neighbor samples on all heterogeneous edges when constructing the local relationship between classes. Then, based on SPP, Sparsity Preserving Discriminant Projections (SPDP)ISI:000369270800001 is proposed to keep the sparse reconstruction coefficients of samples in the subspace. In fact, a supervised GE framework can be generated by introducing class information into unsupervised GE, and it can also be integrated into general GE.
Recently, self-supervised learning has become a hot topic in the field of deep learning. Self supervised learning is a method of unsupervised learning. It uses data information to supervise itself by constructing positive and negative pairs. Self-supervised learning hopes to learn more discriminative feature, which has proved that it can effectively narrow the gap between unsupervised learning and supervised learning. As the main method of self-supervised learning, contrastive learning has attracted extensive attention of researchers at home and abroad. Tian et al. Proposed Contrastive Multiview Coding (CMC) to process multiview data2019arXiv190605849T
. CMC first constructs the same samples as positive pairs in any two views, and different samples as a negative pairs, and then optimizes a neural network framework by minimizing the contrastive loss function to maximize the similarity of the projected positive pairs. But for single view data, we do not have different representations of the same sample. Chen et al. Proposed A Simple Framework for Contrastive Learning (SimCLR)2020arXiv200205709C to process the problem. It first performs data enhancement to obtain different representations of the same sample, then takes the different representations of the same samples as positive pairs, and takes the representations of any two different samples as negative pairs. Finally, like CMC, it optimizes the network framework by minimizing contrastive loss. From the these researchs, we can know that there are two key problems in the application of contrastive learning: one is how to construct positive and negative pairs; the other is what kind of tasks to apply. Inspired by these, we propose a unified feature extraction framework based on contrastive learning and graph embedding (CL-UFEF), which is suitable for both unsupervised and supervised feature extraction. Data enhancement will increase the amount of data, which will increase the time consumption of the algorithm. Therefore, in this framework, we do not perform data enhancement, but construct a contrastive graph based on GE, which proposes a new way to define positive and negative pairs. Then, by constractive loss function, we can consider not only similar samples but also dissimilar samples on the basis of unsupervised GE, so as to narrow the gap with supervised feature extraction, and this framework can be effectively applied to supervised GE. In order to verify the effectiveness of our proposed framework for unsupervised and supervised feature extraction, we improved the unsupervised GE method LPP with local preserving, the supervised GE method LDA without local preserving, and the supervised GE method LFDA with local preserving, and proposed CL-LPP, CL-LDA, and CL-LFDA, respectively. Finally, we performed numerical experiments on five real datasets.
The main contributions of this paper are as follows:
This paper proposes a unified feature extraction framework based on contrastive learning from a new perspective, which is suitable for both unsupervised and supervised feature extraction. It can consider not only similar samples but also dissimilar samples on the basis of unsupervised GE, so as to narrow the gap with supervised feature extraction.
Combined with GE, this paper constructs a contrastive graph , which proposes a new way to define positive and negative pairs in contrastive learning.
Based on the feature extraction framework of contrastive learning, we give three improved optimization models for LPP, LDA and LFDA, and the experiments on five real datasets prove the advantages of our frameworks.
The rest of this paper is organized as follows: in Section 2, GE and the models of LPP, LDA, LFDA will be briefly introduced, Section 3 provides the unified feature framework, and the models of CL-LPP,CL-LDA and CL-LFDA. Extensive experiments conducted on several real world data sets are developed in Section 4. Finally, Section 5 concludes the whole paper.
2 Related works
|Training samples set|
|Set of training samples in low-dimensional space|
|Number of training samples|
|Dimensionality of samples in original space|
|Dimensionality of embedding features|
|Labels of samples|
|Number of classes|
|Similarity between and|
|Dissimilarity between and|
|Within-class scatter matrix|
|Between-class scatter matrix|
|Similarity martix of positive pairs and|
|Disimilarity of negative pairs and|
|Number of neighbors|
|Gradient of with respect to|
|Number of iterations|
Assuming that the number of samples in a data set is , and each sample contains D features, the data set can be . denotes the class label of , where is the number of classes in the data set and let denote the number of the samples belonging to the th class. The purpose of feature extaction is to construct a low-dimensional embedding space to obtain the discriminant feature where represents the dimensionality of embedding features. is calculated by with projection matrix . For convenience, Table 1 summarizes the symbols used in this paper.
2.1 Graph Embedding (GE)
GE framework integrates the manifold embedding into the regression model to learn the projection. Defined the intrinsic graph and the penalty graph are two undirected weighted graphs with the data set , where and are the weight matrices of and , respectively. indicates the similarity between and , and calculates the dissimilarity of and . The matrix and can be formed using various similarity criteria, such as Gaussian similarity from Euclidean distance in LPP, linear reconstruction coefficient in NPE, and sparse reconstruction coefficient in SPP. The unsupervised GE hopes that the two original samples with greater similarity to be closer in the subspace. Therefore, the GE framework is as follows
is a identity matrix,is the Laplacian matrix of graph , is a constraint matrix for scale normalization and it can be the Laplacian matrix of graph . and can be given as
where,, It should be noted that in unsupervised GE. In fact, unsupervised GE only constructs an intrinic graph to consider the similarity between samples, but does not consider the dissimilarity between samples, while supervised GE considers both by class information.
2.2 Locality Preserving Projection (LPP)
LPP is a commonly unsupervised feature extraction method. Its goal is to find a projection matrix to preserve the local neighbor structure of samples as much as possible. The idea of LPP algorithm is that the nearest neighbor samples in the original feature space are still close in the subspace. The optimization problem of LPP can be expressed as
where, is the similarity matrix of intrinsic graph . is large if and are close and is small if and are far apart. can be defined as follows
where, is the thermal parameter used to adjust the value range of the weight matrix . represents the nearest neighbors of , and the parameter is a tuning parameter. The LPP optimization problem is simplified by algebraic expansion, and the constraint is added to prevent the trivial solution of . Finally, the optimization problem is simplified as follows
By constructing Lagrange multipliers, the optimization problem can be transformed into solving the minimum generalized eigenvalue problem
The projection matrix
is composed of the eigenvectors corresponding to the firstminimum non-zero eigenvalues.
2.3 Linear Discriminant Analysis (LDA)
LDA was proposed by Fisher, which is a classical supervised feature extraction method. The basic idea of LDA algorithm is to find an optimal projection direction, which makes the within-class scatter as small as possible and the between-class scatter as large as possible.
Concretely, in LDA, the within-class scatter matrix and the between-class scatter matrix are defined as
Consequently, the transformation matrix can be obtained by solving following optimization problem:
The partial derivative of projection direction is obtained, and the optimal direction is obtained by making
After algebraic simplification, we can get
Let , then the optimization problem of LDA can be transformed into solving the generalized maximum eigenvalue problem
The projection matrix formed by the eigenvectors corresponding to the first largest non-zero eigenvalues. According to (8), since the maximum rank of is , LDA can extract up to effective low-dimensional features, which may affect the subsequent classification and recognition effect.
2.4 Local Fisher Discriminant Analysis (LFDA)
LFDA is a supervised feature extraction method based on GE. It hopes that the within-class local neighbor relationship can be preserved in subspace, and the between-class scatter is the largest. The optimization problem can be written as follows:
where, represents the nearest neighbors of in th class. Zelnik-Manor and Perona demonstrated that
works well on the whole, and LFDA employed the local scaling method with this heuristic. Its solution can still be transformed into solving the generalized eigenvalue problem.
In this section, we propose a unified feature extraction framework based on contrastive learning (CL-UFEF) suitable for both unsupervised and supervised feature extraction, and give specific optimization problems on CL-LPP, CL-LDA and CL-LFDA.
3.1 A Unified Feature Extraction Framework based on Contrastive Learning (CL-UFEF)
In order to apply contrastive learning in unsupervised and supervised feature extraction, we combine intrinic graph and penalty graph to construct a contrastive learning graph (CLG), including postive graph and negative graph . In CLG, we give a new definition of positive and negative pairs, and define positive matix and negative matix to measure the similarity of the postive pairs and the dissimilarity of the negative pairs.
First, in CLG, we define the pairs of and as positive pairs if , and the pairs of and as negative pairs if . This mean that we define local nearest neighbors as positive pairs and non nearest neighbors as negative pairs in unsupervised CLG, and define local nearest neighbors in same class as positive pairs and other relationships as negative pairs in supervised CLG. Then, we will give the specific calculation method of and
According to different graph embedding methods, can also be calculated in many ways, such as Euclidean distance, linear reconstruction coefficient and sparse reconstruction coefficient.
Next, we embed CLG into contrastive learning, and use the postive matrix and the negative matrix in CLG for weight matrix. We hope that the positive pairs and with large will have greater similarity in the subspace, and the negative pairs and with large will have greater dissimilarity in the subspace. Therefore, CL-UFEF is proposed, and its optimization problem is as follows:
We use cosine similarity to calculate the similarity of samples in subspace samples. Where,is a positive parameter, is the whole similarity matrix.
CL-UFEF constructs CLG based on traditional unsupervised and supervised GE formwork in the contrastive learning, which can improve the GE from a new perspective. According to the viewpoint of self-supervised learning, it can narrow the gap between unsupervised GE and supervised GE, and it can be applied to both unsupervised and supervised feature extraction problems. From the optimization problem (22), we can see that the key of CL-UFEF is how to construct CLG. In order to more intuitively understand how our framework is applied to supervised and unsupervised feature extraction and verify the effectiveness of our framework, we make improvements on the unsupervised graph embedding model LPP with local preserving, the supervised graph embedding model LDA without local preserving, and the supervised graph embedding model LFDA with local preserving.
3.2 Local Preserving Projection based on Constrative Learning (CL-LPP)
CLG is constructed according to GL in LPP, in which similarity matrix and dissimilarity matrix are defined as follows:
The optimization problem of CL-LPP can be obtained by introducing and here into CL-UFEF.
3.3 Linear Discriminant Analysis based on Constrative Learning (CL-LDA)
CLG is constructed according to label information in LDA, in which similarity matrix and dissimilarity matrix are defined as follows:
In order to balance the importance of between positive and negative pairs, we use 1 instead of to define the similarity of positive pairs in . The optimization problem of CL-LDA can be obtained by introducing and here into CL-UFEF.
3.4 Local Fisher Discriminant Analysis based on Constrative Learning (CL-LFDA)
CLG is constructed according to GL in LFDA, in which similarity matrix and dissimilarity matrix are defined as follows:
In order to balance the importance of between positive and negative pairs, we use instead of to define the similarity of positive pairs in . The optimization problem of CL-LFDA can be obtained by introducing and here into CL-UFEF.
3.5 Optimization algorithm
We use Adam optimizer2014arXiv1412.6980K
to solve optimization problems of CL-LPP, CL-LDA and CL-LFDA. In this section, we take CL-LPP as an example to give the specific optimization algorithms which is described in Algorithm 1. Adam optimizer is an improvement on the random gradient descent method, and it can quickly achieve good results. This method calculates the adaptive learning rate of different parameters from the budget of the first and second moments of the gradient. The better default parameters for testing machine learning problems are the learning rate, the exponential decay rate of first order moment estimation , the exponential decay rate of second order moment estimation and the parameter to prevent dividing by zero in the implementation, which we also set. The gradient of loss function with respect to projection matrix is obtained by (30).
We set the convergence condition of the Algorithm 1 as , where and are the function values obtained after the th and th gradient descent respectively. Therefore, the computational complexity of Algorithm 1 is mainly in the first step, where the derivative of the objective function is , where is number of iterations.
4 Experimental results
4.1 Data Descriptions and Experimental Setups
Five real datasets are used in our numerical experiments to demonstrate the performance advantages of our proposed CL-UFEF.
mfeat data set: This data set is from the UCI machine learning repository. It consists of six different feature sets extracted from handwritten numbers from 0 to 9, with 200 patterns per class (2,000 patterns, ten classes totally). All of them have been digitized in binary images. The six feature sets include fou (fourier coefficients, 76 features), fac (profile correlations, 216 features), kar (Karhunen–Loeve coefficients, 64 features), pix (pixel averages in 2 × 3 windows, 240 features), zer (Zernike moments, 47 features), mor (morphological features, six features).
Yale data set: The data set was created by Yale University Computer Vision and Control Center. It containsindividuals, and each person has frontal images ( pixels in size) taken under different lighting. The size of each image is recropped to pixels with 256 gray levels per pixel.
COIL20 data set: The COIL20 data set was created by Columbia University in 1996. It contains 1440 images of 20 objects. Each object has 72 images ( pixels in size), each image is recropped to pixels, and each pixel has 256 gray levels.
MNIST data set: It contains 70,000 samples of digital images with a size of . We randomly selected 2000 images as experimental data. We uniformly rescale all images to a size of , and use a feature vector of 256-level gray scale pixel values to represent each image.
USPS data set: It contains 9298 samples of handwritten digital images, and the size of each image is adjusted to . 1800 images were randomly selected as experimental data and each image was represented by a feature vector of 256-level gray scale pixel values.
In the process of data processing, we first use PCA to prereduce the dimension of Yale, COIL20, MNIST, USPS datasets, and the details are shown in Table 2
. Then in order to improve the convergence speed of the model, we standardize these five datasets separately. Finally, we compare CL-LPP, CL-LDA and CL-LFDA with the traditional methods LPP, LDA and LFDA to verify the advantages of our proposed feature extraction framework. The K nearest neighbor classifier (K = 1) is used in the experiment. Six samples of each class were randomly selected from Yale, COIL20 and each feature sets of mfeat data set for training, and the rest data were used for testing. Nine samples of each class were randomly selected from MNIST and USPS for training, and the rest data were used for testing. All processes are repeated five times, and the final evaluation criteria are the average recognition accuracy and average recall rate of five repeated experiments. All experiments are implemented using Matlab R2018a on a computer with Intel Core i5-9400 2.90GHz CPU and windows 10 operating system.
|data sets||of instance||of features||of classes||of features after PCA|
4.2 Parameters Setting
To evaluate the performance of different feature extraction methods, there are some parameters to be set in advance. First, for all comparative algorithms, We set the search range of to . Second, for CL-LPP, CL-LDA and CL-LFDA, we set the search range of to . Finally, let the thermal parameter , where and is the kth nearest neighbor of and respectively.
4.3 Experimental Results Analysis
Experimental results of each feature sets in mfeat (maximum mean recognition accuracy ± standard deviations%) on optimal dimensions.
In this section, we first report the maximum mean recognition accuracy (contains standard) and maximum mean recall rate (contains standard) deviation under optimal feature extraction on each feature sets of mfeat in Table 3 and Table 4 respectively. What’s more, we also present the average recognition accuracy of all methods under different reduced dimensions on each feature sets of mfeat in Figure 1 respectively. From above experimental results, we have following observations.
To observe the highest recognition accuracy of each feature sets in mefeat listed in Table 3 and Table 4, compared with traditional methods (LPP, LDA, LFDA), only the recognition accuracy of CL-LPP is the same as that of LPP, and the rest are our methods higher. Therefore, the best performance in each case is always achieved by our proposed methods based on contrastive learning. At the same time, we can know that in Table 3, only the recognition accuracy of CL-LPP on the pix feature set is lower than that of LFDA, and the accuracy on other feature sets are higher than LDA and LFDA, which also shows that our framework can narrow the gap between unsupervised GE and supervised GE. In Figure 1, our method always achieves maximum mean recognition accuracy under various reduced dimensions on most of feature sets. Although the superiority of performance of proposed method are not obvious on fou feature set and mor feature set, our method is dominant on other feature sets.
Next, we give the maximum mean recognition accuracy (contains standard) and mean recall rate (contains standard) deviation under optimal feature extraction on Yale, COIL20, MNIST and USPS data sets in Table 4.3 and Table 6 respectively. We also present the average recognition accuracy of all methods under different reduced dimensions on the four data sets in Figure 2 respectively. From above experimental results, we have following observations.
To observe the Table 4.3, in the maximum mean recognition accuracy, only in the COIL20 data set, the traditional method LFDA is higher than our method CL-LFDA, the rest are our proposed methods have more advantages. To observe the Table 6, in the maximum mean recall rate, the traditional method LPP is higher than our method CL-LPP on Yale data set, the traditional method LFDA is higher than our method CL-LFDA on COIL20 data set, and the rest are more advantages of our proposed methods. Therefore, the best performance in each case is mostly achieved by our proposed methods based on contrastive learning. At the same time, in Table 4.3, we can know that the recognition accuracy of CL-LPP is higher than LDA and LFDA on MNIST and USPS datasets, and CL-LPP can narrow the gap with CL-LPP and traditional superised GE on Yale and COIL20 data sets. To observe the Figure 2, we can see that our proposed three models has higher advantages than the traditional models in most feature dimensions on Yale, MNIST and USPS datasets. On COIL20 data set, CL-LDA and CL-LFDA have no obvious advantages because the highest classification accuracy LDA and LFDA are already 1.