I Introduction
Feature extraction is currently the subject of intense research because of its wide applications. In many applications, like visual recognition[1], scene analysis[2], [3], object recognition[4], multimodal learning[5], [6], speech recognition[7], image classification[8] and activity recognition[9], feature learning is not only an essential phase but a crucial procedure which is used to simplify the subsequent tasks by obtaining appropriate features distribution from the input data.
Many modeling paradigms such as autoencoders and energybased models have been applied to feature learning. The restricted Boltzmann machine (RBM)
[10] is a popular energybased model for unsupervised feature learning and aims to explore appropriate hidden features. The structure of an RBM is a bipartite graph consists of a binary visible layer and a binary hidden layer. There are no connections between the interior of visible layer units and the interior of hidden layer units. The most popular learning algorithms of RBM such as stochastic maximum likelihood[11] and contrastive divergence (CD)[12] base on the efficient Gibbs sampling. There are a large number of successful applications based on the RBMs, e.g., speaker recognition[7], feature fusion[6], clustering[13], natural language understanding[14], classification[15], [16], [17], [18], [19][20] and speech recognition[21]. Meanwhile, various variants of the RBMs have been proposed by the researchers, e.g., pairwise constraints restricted Boltzmann machine with Gaussian visible units (pcGRBM)[22], classification RBM[23], fuzzy restricted Boltzmann machine(FRBM)[24], sparse restricted Boltzmann machine (SRBM)[25] and spikeandslab restricted Boltzmann machine (ssRBM)[26]. For realvalued data, the RBM with Gaussian visible units[5], [22] as the canonical energy model has usually been applied to extract the hidden features from image data. Unlike standard RBM, the visible layer units of the model have Gaussian noise and the hidden layer still maintains binary units. The CD learning can also be used to train the RBM with Gaussian visible units[27]. The hidden representations of traditional RBMs do not have explicit instancelevel constraints. So, Chu et al. presented semisupervised pcGRBM that pairwise constraints are fused into the reconstructed visible layer
[22]. However, the labeled data is lacking in many applications and it is expensive to obtain more labels. So, it is certainly worth exploring unsupervised feature learning method of RBMs by fusing external interventions.To obtain suitable feature distribution is a difficult task in machine learning, especially for unsupervised learning. In our previous research[22], we have explored semisupervised feature extraction based on GRBM. In this paper, we explore further unsupervised feature extraction based on RBMs to constrict and disperse hidden layer feature distribution for clustering tasks. Some selflearning local supervisions from visible layer are integrated into the contrastive divergence (CD) learning in the hidden layer and reconstructed hidden layer, then we propose a novel encoding framework. The selflearning local supervisions stem from unsupervised learning and unanimous voting strategy. They are fused into hidden layer features and reconstructed hidden layer features. Under such a framework, we propose two instantiation models: one is selflearning local supervision GRBM (slsGRBM) with Gaussian linear visible units and binary hidden units for modeling realvalued data and the other is selflearning local supervision RBM (slsRBM) with binary visible units and hidden units using different transformations for visible layer reconstruction. The contributions of our work are summarized below.

A novel selflearning local supervision encoding framework is presented in which the selflearning local supervisions from visible layer are integrated into the CD learning of RBMs to constrict and disperse the distribution of the hidden layer features and reconstructed hidden layer features.

An instantiation model of selflearning local supervision GRBM (slsGRBM) with Gaussian linear visible units and binary hidden units is proposed using linear transformation for visible layer reconstruction under our encoding framework.

An instantiation model of selflearning local supervision RBM (slsRBM) with binary visible units and hidden units based on our encoding framework is proposed using sigmoid transformation for visible layer reconstruction.

We demonstrate that the selflearning local supervisions bring more positive impact for feature extraction of slsGRBM and slsRBM model than traditional GRBM on the MSRAMM 2.0 image data and RBM on UCI data sets without any external interventions using unsupervised clustering, respectively.
The remaining of the paper is organized as follows. The related work is provided in Section II. In Section III, the theoretical background is described. The selflearning local supervisions encoding framework and the learning algorithm of two instantiation models slsGRBM and slsRBM are proposed in Section IV. The experimental results are shown in Section V. Finally, our contributions are summarized in Section VI.
Ii Related Work
In this section, we review literature on supervised, semisupervised, unsupervised feature learning based on RBMs and other models, together with the voting strategy in supervised learning.
Supervised feature learning has proved to be an effective method in machine learning[8], [28], [29], [15], [30], [31], [32]. Amer et al.[28]
proposed a Multimodal Discriminative CRBMs (MMDCRBMs) model based on a Conditional RBMs (an extension of the RBM). Its training process is composed of training each modality using labeled data and training a fusion layer. For multimodality deep learning, Bu et al.
[33] developed a supervised 3D feature learning framework in which a RBM is used to mine the deep correlations of different modalities. Cheng et al.[34] presented a novel duplex metric learning (DML) framework for feature learning and image classification. The main task of DML is to learn an effective hidden layer feature of a discriminative stacked autoencoder (DSAE). In the feature space of the DSAE, similar and dissimilar samples are mapped close to each other and further apart, respectively. This framework is the most related work to our study, but it belongs to supervised feature learning with a DSAE by layerwisely imposing metric learning method and it is applied to image classification tasks.However, supervision information, e.g., labels, is scarce and it is expensive to obtain more labels in many applications. So, some works[35], [36], [37], [22] explored semisupervised feature learning which only needs a small number of labels. Chu et al.[22] presented a pcGRBM model by fusing pairwise constraints into the reconstructed visible layer for clustering tasks. To mitigate the burden of annotation, Yesilbek and Sezgin[36] applied selflearning methods to build a system that can learn from large amounts of unlabeled data and few labeled examples for sketch recognition. The systems perform selflearning by extending a small labeled set with new examples which are extracted from unlabeled sketches. Chen et al.[37]
developed a deep sparase autoencoder network with supervised finetuning and unsupervised layerwise selflearning for fault identification. As a whole, these works belong to semisupervised learning methods.
Many unsupervised feature learning approaches based on the RBMs have been proposed by previous researches[38], [39], [40], [41], [42], [43], [44], [45], [13], [46]. Chopra and Yadav[44] presented a unique technique to extract fault feature from the noisy acoustic signal by an unsupervised RBM. Zhang et al.[45] proposed unsupervised feature learning based on recursive autoencoders network (RAE) for image classfication. They used the spectral and spatial information from original data to produce highlevel features. Chen et al.[13] illustrated a new graph regularized RBM (GraphRBM) to extract hidden layer representations for unsupervised clustering and classification problem. Meanwhile, they have considered the manifold structure of the original data. Xie et al.[46]
showed a novel approach to optimize RBM pretraining by capturing principal component directions of the input with principal component analysis. AlDmour and AlAni
[47]proposed a fullyautomatic segmentation algorithm in which a neural network (NN) model is used to extract the features of the brain tissue image and is trained using clustering labels produced by three clustering algorithms. The obtained classes are combined by majority voting. The study is closely related to our work, but our encoding framework based on RBMs is guided by selflearning local supervisions which stem from unsupervised clustering algorithms and unanimous voting strategy are integrated into the CD learning of RBMs to constrict and disperse the distribution of the hidden layer features and reconstructed hidden layer features. Stewart and Ermon
[48]presented a new technique to supervise NN by prior domain knowledge for computer vision tasks. It is a related work to our study. However, their work faces to a convolutional neural network (CNN) and requires large amounts of prior domain knowledge and how to encode prior knowledge into loss functions of a CNN is a new challenge.
Two existing voting strategies are often used to supervised learning in previous researches. One is the maxvoting scheme. For example, Azimi et al.[49] developed a deep learning method for low carbon steel microstructural classification via fully CNN (FCNN) accompanied by a maxvoting scheme. The other is the majority voting scheme. For example, Seera et al.[50]
applied a recurrent neural network (RNN) to extract features from the Transcranial Doppler (TCD) signals for classification tasks. This work proposed an ensemble RNN model in which the majority voting scheme is used to combine the single RNN predictions. Recently various voting classifiers using majority voting have been proposed to enhance the performance of the classification
[51], [52], [53], [54], [55], [56].In our following work, we explore unsupervised feature learning based on RBMs to constrict and disperse hidden layer feature distribution for clustering tasks.
Iii Theoretical Background
Iiia Restricted Boltzmann Machine
A RBM[10]
consists of twolayer structure: a visible layer and a hidden layer with stochastic binary units via symmetrically weighted connections. It has no any interiorlayer connections both between the visible layer units and between the hidden layer units. An energy function of a joint distribution of the visible layer and hidden layer units takes the form:
(1) 
where v
is the visible layer vectors and
is its the binary states of visible unit , h is the hidden layer vectors and is its the binary states of hidden unit , and are the biases of visible layer and hidden layer respectively, is the symmetric connection weight between and .The conditional probability distributions of hidden layer and visible layer units of the RBM are given by:
(2) 
and
(3) 
where
is the sigmoid function.
IiiB Linear Visible Units
The classical RBM was designed with binary units for both the hidden and visible layers[12]. For training realvalued data, the visible layer of RBM consists of Gaussian linear units and the hidden layer of RBM is still binary units. The energy function of RBM with Gaussian linear visible units takes the form:
(4) 
where
is the standard deviation of visible unit
with Gaussian noise. In the visible layer, the conditional probability is defined by:(5) 
where represents gaussian density (
). The update rules of the parameters become simple when the linear visible units have unit variance of Gaussian noise. This type of noisefree visible units are used for one of the proposed models (slsGRBM) based on our novel framework described later. Then, the reconstructed values of Gaussian linear visible units are equal to their topdown input values from the binary hidden units plus their bias.
IiiC Contrastive Divergence Learning
To learn the parameters of symmetric connection weight by Maximumlikelihood learning, the update rule is given by:
(6) 
where is a learning rate. But it is very slow to obtain unbiased sample of .
So, a faster learning algorithm[12] was proposed by applying approximating the gradient of CD. Then the change of symmetric connection weight with one step CD is given by:
(7) 
where the hidden layer units are driven by visible data, is the multiple that visible layer unit and hidden layer unit are on together, represents the corresponding reconstructions. Similarly, the changes of biases and with one step CD are given by:
(8) 
and
(9) 
So, the update rules of all parameters take the form
(10) 
(11) 
and
(12) 
The learning efficiency can be obviously improved by the CD learning.
Iv Selflearning Local Supervision Encoding Framework and Learning Algorithms
In this section, we present a novel selflearning local supervision encoding framework based on RBMs, in which the selflearning local supervisions from visible layer are integrated into the CD learning of RBMs to constrict and disperse the distribution of the hidden layer features. In the framework, we use sigmoid transformation to obtain hidden layer and reconstructed hidden layer features from visible layer and reconstructed visible layer units during sampling procedure. The selflearning local supervisions contain local credible clusters which stem from different unsupervised learning and unanimous voting strategy. For the same local clusters, the hidden features of the input and reconstructed data of the framework tends to constrict together. Furthermore, the center of different local clusters of hidden layer tends to disperse in the encoding process. The structure of the framework is shown in Fig. 1.
Iva The Framework
Notation  Definition 
Visible layer data set  
Hidden layer feature set  
Reconstructed visible layer set  
Hidden layer feature of reconstructed visible layer set  
Visible layer row vector  
Hidden layer feature row vector  
Reconstructed visible layer row vector  
Hidden layer feature row vector of reconstructed data  
All vectors of belong to the same cluster.  
All vectors of belong to the same cluster.  
All vectors of belong to the same cluster.  
All vectors of belong to the same cluster.  
The center of cluster  
The center of cluster  
The center of cluster  
The center of cluster 
Suppose that is the original data set. is the hidden layer feature set. is the reconstructed visible layer data set. is the hidden features set of reconstructed data.
Let be local clusters of visible layer set , are local clusters mapped of , respectively. Similarly, are local clusters of reconstructed visible layer set , are local clusters mapped of , respectively.
We use the gradient descent method to obtain approximate optimal parameters of the framework. It is expensive to compute the gradient of the log probability of RBMs. However, Karakida et al.[27] demonstrated that learning is simpler than ML learning in RBMs. Therefore, we apply the learning method to obtain an approximation of the log probability gradient of RBMs.
Then the objective function takes the form:
(13)  
where are the model parameters, is a scale coefficient, is the cardinality of , is the number of pairwise cluster center, is the average of the loglikelihood and is the square of 2norm. For the same local cluster, the hidden layer features and the reconstructed hidden layer features tend to constrict together in the training procedure. Meanwhile, the center of different local clusters tends to disperse in the hidden layer and the reconstructed hidden layer.
Let
(14)  
and
(15)  
Then the objective function is as follows:
(16)  
The next problems are how to get the gradients of and . Firstly, we compute the gradients of as follows. Because has another equivalent form:
(17)  
Then we can obtain:
(18)  
From above result, we can see that the following task is how to compute , , and . Next, all of them are solved separately.
(19)  
Obviously, is a row vector, all components of which are independent of except component. So,
(20)  
Because , the final result of has an expression as follows:
(21) 
Similarly, the expression of final result of is as follows:
(22) 
Then, the final result of is a column vector:
(23)  
Similarly,
(24)  
(25)  
and
(26)  
Eq. (23), Eq. (24), Eq. (25) and Eq. (26) are substituted in Eq. (18). Then we have
(27)  
Using above same solution, we can obtain:
(28)  
The following task is how to obtain and . Because and , the final result of and have expressions as follows:
(29) 
and
(30) 
So, the final result of is as follows.
(31)  
Similarly, the expression of is as follows.
(32)  
Because and , the final result of and . Then we can obtain: .
Finally, following the learning and the gradient of the and , the update rules of the symmetric connection weight take the form
(33)  
The update rules of biases take the form
(34)  
and the update rules of biases take the form
(35) 
Under such a framework, we present two instantiation models with two different visible layer reconstruction. One is selflearning local supervision GRBM (slsGRBM) model with Gaussian linear visible units and binary hidden units using linear transformation for visible layer reconstruction. The other is selflearning local supervision RBM (slsRBM) model with with binary visible and hidden units using sigmoid transformation for visible layer reconstruction.
No.  Dataset  classes  Instances  feature 
1  Book (BO)  3  896  892 
2  Water (WA)  3  922  899 
3  Weddingring (WR)  3  897  899 
4  Birthdaycake (BC)  3  932  892 
5  Vegetable (VE)  3  872  899 
6  Ambulances (AM)  3  930  892 
7  Vista (VI)  3  799  899 
8  Wallpaper (WP)  3  919  899 
9  Voituretuning (VT)  3  879  899 
No.  Dataset  classes  Instances  feature 
1  Haberman’s Survival (HS)  2  306  3 
2  QSAR biodegradation (QB)  2  1055  41 
3  SPECT Heart (SH)  2  267  22 
4  Simulation Crashes (SC)  2  540  18 
5  Breast Cancer Wisconsin (BCW)  2  569  32 
6  Iris (IR)  3  150  4 
Dataset (No.)  DP  Kmeans  AP  DP+GRBM  Kmeans+GRBM  AP+GRBM  DP+slsGRBM  Kmeans+slsGRBM  AP+slsGRBM 
BO (1)  0.42750.00000  0.40070.00068  0.42300.00000  0.42190.00014  0.35270.00012  0.42750.00001  0.47430.00340  0.42750.00009  0.43190.00033 
WA (2)  0.45440.00000  0.41760.00007  0.39050.00000  0.43600.00046  0.42730.00001  0.40240.00017  0.48370.00290  0.48260.00180  0.48260.00240 
WR (3)  0.41470.00000  0.40580.00005  0.40480.00000  0.51620.00009  0.40470.00120  0.41580.00000  0.53260.00076  0.50170.00210  0.48720.00002 
BC (4)  0.44530.00000  0.49790.00000  0.47530.00000  0.47420.00055  0.47960.00055  0.48820.00075  0.54720.00140  0.54610.00083  0.50540.00330 
VE (5)  0.50110.00000  0.40410.00110  0.42430.00000  0.48740.00015  0.42660.00009  0.42320.00000  0.50570.00230  0.50340.00880  0.49770.00440 
AM (6)  0.56670.00000  0.39350.00650  0.39680.00000  0.55480.00110  0.49680.00790  0.35810.00003  0.56990.00008  0.55700.00045  0.55700.00033 
VI (7)  0.52320.00000  0.47310.00001  0.43180.00000  0.44930.00025  0.45810.00023  0.46310.00038  0.57820.01220  0.52940.00510  0.54570.00340 
WP (8)  0.50160.00000  0.42660.00029  0.43420.00000  0.47230.00019  0.42110.00001  0.46900.00011  0.53650.00720  0.56260.001210  0.56470.01460 
VT (9)  0.46640.00000  0.37880.00130  0.40270.00000  0.46760.00017  0.36970.00170  0.42320.00021  0.51650.00000  0.61890.0000  0.62230.00002 
Average  0.4779  0.4217  0.4207  0.4755  0.4263  0.4300  0.5276  0.5255  0.5216 
Dataset (No.)  DP  Kmeans  AP  DP+GRBM  Kmeans+GRBM  AP+GRBM  DP+slsGRBM  Kmeans+slsGRBM  AP+slsGRBM 
BO (1)  0.8778  0.8559  0.8731  0.8707  0.8785  0.8731  0.9014  0.8875  0.8945 
WA (2)  0.8376  0.8175  0.8230  0.8427  0.8167  0.8282  0.8645  0.8660  0.8660 
WR (3)  0.8089  0.8068  0.8028  0.8069  0.8056  0.8037  0.8297  0.8240  0.8298 
BC (4)  0.8218  0.7325  0.7694  0.8344  0.7413  0.7667  0.8560  0.8086  0.8191 
VE (5)  0.8339  0.8290  0.8327  0.8333  0.8317  0.8319  0.8591  0.8576  0.8589 
AM (6)  0.7625  0.7571  0.7525  0.7626  0.7425  0.7635  0.7908  0.7815  0.7815 
VI (7)  0.8490  0.8489  0.8493  0.8486  0.8482  0.8492  0.8780  0.8772  0.8778 
WP (8)  0.7811  0.7709  0.7829  0.7811  0.7731  0.7687  0.8131  0.8181  0.8155 
VT (9)  0.9179  0.9194  0.9201  0.9171  0.9196  0.9173  0.9495  0.9506  0.9510 
Average  0.8323  0.8154  0.8229  0.8330  0.8175  0.8223  0.8603  0.8523  0.8549 
Dataset (No.)  DP  Kmeans  AP  DP+GRBM  Kmeans+GRBM  AP+GRBM  DP+slsGRBM  Kmeans+slsGRBM  AP+slsGRBM 
BO (1)  0.4471  0.3838  0.3999  0.4170  0.3767  0.4078  0.5110  0.4212  0.3992 
WA (2)  0.4731  0.3907  0.4001  0.4660  0.3932  0.4011  0.4907  0.4781  0.4781 
WR (3)  0.4093  0.4058  0.4104  0.4841  0.4053  0.4086  0.5281  0.4765  0.4676 
BC (4)  0.4803  0.4632  0.4288  0.5140  0.4537  0.4342  0.5215  0.5199  0.4783 
VE (5)  0.5044  0.4042  0.4149  0.4613  0.4052  0.4147  0.5117  0.4968  0.5046 
AM (6)  0.5887  0.4341  0.4271  0.5719  0.4771  0.4074  0.5508  0.5151  0.5151 
VI (7)  0.4963  0.4418  0.4357  0.5097  0.4422  0.4394  0.5600  0.5363  0.5552 
WP (8)  0.5718  0.4148  0.4154  0.5027  0.4078  0.4362  0.5336  0.6782  0.6743 
VT (9)  0.4644  0.4054  0.4212  0.4751  0.4041  0.4523  0.4964  0.6535  0.6557 
Average  0.4928  0.4160  0.4170  0.4891  0.4184  0.4224  0.5227  0.5306  0.5253 
Dataset (No.)  DP  Kmeans  AP  DP+RBM  Kmeans+RBM  AP+RBM  DP+slsRBM  Kmeans+slsRBM  AP+slsRBM 
HS (1)  0.57190.00000  0.51630.00013  0.51690.00001  0.52290.02400  056860.00140  0.55880.01620  0.61740.00480  0.61440.000410  0.59800.00390 
QB (2)  0.55920.00000  0.58860.00000  0.56400.00000  0.61420.00003  0.57820.00087  0.56780.00095  0.62180.00000  0.60280.00016  0.61040.00003 
SH (3)  0.61800.00000  0.53560.00000  0.55430.00000  0.55060.00810  0.53180.00018  0.52430.00011  0.77150.00200  0.57300.00018  0.57300.00250 
SC (4)  0.62590.00000  0.53150.00011  0.53150.00000  0.80560.04390  0.55560.00014  0.54810.00200  0.81110.00210  0.57410.00001  0.59630.00004 
BCW (5)  0.79090.00000  0.85410.00000  0.85410.00000  0.63620.00800  0.63090.00310  0.63090.00420  0.85240.02450  0.86820.00026  0.86640.00022 
IR (6)  0.90670.00000 
Comments
There are no comments yet.