Self-learning Local Supervision Encoding Framework to Constrict and Disperse Feature Distribution for Clustering

12/05/2018 ∙ by Jielei Chu, et al. ∙ Sichuan University 16

To obtain suitable feature distribution is a difficult task in machine learning, especially for unsupervised learning. In this paper, we propose a novel self-learning local supervision encoding framework based on RBMs, in which the self-learning local supervisions from visible layer are integrated into the contrastive divergence (CD) learning of RBMs to constrict and disperse the distribution of the hidden layer features for clustering tasks. In the framework, we use sigmoid transformation to obtain hidden layer and reconstructed hidden layer features from visible layer and reconstructed visible layer units during sampling procedure. The self-learning local supervisions contain local credible clusters which stem from different unsupervised learning and unanimous voting strategy. They are fused into hidden layer features and reconstructed hidden layer features. For the same local clusters, the hidden features and reconstructed hidden layer features of the framework tend to constrict together. Furthermore, the hidden layer features of different local clusters tend to disperse in the encoding process. Under such framework, we present two instantiation models with the reconstruction of two different visible layers. One is self-learning local supervision GRBM (slsGRBM) model with Gaussian linear visible units and binary hidden units using linear transformation for visible layer reconstruction. The other is self-learning local supervision RBM (slsRBM) model with binary visible and hidden units using sigmoid transformation for visible layer reconstruction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 8

page 9

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Feature extraction is currently the subject of intense research because of its wide applications. In many applications, like visual recognition[1], scene analysis[2], [3], object recognition[4], multimodal learning[5], [6], speech recognition[7], image classification[8] and activity recognition[9], feature learning is not only an essential phase but a crucial procedure which is used to simplify the subsequent tasks by obtaining appropriate features distribution from the input data.

Many modeling paradigms such as autoencoders and energy-based models have been applied to feature learning. The restricted Boltzmann machine (RBM)

[10] is a popular energy-based model for unsupervised feature learning and aims to explore appropriate hidden features. The structure of an RBM is a bipartite graph consists of a binary visible layer and a binary hidden layer. There are no connections between the interior of visible layer units and the interior of hidden layer units. The most popular learning algorithms of RBM such as stochastic maximum likelihood[11] and contrastive divergence (CD)[12] base on the efficient Gibbs sampling. There are a large number of successful applications based on the RBMs, e.g., speaker recognition[7], feature fusion[6], clustering[13], natural language understanding[14], classification[15], [16], [17], [18], [19]

, computer vision

[20] and speech recognition[21]. Meanwhile, various variants of the RBMs have been proposed by the researchers, e.g., pairwise constraints restricted Boltzmann machine with Gaussian visible units (pcGRBM)[22], classification RBM[23], fuzzy restricted Boltzmann machine(FRBM)[24], sparse restricted Boltzmann machine (SRBM)[25] and spike-and-slab restricted Boltzmann machine (ssRBM)[26]. For real-valued data, the RBM with Gaussian visible units[5], [22] as the canonical energy model has usually been applied to extract the hidden features from image data. Unlike standard RBM, the visible layer units of the model have Gaussian noise and the hidden layer still maintains binary units. The CD learning can also be used to train the RBM with Gaussian visible units[27]

. The hidden representations of traditional RBMs do not have explicit instance-level constraints. So, Chu et al. presented semi-supervised pcGRBM that pairwise constraints are fused into the reconstructed visible layer

[22]. However, the labeled data is lacking in many applications and it is expensive to obtain more labels. So, it is certainly worth exploring unsupervised feature learning method of RBMs by fusing external interventions.
To obtain suitable feature distribution is a difficult task in machine learning, especially for unsupervised learning. In our previous research[22], we have explored semi-supervised feature extraction based on GRBM. In this paper, we explore further unsupervised feature extraction based on RBMs to constrict and disperse hidden layer feature distribution for clustering tasks. Some self-learning local supervisions from visible layer are integrated into the contrastive divergence (CD) learning in the hidden layer and reconstructed hidden layer, then we propose a novel encoding framework. The self-learning local supervisions stem from unsupervised learning and unanimous voting strategy. They are fused into hidden layer features and reconstructed hidden layer features. Under such a framework, we propose two instantiation models: one is self-learning local supervision GRBM (slsGRBM) with Gaussian linear visible units and binary hidden units for modeling real-valued data and the other is self-learning local supervision RBM (slsRBM) with binary visible units and hidden units using different transformations for visible layer reconstruction. The contributions of our work are summarized below.

  • A novel self-learning local supervision encoding framework is presented in which the self-learning local supervisions from visible layer are integrated into the CD learning of RBMs to constrict and disperse the distribution of the hidden layer features and reconstructed hidden layer features.

  • An instantiation model of self-learning local supervision GRBM (slsGRBM) with Gaussian linear visible units and binary hidden units is proposed using linear transformation for visible layer reconstruction under our encoding framework.

  • An instantiation model of self-learning local supervision RBM (slsRBM) with binary visible units and hidden units based on our encoding framework is proposed using sigmoid transformation for visible layer reconstruction.

  • We demonstrate that the self-learning local supervisions bring more positive impact for feature extraction of slsGRBM and slsRBM model than traditional GRBM on the MSRA-MM 2.0 image data and RBM on UCI data sets without any external interventions using unsupervised clustering, respectively.


Fig. 1: The self-learning local supervisions from visible layer are integrated into hidden layer and reconstructed hidden layer of the contrastive divergence (CD) learning of RBMs. In the encoding procedure, we use sigmoid transformation to obtain the hidden layer features and the reconstructed hidden layer features during sampling procedure. In the reconstruction procedure, we use linear transformation to obtain the reconstructed visible layer units from the hidden layer units that is an instantiation model of self-learning local supervision GRBM (slsGRBM) with Gaussian linear visible units and binary hidden units. Similarly, we use sigmoid transformation to obtain the reconstructed visible layer units from the hidden layer units that is an instantiation model of self-learning local supervision RBM (slsRBM) with binary visible units and hidden units.

The remaining of the paper is organized as follows. The related work is provided in Section II. In Section III, the theoretical background is described. The self-learning local supervisions encoding framework and the learning algorithm of two instantiation models slsGRBM and slsRBM are proposed in Section IV. The experimental results are shown in Section V. Finally, our contributions are summarized in Section VI.

Ii Related Work

In this section, we review literature on supervised, semi-supervised, unsupervised feature learning based on RBMs and other models, together with the voting strategy in supervised learning.


Supervised feature learning has proved to be an effective method in machine learning[8], [28], [29], [15], [30], [31], [32]. Amer et al.[28]

proposed a Multimodal Discriminative CRBMs (MMDCRBMs) model based on a Conditional RBMs (an extension of the RBM). Its training process is composed of training each modality using labeled data and training a fusion layer. For multi-modality deep learning, Bu et al.

[33] developed a supervised 3D feature learning framework in which a RBM is used to mine the deep correlations of different modalities. Cheng et al.[34] presented a novel duplex metric learning (DML) framework for feature learning and image classification. The main task of DML is to learn an effective hidden layer feature of a discriminative stacked autoencoder (DSAE). In the feature space of the DSAE, similar and dissimilar samples are mapped close to each other and further apart, respectively. This framework is the most related work to our study, but it belongs to supervised feature learning with a DSAE by layer-wisely imposing metric learning method and it is applied to image classification tasks.
However, supervision information, e.g., labels, is scarce and it is expensive to obtain more labels in many applications. So, some works[35], [36], [37], [22] explored semi-supervised feature learning which only needs a small number of labels. Chu et al.[22] presented a pcGRBM model by fusing pairwise constraints into the reconstructed visible layer for clustering tasks. To mitigate the burden of annotation, Yesilbek and Sezgin[36] applied self-learning methods to build a system that can learn from large amounts of unlabeled data and few labeled examples for sketch recognition. The systems perform self-learning by extending a small labeled set with new examples which are extracted from unlabeled sketches. Chen et al.[37]

developed a deep sparase auto-encoder network with supervised fine-tuning and unsupervised layer-wise self-learning for fault identification. As a whole, these works belong to semi-supervised learning methods.


Many unsupervised feature learning approaches based on the RBMs have been proposed by previous researches[38], [39], [40], [41], [42], [43], [44], [45], [13], [46]. Chopra and Yadav[44] presented a unique technique to extract fault feature from the noisy acoustic signal by an unsupervised RBM. Zhang et al.[45] proposed unsupervised feature learning based on recursive autoencoders network (RAE) for image classfication. They used the spectral and spatial information from original data to produce high-level features. Chen et al.[13] illustrated a new graph regularized RBM (GraphRBM) to extract hidden layer representations for unsupervised clustering and classification problem. Meanwhile, they have considered the manifold structure of the original data. Xie et al.[46]

showed a novel approach to optimize RBM pre-training by capturing principal component directions of the input with principal component analysis. Al-Dmour and Al-Ani

[47]

proposed a fully-automatic segmentation algorithm in which a neural network (NN) model is used to extract the features of the brain tissue image and is trained using clustering labels produced by three clustering algorithms. The obtained classes are combined by majority voting. The study is closely related to our work, but our encoding framework based on RBMs is guided by self-learning local supervisions which stem from unsupervised clustering algorithms and unanimous voting strategy are integrated into the CD learning of RBMs to constrict and disperse the distribution of the hidden layer features and reconstructed hidden layer features. Stewart and Ermon

[48]

presented a new technique to supervise NN by prior domain knowledge for computer vision tasks. It is a related work to our study. However, their work faces to a convolutional neural network (CNN) and requires large amounts of prior domain knowledge and how to encode prior knowledge into loss functions of a CNN is a new challenge.


Two existing voting strategies are often used to supervised learning in previous researches. One is the max-voting scheme. For example, Azimi et al.[49] developed a deep learning method for low carbon steel microstructural classification via fully CNN (FCNN) accompanied by a max-voting scheme. The other is the majority voting scheme. For example, Seera et al.[50]

applied a recurrent neural network (RNN) to extract features from the Transcranial Doppler (TCD) signals for classification tasks. This work proposed an ensemble RNN model in which the majority voting scheme is used to combine the single RNN predictions. Recently various voting classifiers using majority voting have been proposed to enhance the performance of the classification

[51], [52], [53], [54], [55], [56].
In our following work, we explore unsupervised feature learning based on RBMs to constrict and disperse hidden layer feature distribution for clustering tasks.

Iii Theoretical Background

Iii-a Restricted Boltzmann Machine

A RBM[10]

consists of two-layer structure: a visible layer and a hidden layer with stochastic binary units via symmetrically weighted connections. It has no any interior-layer connections both between the visible layer units and between the hidden layer units. An energy function of a joint distribution of the visible layer and hidden layer units takes the form:

(1)

where v

is the visible layer vectors and

is its the binary states of visible unit , h is the hidden layer vectors and is its the binary states of hidden unit , and are the biases of visible layer and hidden layer respectively, is the symmetric connection weight between and .

The conditional probability distributions of hidden layer and visible layer units of the RBM are given by:

(2)

and

(3)

where

is the sigmoid function.

Iii-B Linear Visible Units

The classical RBM was designed with binary units for both the hidden and visible layers[12]. For training real-valued data, the visible layer of RBM consists of Gaussian linear units and the hidden layer of RBM is still binary units. The energy function of RBM with Gaussian linear visible units takes the form:

(4)

where

is the standard deviation of visible unit

with Gaussian noise. In the visible layer, the conditional probability is defined by:

(5)

where represents gaussian density (

). The update rules of the parameters become simple when the linear visible units have unit variance of Gaussian noise. This type of noise-free visible units are used for one of the proposed models (slsGRBM) based on our novel framework described later. Then, the reconstructed values of Gaussian linear visible units are equal to their top-down input values from the binary hidden units plus their bias.

Iii-C Contrastive Divergence Learning

To learn the parameters of symmetric connection weight by Maximum-likelihood learning, the update rule is given by:

(6)

where is a learning rate. But it is very slow to obtain unbiased sample of .
So, a faster learning algorithm[12] was proposed by applying approximating the gradient of CD. Then the change of symmetric connection weight with one step CD is given by:

(7)

where the hidden layer units are driven by visible data, is the multiple that visible layer unit and hidden layer unit are on together, represents the corresponding reconstructions. Similarly, the changes of biases and with one step CD are given by:

(8)

and

(9)

So, the update rules of all parameters take the form

(10)
(11)

and

(12)

The learning efficiency can be obviously improved by the CD learning.

Iv Self-learning Local Supervision Encoding Framework and Learning Algorithms

In this section, we present a novel self-learning local supervision encoding framework based on RBMs, in which the self-learning local supervisions from visible layer are integrated into the CD learning of RBMs to constrict and disperse the distribution of the hidden layer features. In the framework, we use sigmoid transformation to obtain hidden layer and reconstructed hidden layer features from visible layer and reconstructed visible layer units during sampling procedure. The self-learning local supervisions contain local credible clusters which stem from different unsupervised learning and unanimous voting strategy. For the same local clusters, the hidden features of the input and reconstructed data of the framework tends to constrict together. Furthermore, the center of different local clusters of hidden layer tends to disperse in the encoding process. The structure of the framework is shown in Fig. 1.

Iv-a The Framework

Notation Definition
Visible layer data set
Hidden layer feature set
Reconstructed visible layer set
Hidden layer feature of reconstructed visible layer set
Visible layer row vector
Hidden layer feature row vector
Reconstructed visible layer row vector
Hidden layer feature row vector of reconstructed data
All vectors of belong to the same cluster.
All vectors of belong to the same cluster.
All vectors of belong to the same cluster.
All vectors of belong to the same cluster.
The center of cluster
The center of cluster
The center of cluster
The center of cluster
TABLE I: List of symbols.

Suppose that is the original data set. is the hidden layer feature set. is the reconstructed visible layer data set. is the hidden features set of reconstructed data. Let be local clusters of visible layer set , are local clusters mapped of , respectively. Similarly, are local clusters of reconstructed visible layer set , are local clusters mapped of , respectively. We use the gradient descent method to obtain approximate optimal parameters of the framework. It is expensive to compute the gradient of the log probability of RBMs. However, Karakida et al.[27] demonstrated that learning is simpler than ML learning in RBMs. Therefore, we apply the learning method to obtain an approximation of the log probability gradient of RBMs.
Then the objective function takes the form:

(13)

where are the model parameters, is a scale coefficient, is the cardinality of , is the number of pairwise cluster center, is the average of the log-likelihood and is the square of 2-norm. For the same local cluster, the hidden layer features and the reconstructed hidden layer features tend to constrict together in the training procedure. Meanwhile, the center of different local clusters tends to disperse in the hidden layer and the reconstructed hidden layer.
Let

(14)

and

(15)

Then the objective function is as follows:

(16)

The next problems are how to get the gradients of and . Firstly, we compute the gradients of as follows. Because has another equivalent form:

(17)

Then we can obtain:

(18)

From above result, we can see that the following task is how to compute , , and . Next, all of them are solved separately.

(19)

Obviously, is a row vector, all components of which are independent of except component. So,

(20)

Because , the final result of has an expression as follows:

(21)

Similarly, the expression of final result of is as follows:

(22)

Then, the final result of is a column vector:

(23)

Similarly,

(24)
(25)

and

(26)

Eq. (23), Eq. (24), Eq. (25) and Eq. (26) are substituted in Eq. (18). Then we have

(27)

Using above same solution, we can obtain:

(28)

The following task is how to obtain and . Because and , the final result of and have expressions as follows:

(29)

and

(30)

So, the final result of is as follows.

(31)

Similarly, the expression of is as follows.

(32)

Because and , the final result of and . Then we can obtain: .
Finally, following the learning and the gradient of the and , the update rules of the symmetric connection weight take the form

(33)

The update rules of biases take the form

(34)

and the update rules of biases take the form

(35)

Under such a framework, we present two instantiation models with two different visible layer reconstruction. One is self-learning local supervision GRBM (slsGRBM) model with Gaussian linear visible units and binary hidden units using linear transformation for visible layer reconstruction. The other is self-learning local supervision RBM (slsRBM) model with with binary visible and hidden units using sigmoid transformation for visible layer reconstruction.

No. Dataset classes Instances feature
1 Book (BO) 3 896 892
2 Water (WA) 3 922 899
3 Weddingring (WR) 3 897 899
4 Birthdaycake (BC) 3 932 892
5 Vegetable (VE) 3 872 899
6 Ambulances (AM) 3 930 892
7 Vista (VI) 3 799 899
8 Wallpaper (WP) 3 919 899
9 Voituretuning (VT) 3 879 899
TABLE II: Summary of the experiment datasets I.
No. Dataset classes Instances feature
1 Haberman’s Survival (HS) 2 306 3
2 QSAR biodegradation (QB) 2 1055 41
3 SPECT Heart (SH) 2 267 22
4 Simulation Crashes (SC) 2 540 18
5 Breast Cancer Wisconsin (BCW) 2 569 32
6 Iris (IR) 3 150 4
TABLE III: Summary of the experiment datasets II (UCI).
Dataset (No.) DP K-means AP DP+GRBM K-means+GRBM AP+GRBM DP+slsGRBM K-means+slsGRBM AP+slsGRBM
BO (1) 0.42750.00000 0.40070.00068 0.42300.00000 0.42190.00014 0.35270.00012 0.42750.00001 0.47430.00340 0.42750.00009 0.43190.00033
WA (2) 0.45440.00000 0.41760.00007 0.39050.00000 0.43600.00046 0.42730.00001 0.40240.00017 0.48370.00290 0.48260.00180 0.48260.00240
WR (3) 0.41470.00000 0.40580.00005 0.40480.00000 0.51620.00009 0.40470.00120 0.41580.00000 0.53260.00076 0.50170.00210 0.48720.00002
BC (4) 0.44530.00000 0.49790.00000 0.47530.00000 0.47420.00055 0.47960.00055 0.48820.00075 0.54720.00140 0.54610.00083 0.50540.00330
VE (5) 0.50110.00000 0.40410.00110 0.42430.00000 0.48740.00015 0.42660.00009 0.42320.00000 0.50570.00230 0.50340.00880 0.49770.00440
AM (6) 0.56670.00000 0.39350.00650 0.39680.00000 0.55480.00110 0.49680.00790 0.35810.00003 0.56990.00008 0.55700.00045 0.55700.00033
VI (7) 0.52320.00000 0.47310.00001 0.43180.00000 0.44930.00025 0.45810.00023 0.46310.00038 0.57820.01220 0.52940.00510 0.54570.00340
WP (8) 0.50160.00000 0.42660.00029 0.43420.00000 0.47230.00019 0.42110.00001 0.46900.00011 0.53650.00720 0.56260.001210 0.56470.01460
VT (9) 0.46640.00000 0.37880.00130 0.40270.00000 0.46760.00017 0.36970.00170 0.42320.00021 0.51650.00000 0.61890.0000 0.62230.00002
Average 0.4779 0.4217 0.4207 0.4755 0.4263 0.4300 0.5276 0.5255 0.5216
TABLE IV: The accuracies and variance of the proposed algorithms (DP+slsGRBM, K-means+slsGRBM and AP+slsGRBM) and contrastive algorithms (datasets I).
Fig. 2: Comparison of the proposed algorithms (DP+slsGRBM, K-means+slsGRBM and AP+slsGRBM) and contrastive algorithms using the evaluating indicator of the accuracy. The X axis indicates the serial number of data sets I.
Dataset (No.) DP K-means AP DP+GRBM K-means+GRBM AP+GRBM DP+slsGRBM K-means+slsGRBM AP+slsGRBM
BO (1) 0.8778 0.8559 0.8731 0.8707 0.8785 0.8731 0.9014 0.8875 0.8945
WA (2) 0.8376 0.8175 0.8230 0.8427 0.8167 0.8282 0.8645 0.8660 0.8660
WR (3) 0.8089 0.8068 0.8028 0.8069 0.8056 0.8037 0.8297 0.8240 0.8298
BC (4) 0.8218 0.7325 0.7694 0.8344 0.7413 0.7667 0.8560 0.8086 0.8191
VE (5) 0.8339 0.8290 0.8327 0.8333 0.8317 0.8319 0.8591 0.8576 0.8589
AM (6) 0.7625 0.7571 0.7525 0.7626 0.7425 0.7635 0.7908 0.7815 0.7815
VI (7) 0.8490 0.8489 0.8493 0.8486 0.8482 0.8492 0.8780 0.8772 0.8778
WP (8) 0.7811 0.7709 0.7829 0.7811 0.7731 0.7687 0.8131 0.8181 0.8155
VT (9) 0.9179 0.9194 0.9201 0.9171 0.9196 0.9173 0.9495 0.9506 0.9510
Average 0.8323 0.8154 0.8229 0.8330 0.8175 0.8223 0.8603 0.8523 0.8549
TABLE V: The purity of the proposed algorithms (DP+slsGRBM, K-means+slsGRBM and AP+slsGRBM) and contrastive algorithms (datasets I).
Fig. 3: Comparison of the proposed algorithms (DP+slsGRBM, K-means+slsGRBM and AP+slsGRBM) and contrastive algorithms using the evaluating indicator of the purity. The X axis indicates the serial number of data sets I.
Dataset (No.) DP K-means AP DP+GRBM K-means+GRBM AP+GRBM DP+slsGRBM K-means+slsGRBM AP+slsGRBM
BO (1) 0.4471 0.3838 0.3999 0.4170 0.3767 0.4078 0.5110 0.4212 0.3992
WA (2) 0.4731 0.3907 0.4001 0.4660 0.3932 0.4011 0.4907 0.4781 0.4781
WR (3) 0.4093 0.4058 0.4104 0.4841 0.4053 0.4086 0.5281 0.4765 0.4676
BC (4) 0.4803 0.4632 0.4288 0.5140 0.4537 0.4342 0.5215 0.5199 0.4783
VE (5) 0.5044 0.4042 0.4149 0.4613 0.4052 0.4147 0.5117 0.4968 0.5046
AM (6) 0.5887 0.4341 0.4271 0.5719 0.4771 0.4074 0.5508 0.5151 0.5151
VI (7) 0.4963 0.4418 0.4357 0.5097 0.4422 0.4394 0.5600 0.5363 0.5552
WP (8) 0.5718 0.4148 0.4154 0.5027 0.4078 0.4362 0.5336 0.6782 0.6743
VT (9) 0.4644 0.4054 0.4212 0.4751 0.4041 0.4523 0.4964 0.6535 0.6557
Average 0.4928 0.4160 0.4170 0.4891 0.4184 0.4224 0.5227 0.5306 0.5253
TABLE VI: The Fowlkes and Mallows Index (FMI) of the proposed algorithms (DP+slsGRBM, K-means+slsGRBM and AP+slsGRBM) and contrastive algorithms (datasets I).
Fig. 4: Comparison of the proposed algorithms (DP+slsGRBM, K-means+slsGRBM and AP+slsGRBM) and contrastive algorithms using the evaluating indicator of the Fowlkes and Mallows Index. The X axis indicates the serial number of data sets I.
Fig. 5: Comparison of the proposed algorithms (DP+slsGRBM, K-means+slsGRBM and AP+slsGRBM) and contrastive algorithms using average accuracy, purity and Fowlkes and Mallows Index (datasets I). The X axis indicates the name of contrastive algorithms.
Dataset (No.) DP K-means AP DP+RBM K-means+RBM AP+RBM DP+slsRBM K-means+slsRBM AP+slsRBM
HS (1) 0.57190.00000 0.51630.00013 0.51690.00001 0.52290.02400 056860.00140 0.55880.01620 0.61740.00480 0.61440.000410 0.59800.00390
QB (2) 0.55920.00000 0.58860.00000 0.56400.00000 0.61420.00003 0.57820.00087 0.56780.00095 0.62180.00000 0.60280.00016 0.61040.00003
SH (3) 0.61800.00000 0.53560.00000 0.55430.00000 0.55060.00810 0.53180.00018 0.52430.00011 0.77150.00200 0.57300.00018 0.57300.00250
SC (4) 0.62590.00000 0.53150.00011 0.53150.00000 0.80560.04390 0.55560.00014 0.54810.00200 0.81110.00210 0.57410.00001 0.59630.00004
BCW (5) 0.79090.00000 0.85410.00000 0.85410.00000 0.63620.00800 0.63090.00310 0.63090.00420 0.85240.02450 0.86820.00026 0.86640.00022
IR (6) 0.90670.00000