Restricted Boltzmann machine (RBM), an energy-based model to define an input distribution, is widely used to extract latent features before classification. Such an approach combines unsupervised learning for feature modeling and supervised learning for classification. Two training steps are needed. The first step, called pre-training, is to model features used for classification. This can be done by training RBM that captures the distribution of input. The second step, called fine-tuning, is to train a separate classifier based on the features from the first step[Larochelle et al.2012]. This two-phase training approach for classification is also used for deep networks. Deep belief networks (DBN) are built with stacked RBMs, and trained in a layer-wise manner [Hinton and Salakhutdinov2006]. Two-phase training based on a deep network consists of DBN and a classifier on top of it.
The two-phase training strategy has three possible problems. 1) It requires two training processes; one for training RBMs and one for training a classifier. 2) It is not guaranteed that the modeled features in the first step are useful in the classification phase since they are obtained independently of the classification task. 3) It is an effort to decide which classifier is the best for each problem. Therefore, there is a need for a method that can conduct feature modeling and classification concurrently [Larochelle et al.2012].
To resolve these problems, recent papers suggest to transform RBM to a model that can deal with both unsupervised and supervised learning. Since RBM calculate the joint and conditional probabilities, the suggested prior models combine a generative and discriminative RBM. Consequently, this hybrid discriminative RBM is trained concurrently for both objectives by summing the two contributions[Larochelle and Bengio2008, Larochelle et al.2012]
. In a similar way a self-contained RBM for classification is developed by applying the free-energy function based approximation to RBM, which was used for a supervised learning method, reinforcement learning[Elfwing et al.2015]. However, these approaches are limited to transforming RBM that is a shallow network.
In this study, we developed alternative models to solve a classification problem based on DBN. Viewing the two-phase training as two separate optimization problems, we applied optimization modeling techniques in developing our models. Our first approach is to design new objective functions. We design an expected loss function based on built by DBN and the loss function of the classifier. Second, we introduce constraints that bound the DBN weights in the feed-forward phase. The constraints keep a good representation of input as well as regularize the weights during updates. Third, we applied bilevel programming to the two-phase training method. The bilevel model has a loss function of the classifier in its objective function but it constrains the DBN values to the optimal to phase-1. This model searches possible optimal solutions for the classification objective only where DBN objective solutions are optimal.
Our main contributions are several classification models combining DBN and a loss function in a coherent way. In the computational study we verify that the suggested models perform better than the two-phase method.
2 Literature Review
The two-phase training strategy has been applied to many classification tasks on different types of data. Two-phase training with RBM and support vector machine (SVM) has been explored in classification tasks on images, documents, and network intrusion data[Xing et al.2005, Norouzi et al.2009, Salama et al.2011, Dahl et al.2012]
. Logistic regression replacing SVM has been explored[Mccallum et al.2006, Cho et al.2011]
. Gehler et al. Gehler2006 used the 1-nearest neighborhood classifier with RBM to solve a document classification task. Hinton and Salakhutdinov Hinton2006 suggested DBN consisting of stacked RBMs that is trained in a layer-wise manner. Two-phase method using DBN and deep neural network has been studied to solve various classification problems such as image and text recognition[Hinton and Salakhutdinov2006, Bengio and Lamblin2007, Sarikaya et al.2014]. Recently, this approach has been applied to motor imagery classification in the area of brain–computer interface [Lu et al.2017], and biomedical research, classification of Cytochrome P450 1A2 inhibitors and non-inhibitors [Yu et al.2017]. All these papers rely on two distinct phases, while our models assume a holistic view of both aspects.
Many studies have been conducted to improve the problems of two-phase training. Most of the research has been focused on transforming RBM so that the modified model can achieve generative and discriminative objectives at the same time. Schmah et al. Schmah2009 proposed a discriminative RBM method, and subsequently classification is done in the manner of a Bayes classifier. However, this method cannot capture the relationship between the classes since the RBM of each class is trained separately. Larochelle et al. Larochelle2008,Larochelle2012 proposed a self-contained discriminative RBM framework where the objective function consists of the generative learning objective , and the discriminative learning objective, . Both distributions are derived from RBM. Similarly, a self-contained discriminative RBM method for classification is proposed [Elfwing et al.2015]. The free-energy function based approximation is applied in the development of this method, which is initially suggested for reinforcement learning. This prior paper relying on RBM conditional probability while we handle general loss functions. Our models also hinge on completely different principles.
Restricted Boltzmann Machines.
RBM is an energy-based probabilistic model, which is a restricted version of Boltzmann machines (BM) that is a log-linear Markov Random Field. It has visible nodes corresponding to input and hidden nodes
matching the latent features. The joint distribution of the visible nodesand hidden variable is defined as
where , and are the model parameters, and is the partition function. Since units in a layer are independent in RBM, we have the following form of conditional distributions:
For binary units where and , we can write and where
is the sigmoid function. In this manner RBM with binary units is an unsupervised neural network with a sigmoid activation function. The model calibration of RBM can be done by minimizing negative log-likelihood through gradient descent. RBM takes advantage of having the above conditional probabilities which enable to obtain model samples easier through a Gibbs sampling method. Contrastive divergence (CD) makes Gibbs sampling even simpler: 1) start a Markov chain with training samples, and 2) stop to obtain samples after k steps. It is shown that CD with a few steps performs effectively[Bengio2009, Hinton2002].
Deep Belief Networks.
DBN is a generative graphical model consisting of stacked RBMs. Based on its deep structure DBN can capture a hierarchical representation of input data. Hinton et al. (2006) introduced DBN with a training algorithm that greedily trains one layer at a time. Given visible unit and hidden layers the joint distribution is defined as [Bengio2009, Hinton et al.2006]
Since each layer of DBN is constructed as RBM, training each layer of DBN is the same as training a RBM.
Classification is conducted by initializing a network through DBN training [Hinton et al.2006, Bengio and Lamblin2007]. A two-phase training can be done sequentially by: 1) pre-training, unsupervised learning of stacked RBM in a layer-wise manner, and 2) fine-tuning, supervised learning with a classifier. Each phase requires solving an optimization problem. Given training dataset , with input and label , the pre-training phase solves the following optimization problem at each layer
where is the RBM model parameter that denotes weights, visible bias, and hidden bias in the energy function, and is visible input to layer corresponding to input . Note that in layer-wise updating manner we need to solve of the problems from the bottom to the top hidden layer. For the fine-tuning phase we solve the following optimization problem
where is a loss function, denotes the final hidden features at layer , and denotes the parameters of the classifier. Here for simplicity we write
. When combining DBN and a feed-forward neural networks (FFN) with sigmoid activation, all the weights and hidden bias parameters among input and hidden layers are shared for both training phases. Therefore, in this case we initialize FFN by training DBN.
4 Proposed Models
We model an expected loss function for classification. Considering classification of two phase method is conducted on hidden space, the probability distribution of the hidden variables obtained by DBN is used in the proposed models. The two-phase method provides information about modeling parameters after each phase is trained. Constraints based on the information are suggested to prevent the model parameters from deviating far from good representation of input. Optimal solution set for unsupervised objective of the two-phase method is good candidate solutions for the second phase. Bilevel model has the set to find optimal solutions for the phase-2 objective so that it conducts the two-phase training at one-shot.
DBN Fitting Plus Loss Model.
We start with a naive model of summing pre-trainning and fine-tuning objectives. This model conducts the two-phase training strategy simultaneously; however, we need to add one more hyperparameterto balance the impact of both objectives. The model (DBN+loss) is defined as
and empirically based on training samples ,
where are the underlying parameters. Note that from (1) and . This model has already been proposed if the classification loss function is based on the RBM conditional distribution [Larochelle and Bengio2008, Larochelle et al.2012].
Expected Loss Model with DBN Boxing.
We first design an expected loss model based on conditional distribution obtained by DBN. This model conducts classification on the hidden space. Since it minimizes the expected loss, it should be more robust and thus it should yield better accuracy on data not observed. The mathematical model that minimizes the expected loss function is defined as
and empirically based on training samples ,
With notation we explicitly show the dependency of on . We modify the expected loss model by introducing a constraint that sets bounds on DBN related parameters with respect to their optimal values. This model has two benefits. First, the model keeps a good representation of input by constraining parameters fitted in the unsupervised manner. Also, the constraint regularizes the model parameters by preventing them from blowing up while being updated. Given training samples the mathematical form of the model (EL-DBN) reads
where are the optimal DBN parameters and is a hyperparameter. This model needs a pre-training phase to obtain the DBN fitted parameters.
Expected Loss Model with DBN Classification Boxing.
Similar to the DBN boxing model, this expected loss model has a constraint that the DBN parameters are bounded by their optimal values at the end of both phases. This model regularizes parameters with those that are fitted in both the unsupervised and supervised manner. Therefore, it can achieve better accuracy even though we need an additional training to the two-phase trainings. Given training samples the model (EL-DBNOPT) reads
where are the optimal values of DBN parameters after two-phase training and is a hyperparameter.
Feed-forward Network with DBN Boxing.
We also propose a model based on boxing constraints where FFN is constrained by DBN output. The mathematical model (FFN-DBN) based on training samples is
Feed-forward Network with DBN Classification Boxing.
We also apply bilevel programming to the two-phase training method. This model searches optimal solutions to minimize the loss function of the classifier only where DBN objective solutions are optimal. Possible candidates for optimal solutions of the first level objective function are optimal solutions of the second level objective function. This model (BL) reads
and empirically based on training samples,
One of the solution approaches to bilevel programming is to apply Karush–Kuhn–Tucker (KKT) conditions to the lower level problem. After applying KKT to the lower level, we obtain
Furthermore, we transform this constrained problem to an unconstrained problem with a quadratic penalty function:
where is a hyperparameter. The gradient of the objective function is derived in the appendix.
5 Computational Study
To evaluate the proposed models classification tasks on three datasets were conducted: the MNIST hand-written images 111http://yann.lecun.com/exdb/mnist/, the KDD’99 network intrusion dataset (NI)222http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, and the isolated letter speech recognition dataset (ISOLET) 333https://archive.ics.uci.edu/ml/datasets/ISOLET. The experimental results of the proposed models on these datasets were compared to those of the two-phase method.
In FFN, the sigmoid functions in the hidden layers and the softmax function in the output layer were chosen with negative log-likelihood as a loss function of the classifiers. The size and the number of the hidden layers was selected differently depending on the datasets (optimally for each case). We first implemented the two-phase method to obtain the best configuration of the hidden units and layers, and then applied this configuration to the proposed models.
Implementations were done in Theano. The mini-batch gradient descent algorithm was used to solve the optimization problems of each model. To calculate the gradients of each objective function of the models Theano’s built-in functions, ’theano.tensor.grad’, was used. We denote by DBN-FFN the two-phase approach.
The task on the MNIST is to classify ten digits from 0 to 9 given by
pixel hand-written images. The dataset is divided in 60,000 samples for training and validation, and 10,000 samples for testing. The hyperparameters are set as: 1) hidden units at each layer are 500 or 1000, 2) training epochs for pre-training and fine-tuning range from 100 to 900, 3) learning rates for pre-training are 0.01 or 0.05, and these for fine-tuning range from 0.1 to 2, 4) batch size is 50, and 5)of the DBN+loss and of the BL model are diminishing during iterations.
DBN-FFN with four-hidden layers of size, 784-1000-1000-1000-1000-10, was the best, and subsequently we compared it to the proposed models with the same size of the network. In Table 1, the best test error rate was achieved by FFN-DBNOPT, 1.09%. Furthermore, the models with the DBN classification constraints, EL-DBNOPT and FFN-DBNOPT, perform better than the two-phase method. This shows that DBN classification boxing constraints regularize the model parameters by keeping a good representation of input.
|Test error (%)|
|Model||Shallow network||Deep network|
|DBN-FFN||1.17 %||1.14 %|
|DBN+loss||1.61 %||1.64 %|
|EL-DBN||1.35 %||1.30 %|
|EL-DBNOPT||1.17 %||1.13 %|
|FFN-DBN||1.17 %||1.29 %|
|FFN-DBNOPT||1.16 %||1.09 %|
|BL||1.61 %||1.72 %|
5.2 Network Intrusion
The classification task on NI is to distinguish between normal and bad connections given the related network connection information. The preprocessed dataset consists of 41 input features and 5 classes, and 4,898,431 examples for training and 311,029 examples for testing. The experiments were conducted on 20%, 30%, and 40% subsets of the whole training set, which were obtained by stratified random sampling. Hyperparameters are set as: 1) hidden units at each layer are 13, 15, or 20, 2) training epochs for pre-training and fine-tuning range from 100 to 900, 3) learning rates for pre-training are 0.01 or 0.05, and these for fine-tuning are from 0.1 to 2, 4) batch size is 1000, and 5) of the DBN+loss and of the BL are diminishing during iterations.
On NI the best structure of DBN-FFN was 41-15-15-5 for the 20% and the 30% training set, and 41-15-15-15-5 for the 40 % training set. Table 2 shows the experimental results of the proposed models with the same network as the best DBN-FFN. BL produces the best test error, 7.08%. This showed that the model being trained concurrently for unsupervised and supervised purpose can achieve better accuracy than the two-phase method. Furthermore, both EL-DBNOPT and FFN-DBNOPT yield similar to, or lower error rates than DBN-FFN in all of the three subsets.
|Test error rate|
|Model||20% dataset||30% dataset||40% dataset|
|DBN-FFN||7.41 %||7.19 %||7.31 %|
|DBN+loss||7.29 %||7.30 %||7.35 %|
|EL-DBN||8.35 %||7.69 %||7.69 %|
|EL-DBNOPT||7.34 %||7.18 %||7.31 %|
|FFN-DBN||7.53 %||7.45 %||7.56 %|
|FFN-DBNOPT||7.32 %||7.14 %||7.31 %|
|BL||7.19 %||7.21 %||7.08 %|
The classification on ISOLET is to predict which letter-name was spoken among the 26 English alphabets given 617 input features of the related signal processing information. The dataset consists of 5,600 for training, 638 for validation, and 1,559 examples for testing. Hyperparameters are set as: 1) hidden units at each layer are 400, 500, or 800, 2) training epochs for pre-training and fine-tuning are from 100 to 900, 3) learning rates for pre-training are from 0.001 to 0.05, and these for fine-tuning are from 0.05 to 1, 4) batch size is 20, and 5) of the DBN+loss and of the BL model are diminishing during iterations.
In this experiment the deep network performed worse than the shallow network. One possible reason for this is its small size of training samples. The one hidden layer with 500 units was the best for DBN-FFN. Table 3 shows the experimental results of the proposed models with the same hidden layer setting. DBN-FFN and DBN classification boxing models achieve the same accuracy.
|Model||Test error rate|
DBN+loss showed worse accuracy than two-phase training in all of the experiments. Aggregating two unsupervised and supervised objectives without a specific treatment is not effective. Second, the models with DBN optimal boxing, EL-DBN and FFN-DBN, performed worse than DBN-FFN. Regularizing the model parameters with unsupervised learning is not so effective in solving a supervised learning problem. Third, the models with DBN classification boxing, EL-DBNOPT and FFN-DBNOPT, performed no worse than DBN-FFN in all of the experiments. This shows that classification accuracy can be improved by regularizing the model parameters with the values trained for unsupervised and supervised purpose. One drawback of this approach is that one more training phase to the two-phase approach is necessary. Last, BL showed that one-step training can achieve a better performance than two-phase training. Even though it worked in one instance, improvements to current BL can be made such as applying different solution search algorithms, supervised learning regularization techniques, or different initialization strategies.
- [Bengio and Lamblin2007] Yoshua Bengio and Pascal Lamblin. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems (NIPS) 19, volume 20, pages 153–160. MIT Press, 2007.
Learning deep architectures for AI.
Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
- [Cho et al.2011] KyungHyun Cho, Alexander Ilin, and Tapani Raiko. Improved learning algorithms for restricted Boltzmann machines. In Artificial Neural Networks and Machine Learning (ICANN), volume 6791. Springer, Berlin, Heidelberg, 2011.
- [Dahl et al.2012] George E Dahl, Ryan P Adams, and Hugo Larochelle. Training restricted Boltzmann machines on word observations. In International Conference on Machine Learning (ICML) 29, volume 29, pages 679–686, Edinburgh, Scotland, UK, 2012.
- [Elfwing et al.2015] S. Elfwing, E. Uchibe, and K. Doya. Expected energy-based restricted Boltzmann machine for classification. Neural Networks, 64:29–38, 2015.
- [Fischer and Igel2012] Asja Fischer and Christian Igel. An introduction to restricted Boltzmann machines. , 7441:14–36, 2012.
- [Gehler et al.2006] Peter V. Gehler, Alex D. Holub, and MaxWelling. The rate adapting Poisson (RAP) model for information retrieval and object recognition. In International Conference on Machine Learning (ICML) 23, volume 23, pages 337–344, Pittsburgh, PA, USA, 2006.
- [Hinton and Salakhutdinov2006] G E Hinton and R R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
- [Hinton et al.2006] Geoffrey E Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–54, 2006.
- [Hinton2002] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- [Larochelle and Bengio2008] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. In International Conference on Machine Learning (ICML) 25, pages 536–543, Helsinki, Finland, 2008.
- [Larochelle et al.2012] Hugo Larochelle, Michael Mandel, Razvan Pascanu, and Yoshua Bengio. Learning algorithms for the classification restricted Boltzmann machine. Journal of Machine Learning Research, 13:643–669, 2012.
[Lu et al.2017]
Na Lu, Tengfei Li, Xiaodong Ren, and Hongyu Miao.
A deep learning scheme for motor imagery classification based on restricted Boltzmann machines.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25:566–576, 2017.
[Mccallum et al.2006]
Andrew Mccallum, Chris Pal, Greg Druck, and Xuerui Wang.
Multi-conditional learning : generative / discriminative training
for clustering and classification.
National Conference on Artificial Intelligence (AAAI), volume 21, pages 433–439, 2006.
- [Norouzi et al.2009] Mohammad Norouzi, Mani Ranjbar, and Greg Mori. Stacks of convolutional restricted Boltzmann machines for shift-invariant feature learning. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR), pages 2735–2742, Miami, FL, USA, 2009.
- [Salama et al.2011] Ma Salama, Hf Eid, and Ra Ramadan. Hybrid intelligent intrusion detection scheme. Advances in Intelligent and Soft Computing, pages 293–303, 2011.
- [Sarikaya et al.2014] Ruhi Sarikaya, Geoffrey E. Hinton, and Anoop Deoras. Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):778–784, 2014.
- [Schmah et al.2009] Tanya Schmah, Geoffery E. Hinton, Richard S. Zemel, Steven L. Small, and Stephen Strother. Generative versus discriminative training of RBMs for classification of fMRI images. In Advances in Neural Information Processing Systems (NIPS) 21, volume 21, pages 1409–1416. Curran Associates, Inc., 2009.
- [Xing et al.2005] Eric P Xing, Rong Yan, and Alexander G Hauptmann. Mining associated text and images with dual-wing Harmoniums. In Conference on Uncertainty in Artificial Intelligence (UAI), volume 21, pages 633–641, Edinburgh, Scotland, 2005.
- [Yu et al.2017] Long Yu, Xinyu Shi, and Tian Shengwei. Classification of Cytochrome P450 1A2 of inhibitors and noninhibitors based on deep belief network. International Journal of Computational Intelligence and Applications, 16:1750002, 2017.
Approximation of DBN Probability in the Proposed Models
DBN defines the joint distribution of the visible unit and the hidden layers, as
DBN Fitting Plus Loss Model.
From Eq. (2), in the second term of the objective function is approximated as
Expected Loss Models.
in the objective function is approximated as
From Eq. (5), in the objective function is approximated for as
where . The gradient of this approximated quantity is then the Hessian matrix of the underlying RBM.
Derivation of the Gradient of the Bilevel Model
We write the approximated at the layer as
where and denote dimensions of and and denotes the and component of the . The gradient of the approximated at the layer is
for . This shows that the gradient of the approximated in (5) is then the Hessian matrix times the gradient of the underlying RBM. The stochastic gradient of of RBM with binary input and hidden unit with respect to is
where denotes [Fischer and Igel2012]. We derive the Hessian matrix with respect to as
where is the sigmoid function, is , and is the hidden bias. Based on what we derive above we can calculate the gradient of approximated .