1 Introduction
Restricted Boltzmann machine (RBM), an energybased model to define an input distribution, is widely used to extract latent features before classification. Such an approach combines unsupervised learning for feature modeling and supervised learning for classification. Two training steps are needed. The first step, called pretraining, is to model features used for classification. This can be done by training RBM that captures the distribution of input. The second step, called finetuning, is to train a separate classifier based on the features from the first step
[Larochelle et al.2012]. This twophase training approach for classification is also used for deep networks. Deep belief networks (DBN) are built with stacked RBMs, and trained in a layerwise manner [Hinton and Salakhutdinov2006]. Twophase training based on a deep network consists of DBN and a classifier on top of it.The twophase training strategy has three possible problems. 1) It requires two training processes; one for training RBMs and one for training a classifier. 2) It is not guaranteed that the modeled features in the first step are useful in the classification phase since they are obtained independently of the classification task. 3) It is an effort to decide which classifier is the best for each problem. Therefore, there is a need for a method that can conduct feature modeling and classification concurrently [Larochelle et al.2012].
To resolve these problems, recent papers suggest to transform RBM to a model that can deal with both unsupervised and supervised learning. Since RBM calculate the joint and conditional probabilities, the suggested prior models combine a generative and discriminative RBM. Consequently, this hybrid discriminative RBM is trained concurrently for both objectives by summing the two contributions
[Larochelle and Bengio2008, Larochelle et al.2012]. In a similar way a selfcontained RBM for classification is developed by applying the freeenergy function based approximation to RBM, which was used for a supervised learning method, reinforcement learning
[Elfwing et al.2015]. However, these approaches are limited to transforming RBM that is a shallow network.In this study, we developed alternative models to solve a classification problem based on DBN. Viewing the twophase training as two separate optimization problems, we applied optimization modeling techniques in developing our models. Our first approach is to design new objective functions. We design an expected loss function based on built by DBN and the loss function of the classifier. Second, we introduce constraints that bound the DBN weights in the feedforward phase. The constraints keep a good representation of input as well as regularize the weights during updates. Third, we applied bilevel programming to the twophase training method. The bilevel model has a loss function of the classifier in its objective function but it constrains the DBN values to the optimal to phase1. This model searches possible optimal solutions for the classification objective only where DBN objective solutions are optimal.
Our main contributions are several classification models combining DBN and a loss function in a coherent way. In the computational study we verify that the suggested models perform better than the twophase method.
2 Literature Review
The twophase training strategy has been applied to many classification tasks on different types of data. Twophase training with RBM and support vector machine (SVM) has been explored in classification tasks on images, documents, and network intrusion data
[Xing et al.2005, Norouzi et al.2009, Salama et al.2011, Dahl et al.2012]. Logistic regression replacing SVM has been explored
[Mccallum et al.2006, Cho et al.2011]. Gehler et al. Gehler2006 used the 1nearest neighborhood classifier with RBM to solve a document classification task. Hinton and Salakhutdinov Hinton2006 suggested DBN consisting of stacked RBMs that is trained in a layerwise manner. Twophase method using DBN and deep neural network has been studied to solve various classification problems such as image and text recognition
[Hinton and Salakhutdinov2006, Bengio and Lamblin2007, Sarikaya et al.2014]. Recently, this approach has been applied to motor imagery classification in the area of brain–computer interface [Lu et al.2017], and biomedical research, classification of Cytochrome P450 1A2 inhibitors and noninhibitors [Yu et al.2017]. All these papers rely on two distinct phases, while our models assume a holistic view of both aspects.Many studies have been conducted to improve the problems of twophase training. Most of the research has been focused on transforming RBM so that the modified model can achieve generative and discriminative objectives at the same time. Schmah et al. Schmah2009 proposed a discriminative RBM method, and subsequently classification is done in the manner of a Bayes classifier. However, this method cannot capture the relationship between the classes since the RBM of each class is trained separately. Larochelle et al. Larochelle2008,Larochelle2012 proposed a selfcontained discriminative RBM framework where the objective function consists of the generative learning objective , and the discriminative learning objective, . Both distributions are derived from RBM. Similarly, a selfcontained discriminative RBM method for classification is proposed [Elfwing et al.2015]. The freeenergy function based approximation is applied in the development of this method, which is initially suggested for reinforcement learning. This prior paper relying on RBM conditional probability while we handle general loss functions. Our models also hinge on completely different principles.
3 Background
Restricted Boltzmann Machines.
RBM is an energybased probabilistic model, which is a restricted version of Boltzmann machines (BM) that is a loglinear Markov Random Field. It has visible nodes corresponding to input and hidden nodes
matching the latent features. The joint distribution of the visible nodes
and hidden variable is defined aswhere , and are the model parameters, and is the partition function. Since units in a layer are independent in RBM, we have the following form of conditional distributions:
For binary units where and , we can write and where
is the sigmoid function. In this manner RBM with binary units is an unsupervised neural network with a sigmoid activation function. The model calibration of RBM can be done by minimizing negative loglikelihood through gradient descent. RBM takes advantage of having the above conditional probabilities which enable to obtain model samples easier through a Gibbs sampling method. Contrastive divergence (CD) makes Gibbs sampling even simpler: 1) start a Markov chain with training samples, and 2) stop to obtain samples after k steps. It is shown that CD with a few steps performs effectively
[Bengio2009, Hinton2002].Deep Belief Networks.
DBN is a generative graphical model consisting of stacked RBMs. Based on its deep structure DBN can capture a hierarchical representation of input data. Hinton et al. (2006) introduced DBN with a training algorithm that greedily trains one layer at a time. Given visible unit and hidden layers the joint distribution is defined as [Bengio2009, Hinton et al.2006]
Since each layer of DBN is constructed as RBM, training each layer of DBN is the same as training a RBM.
Classification is conducted by initializing a network through DBN training [Hinton et al.2006, Bengio and Lamblin2007]. A twophase training can be done sequentially by: 1) pretraining, unsupervised learning of stacked RBM in a layerwise manner, and 2) finetuning, supervised learning with a classifier. Each phase requires solving an optimization problem. Given training dataset , with input and label , the pretraining phase solves the following optimization problem at each layer
where is the RBM model parameter that denotes weights, visible bias, and hidden bias in the energy function, and is visible input to layer corresponding to input . Note that in layerwise updating manner we need to solve of the problems from the bottom to the top hidden layer. For the finetuning phase we solve the following optimization problem
(1) 
where is a loss function, denotes the final hidden features at layer , and denotes the parameters of the classifier. Here for simplicity we write
. When combining DBN and a feedforward neural networks (FFN) with sigmoid activation, all the weights and hidden bias parameters among input and hidden layers are shared for both training phases. Therefore, in this case we initialize FFN by training DBN.
4 Proposed Models
We model an expected loss function for classification. Considering classification of two phase method is conducted on hidden space, the probability distribution of the hidden variables obtained by DBN is used in the proposed models. The twophase method provides information about modeling parameters after each phase is trained. Constraints based on the information are suggested to prevent the model parameters from deviating far from good representation of input. Optimal solution set for unsupervised objective of the twophase method is good candidate solutions for the second phase. Bilevel model has the set to find optimal solutions for the phase2 objective so that it conducts the twophase training at oneshot.
DBN Fitting Plus Loss Model.
We start with a naive model of summing pretrainning and finetuning objectives. This model conducts the twophase training strategy simultaneously; however, we need to add one more hyperparameter
to balance the impact of both objectives. The model (DBN+loss) is defined asand empirically based on training samples ,
(2) 
where are the underlying parameters. Note that from (1) and . This model has already been proposed if the classification loss function is based on the RBM conditional distribution [Larochelle and Bengio2008, Larochelle et al.2012].
Expected Loss Model with DBN Boxing.
We first design an expected loss model based on conditional distribution obtained by DBN. This model conducts classification on the hidden space. Since it minimizes the expected loss, it should be more robust and thus it should yield better accuracy on data not observed. The mathematical model that minimizes the expected loss function is defined as
and empirically based on training samples ,
With notation we explicitly show the dependency of on . We modify the expected loss model by introducing a constraint that sets bounds on DBN related parameters with respect to their optimal values. This model has two benefits. First, the model keeps a good representation of input by constraining parameters fitted in the unsupervised manner. Also, the constraint regularizes the model parameters by preventing them from blowing up while being updated. Given training samples the mathematical form of the model (ELDBN) reads
where are the optimal DBN parameters and is a hyperparameter. This model needs a pretraining phase to obtain the DBN fitted parameters.
Expected Loss Model with DBN Classification Boxing.
Similar to the DBN boxing model, this expected loss model has a constraint that the DBN parameters are bounded by their optimal values at the end of both phases. This model regularizes parameters with those that are fitted in both the unsupervised and supervised manner. Therefore, it can achieve better accuracy even though we need an additional training to the twophase trainings. Given training samples the model (ELDBNOPT) reads
(3)  
where are the optimal values of DBN parameters after twophase training and is a hyperparameter.
Feedforward Network with DBN Boxing.
We also propose a model based on boxing constraints where FFN is constrained by DBN output. The mathematical model (FFNDBN) based on training samples is
(4)  
Feedforward Network with DBN Classification Boxing.
Bilevel Model.
We also apply bilevel programming to the twophase training method. This model searches optimal solutions to minimize the loss function of the classifier only where DBN objective solutions are optimal. Possible candidates for optimal solutions of the first level objective function are optimal solutions of the second level objective function. This model (BL) reads
and empirically based on training samples,
One of the solution approaches to bilevel programming is to apply Karush–Kuhn–Tucker (KKT) conditions to the lower level problem. After applying KKT to the lower level, we obtain
Furthermore, we transform this constrained problem to an unconstrained problem with a quadratic penalty function:
(5)  
where is a hyperparameter. The gradient of the objective function is derived in the appendix.
5 Computational Study
To evaluate the proposed models classification tasks on three datasets were conducted: the MNIST handwritten images ^{1}^{1}1http://yann.lecun.com/exdb/mnist/, the KDD’99 network intrusion dataset (NI)^{2}^{2}2http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, and the isolated letter speech recognition dataset (ISOLET) ^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/ISOLET. The experimental results of the proposed models on these datasets were compared to those of the twophase method.
In FFN, the sigmoid functions in the hidden layers and the softmax function in the output layer were chosen with negative loglikelihood as a loss function of the classifiers. The size and the number of the hidden layers was selected differently depending on the datasets (optimally for each case). We first implemented the twophase method to obtain the best configuration of the hidden units and layers, and then applied this configuration to the proposed models.
Implementations were done in Theano. The minibatch gradient descent algorithm was used to solve the optimization problems of each model. To calculate the gradients of each objective function of the models Theano’s builtin functions, ’theano.tensor.grad’, was used. We denote by DBNFFN the twophase approach.
5.1 Mnist
The task on the MNIST is to classify ten digits from 0 to 9 given by
pixel handwritten images. The dataset is divided in 60,000 samples for training and validation, and 10,000 samples for testing. The hyperparameters are set as: 1) hidden units at each layer are 500 or 1000, 2) training epochs for pretraining and finetuning range from 100 to 900, 3) learning rates for pretraining are 0.01 or 0.05, and these for finetuning range from 0.1 to 2, 4) batch size is 50, and 5)
of the DBN+loss and of the BL model are diminishing during iterations.DBNFFN with fourhidden layers of size, 784100010001000100010, was the best, and subsequently we compared it to the proposed models with the same size of the network. In Table 1, the best test error rate was achieved by FFNDBNOPT, 1.09%. Furthermore, the models with the DBN classification constraints, ELDBNOPT and FFNDBNOPT, perform better than the twophase method. This shows that DBN classification boxing constraints regularize the model parameters by keeping a good representation of input.
Test error (%)  

Model  Shallow network  Deep network 
DBNFFN  1.17 %  1.14 % 
DBN+loss  1.61 %  1.64 % 
ELDBN  1.35 %  1.30 % 
ELDBNOPT  1.17 %  1.13 % 
FFNDBN  1.17 %  1.29 % 
FFNDBNOPT  1.16 %  1.09 % 
BL  1.61 %  1.72 % 
5.2 Network Intrusion
The classification task on NI is to distinguish between normal and bad connections given the related network connection information. The preprocessed dataset consists of 41 input features and 5 classes, and 4,898,431 examples for training and 311,029 examples for testing. The experiments were conducted on 20%, 30%, and 40% subsets of the whole training set, which were obtained by stratified random sampling. Hyperparameters are set as: 1) hidden units at each layer are 13, 15, or 20, 2) training epochs for pretraining and finetuning range from 100 to 900, 3) learning rates for pretraining are 0.01 or 0.05, and these for finetuning are from 0.1 to 2, 4) batch size is 1000, and 5) of the DBN+loss and of the BL are diminishing during iterations.
On NI the best structure of DBNFFN was 4115155 for the 20% and the 30% training set, and 411515155 for the 40 % training set. Table 2 shows the experimental results of the proposed models with the same network as the best DBNFFN. BL produces the best test error, 7.08%. This showed that the model being trained concurrently for unsupervised and supervised purpose can achieve better accuracy than the twophase method. Furthermore, both ELDBNOPT and FFNDBNOPT yield similar to, or lower error rates than DBNFFN in all of the three subsets.
Test error rate  
Model  20% dataset  30% dataset  40% dataset 
DBNFFN  7.41 %  7.19 %  7.31 % 
DBN+loss  7.29 %  7.30 %  7.35 % 
ELDBN  8.35 %  7.69 %  7.69 % 
ELDBNOPT  7.34 %  7.18 %  7.31 % 
FFNDBN  7.53 %  7.45 %  7.56 % 
FFNDBNOPT  7.32 %  7.14 %  7.31 % 
BL  7.19 %  7.21 %  7.08 % 
5.3 Isolet
The classification on ISOLET is to predict which lettername was spoken among the 26 English alphabets given 617 input features of the related signal processing information. The dataset consists of 5,600 for training, 638 for validation, and 1,559 examples for testing. Hyperparameters are set as: 1) hidden units at each layer are 400, 500, or 800, 2) training epochs for pretraining and finetuning are from 100 to 900, 3) learning rates for pretraining are from 0.001 to 0.05, and these for finetuning are from 0.05 to 1, 4) batch size is 20, and 5) of the DBN+loss and of the BL model are diminishing during iterations.
In this experiment the deep network performed worse than the shallow network. One possible reason for this is its small size of training samples. The one hidden layer with 500 units was the best for DBNFFN. Table 3 shows the experimental results of the proposed models with the same hidden layer setting. DBNFFN and DBN classification boxing models achieve the same accuracy.
Model  Test error rate 

DBNFFN  3.12 % 
DBN+loss  4.09 % 
ELDBN  3.38 % 
ELDBNOPT  3.44 % 
FFNDBN  3.12 % 
FFNDBNOPT  3.12 % 
BL  3.96 % 
6 Conclusions
DBN+loss showed worse accuracy than twophase training in all of the experiments. Aggregating two unsupervised and supervised objectives without a specific treatment is not effective. Second, the models with DBN optimal boxing, ELDBN and FFNDBN, performed worse than DBNFFN. Regularizing the model parameters with unsupervised learning is not so effective in solving a supervised learning problem. Third, the models with DBN classification boxing, ELDBNOPT and FFNDBNOPT, performed no worse than DBNFFN in all of the experiments. This shows that classification accuracy can be improved by regularizing the model parameters with the values trained for unsupervised and supervised purpose. One drawback of this approach is that one more training phase to the twophase approach is necessary. Last, BL showed that onestep training can achieve a better performance than twophase training. Even though it worked in one instance, improvements to current BL can be made such as applying different solution search algorithms, supervised learning regularization techniques, or different initialization strategies.
References
 [Bengio and Lamblin2007] Yoshua Bengio and Pascal Lamblin. Greedy layerwise training of deep networks. In Advances in Neural Information Processing Systems (NIPS) 19, volume 20, pages 153–160. MIT Press, 2007.

[Bengio2009]
Yoshua Bengio.
Learning deep architectures for AI.
Foundations and Trends in Machine Learning
, 2(1):1–127, 2009.  [Cho et al.2011] KyungHyun Cho, Alexander Ilin, and Tapani Raiko. Improved learning algorithms for restricted Boltzmann machines. In Artificial Neural Networks and Machine Learning (ICANN), volume 6791. Springer, Berlin, Heidelberg, 2011.
 [Dahl et al.2012] George E Dahl, Ryan P Adams, and Hugo Larochelle. Training restricted Boltzmann machines on word observations. In International Conference on Machine Learning (ICML) 29, volume 29, pages 679–686, Edinburgh, Scotland, UK, 2012.
 [Elfwing et al.2015] S. Elfwing, E. Uchibe, and K. Doya. Expected energybased restricted Boltzmann machine for classification. Neural Networks, 64:29–38, 2015.

[Fischer and Igel2012]
Asja Fischer and Christian Igel.
An introduction to restricted Boltzmann machines.
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
, 7441:14–36, 2012.  [Gehler et al.2006] Peter V. Gehler, Alex D. Holub, and MaxWelling. The rate adapting Poisson (RAP) model for information retrieval and object recognition. In International Conference on Machine Learning (ICML) 23, volume 23, pages 337–344, Pittsburgh, PA, USA, 2006.
 [Hinton and Salakhutdinov2006] G E Hinton and R R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [Hinton et al.2006] Geoffrey E Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–54, 2006.
 [Hinton2002] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
 [Larochelle and Bengio2008] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. In International Conference on Machine Learning (ICML) 25, pages 536–543, Helsinki, Finland, 2008.
 [Larochelle et al.2012] Hugo Larochelle, Michael Mandel, Razvan Pascanu, and Yoshua Bengio. Learning algorithms for the classification restricted Boltzmann machine. Journal of Machine Learning Research, 13:643–669, 2012.

[Lu et al.2017]
Na Lu, Tengfei Li, Xiaodong Ren, and Hongyu Miao.
A deep learning scheme for motor imagery classification based on restricted Boltzmann machines.
IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25:566–576, 2017. 
[Mccallum et al.2006]
Andrew Mccallum, Chris Pal, Greg Druck, and Xuerui Wang.
Multiconditional learning : generative / discriminative training
for clustering and classification.
In
National Conference on Artificial Intelligence (AAAI)
, volume 21, pages 433–439, 2006.  [Norouzi et al.2009] Mohammad Norouzi, Mani Ranjbar, and Greg Mori. Stacks of convolutional restricted Boltzmann machines for shiftinvariant feature learning. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR), pages 2735–2742, Miami, FL, USA, 2009.
 [Salama et al.2011] Ma Salama, Hf Eid, and Ra Ramadan. Hybrid intelligent intrusion detection scheme. Advances in Intelligent and Soft Computing, pages 293–303, 2011.
 [Sarikaya et al.2014] Ruhi Sarikaya, Geoffrey E. Hinton, and Anoop Deoras. Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):778–784, 2014.
 [Schmah et al.2009] Tanya Schmah, Geoffery E. Hinton, Richard S. Zemel, Steven L. Small, and Stephen Strother. Generative versus discriminative training of RBMs for classification of fMRI images. In Advances in Neural Information Processing Systems (NIPS) 21, volume 21, pages 1409–1416. Curran Associates, Inc., 2009.
 [Xing et al.2005] Eric P Xing, Rong Yan, and Alexander G Hauptmann. Mining associated text and images with dualwing Harmoniums. In Conference on Uncertainty in Artificial Intelligence (UAI), volume 21, pages 633–641, Edinburgh, Scotland, 2005.
 [Yu et al.2017] Long Yu, Xinyu Shi, and Tian Shengwei. Classification of Cytochrome P450 1A2 of inhibitors and noninhibitors based on deep belief network. International Journal of Computational Intelligence and Applications, 16:1750002, 2017.
Appendix
Approximation of DBN Probability in the Proposed Models
DBN defines the joint distribution of the visible unit and the hidden layers, as
with .
DBN Fitting Plus Loss Model.
From Eq. (2), in the second term of the objective function is approximated as
Expected Loss Models.
in the objective function is approximated as
Bilevel Model.
From Eq. (5), in the objective function is approximated for as
(6)  
where . The gradient of this approximated quantity is then the Hessian matrix of the underlying RBM.
Derivation of the Gradient of the Bilevel Model
We write the approximated at the layer as
where and denote dimensions of and and denotes the and component of the . The gradient of the approximated at the layer is
for . This shows that the gradient of the approximated in (5) is then the Hessian matrix times the gradient of the underlying RBM. The stochastic gradient of of RBM with binary input and hidden unit with respect to is
where denotes [Fischer and Igel2012]. We derive the Hessian matrix with respect to as
where is the sigmoid function, is , and is the hidden bias. Based on what we derive above we can calculate the gradient of approximated .
Comments
There are no comments yet.