1 Introduction
Deep learning algorithms have superlative capabilities for jointly performing feature mapping and classification. Thus, they outperform other machine learning approaches in applications ranging from image classification [8] and medical diagnostics [19] to credit fraud analytics [18]. However, it is challenging to adaptively (re)train deep neural networks to track changes in data distribution, especially in streaming data applications. Moreover, training multiple layer neural networks requires a priori specification of a suitable network architecture, and thus it is difficult to inform choice of architecture with the statistics of the data.
Online learning approaches for deep neural networks have the potential to address both these challenges. Several studies have put forth online learning algorithms for training single layer perceptron networks
[15, 12, 4]. Single layer feedforward neural networks can be trained in an online fashion using Stochastic Gradient Descent [2]or Extended Kalman Filters
[16][17]for the parameter update. However, it remains challenging to extend these successes to the task of training deep neural networks in a fully online manner. For example, online algorithms for denoising autoencoders (DAE)
[20] have been used for incremental feature learning with streaming data, but need a prioritraining with a DAE architecture as the building block to learn a base set of features first. Further, incremental learning has been applied within a boosting convolutional neural network framework for feature augmentation, loss function updation and finetuned back propagation with information accumulating in successive minibatches
[5]. Finally, it has also been shown that updating a greedily pretrained layerwise restricted Boltzmann machines (RBMs) in an online fashion automatically learns discriminative features for classification [3]. However, all the above approaches require pretraining and/or a fixed base network architecture as a precursor for incremental online updates with streaming data. Thus, methods that evolve a network architecture from scratch in an online manner, as the data streams in, would offer novel capabilities for online learning with deep neural networks.In this paper, we present an unsupervised online learning algorithm named the Online Generative Discriminative Restricted Boltzmann Machine (OGDRBM), for generative RBM and hence, layerwise training of DBN. At the beginning, there are no neurons in the hidden layer of a RBM. As samples stream in, the ability of the network to represent the sample is assessed using the reconstruction error for the sample. Based on this reconstruction error, the algorithm either deletes the samples that are well represented, or adds a neuron to the hidden layer to represent the sample or updates the weights for existing neurons in the network. As the network updates are tailored to represent the distributions of the distinctive input features, the network is compact and inherently discriminative [13]. Finally, the features learned in the generative phase are mapped to the specific classes via discriminative learning.
We first demonstrate the unique abilities of the OGDRBM to represent the distinctive class distributions and to learn in a manner that is invariant to the training data sequence, through a study on the wellexplored MNIST data set. The sequential invariance is much like the invariance to permutations in the training set seen with batch learning algorithms [14]. We then evaluate the performance of OGDRBM on binary classification tasks with a variety of highly unbalanced streaming credit fraud analytics datasets. It is critical to learn the distribution of a minority class from a highly imbalanced data set, and online learning provides a premise to efficiently learn the underrepresented minority class, owing to its ability to detect novelty in data. Our results show that the OGDRBM can perform better than batch DBN with lesser network resources and fewer training samples than batch methods. The main contributions of the paper are:

An online generative learning algorithm for unsupervised feature representation at the hidden layer of RBM.

A statistical demonstration that the neurons trained through the unsupervised online generative training are inherently discriminative.

A statistical demonstration that the classification accuracy and the neurontoclass label associations of the OGDRBM are independent of the sequence in which the training samples are presented.

A demonstration that the OGDRBM achieves better accuracies with more compact network architectures than batch learning algorithms.
The paper is organized as follows. First, we present the OGDRBM architecture and algorithm. Next, we demonstrate the learning algorithm of OGDRBM using the MNIST. Then, we evaluate OGDRBM in relation to other algorithms applied to the credit fraud detection problem. Finally, we summarize the study and outline future directions.
2 Online Generative Discriminative Restricted Boltzmann Machine
We describe the Online Generative Discriminative Restricted Boltzmann Machine (OGDRBM) learning algorithm. Fig. 1
shows the two phases of training the OGDRBM, namely: (1) An online generative learning phase for unsupervised feature representation at the hidden layer, and (2) a discriminative phase for supervised modeling of the class conditional probabilities.
We denote the training data set as , wherein is the dimensional input of the sample; denotes the set of class labels or targets for classes, and is the total number of samples. The objective of the OGDRBM is to best approximate the functional relationship between the inputs and their targets.
2.1 Preliminaries
A Restricted Boltzmann Machine (RBM) [7] has visible and hidden layers, connected through symmetric weights. The inputs () correspond to the neurons in the visible layer. The response of the neurons in the hidden layer (
) model the probability distribution of the inputs. The probability distribution is derived by learning the symmetrical connecting weights between the visible and the hidden layers (
). The neurons in the same layer of the RBM are not connected.The conditional probability of a configuration of the hidden neurons (), given a configuration of the visible neurons associated with inputs , is:
(1) 
The objective of the training phase is to learn the unknown () iteratively using the input (), as described below. Denote a configuration of the visible neurons by (). Then, the conditional probability of () given a configuration of the hidden neurons (), is:
(2) 
Denote a configuration of the hidden neurons by (). Then, the conditional probability of () given () is:
(3) 
The individual activation probabilities are given by:
(4)  
(5)  
(6) 
where the function
refers to the sigmoidal activation function. The generative training phase iterates until the (
) most closely approximates ().Training is performed using the maximum likelihood criterion, implemented by minimizing the negative log probability of the training data :
(7) 
The weights between the input and hidden layers of the RBM are updated according to:
(8) 
wherein denotes a prespecified learning rate.
2.2 Online Generative Learning
We now describe the online generative learning process for feature representation at the hidden layer. Initially, the hidden layer has no neurons. As the data streams in, the online generative learning algorithm of the RBM adds neurons (with activations defined by Eqs. (4) and (5)) and/or updates the representations of the existing neurons depending on the novelty of the sample. The first neuron is added based on the first sample in the data set.
At a given point during the training process, the network comprises neurons in the hidden layer, corresponding to novel samples out of a history of samples presented to the RBM thus far. For the next sample, the reconstruction error of the network is:
(9) 
The reconstruction error () is compared to two predefined thresholds, namely the novelty threshold and the marginal representation threshold . Based on this comparison, the algorithm chooses one of the following steps for the sample:

Add a Representative Neuron: If ), the sample is deemed novel and a neuron () is added to the hidden layer of the network. The input weights connecting the hidden neuron and the neurons in the input layer are obtained as a function of the inputs , where can be any function such that . In this work, we assign . The network weights of all the neurons, including the new neuron, are then updated according to Eq. (8).

Adapt Existing Network: If , the network weights () are adapted such that the probability distribution approximated by the hidden neurons includes the representation of this sample, according to Eq. (8).

Ignore Sample: If ), then the sample is sufficiently represented by the existing network and does not warrant a network update.
Overall, the neurons in the hidden layer of the network are adaptively added and updated to obtain a compact network structure that is sufficiently representative of the data.
2.3 Discriminative Training
We now describe the discriminative training, where the feature representation learned during the online generative phase is mapped to the conditional class distributions in a supervised fashion.
The responses of the neurons in the hidden layer are as below:
(10) 
This feature representation is then used in a supervised discriminative training phase to learn the conditional probability distribution
. The class labels are encoded in , as below:(11) 
The objective of discriminative training is to minimize the log probability
(12) 
where is a measure of error between and , and are the weights connecting the th output neuron and the
th hidden neuron. Here, we perform discriminative training through 10 epochs of supervised training using a MultiLayer Perceptron (MLP) with sigmoidal activation function.
3 Demonstration of OGDRBM: MNIST
We now demonstrate the progression of learning within the proposed OGDRBM approach, and make some observations about the algorithm. We characterize the algorithm on the MNIST data set [10], as it is a large wellexplored multicategory dataset (60,000 training samples, 10 categories) to demonstrate the OGDRBM. The network is trained in an online fashion, using the training data set. The validity of the trained network is established independently on the test set (10,000 samples, 10 categories) in an offline fashion.
Fig. 2 and 3 show the evolution of reconstruction error and network architecture, as samples stream in for training. Fig. 2 shows that the reconstruction error is high for the initial samples. This is because the model is at infancy and is beginning to learn. Hence, most samples are novel to the network, resulting in neurons being added (see Fig. 3). However, as training progresses, the network learns a sufficient representation of the data and the reconstruction error reduces progressively, resulting in fewer neurons being added to the network. It is evident from Fig. 2 and 3 that the online generative phase converges to a stable, concise network architecture, and the generative training is complete in 26min 15. It is also evident from Fig. 3 that % of the neurons in the stable network are added for the first % of the training samples (i.e., the first samples). The remaining % of the training samples (i.e., the latter samples) contribute only about % of the neurons in the stable network.
We next tested the effect of the training data sequence on the performance of the algorithm. We trained the OGDRBM independently for randomly constructed sequences of the MNIST training samples. In each case, we presented different sequences of the training data set to train the network. Across the training trials, the classification accuracy on the testing data set is 97%pm 2%, and the final number of neurons was . Thus, changing the sequence of presentation of training samples does not change the accuracy or the network architecture significantly, showing that the network is able to generalize well with a concise network architecture.
To study the discriminative potential of the feature representation learned during the online generative training phase, we related the number of ‘novel’ samples (where ) to their corresponding class labels for each of the trials. Fig. 4
shows the average number of hidden layer neurons associated with each class of the MNIST dataset, with the standard deviation across the
trials. These results show that the individual neurons in the trained network are inherently discriminative to the class labels despite the unsupervised nature of the training. Further, we observe that the variability across trials is a small proportion of the average number of neurons in each class, suggesting that the neurontoclass associations are largely independent of the sequence of training data samples.4 Performance Study: Credit Fraud Analytics
Online learning algorithms are particularly suitable for streaming data applications where the data distribution evolves with time, and are therefore relevant to the problem of credit scoring. Credit scoring is the problem of estimating the probability that borrower might default and/or exhibit undesirable behavior in the future. This problem is usually characterized with an imbalanced data set, and there is a huge interpersonal variability across borrowers. Such a problem calls for online learning algorithms that is capable of learning the distribution of the data set, regardless of the imbalance and distinction between samples in the same class.
Several studies have employed batch machine learning techniques for credit scoring ([1], [11], [6], [18]
). We perform analogous evaluations to benchmark our online learning algorithm in relation to these batch learning techniques. Specifically, we perform credit fraud prediction using three publicly available data sets, namely, the UCI German credit data set (UCI German), the UCI Australian credit data set (UCI AUS), and the KAGGLE ’Give me some credit’ data set (KAGGLE GMSC). We evaluate the OGDRBM classifier, in comparison with the Support Vector Machine classifier (SVM), the Multilayer Perceptron Neural Network (NN) classifier, the Classification Restricted Boltzmann Machine classifier (ClassRBM)
[9] and the Scoring Table (ST) method on the three credit data sets (listed in Table 1) in Table 2.Table 1 details the public credit scoring data sets, along with the the number of classes, the number of training and testing samples, and their imbalance factors ():
(13) 
where is the total number of classes and is the number of samples in class . It is evident that the 3 public datasets have varying degrees of classimbalance. While the UCI AUS is mildly imbalanced, the UCI German is partially imbalanced and the KAGGLE GMSC has very high imbalance across classes. This varying degree of classimbalance provides a unique opportunity to characterize the neuron distribution across classes in the online learning framework.
Data set  Input  Size of  

features  Data set  
UCI AUS  14  690  0.1101 
UCI German  24  1000  0.4 
KAGGLE GMSC  10  150000  0.86632 
We filled in the missing values in the Kaggle ’Give me some credit’ data set by averaging across similar participants in the population, grouped according to ages in intervals of .
We compare the classifiers on the three problems based on the network size and the performance measures such as the overall efficiency (), the average efficiency (
), True Positive Rate (TPR), True Negative Rate (TNR), and Geometric mean accuracy (Gmean) defined as:
(14)  
(15)  
TPR  (16)  
TNR  (17)  
Gmean  (18) 
Here, is the the number of correctly classified samples in class and is the total number of samples in class .
We now present the results of OGDRBM in relation to the batch learning techniques. We reproduce previously obtained batch learning results using the SVM, NN, ClassRBM and ST classifiers from [18]. Although the ClassRBM results in [18] are reported with fixed architecture of neurons with a batch size of and learning rate of , the architecture of the other classifiers is not specified. Further, the training accuracies of the classifiers have also not been reported. Hence, we perform independent evaluations using SVM, NN, and ClassRBM classifiers, to report an additional performance validation beyond the previously reported results.
The performance comparisons provide the following observations:

Network Size: Overall, the OGDRBM network uses fewer neurons than the classifiers used in comparison. This is because the OGDRBM uses the most novel samples to add neurons to the network, and the neurons are wellrepresentative of the data set.

Performance Measures: Despite having a compact architecture, the proposed OGDRBM performs better than all the classifiers used in comparison. This could be attributed to the fact that the learnt distributions represent the data very well. Moreover, while the other algorithms learn the data in batches, and updates gradients in batches, the OGDRBM updates gradients based on every sample in the data set.

Neuron Distribution Per Class: Unlike the batch learning algorithms that need a priori assumption of the architecture, the OGDRBM builds the network as learning progresses. This helps us to infer the number of neurons per class that may help to characterize the distribution of the samples in each class.

Effect of ClassImbalance: Classes with fewer samples require more neurons for sufficient feature representation. As the class imbalance increases, a greater proportion of the hidden layer neurons is associated with less prevalent classes. This adaptation is a natural consequence of the online learning process, and differentiates our approach from the the batch learning algorithms.
Data set  Classifier  Training  Testing  
TPR  TNR  Gmean  
SVM  534  76.429  66.679  74.667  61.378  0.3255  0.8878  0.5376  
SVM*            0.484  0.867  0.648  
UCI  NN  60  98.571  97.573  72.333  65.105  0.4574  0.8446  0.6216 
NN*            0.517  0.814  0.648  
German  ClassRBM  80  77.428  63.346  74.000  56.738  0.4418  0.8271  0.6045 
ClassRBM*  100          0.479  0.872  0.646  
ST*            0.67  0.68  0.68  
OGDRBM  48  79  74.2  76.5  71.69  0.60  0.83  0.71  
(32:16)  
SVM  192  85.507  86.263  85.507  86.048  0.7946  0.9263  0.8579  
SVM*            0.913  0.71  0.850  
UCI  NN  60  94.824  94.767  84.058  83.727  0.7917  0.8828  0.836 
NN*            0.850  0.857  0.854  
AUS  ClassRBM  50  86.128  86.391  85.507  86.021  0.8953  0.8264  0.8602 
ClassRBM*  100          0.880  0.847  0.863  
ST*            0.828  0.805  0.816  
OGDRBM  38  86.68  86.8  88.49  89  0.92  0.86  0.89  
(20:18)  
SVM  6340  69.970  59.430  72.240  60.018  0.5771  0.8982  0.72  
SVM*            0.114  0.994  0.336  
KAGGLE  NN  60  63.896  62.287  74.200  63.017  0.6165  0.8792  0.7363 
NN*            0.229  0.986  0.475  
GMSC  ClassRBM  100  75.687  74.048  86.160  74.789  0.6  0.8975  0.73384 
ClassRBM*  100          0.182  0.991  0.424  
ST*            0.515  0.622  0.566  
OGDRBM  13  76.08  74.49  86.25  75.22  0.63  0.88  0.74  
(3:10)  

*Reproduced from [18]
5 Conclusion
We introduced a novel Online Generative Discriminative Restricted Boltzmann Machines (OGDRBM) algorithm that evolves a network architecture in a fully bottomup online manner as data streams in. We demonstrated that the algorithm converges to a stable compact network architecture wherein (a) hidden layer neurons are implicitly associated with class labels (despite unsupervised training), and (b) classification performance are invariant to the sequence in which the training data samples are presented. Further, OGDRBM performed better than batch techniques in credit score classification with streaming data – specifically online learning achieved better accuracy with fewer neurons and showed the unique ability to adapt to class imbalance. Areas of future work will include expansions to unsupervised discriminative training, deeper architectures, and interpretable models.
References
 [1] B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, and J. Vanthienen. Benchmarking stateoftheart classification algorithms for credit scoring. The Journal of the Operational Research Society, 54(6):627–635, 2003.
 [2] G. Bouchard, T. Trouillon, J. Perez, and A. Galdon. Online learning to sample. arXiv preprint arXiv:1506.09016v2, 2016.
 [3] G. Chen, R. xu, and S. Srihari. Sequential labeling with online deep learning. arXiv preprint arXiv:1412:3397v3, 2015.

[4]
Thomas G. Dietterich.
Structure, Synctactic and Statistical Pattern Recognition
, chapter Machine learning for sequential data: A review, pages 15 – 30. SpringerVerlag, 2002.  [5] Shizhong Han, Zibo Meng, Ahmed Shehab Khan, and Yan Tong. Incremental boosting convolutional neural network for facial action unit recognition. Neural Information Processing Systems, 2016.
 [6] D. J. Hand and W. E. Henley. Statistical classification models in consumer credit scoring: A review. Journal of the Royal Statistical Society: Series A (General), 160:523–541, 1997.

[7]
G. E. Hinton.
Training products of experts by minimizing contrastive divergence.
Neural Computation, 14(8):1771–1800, 2002.  [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
 [9] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio. Learning Algorithms for the Classification Restricted Boltzmann Machine. Journal of Machine Learning Research, 13:643–669, 2012.
 [10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [11] Stefan Lessmann, Bart Baesens, HsinVonn Seow, and Lyn C Thomas. Benchmarking stateoftheart classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1):124–136, 2015.
 [12] Zachary C. Lipton, John Berkowitz, and Charles Elkon. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.

[13]
J. K. Pugh K. O. Stanley P. A. Szerlip, G. Morse.
Unsupervised Feature Learning through Divergent Discriminative
Feature Accumulation.
In Proc. of the TwentyNinth AAAI Conference on Artificial Intelligence (AAAI2015), Menlo Park, CA, AAAI Press.
, 2015.  [14] Tomaso Poggio, Stephen Voinea, and Lorenzo Rosasco. Online learning, stability, and stochastic gradient descent. arXiv preprint arXiv:1105.4701v3, 2011.
 [15] Anthony Robins. Sequential learning in neural networks: A review and a discussion of pseudorehearsal based methods. Journal of Intelligent Data Analysis, 8(3):301–322, 2004.
 [16] K. Subramanian, S. Suresh, and R. Savitha. A metacognitive complexvalued interval type2 fuzzy inference system. IEEE Transactions on Neural Networks and their Learning Systems, 25(9):1659–1672, 2014.
 [17] S. Suresh, R. Savitha, and N. Sundararajan. A sequential learning algorithm for complexvalued resource allocation networkCSRAN. IEEE Transactions on Neural Networks, 22(7):1061–1072, 2011.
 [18] M. J. Tomczak and M. B. Zie. Classification Restricted Boltzmann Machine for comprehensible credit scoring model. Expert Systems with Applications, 42(4):1789–1796, 2015.
 [19] J. T. Turner, Adam Page, Tinoosh Mohsenin, and Tim Oates. Deep belief networks used on high resolution multichannel Electroencephalography data for seizure detection. AAAI Spring Symposium, 2014.
 [20] Guanyu Zhou, Kihyuk Sohn, and Honglak Lee. Online incremental feature learning with denoising autoencoders. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS12), volume 22, pages 1453–1461, 2012.