RMDL
RMDL: Random Multimodel Deep Learning for Classification
view repo
The continually increasing number of complex datasets each year necessitates ever improving machine learning methods for robust and accurate categorization of these data. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RDML can accept as input a variety data to include text, video, images, and symbolic. This paper describes RMDL and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RDML produces consistently better performance than standard methods over a broad range of data types and classification problems.
READ FULL TEXT VIEW PDF
The exponential growth in the number of complex datasets every year requ...
read it
Ensemble methods have been widely used for improving the results of the ...
read it
Unsupervised feature extractors are known to perform an efficient and
di...
read it
Deep Learning methods are currently the state-of-the-art in many problem...
read it
This paper focuses on a comparative evaluation of the most common and mo...
read it
In the scope of WNUT-2020 Task 2, we developed various text classificati...
read it
Efficient and accurate joint representation of a collection of images, t...
read it
RMDL: Random Multimodel Deep Learning for Classification
Categorization and classification with complex data such as images, documents, and video are central challenges in the data science community. Recently, there has been an increasing body of work using deep learning structures and architectures for such problems. However, the majority of these deep architectures are designed for a specific type of data or domain. There is a need to develop more general information processing methods for classification and categorization across a broad range of data types.
While many researchers have successfully used deep learning for classification problems (e.g., see (Kowsari et al., 2017; LeCun et al., 2015; Lee et al., 2009; Chung et al., 2014; Turan et al., 2017)), the central problem remains as to which deep learning architecture (DNN, CNN, or RNN) and structure (how many nodes (units) and hidden layers) is more efficient for different types of data and applications. The favored approach to this problem is trial and error for the specific application and dataset.
This paper describes an approach to this challenge using ensembles of deep learning architectures. This approach, called Random Multimodel Deep Learning (RMDL), uses three different deep learning architectures: Deep Neural Networks (DNN), Convolutional Neural Netwroks (CNN), and Recurrent Neural Networks (RNN). Test results with a variety of data types demonstrate that this new approach is highly accurate, robust and efficient.
The three basic deep learning architectures use different feature space methods as input layers. For instance, for feature extraction from text, DNN uses term frequency-inverse document frequency (TF-IDF)
(Robertson, 2004). RDML searches across randomly generated hyperparameters for the number of hidden layers and nodes (desity) in each hidden layer in the DNN. CNN has been well designed for image classification. RMDL finds choices for hyperparameters in CNN using random feature maps and random numbers of hidden layers. CNN can be used for more than image data. The structures for CNN used by RMDL are 1D convolutional layer for text, 2D for images and 3D for video processings. RNN architectures are used primarily for text classification. RMDL uses two specific RNN structures: Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM). The number of GRU or LSTM units and hidden layers used by the RDML are also the results of search over randomly generated hyperparameters.
The main contributions of this work are as follows: I) Description of an ensemble approach to deep learning which makes the final model more robust and accurate. II) Use of different optimization techniques in training the models to stabilize the classification task. III) Different feature extraction approaches for each Random Deep Leaning (RDL) model in order to better understand the feature space (specially for text and video data). IV) Use of dropout in each individual RDL to address over-fitting. V) Use of majority voting among the RDL models. This majority vote from the ensemble of RDL models improves the accuracy and robustness of results. Specifically, if number of RDL models produce inaccuracies or overfit classifications and , the overall system is robust and accurate VI) Finally, the RMDL has ability to process a variety of data types such as text, images and videos.
The rest of this paper is organized as follows: Section 2 gives related work for feature extraction, other classification techniques, and deep learning for classification task; Section 3 describes current techniques for classification tasks which are used as our baseline; Section 4 describes Random Multimodel Deep Learning methods and the architecture for RMDL including Section 4.1 shows feature extraction in RMDL, Section 4.2 talks about overall view of RMDL; Section 4.3 addresses the deep learning structure used in this model, Section 4.4 discusses optimization problem; Section 5.1 talks about evaluation of these techniques; Section 5 shows the experimental results which includes the accuracy and performance of RMDL; and finally, Section 6 presents discussion and conclusions of our work.
Researchers from a variety of disciplines have produced work relevant to the approach described in this paper. We have organized this work into three areas: I) Feature extraction; II) Classification methods and techniques (baseline and other related methods); and III) Deep learning for classification.
Feature Extraction: Feature extraction is a significant part of machine learning especially for text, image, and video data. Text and many biomedical datasets are mostly unstructured data from which we need to generate a meaningful and structures for use by machine learning algorithms. As an early example, L. Krueger et. al. in 1979 (Krueger and Shapiro, 1979) introduced an effective method for feature extraction for text categorization. This feature extraction method is based on word counting to create a structure for statistical learning. Even earlier work by H. Luhn (Luhn, 1957) introduced weighted values for each word and then G. Salton et. al. in 1988 (Salton and Buckley, 1988)
modified the weights of words by frequency counts called term frequency-inverse document frequency (TF-IDF). The TF-IDF vectors measure the number of times a word appears in the document weighted by the inverse frequency of the commonality of the word across documents. Although, the TF-IDF and word counting are simple and intuitive feature extraction methods, they do not capture relationships between words as sequences. Recently, T. Mikolov
et. al. (Mikolov et al., 2013) introduced an improved technique for feature extraction from text using the concept of embedding or placing the word into a vector space based on context. This approach to word embedding, called Word2Vec, solves the problem of representing contextual word relationships in a computable feature space. Building on these ideas, J. Pennington et. al. in 2014 (Pennington et al., 2014) developed a learning vector space representation of the words called Glove and deployed it in Stanford NLP lab. The RMDL approach described in this paper uses Glove for feature extraction from textual data.Classification Methods and Techniques: Over the last 50 years, many supervised learning classification techniques have been developed and implemented in software to accurately label data. For example, the researchers, K. Murphy in 2006 (Murphy, 2006) and I. Rish in 2001 (Rish, 2001)
introduced the Naïve Bayes Classifier (NBC) as a simple approach to the more general respresentation of the supervised learning classification problem. This approach has provided a useful technique for text classification and information retrieval applications. As with most supervised learning classification techniques, NBC takes an input vector of numeric or categorical data values and produce the probability for each possible output labels. This approach is fast and efficient for text classification, but NBC has important limitations. Namely, the order of the sequences in text is not reflected on the output probability because for text analysis, naïve bayes uses a bag of words approach for feature extraction. Because of its popularity, this paper uses NBC as one of the baseline methods for comparison with RMDL. Another popular classification technique is Support Vector Machines (SVM), which has proven quite accurate over a wide variety of data. This technique constructs a set of hyper-planes in a transformed feature space. This transformation is not performed explicitly but rather through the kernal trick which allows the SVM classifier to perform well with highly nonlinear relationships between the predictor and response variables in the data. A variety of approaches have been developed to further extend the basic methodology and obtain greater accuracy. C. Yu
et. al. in 2009 (Yu and Joachims, 2009) introduced latent variables into the discriminative model as a new structure for SVM, and S. Tong et. al. in 2001 (Tong and Koller, 2001)added active learning using SVM for text classification. For a large volume of data and datasets with a huge number of features (such as text), SVM implementations are computationally complex. Another technique that helps mediate the computational complexity of the SVM for classification tasks is stochastic gradient descent classifier (SGDClassifier)
(Kabir et al., 2015) which has been widely used in both text and image classification. SGDClassifier is an iterative model for large datasets. The model is trained based on the SGD optimizer iteratively.Deep Learning:
Neural networks derive their architecture as a relatively simply representation of the neurons in the human’s brain. They are essentially weighte combinations of inputs the pass through multiple non-linear functions. Neural networks use an iterative learning method known as back-propagation and an optimizer (such as stochastic gradient descent (SGD)).
Deep Neural Networks (DNN) are based on simple neural networks architectures but they contain multiple hidden layers. These networks have been widely used for classification. For example, D. CireşAn et. al. in 2012 (CireşAn et al., 2012)
used multi-column deep neural networks for classification tasks, where multi-column deep neural networks use DNN architectures. Convolutional Neural Networks (CNN) provide a different architectural approach to learning with neural networks. The main idea of CNN is to use feed-forward networks with convolutional layers that include local and global pooling layers. A. Krizhevsky in 2012
(Krizhevsky et al., 2012) used CNN, but they have used convolutional layers combined with the feature space of the image. Another example of CNN in (LeCun et al., 2015) showed excellent accuracy for image classification. This architecture can also be used for text classification as shown in the work of (Kim, 2014). For text and sequences, convolutional layers are used with word embeddings as the input feature space. The final type of deep learning architecture is Recurrent Neural Networks (RNN) where outputs from the neurons are fed back into the network as inputs for the next step. Some recent extensions to this architecture uses Gated Recurrent Units (GRUs) (Chung et al., 2014) or Long Short-Term Memory (LSTM) units (Hochreiter and Schmidhuber, 1997). These new units help control for instability problems in the original network architecure. RNN have been successfully used for natural language processing
(Mikolov et al., 2010). Recently, Z. Yang et. al. in 2016 (Yang et al., 2016) developed hierarchical attention networks for document classification. These networks have two important characteristics: hierarchical structure and an attention mechanism at word and sentence level.New work has combined these three basic models of the deep learning structure and developed a novel technique for enhancing accuracy and robustness. The work of M. Turan et. al. in 2017 (Turan et al., 2017) and M. Liang et. al.in 2015 (Liang and Hu, 2015) implemented innovative combinations of CNN and RNN called A Recurrent Convolutional Neural Network (RCNN). K. Kowsari et. al. in 2017 (Kowsari et al., 2017) introduced hierarchical deep learning for text classification (HDLTex) which is a combination of all deep learning techniques in a hierarchical structure for document classification has improved accuracy over traditional methods. The work in this paper builds on these ideas, spcifically the work of (Kowsari et al., 2017) to provide a more general approach to supervised learning for classification.
In this paper, we use both contemporary and traditional techniques of document and image classification as our baselines. The baselines of image and text classification are different due to feature extraction and structure of model; thus, text and image classification’s baselines are described separately in the following section.
Text classification techniques which are used as our baselines to evaluate our model are as follows: regular deep models such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Deep Neural Networks (DNN). Also, we have used two different techniques of Support Vector Machine (SVM), naïve bayes classification (NBC), and finally Hierarchical Deep Learning for Text Classification (HDLTex) (Kowsari et al., 2017).
The baseline, we used in this paper is Deep Learning without hierarchical levels. An example of hierarchical levels’ structure is (Yang et al., 2016) that has been used as one of our baselines for text classification. In our methods’ Section 4, we will explain the basic models of deep learning such as DNN, CNN, and RNN which are used as part of RMDL model.
The original version of SVM is used for binary classification, so for multi class we need to generate Multimodel or MSVM. One-Vs-One is a technique for multi-class SVM and needs to build N(N-1) classifiers.
The natural way to solve k-class problem is to construct a decision function of all classes at once (Chen et al., 2016; Weston and Watkins, 1998). Another technique of multi-class classification using SVM is All-against-One. In SVM, many different methods are available for feature extraction such as word sequences feature extracting (Zhang et al., 2008), and Term frequency-inverse document frequency (TF-IDF).
The basic idea of String Kernel (SK) is using for mapping string in the feature space; therefore, the only different between the three techniques are the way they map the string into feature space. For many applications such as text, DNA, and protein classification, Spectrum Kernel (SP) is addressed (Leslie et al., 2002; Eskin et al., 2002). The basic idea of SP is counting number of time a word appears in string as feature map where defining feature maps from
Mismatch Kernel is the other stable way to map the string into feature space. The key idea is using which stands for or size of the word and allow to have mismatch in feature space (Leslie et al., 2004). The main problem of SVM for string sequences is time complexity of these models. S. Ritambhara et. al. in 2017 (Singh et al., 2017) addressed the problem of time for gap k-mers kernel called GaKCo which is used only for protein and DNA sequences.
Stacking SVMs is used as another baseline method for comparison with RMDL, but this technique is used only for hierarchical labeled datasets. The stacking SVM provides an ensemble of individual SVM classifiers and generally produces more accurate results than single-SVM models (Sun and Lim, 2001; Sebastiani, 2002).
This technique has been used in industry and academia for a long time, and it is the most traditional method of text categorization which is widely used in Information Retrieval (Manning et al., 2008). If the number of documents, fit into categories, the predicted class as output is . Naïve bayes is a simple algorithm using naïve bayes rule described as follows:
(1) |
where is document, indicates classes.
(2) |
The baseline of this paper is word level of NBC (Kim et al., 2006) as follows:
(3) |
This technique is used as one of our baselines for hierarchical labeled datasets. When documents are organized hierarchically, multi-class approaches are difficult to apply using traditional supervised learning methods. The HDLTex (Kowsari et al., 2017) introduced a new approach to hierarchical document classification that combines multiple deep learning approaches to produce hierarchical classification. The primary contribution of HDLTex research is hierarchical classification of documents. A traditional multi-class classification technique can work well for a limited number of classes, but performance drops with increasing number of classes, as is present in hierarchically organized documents. HDLTex solved this problem by creating architectures that specialize deep learning approaches for their level of the document hierarchy.
For image classification, we have five baselines as follows: Deep L2-SVM (Tang, 2013), Maxout Network (Goodfellow et al., 2013), BinaryConnect (Courbariaux
et al., 2015), PCANet-1 (Chan
et al., 2015), and gcForest (Zhou and Feng, 2017).
Deep L2-SVM: This technique is known as deep learning using linear support vector machines which simply softmax is replaced with linear SVMs (Tang, 2013).
Maxout Network: I. Goodfellow et. al. in 2013 (Goodfellow et al., 2013) defined a simple novel model called maxout (named because its outputs’ layer is a set of max of inputs’ layer, and it is a natural companion to dropout). Their design both facilitates optimization by using dropout, and also improves the accuracy of dropout’s model.
BinaryConnect: M. Courbariaux et. al. in 2015 (Courbariaux
et al., 2015)
worked on training Deep Neural Networks (DNN) with binary weights during propagations. They have introduced a binarization scheme for binary weights during forward and backward propagations (BinaryConnect) which is mainly used for image classification. BinaryConnect is used as our baseline for RMDL on image classification.
is simple way of deep learning for image classification which uses CNN structure. Their technique is one of the basic and efficient methods of deep learning. The CNN structure they’ve used, is part of RMDL with significant differences that they use: I) cascaded principal component analysis (PCA); II) binary hashing; and III) blockwise histograms, and also number of hidden layers and nodes in RMDL is selected automatically.
introduced a decision tree ensemble approach with high performance as an alternative to deep neural networks. Deep forest creates multi level of forests as decision trees.
The novelty of this work is in using multi random deep learning models including DNN, RNN, and CNN techniques for text and image classification. The method section of this paper is organized as follows: first we describe RMDL and we discuss three techniques of deep learning architectures (DNN, RNN, and CNN) which are trained in parallel. Next, we talk about multi optimizer techniques that are used in different random models.
The feature extraction is divided into two main parts for RMDL (Text and image). Text and sequential datasets are unstructured data, while the feature space is structured for image datasets.
Image features are the followings: where denotes the height of the image, represents the width of image, and is the color that has 3 dimensions (RGB). For gray scale datasets such as dataset, the feature space is . A 3D object in space contains cloud points in space and each cloud point has features which are (x, y, z, R, G, and B). The 3D object is unstructured due to number of cloud points since one object could be different with others. However, we could use simple instance down/up sampling to generate the structured datasets.
In this paper we use several techniques of text feature extraction which are word embedding (GloVe and Word2vec) and also TF-IDF. In this paper, we use word vectorization techniques (Hotta et al., 2010)
for extracting features; Besides, we also can use N-gram representation as features for neural deep learning
(Kešelj et al., 2003; Dave et al., 2003). For example, feature extraction in this model for the string ”In this paper we introduced this technique” would be composed of the following:Feature count(1) { (In 1) , (this 2), (paper 1), (we 1), (introduced 1), (technique 1) }
Feature count(2) { (In 1) , (this 2), (paper 1), (we 1), (introduced 1), (technique 1), (In this 1), (This Paper 1), ( paper we 1), ( we introduced 1), (introduced this 1), ( this technique 1) }
Documents enter our models via features extracted from the text. We employed different feature extraction approaches for the deep learning architectures we built. For CNN and RNN, we used the text vector-space models using dimensions as described in GloVe (Pennington et al., 2014). A vector-space model is a mathematical mapping of the word space, defined as follows:
(4) |
where is the length of the document , and is the GloVe word embedding vectorization of word in document .
Random Multimodel Deep Learning is a novel technique that we can use in any kind of dataset for classification. An overview of this technique is shown in Figure 2 which contains multi Deep Neural Networks (DNN), Deep Convolutional Neural Networks (CNN), and Deep Recurrent Neural Networks (RNN). The number of layers and nodes for all of these Deep learning multi models are generated randomly (e.g. 9 Random Models in RMDL constructed of CNNs, RNNs, and DNNs, all of them are unique due to randomly creation).
(5) |
Where is the number of random models, and is the output prediction of model for data point in model (Equation 5 is used for binary classification, ). Output space uses majority vote for final . Therefore, is given as follows:
(6) |
Where is number of random model, and shows the prediction of label of document or data point of for model and is defined as follows:
(7) |
After all RDL models (RMDL) are trained, the final prediction is calculated using majority vote of these models.
The RMDL model structure (section 4.2) includes three basic architectures of deep learning in parallel. We describe each individual model separately. The final model contains random DNNs (Section 4.3.1), RNNs (Section 4.3.2), and CNNs models (Section 4.3.3).
Deep Neural Networks’ structure is designed to learn by multi connection of layers that each layer only receives connection from previous and provides connections only to the next layer in hidden part. The input is a connection of feature space with first hidden layer for all random models. The output layer is number of classes for multi-class classification and only one output for binary classification. But our main contribution of this paper is that we have many training DNN for different purposes. In our techniques, we have multi-classes DNNs where each learning models is generated randomly (number of nodes in each layer and also number of layers are completely random assigned). Our implementation of Deep Neural Networks (DNN) is discriminative trained model that uses standard back-propagation algorithm using sigmoid (equation 8
), ReLU
(Nair and Hinton, 2010) (equation 9) as activation function. The output layer for multi-class classification, should use
equation 10.(8) | ||||
(9) |
(10) | ||||
Given a set of example pairs , the goal is to learn from these input and target space using hidden layers. In text classification, the input is string which is generated by vectorization of text. In Figure 2 the left model shows how DNN contribute in RMDL.
Another neural network architecture that contributes in RMDL is Recurrent Neural Networks (RNN). RNN assigns more weights to the previous data points of sequence. Therefore, this technique is a powerful method for text, string and sequential data classification but also could be used for image classification as we did in this work. In RNN the neural net considers the information of previous nodes in a very sophisticated method which allows for better semantic analysis of structures of dataset. General formulation of this concept is given in Equation 11 where is the state at time and refers to the input at step t.
(11) |
More specifically, we can use weights to formulate the Equation 11 with specified parameters in Equation 12
(12) |
Where refers to recurrent matrix weight, refers to input weights, is the bias and denotes an element-wise function.
Again, we have modified the basic architecture for use RMDL. Figure 2 left side shows this extended RNN architecture. Several problems arise from RNN when the error of the gradient descent algorithm is back propagated through the network: vanishing gradient and exploding gradient (Bengio
et al., 1994).
Long Short-Term Memory (LSTM)
: To deal with these problems Long Short-Term Memory (LSTM) is a special type of RNN that preserve long term dependency in a more effective way in comparison to the basic RNN. This is particularly useful to overcome vanishing gradient problem
(Pascanu et al., 2013). Although LSTM has a chain-like structure similar to RNN, LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state. Figure 3 shows the basic cell of a LSTM model. A step by step explanation of a LSTM cell is as following:(13) | ||||
(14) | ||||
(15) | ||||
(16) | ||||
(17) | ||||
(18) |
Where equation 13 is input gate, Equation 14 shows candid memory cell value, Equation 15 is forget gate activation, Equation 16 is new memory cell value, and Equation 17 and 18 show output gate value. In the above description all
represents bias vectors and all
represent weight matrices and is used as input to the memory cell at time . Also, indices refer to input, cell memory, forget and output gates respectively. Figure 3 shows the structure of these gates with a graphical representation.were introduced which deploys a max-pooling layer to determine discriminative phrases in a text
(Lai et al., 2015).Gated Recurrent Unit (GRU): Gated Recurrent Unit (GRU) is a gating mechanism for RNN which was introduced by (Chung et al., 2014) and (Cho et al., 2014). GRU is a simplified variant of the LSTM architecture, but there are differences as follows: GRU contains two gates, a GRU does not possess internal memory (the in Figure 3); and finally, a second non-linearity is not applied (tanh in Figure 3). A step by step explanation of a GRU cell is as following:
(19) |
Where refers to update gate vector of , stands for input vector, , and are parameter matrices and vector, is activation function that could be sigmoid or ReLU.
(20) |
(21) |
Where is output vector of , stands for reset gate vector of , is update gate vector of , indicates the hyperbolic tangent function.
The final deep learning approach which contributes in RMDL is Convolutional Neural Networks (CNN) that is employed for document or image classification. Although originally built for image processing with architecture similar to the visual cortex, CNN have also been effectively used for text classification (LeCun
et al., 1998); thus, in RMDL, this technique is used in all datasets.
In the basic CNN for image processing an image tensor is convolved with a set of kernels of size
. These convolution layers are called feature maps and can be stacked to provide multiple filters on the input. To reduce the computational complexity CNN use pooling which reduces the size of the output from one layer to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features (Scherer et al., 2010). The most common pooling method is max pooling where the maximum element is selected in the pooling window.In this paper we use two types of stochastic gradient optimizer in our neural networks implementation which are RMSProp and Adam optimizer:
SGD has been used as one of our optimizer that is shown in equation 22. It uses a momentum on re-scaled gradient which is shown in equation 23 for updating parameters. The other technique of optimizer that is used is RMSProp which does not do bias correction. This will be a significant problem while dealing with sparse gradient.
(22) | ||||
(23) |
Adam is another stochastic gradient optimizer which uses only the first two moments of gradient (
and that are shown in equation 24, 25, 26, and 27) and average over them. It can handle non-stationary of objective function as in RMSProp while overcoming the sparse gradient issue that was a drawback in RMSProp (Kingma and Ba, 2014).(24) | ||||
(25) | ||||
(26) | ||||
(27) |
Where is the first moment and
indicates second moment that both are estimated.
andThe main idea of using multi model with different optimizers is that if one optimizer does not provide a good fit for a specific datasets, the RMDL model with random models (some of them might use different optimizers) could ignore models which are not efficient if and only if . The Figure 4 provides a visual insight on how three optimizers work better in the concept of majority voting. Using multi techniques of optimizers such as SGD, adam, RMSProp, Adagrad, Adamax, and so on helps the RMDL model to be more stable for any type of datasets. In this research, we only used two optimizers (Adam and RMSProp) for evaluating our model, but the RMDL model has the capability to use any kind of optimizer.
In this section, experimental results are discussed including evaluation of method, experimental setup, and datasets. Also, we discuss the hardware and frameworks which are used in RMDL; finally, a comparison between our empirical results and the baselines has been presented. Moreover, losses and accuracies of this model for each individual RDL (in each epoch) is shown in Figure
5.In this work, we report accuracy and Micro F1-Score which are given as follows:
(28) | ||||
(29) | ||||
(30) |
However, the performance of our model is evaluated only in terms of F1-score for evaluation as in Tables 1 and 3. Formally, given a set of indices, we define the class as . If we denote and for -true positive of , -false positive, -false negative, and -true negative counts respectively then the above definitions apply for our multi-class classification problem.
Model | Dataset | ||||
---|---|---|---|---|---|
W.1 | W.2 | W.3 | R | ||
Baseline | DNN | 86.15 | 80.02 | 66.95 | 85.3 |
CNN (Yang et al., 2016) | 88.68 | 83.29 | 70.46 | 86.3 | |
RNN (Yang et al., 2016) | 89.46 | 83.96 | 72.12 | 88.4 | |
NBC | 78.14 | 68.8 | 46.2 | 83.6 | |
SVM (Zhang et al., 2008) | 85.54 | 80.65 | 67.56 | 86.9 | |
SVM (TF-IDF) (Chen et al., 2016) | 88.24 | 83.16 | 70.22 | 88.93 | |
Stacking SVM (Sun and Lim, 2001) | 85.68 | 79.45 | 71.81 | NA | |
HDLTex (Kowsari et al., 2017) | 90.42 | 86.07 | 76.58 | NA | |
RMDL | 3 RDLs | 90.86 | 87.39 | 78.39 | 89.10 |
9 RDLs | 92.60 | 90.65 | 81.92 | 90.36 | |
15 RDLs | 92.66 | 91.01 | 81.86 | 89.91 | |
30 RDLs | 93.57 | 91.59 | 82.42 | 90.69 |
Two types of datasets (text and image) has been used to test and evaluate our approach performance. However, in theory the model has capability to solve classification problems with a variety of data including video, text, and images.
For text classification, we used different datasets, namely, , , , and .
Web Of Science (WOS) dataset (Kowsari et al., 2018) is a collection of academic articles’ abstracts which contains three corpora (5736, 11967, and 46985 documents) for (11, 34, and 134 topics).
The Reuters-21578 news dataset contains documents which are divided into documents for training and for testing with total of classes.
IMDB dataset contains reviews that is splitted into a set of highly popular movie reviews for training, and for testing.
20NewsGroup dataset includes documents with maximum length of words. In this dataset, we have for training and samples are used for validation.
For image classification, two traditional and ground truth datasets are used, namely, MNIST hand writing dataset and CIFAR.
MNIST: this dataset contains handwritten number and input feature space is in format. The training and the test set contains and data point examples respectively.
CIFAR: This dataset consists of images with format assigned in classes, with images per class that is splitted into training and test images. Classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
All of the results shown in this paper are performed on Central Process Units (CPU) and Graphical Process Units (GPU). Also, RMDL can be implemented using only GPU, CPU, or both. The processing units that has been used through this experiment was intel on Xeon E5-2640 (2.6 GHz) with 12 cores and 64 GB memory (DDR3). Also, we have used three graphical cards on our machine which are two Nvidia GeForce GTX 1080 Ti and Nvidia Tesla K20c.
This work is implemented in Python using Compute Unified Device Architecture (CUDA) which is a parallel computing platform and Application Programming Interface (API) model created by . We used and library for creating the neural networks (Abadi et al., 2016; Chollet et al., 2015).
Methods | MNIST | CIFAR-10 | |
---|---|---|---|
Baseline | Deep L2-SVM (Tang, 2013) | 0.87 | 11.9 |
Maxout Network (Goodfellow et al., 2013) | 0.94 | 11.68 | |
BinaryConnect (Courbariaux et al., 2015) | 1.29 | 9.90 | |
PCANet-1 (Chan et al., 2015) | 0.62 | 21.33 | |
gcForest (Zhou and Feng, 2017) | 0.74 | 31.00 | |
RMDL | 3 RDLs | 0.51 | 9.89 |
9 RDLs | 0.41 | 9.1 | |
15 RDLs | 0.21 | 8.74 | |
30 RDLs | 0.18 | 8.79 |
Table 2 shows the error rate of RMDL for image classification. The comparison between the RMDL with baselines (as described in Section 3.2), shows that the error rate of the RMDL for MNIST dataset has been improved to , , and for , and random models respectively. For the CIFAR-10 datasets, the error rate has been decreased for RMDL to , , , and ,using , , , and RDL respectively.
Table 1 shows that for four ground truth datasets, RMDL improved the accuracy in comparison to the baselines. In Table 1, we evaluated our empirical results by four different RMDL models (using , , , and RDLs). For Web of Science (WOS-5,736) the accuracy is improved to , , , and respectively. For Web of Science (WOS-11,967), the accuracy is increased to , , , and respectively, and for Web of Science (WOS-46,985) the accuracy has increased to , , , and respectively. The accuracy of Reuters-21578 is , , , and respectively. We report results for other ground truth datasets such as Large Movie Review Dataset (IMDB) and 20NewsGroups. As it is mentioned in Table 3, for two ground truth datasets, RMDL improves the accuracy. In Table 3, we evaluated our empirical results of two datasets (IMDB reviewer and 20NewsGroups).The accuracy of IMDB dataset is , , and for , , and RDLs respectively, whereas the accuracy of DNN is , CNN (Yang
et al., 2016) is , RNN (Yang
et al., 2016) is , Naïve Bayes Classifier is , SVM (Zhang
et al., 2008) is , and SVM (Chen
et al., 2016) using TF-IDF is equal to . The accuracy of 20NewsGroup dataset is , , and for 3, 9, and 15 random models respectively, whereas the accuracy of DNN is , CNN (Yang
et al., 2016) is , RNN (Yang
et al., 2016) is , Naïve Bayes Classifier is , SVM (Zhang
et al., 2008) is , and SVM (Chen
et al., 2016) using TF-IDF is equal to .
Model | Dataset | ||
---|---|---|---|
IMDB | 20NewsGroup | ||
Baseline | DNN | 88.55 | 86.50 |
CNN (Yang et al., 2016) | 87.44 | 82.91 | |
RNN (Yang et al., 2016) | 88.59 | 83.75 | |
Naïve Bayes Classifier | 83.19 | 81.67 | |
SVM (Zhang et al., 2008) | 87.97 | 84.57 | |
SVM(TF-IDF) (Chen et al., 2016) | 88.45 | 86.00 | |
RMDL | 3 RDLs | 89.91 | 86.73 |
9 RDLs | 90.13 | 87.62 | |
15 RDLs | 90.79 | 87.91 |
Figure 5 indicates accuracies and losses of RMDL which are shown with (RDLs) for text classification and RDLs for image classification. As shown in Figure 4(a), RDLs’ loss of MNIST dataset are increasing over each epoch (RDL , RDL , RDL and RDL ) after epochs, but RMDL model contains RDL models; thus, the accuracy of the majority votes for these models as presented in Table 2 is competing with our baselines.
In Figure 4(a), for CIFAR dataset, the models do not have overfitting problem, but for MNIST datasets at least 4 models’ losses are increasing over each epoch after iterations (RDL , RDL , RDL , and RDL ); although the accuracy and F1-measure of these models will drop after epochs, the majority votes’ accuracy is robust and efficient which means RMDL will ignore them due to majority votes between models. The Figure 4(a) shows the loss value over each epoch of two ground truth datasets, CIFAR and IMDB for random deep learning models (RDL). Figure 4(b) presents the accuracy of 15 random models for Reuters-21578 respectively. In Figure 4(b), the accuracy of Random Deep Learning (RDLs) model is addressed over each epoch for WOS-5736 (Web Of Science dataset with 17 categories and documents), the majority votes of these models as shown in Table 1 is competing with our baselines.
The classification task is an important problem to address in machine learning, given the growing number and size of datasets that need sophisticated classification. We propose a novel technique to solve the problem of choosing best technique and method out of many possible structures and architectures in deep learning. This paper introduces a new approach called RMDL (Random Multimodel Deep Learning) for the classification that combines multi deep learning approaches to produce random classification models. Our evaluation on datasets obtained from the Web of Science (WOS), Reuters, MNIST, CIFAR, IMDB, and 20NewsGroups shows that combinations of DNNs, RNNs and CNNs with the parallel learning architecture, has consistently higher accuracy than those obtained by conventional approaches using naïve Bayes, SVM, or single deep learning model. These results show that deep learning methods can provide improvements for classification and that they provide flexibility to classify datasets by using majority vote. The proposed approach has the ability to improve accuracy and efficiency of models and can be use across a wide range of data types and applications.
Some effective techniques for naive bayes text classification.
IEEE transactions on knowledge and data engineering 18, 11 (2006), 1457–1466.Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In
Proceedings of the 26th annual international conference on machine learning. ACM, 609–616.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 3367–3375.IJCAI 2001 workshop on empirical methods in artificial intelligence
, Vol. 3. IBM, 41–46.
Comments
There are no comments yet.