kaggle_redefining_cancer_treatment
Personalized Medicine: Redefining Cancer Treatment with deep learning
view repo
The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of these traditional classifiers has degraded as the number of documents has increased. This is because along with this growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multiclass classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy.
READ FULL TEXT VIEW PDFPersonalized Medicine: Redefining Cancer Treatment with deep learning
Each year scientific researchers produce a massive number of documents. In 2014 the 28,100 active, scholarly, peerreviewed, Englishlanguage journals published about 2.5 million articles, and there is evidence that the rate of growth in both new journals and publications is accelerating [1]
. The volume of these documents has made automatic organization and classification an essential element for the advancement of basic and applied research. Much of the recent work on automatic document classification has involved supervised learning techniques such as classification trees, naïve Bayes, support vector machines (SVM), neural nets, and ensemble methods. Classification trees and naïve Bayes approaches provide good interpretability but tend to be less accurate than the other methods.
However, automatic classification has become increasingly challenging over the last several years due to growth in corpus sizes and the number of fields and subfields. Areas of research that were little known only five years ago have now become areas of high growth and interest. This growth in subfields has occurred across a range of disciplines including biology (e.g., CRISPRCA9), material science (e.g., chemical programming), and health sciences (e.g., precision medicine). This growth in subfields means that it is important to not just label a document by specialized area but to also organize it within its overall field and the accompanying subfield. This is hierarchical classification.
Although many existing approaches to document classification can quickly identify the overall area of a document, few of them can rapidly organize documents into the correct subfields or areas of specialization. Further, the combination of toplevel fields and all subfields presents current document classification approaches with a combinatorially increasing number of class labels that they cannot handle. This paper presents a new approach to hierarchical document classification that we call Hierarchical Deep Learning for Text classification (HDLTex).^{1}^{1}1HDLTex is shared as an open source tool at https://github.com/kk7nc/HDLTex HDLTex combines deep learning architectures to allow both overall and specialized learning by level of the document hierarchy. This paper reports our experiments with HDLTex, which exhibits improved accuracy over traditional document classification methods.
Document classification is necessary to organize documents for retrieval, analysis, curation, and annotation. Researchers have studied and developed a variety of methods for document classification. Work in the information retrieval community has focused on search engine fundamentals such as indexing and dictionaries that are considered core technologies in this field [2]. Considerable work has built on these foundational methods to provide improvements through feedback and query reformulation [3, 4].
More recent work has employed methods from data mining and machine learning. Among the most accurate of these techniques is the support vector machine (SVM)
[5, 6, 7]. SVMs use kernel functions to find separating hyperplanes in highdimensional spaces. Other kernel methods used for information retrieval include string kernels such as the spectrum kernel
[8] and the mismatch kernel [9], which are widely used with DNA and RNA sequence data.SVM and related methods are difficult to interpret. For this reason many information retrieval systems use decision trees
[3] and naïve Bayes [10, 11] methods. These methods are easier to understand and, as such, can support query reformulation, but they lack accuracy. Some recent work has investigated topic modeling to provide similar interpretations as naïve Bayes methods but with improved accuracy [12].This paper uses newer methods of machine learning for document classification taken from deep learning. Deep learning is an efficient version of neural networks [13]
that can perform unsupervised, supervised, and semisupervised learning
[14]. Deep learning has been extensively used for image processing, but many recent studies have applied deep learning in other domains such as text and data mining. The basic architecture in a neural network is a fully connected network of nonlinear processing nodes organized as layers. The first layer is the input layer, the final layer is the output layer, and all other layers are hidden. In this paper, we will refer to these fully connected networks as Deep Neural Networks (DNN). Convolutional Neural Networks (CNNs) are modeled after the architecture of the visual cortex where neurons are not fully connected but are spatially distinct
[15]. CNNs provide excellent results in generalizing the classification of objects in images [16]. More recent work has used CNNs for text mining [17]. In research closely related to the work in this paper, Zhang et al. [18]used CNNs for text classification with characterlevel features provided by a fully connected DNN. Regardless of the application, CNNs require large training sets. Another fundamental deep learning architecture used in this paper is the Recurrent Neural Network (RNN). RNNs connect the output of a layer back to its input. This architecture is particularly important for learning timedependent structures to include words or characters in text
[19]. Deep learning for hierarchical classification is not new with this paper, although the specific architectures, the comparative analyses, and the application to document classification are new. Salakhutdinov [20, 21] used deep learning to hierarchically categorize images. At the top level the images are labeled as animals or vehicles. The next level then classifies the kind of animal or vehicle. This paper describes the use of deep learning approaches to create a hierarchical document classification approach. These deep learning methods have the promise of providing greater accuracy than SVM and related methods. Deep learning methods also provide flexible architectures that we have used to produce hierarchical classifications. The hierarchical classification our methods produce is not only highly accurate but also enables greater understanding of the resulting classification by showing where the document sits within a field or area of study.This paper compares fifteen methods for performing document classification. Six of these methods are baselines since they are used for traditional, nonhierarchical document classification. Of the six baseline methods three are widely used for document classification: termweighted support vector machines [22], multiword support vector machines[23], and naïve Bayes classification (NBC). The other three are newer deep learning methods that form the basis for our implementation of a new approach for hierarchical document classification. These deep learning methods are described in Section V.
Vapnik and Chervonenkis introduced the SVM in 1963 [24, 25], and in 1992 Boser et al. introduced a nonlinear version to address more complex classification problems [26]. The key idea of the nonlinear SVM is the generating kernel shown in Equation 1, followed by Equations 2 and 3:
(1)  
(2)  
(3) 
Text classification using string kernels within SVMs has been successful in many research projects [27]. The original SVM solves a binary classification problem; however, since document classification often involves several classes, the binary SVM requires an extension. In general, the multiclass SVM (MSVM) solves the following optimization:
(4) 
(5) 
where indicates number of classes, are slack variables, and is the learning parameter. To solve the MSVM we construct a decision function of all classes at once [22, 28]. One approach to MSVM is to use binary SVM to compare each of the pairwise classification labels, where is the number of labels or classes. Yet another technique for MSVM is oneversusall, where the two classes are one of the labels versus all of the other labels.
Naïve Bayes is a simple supervised learning technique often used for information retrieval due to its speed and interpretability [31, 2]. Suppose the number of documents is and each document has the label , where is the number of labels. Naïve Bayes calculates
(6) 
where is the document, resulting in
(7) 
The naïve Bayes document classifier used for this study uses wordlevel classification [11]. Let be the parameter for word , then
(8) 
Documents enter our hierarchical models via features extracted from the text. We employed different feature extraction approaches for the deep learning architectures we built. For CNN and RNN, we used the text vectorspace models using
dimensions as described in Glove [32]. A vectorspace model is a mathematical mapping of the word space, defined as(9) 
where is the length of the document , and is the Glove word embedding vectorization of word in document .
For DNN, we used countbased and term frequency–inverse document frequency (tfidf) for feature extraction. This approach uses counts for grams, which are sequences of words [33, 34]
. For example, the text “In this paper we introduced this technique” is composed of the following Ngrams:
Feature count (1): { (In, 1) , (this, 2), (paper, 1), (we, 1), (introduced, 1), (technique, 1) }
Feature count (2): { (In, 1) , (this, 2), (paper, 1), (we, 1), (introduced, 1), (technique, 1), (In this, 1), (This paper, 1), (paper we, 1),…}
Where the counts are indexed by the maximum grams. So Feature count (2) includes both 1grams and 2grams. The resulting DNN feature space is
(10) 
where is the feature space of document for grams of size , and is determined by word or gram counts. Our algorithm is able to use Ngrams for features within deep learning models [35].
The methods used in this paper extend the concepts of deep learning neural networks to the hierarchical document classification problem. Deep learning neural networks provide efficient computational models using combinations of nonlinear processing elements organized in layers. This organization of simple elements allows the total network to generalize (i.e., predict correctly on new data) [36]. In the research described here, we used several different deep learning techniques and combinations of these techniques to create hierarchical document classifiers. The following subsections provide an overview of the three deep learning architectures we used: Deep Neural Networks (DNN), Recurrent Neural Networks(RNN), and Convolutional Neural Networks (CNN).
In the DNN architecture each layer only receives input from the previous layer and outputs to the next layer. The layers are fully connected. The input layer consists of the text features (see IV) and the output layer has a node for each classification label or only one node if it is a binary classification. This architecture is the baseline DNN. Additional details on this architecture can be found in [37].
This paper extends this baseline architecture to allow hierarchical classification. Figure 1 shows this new architecture. The DNN for the first level of classification (on the left side in Figure 1) is the same as the baseline DNN. The second level classification in the hierarchy consists of a DNN trained for the domain output in the first hierarchical level. Each second level in the DNN is connected to the output of the first level. For example, if the output of the first model is labeled computer science then the DNN in the next hierarchical level (e.g., in Figure 1) is trained only with all computer science documents. So while the first hierarchical level DNN is trained with all documents, each DNN in the next level of the document hierarchy is trained only with the documents for the specified domain.
The DNNs in this study are trained with the standard backpropagation algorithm using both sigmoid (Equation 11
) and ReLU (Equation
12) as activation functions. The output layer uses softmax (Equation
13).(11)  
(12)  
(13)  
Given a set of example pairs the goal is to learn from the input and target spaces using hidden layers. In text classification, the input is generated by vectorization of text (see Section IV).
The second deep learning neural network architecture we use is RNN. In RNN the output from a layer of nodes can reenter as input to that layer. This approach has advantages for text processing [38]. The general RNN formulation is given in Equation 14 where is the state at time and refers to the input at step .
(14) 
We use weights to reformulate Equation 14 as shown in Equation 15 below:
(15) 
In Equation 15, is the recurrent matrix weight, are the input weights, is the bias, and is an elementwise function. Again we have modified the basic architecture for use in hierarchical classification. Figure 2 shows this extended RNN architecture.
Several problems (e.g., vanishing and exploding gradients) arise in RNNs when the error of the gradient descent algorithm is backpropagated through the network [39]
. To deal with these problems, long shortterm memory (LSTM) is a special type of RNN that preserves longterm dependencies in a more effective way compared with the basic RNN. This is particularly effective at mitigating the vanishing gradient problem
[40].Figure 3 shows the basic cell of an LSTM model. Although LSTM has a chainlike structure similar to RNN, LSTM uses multiple gates to regulate the amount of information allowed into each node state. A stepbystep explanation the LSTM cell and its gates is provided below:
Input Gate:
(16) 
Candid Memory Cell Value:
(17) 
Forget Gate Activation:
(18) 
New Memory Cell Value:
(19) 
Output Gate Values:
(20) 
In the above description,
is a bias vector,
is a weight matrix, and is the input to the memory cell at time . The and indices refer to input, cell memory, forget and output gates, respectively. Figure 3 shows the structure of these gates with a graphical representation.An RNN can be biased when later words are more influential than the earlier ones. To overcome this bias convolutional neural network (CNN) models (discussed in Section VC
) include a maxpooling layer to determine discriminative phrases in text
[41]. A gated recurrent unit (GRU) is a gating mechanism for RNNs that was introduced in 2014
[42]. GRU is a simplified variant of the LSTM architecture, but there are differences as follows: GRUs contain two gates, they do not possess internal memory (the in Figure 3), and a second nonlinearity is not applied ( in Figure 3).The final deep learning approach we developed for hierarchical document classification is the convolutional neural network (CNN). Although originally built for image processing, as discussed in Section II, CNNs have also been effectively used for text classification [15]. The basic convolutional layer in a CNN connects to a small subset of the inputs usually of size . Similarly the next convolutional layer connects to only a subset of its preceding layer. In this way these convolution layers, called feature maps, can be stacked to provide multiple filters on the input. To reduce computational complexity, CNNs use pooling to reduce the size of the output from one stack of layers to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features [43]. The most common pooling method is maxpooling where the maximum element is selected in the pooling window. In order to feed the pooled output from stacked featured maps to the next layer, the maps are flattened into one column. The final layers in a CNN are typically fully connected. In general during the backpropagation step of a CNN not only the weights are adjusted but also the feature detector filters. A potential problem of CNNs used for text is the number of channels or size of the feature space. This might be very large (e.g., 50K words) for text, but for images this is less of a problem (e.g., only 3 channels of RGB) [14].
The primary contribution of this research is hierarchical classification of documents. A traditional multiclass classification technique can work well for a limited number classes, but performance drops with increasing number of classes, as is present in hierarchically organized documents. In our hierarchical deep learning model we solve this problem by creating architectures that specialize deep learning approaches for their level of the document hierarchy (e.g., see Figure 1). The structure of our Hierarchical Deep Learning for Text (HDLTex) architecture for each deep learning model is as follows:
hidden layers with cells in each hidden layer.
GRU and LSTM are used in this implementation, cells with GRU with two hidden layers.
Filter sizes of and maxpool of , layer sizes of with max pooling of , the CNN contains hidden layers.
All models used the following parameters: Batch Size = 128, learning parameters = , =0.9, =0.999, , , Dropout=0.5 (DNN) and Dropout=0.25 (CNN and RNN).
We used the following cost function for the deep learning models:
(21) 
where is the number of levels, indicates number of classes for each level, and refers to the number of classes in the child’s level of the hierarchical model.
We used two types of stochastic gradient optimization for the deep learning models in this paper: RMSProp and Adam. These are described below.
The basic stochastic gradient descent (SGD) is shown below:
(22)  
(23) 
For these equations, is the learning parameter, is the learning rate, and is the objective or cost function. The history of updates is defined by . To update parameters, SGD uses a momentum term on a rescaled gradient, which is shown in Equation (23). This approach to the optimization does not perform bias correction, which is a problem for a sparse gradient.
Adam is another stochastic gradient optimizer, which averages over only the first two moments of the gradient,
and , as shown below:(24)  
where  
(25)  
(26)  
(27) 
In these equations, and
are the first and second moments, respectively. Both are estimated as
and . This approach can handle the nonstationarity of the objective function as can RMSProp, but Adam can also overcome the sparse gradient issue that is a drawback in RMSProp [44].Our document collection had labels as shown in Table I.^{2}^{2}2WOS dataset is shared at http://archive.ics.uci.edu/index.php The target value has two levels, which are Computer Science, Electrical Engineering, Psychology, Mechanical Engineering, Civil Engineering, Medical Science, biochemistry and children levels of the labels, , which contain specific topics belonging to , respectively. To train and test the baseline methods described in Section III and the new hierarchical document classification methods described in Section V, we collected data and metadata on published papers available from the Web Of Science [45, 46]. To automate collection we used Selenium [47] with ChoromeDriver[48] for the Chrome web browser. To extract the data from the site we used Beautiful Soup [49]. We specifically extracted the abstract, domain, and keywords of this set of published papers. The text in the abstract is the input for classification while the domain name provides the label for the top level of the hierarchy. The keywords provide the descriptors for the next level in the classification hierarchy. Table I shows statistics for this collection. For example, Medical Sciences is one of the toplevel domain classifications and there are 53 subclassifications within this domain. There are also over 14k articles or documents within the domain of health sciences in this data set.
Domain 




Biochemistry  5,687  9  
Civil Engineering  4,237  11  
Computer Science  6,514  17  
Electrical Engineering  5,483  16  
Medical Sciences  14,625  53  
Mechanical Engineering  3,297  9  
Psychology  7,142  19  
Total  46,985  134 
We divided the data set into three parts as shown in Table II. Data set is the full data set with 46,985 documents, and data sets and are subsets of this full data set with the number of training and testing documents shown as well as the number of labels or classes in each of the two levels. For dataset , each of the seven level1 classes has five subclasses. For data set , two of the three higherlevel classes have four subclasses and the last highlevel class has three subclasses. We removed all special characters from all three data sets before training and testing.
Data Set  Training  Testing  Level 1  Level 2 

WOS11967  8018  3949  7  35 
WOS46985  31479  15506  7  134 
WOS5736  4588  1148  3  11 
WOS11967  WOS46985  WOS5736  
Methods  Accuracy  Methods  Accuracy  Methods  Accuracy  
Baseline  DNN  80.02  DNN  66.95  DNN  86.15  
CNN (Yang el. et. 2016)  83.29  CNN (Yang el. et. 2016)  70.46  CNN (Yang el. et. 20016)  88.68  
RNN (Yang el. et. 2016)  83.96  RNN (Yang el. et. 2016)  72.12  RNN (Yang el. et. 2016)  89.46  
NBC  68.8  NBC  46.2  NBC  78.14  
SVM (Zhang el. et. 2008)  80.65  SVM (Zhang el. et. 2008)  67.56  SVM (Zhang el. et. 2008)  85.54  
SVM (Chen el et. 2016)  83.16  SVM (Chen el et. 2016)  70.22  SVM (Chen el et. 2016)  88.24  
Stacking SVM  79.45  Stacking SVM  71.81  Stacking SVM  85.68  
HDLTex  DNN  DNN  83.73  DNN  DNN  70.10  DNN  DNN  88.37 
91.43  91.58  87.31  80.29  97.97  90.21  
DNN  CNN  83.32  DNN  CNN  71.90  DNN  CNN  90.47  
91.43  91.12  87.31  82.35  97.97  92.34  
DNN  RNN  81.58  DNN  RNN  73.92  DNN  RNN  88.42  
91.43  89.23  87.31  84.66  97.97  90.25  
CNN  DNN  85.65  CNN  DNN  71.20  CNN  DNN  88.83  
93.52  91.58  88.67  80.29  98.47  90.21  
CNN  CNN  85.23  CNN  CNN  73.02  CNN  CNN  90.93  
93.52  91.12  88.67  82.35  98.47  92.34  
CNN  RNN  83.45  CNN  RNN  75.07  CNN  RNN  88.87  
93.52  89.23  88.67  84.66  98.47  90.25  
RNN  DNN  86.07  RNN  DNN  72.62  RNN  DNN  88.25  
93.98  91.58  90.45  80.29  97.82  90.21  
RNN  CNN  85.63  RNN  CNN  74.46  RNN  CNN  90.33  
93.98  91.12  90.45  82.35  97.82  92.34  
RNN  RNN  83.85  RNN  RNN  76.58  RNN  RNN  88.28  
93.98  89.23  90.45  84.66  97.82  90.25 
The following results were obtained using a combination of central processing units (CPUs) and graphical processing units (GPUs). The processing was done on a with cores and memory, and the GPU cards were and . We implemented our approaches in Python using the Compute Unified Device Architecture (CUDA), which is a parallel computing platform and Application Programming Interface (API) model created by
. We also used Keras and TensorFlow libraries for creating the neural networks
[50, 51].Table III shows the results from our experiments. The baseline tests compare three conventional document classification approaches (naïve Bayes and two versions of SVM) and stacking SVM with three deep learning approaches (DNN, RNN, and CNN). In this set of tests the RNN outperforms the others for all three data sets. CNN performs secondbest for three data sets. SVM with term weighting [22] is third for the first two sets while the multiword approach of [23] is in third place for the third data set. The third data set is the smallest of the three and has the fewest labels so the differences among the three best performers are not large. These results show that overall performance improvement for general document classification is obtainable with deep learning approaches compared to traditional methods. Overall, naïve Bayes does much worse than the other methods throughout these tests. As for the tests of classifying these documents within a hierarchy, the HDLTex approaches with stacked, deep learning architectures clearly provide superior performance. For data set , the best accuracy is obtained by the combination RNN for the first level of classification and DNN for the second level. This gives accuracies of 94% for the first level, 92% for the second level and 86% overall. This is significantly better than all of the others except for the combination of CNN and DNN. For data set the best scores are again achieved by RNN for level one but this time with RNN for level 2. The closest scores to this are obtained by CNN and RNN in levels 1 and 2, respectively. Finally the simpler data set has a winner in CNN at level 1 and CNN at level 2, but there is little difference between these scores and those obtained by two other HDLTex architectures: DNN with CNN and RNN with CNN.
Document classification is an important problem to address, given the growing size of scientific literature and other document sets. When documents are organized hierarchically, multiclass approaches are difficult to apply using traditional supervised learning methods. This paper introduces a new approach to hierarchical document classification, HDLTex, that combines multiple deep learning approaches to produce hierarchical classifications. Testing on a data set of documents obtained from the Web of Science shows that combinations of RNN at the higher level and DNN or CNN at the lower level produced accuracies consistently higher than those obtainable by conventional approaches using naïve Bayes or SVM. These results show that deep learning methods can provide improvements for document classification and that they provide flexibility to classify documents within a hierarchy. Hence, they provide extensions over current methods for document classification that only consider the multiclass problem.
The methods described here can improved in multiple ways. Additional training and testing with other hierarchically structured document data sets will continue to identify architectures that work best for these problems. Also, it is possible to extend the hierarchy to more than two levels to capture more of the complexity in the hierarchical classification. For example, if keywords are treated as ordered then the hierarchy continues down multiple levels. HDLTex can also be applied to unlabeled documents, such as those found in news or other media outlets. Scoring here could be performed on small sets using human judges.
S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,”
Journal of machine learning research, vol. 2, no. Nov, pp. 45–66, 2001., “A comparison of event models for naive bayes text classification,” in
AAAI98 workshop on learning for text categorization, vol. 752. Citeseer, 1998, pp. 41–48.Proceedings of the IEEE conference on computer vision and pattern recognition
, 2014, pp. 1717–1724.G. B. Huang, H. Lee, and E. LearnedMiller, “Learning hierarchical representations for face verification with convolutional deep belief networks,” in
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2518–2525.Proceedings of the fifth annual workshop on Computational learning theory
. ACM, 1992, pp. 144–152., “Keras: Deep learning library for theano and tensorflow,”
https://keras.io/, 2015.