The recent advancement in data acquisition and computing technologies has spurred the growth in the application of artificial intelligence in the medical domain. These advances have a significant impact on low-level detection and recognition tasks in texts and images. There are a set of risks and challenges when we apply these technologies to mission-critical applications such as health care. For reliable deployment of AI to health care, we must probe the model’s understanding of the task and ensure that it is not merely exploiting the systemic artifacts embedded in data. Model distillation becomes essential in this context towards achieving this goal. For this reason, there has been increased research interest in model distillation or developing techniques for explaining the decision process of deep learning models. This work falls under the first category.
In this paper, we conduct a direct analysis of CNN representations by mapping it to the n-gram features of text reports. We do so by modeling a CNN representation of a text report as a linear function of n-gram features . A common hypothesis is that CNN representations collapse irrelevant information in the texts and thus should not be uniquely invertible. Also, for the information extraction task, we hypothesis that most of the information in the reports is irrelevant, thus should only act on a small proportion of the text segments. Hence, we pose the problem of model reduction as finding a sparse model and by doing so we obtained insights into which text segments maybe that are important for the prediction.
Our contributions are as follows. First, we propose a general method to approximate CNN representation for text. We discuss and evaluate non-negativity constraint penalty as a model prior to get a sparse interpretable model. Second, we show that despite using a linear map of n-gram representation for approximation, our method achieves the same accuracy as the CNN model. Third, we apply pseudo-inverse of the linear map to approximate the sample reconstructions of a text reports from a given representation for explaining an individual prediction. The rest of manuscript is organized as follows. Section 2 provides an overview of related work. Section 3 describes document representations and our method. Section 4 describes de-identified datasets used for the experiments. Section 5 and 6 presents the experimental results and discussion.
2 Related work
Several techniques have been proposed to help improve our understanding of Neural network representations (Erhan et al., 2009; Mahendran and Vedaldi, 2015; Yosinski et al., 2015; Nguyen et al., 2016; Shwartz-Ziv and Tishby, 2017; Doshi-Velez and Kim, 2017). Bach et al. (2015) proposed the layer-wise relevance propagation (LRP) technique to explain the DNN classification decisions. LRP redistributes predictions backward through layers of the model using local redistribution rules until it assigns a relevance score to input components. A second technique, sensitivity analysis (Baehrens et al., 2010; Simonyan et al., 2013), explains model predictions based on the model’s gradients at input locations. Ribeiro et al. (2016) proposed a generalized technique, LIME, that provides locally interpretable explanations of model predictions by utilizing the samples within proximity of the input samples. They perturb the input in the proximity and see how the model behave and then weight the perturations by their proximity to the original input to learn an locally interpretable model. Zhou et al. (2016) recently proposed the Class Activation Mapping (CAM) algorithm using the global average pooling before the final output layer for identifying the all discriminative regions used by CNN models for a particular class. Selvaraju et al. (2017) proposed a generalized CAM (Grad-CAM), which fuses the class-conditional CAM with existing pixel-space gradient visualization techniques to find fine-grained detailed discriminative regions. In this work, instead of developing methods to explain the prediction of individual predictions, we approximate CNN representations as a linear transformation of n-gram presence features with a non-negativity and sparsity prior on the model weights to build a reduced explainable model.
This section introduces our model reduction framework. We illustrate the complete workflow in Figure 1. First, we extract the CNN representations of the text reports using the shallow CNN network Kim (2014); Agrawal et al. (2019), also described in Section 3.1. Second, we generate the n-gram presence representations of the text reports, which is introduced in Section 3.2. Third, we construct the linear map between the CNN representations of the text reports and n-gram presence representations and discard the spurious connections to obtain a sparse interpretable linear model as described in Section 3.3. Next, we review CNN representations and n-gram presence representations of the text reports and describe the model reduction method in details.
3.1 CNN representation
Shallow CNN architecture is first developed for text classification by Kim (2014) and later successfully applied to cancer pathology reports by Qiu et al. (2017). The CNN takes a text document as its input, which is represented as a sequence of word tokens from a provided vocabulary of size
. The input sequence is first passed through a word embedding layer that maps each token to a word vector of dimension, where . The embedding layer maps the document to a matrix whose elements are given by
where is the word embedding vector of work token . The embedding output is then passed through convolutional filters, is number of convolution modules and is the number of convolutional filter per module. A convolutional module with convolutional filters all of width are defined as
where and are elements of the convolutional weight matrix and
-dimensional bias vectorrespectively, and
is the rectified linear activation function. A maxpool layer is used at the end of convolution operation to render a translational invariant feature. The output vectors of all convolutional modules are then concatenated together, producing a CNN representation of dimension.
3.2 N-gram presence representation
-gram representation is a popular bag of word representation of the text documents that capture short sequence information and ignores longer ones. Term-frequency and inverse-document frequency (tf-idf) is two most popular choice for document representation in this category (Ramos and others, 2003). In this work, we rather choose to use -gram presence representation to represent the cancer reports for the information extraction task. The motivation behind this choice is to select a representation that is least susceptible to produce overfitted model. In -gram presence representation, We use the presence or absence of the short word sequences in the reports as a feature for the representation. We denote the n-gram presence features of a document by
where when n-gram is present in the document and when is absent.
3.3 Model reduction
In this section, we introduce our method to distill the cnn representations by finding the active set of n-gram presence features in the pathology reports via non-negative linear map approximation of the cnn representations. Consider the functional dependence of the label variable on the explanatory document variables through a non-linear mapping . Given a set of observations of and
, we estimate the functionthrough training a shallow CNN model as in Kim (2014) and obtain a set of CNN representations . We also extract a set of n-gram presence features from the pathology reports. We find the approximation of the CNN representations with a linear map of the n-gram presence features with non-negative weights by solving a non-negative least squares problem,
By minimizing 4, we obtain a dense linear map with non-negative weights. To discard the spurious associations, we suppress the values of to zero of n-gram features that contributes little. Overall contribution of the th n-gram feature to the cnn representation is calculated by aggregating weights of the th column of the linear map . We keep only the top contributing n-gram presence features for building a reduced interpretable model. We note that our decision to put a non-negativity prior on the weights of the linear map is in order to discard the complex solutions that may overfit when where is the number n-gram features and is available text reports.
To validate our model, we used primary-site extraction task on the de-identified pathology reports, which were collected from SEER cancer registries (CT, HI, KY, NM, Seattle). The reports were labeled by state cancer registries as per coding guidelines issued by SEER. The primary-site extraction task is to label the report with the standard ICD-O-3 topography codes that is used to encode the tumor location. A summary of the labels including the ICD-O-3 code and an example of unstructured text section of the pathology report are provided by Dubey et al. (2019b). The cancer pathology reports were provided in XML format. We discarded the meta-data associated with the pathology reports and used only unstructured text sections of the XML file for the task. We adopted the same pre-processing steps as in (Dubey et al., 2019b).
5 Experimental results
|Features||Reduction||Classifier||Model type||Active features||Accuracy|
|N-gram Term-Freq.||LSIR Dubey et al. (2019a)||kNN||dense||6855||0.80|
|N-gram Term-Freq.||mRMR+LSIR Dubey et al. (2019b)||kNN||sparse||2000||0.82|
|CNN (our impl.)||-||Softmax Linear||dense||-||0.87|
|N-gram presence||model reduction (Section 3.3)||SIGP||sparse||3000||0.87|
The goal of our experiments is to compare the classification accuracy and complexity of the reduced model with the CNN baseline and two other inverse-regression based methods (Dubey et al., 2019b, a) explored previously for information extraction from cancer pathology reports. We report the average test accuracy on ten test folds obtained by a ten-fold stratified cross-validation of the de-identified reports. For each of the training-test split, we further split the training fold into training and validation sets, stratified to maintain relative class frequencies.
For the CNN baseline, we used CNN model by (Kim, 2014) with word embedding dimensions, convolutional modules of filter-width 3, 4 and 5,
convolution filters per modules and l2-regularization parameter of 0.001 in the loss function. The feature network weights and classifier weights of the shallow CNN model are initialized randomly. The CNN model is trained in each fold for up toepochs using the Adam optimizer with minibatch size and a learning rate of . We used early stopping with patience . We used the CNN model that gives the highest validation accuracy score for prediction as the final model. We used the final models for prediction on test sets to estimate the average classification accuracy. We also passed all the de-identified reports through the final model for each fold to obtain the CNN representations for model reduction.
For the model reduction experiments, we extracted -gram features of length 3, 4, and 5 from the reports using CountVectorizer function from the sklearn library. We obtain the corresponding n-gram presence features of de-identified reports for all ten folds by using the same seed as the CNN experiment. We discard any n-gram features that have a document frequency strictly lower than . We extracted a total of
features. We binarize the features to obtain the-gram presence features (see Section 3.2). We used a fast non-negative least squares solver by Kim et al. (2013) to obtained the linear map . We used the default parameter settings recommended in the released package Kim et al. (2012). We computed the overall contribution of the features to the CNN features by summing their contributions. We keep only the top features for building a reduced model. We build a classifier using subspace-induced gaussian processes (SIGP) Tan and Mukherjee (2018)
to get the final prediction for each fold. We used the radial basis function (RBF) kernel in the SIGP and learned its scale parameter through bayesian optimization.
Table 1 shows a comparison of the classification accuracy of the reduced model with other methods for the primary-site classification task. The results show that the reduced model achieves the same classification accuracy as the CNN model despite only using less than of the total n-gram features. Results also show a gap in the accuracy between the traditional inverse-regression based methods and the CNN based models for the task.
We have developed a non-negative least squares based model reduction technique for discarding the spurious association between the CNN and n-gram representations to achieve an explainable model for information extraction. We have experimentally demonstrated that the reduced model on the n-gram presence features despite using only a fraction of total text segments (
) produce the same accuracy as the CNN model. Often lasso regression(Tibshirani, 1996) is used in the literature to render the sparse model. However, our experiments with lasso regression failed to achieve an accurate sparse model. We hypothesize that the lasso regression renders a relatively complex linear map that overfits the CNN representation by allowing the numerical cancellations in the modeling. By using the non-negativity constraints on the model, we were able to render a simpler sparse interpretable model.
Moving forward, we would like to examine the effectiveness of the model reduction on larger datasets. Our future work would address any scalability hurdles that we may counter in solving large scale NNLS problems. In this work, we employed a finite-sample variant of the integral Gaussian Process (SIGP) primarily for rendering the class-label prediction. However, SIGP can be also useful to estimate the full predictive posterior distribution, which may yield a better uncertainty estimate than the softmax scores of a traditional CNN. SIGP admits a strictly larger set of functions than the Gaussian Process, which includes solutions to Tikhonov regularization problem and Bayesian kernel models, and thus can prove potentially useful for uncertainty quantification (Tan and Mukherjee, 2018) , which we would examine in our future work.
- Deep kernel learning for information extraction from cancer pathology reports. External Links: Cited by: §3.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §2.
How to explain individual classification decisions.
Journal of Machine Learning Research11 (Jun), pp. 1803–1831. Cited by: §2.
- Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §2.
- Inverse regression for extraction of tumor site from cancer pathology reports. In 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 1–4. Cited by: Table 1, §5.
- Extraction of tumor site from cancer pathology reports using deep filters. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19, New York, NY, USA, pp. 320–327. External Links: Cited by: §4, Table 1, §5.
- Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §2.
- A non-monotonic method for large-scale non-negative least squares. Optimization Methods and Software (OMS). Cited by: §5.
- A non-monotonic method for large-scale non-negative least squares. Optimization Methods and Software 28 (5), pp. 1012–1039. Cited by: §5.
Convolutional neural networks for sentence classification.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–175. External Links: Cited by: §3.1, §3.3, §3, §5.
- Understanding deep image representations by inverting them. In , pp. 5188–5196. Cited by: §2.
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems, pp. 3387–3395. Cited by: §2.
- Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE journal of biomedical and health informatics 22 (1), pp. 244–251. Cited by: §3.1.
- Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242, pp. 133–142. Cited by: §3.2.
- Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §2.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §2.
- Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §2.
- Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.
- Subspace-induced gaussian processes. arXiv preprint arXiv:1802.07528. Cited by: §5, §6.
- Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §6.
- Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §2.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §2.