Cancer is a major concern worldwide, as it decreases the quality of life and leads to premature mortality. In addition, it is one of the most complex and difficult-to-treat diseases, with significant social implications, both in terms of mortality rate and in terms of costs associated with treatment and disability [1, 2, 3, 4]. Measuring the burden of disease is one of the main concerns of public healthcare operators. Suitable measures are necessary to describe the general state of population’s health, to establish public health goals and to compare the national health status and performance of health systems across countries. Furthermore, such studies are needed to assess the allocation of health care and health research resources across disease categories and to evaluate the potential costs and benefits of public health interventions .
Cancer registries emerged during the last few decades as a strategic tool to quantify the impact of the disease and to provide analytical data to healthcare operators and decision makers. Cancer registries use administrative and clinical data sources in order to identify all the new cancer diagnoses in a specific area and time period and collect incidence records that provide details on the diagnosis and the outcome of treatments. Mining cancer registry datasets can help towards the development of global surveillance programs  and can provide important insights such as survivability . Although data analysis software would best operate on structured representations of the reports, pathologists normally enter data items as free text in the local country language. This requires intelligent algorithms for medical document information extraction, retrieval, and classification, an area that has received significant attention in the last few years (see, e.g.,  for a recent account and  for the specific case of cancer).
The study of intelligent algorithms is also motivated by the inherent slowness of the cancer registration process, which is partially based on manual revision, and which also requires the interpretation of pathological reports written as free text [10, 11, 12]
. In practice, significant delays in data production and publication may occur. This weakens data relevance for the purpose of assessing compliance with updated regional recommended integrated case pathways, as well as for public health purposes. Improving automated methods to generate a list of putative incident cases and to automatically estimate process indicators is thus an opportunity to perform an up-to-date evaluation of cancer-care quality. In particular, machine learning techniques like the ones presented in this paper could overcome the delay in cancer case definition by the cancer registry and pave the way towards powerful tools for obtaining indicators automatically and in a timely fashion.
In our specific context, pathology reports can be classified according to codes defined in theInternational Classification of Diseases for Oncology, third edition (ICD-O3) system , a specialization of the ICD for the cancer domain which is internationally adopted as the standard classification for topography and morphology . The development of text analysis tools specifically devoted to the automatic classification of incidence records according to ICD-O3 codes has been addressed in a number of previous papers (see Section 2 below). Some works have either focused on reasonably large datasets but using simple linear classifiers based on bag-of-words representations of text [14, 15]. Most other works have applied recent state-of-the-art deep learning techniques [16, 17] but using smaller datasets and restricted to a partial set of tumors. A remarkable exception is 
that applies convolutional networks to a large dataset. Additionally, the use of deep learning techniques usually requires accurate domain-specific word vectors (embeddings of words in a vector space) that can be derived from word co-occurrences in large corpora of unlabeled text[19, 20, 21]. Large medical corpora are easily available for English (e.g. PubMed) but not necessarily for other languages.
To the best of our knowledge, the present work is the first to report results on a large dataset (
labeled reports for supervised learning andunlabeled reports for pretraining word vectors), with a large number of both topography and morphology classes, and comparing several alternative state-of-the-art deep learning techniques, namely Gated Recurrent Unit (GRU) Recurrent Neural Network (RNN) , with and without attention , Bidirectional Encoder Representations from Transformers (BERT)  and Convolutional Neural Network (CNN). In particular, we are interested in evaluating on real data the effectiveness of attention models, comparing them with a simpler form based on max aggregation. We also report an extensive study on the interpretability of the trained classifiers. Our results confirm that recent deep learning techniques are very effective on this task, with attentive GRUs
reaching a multiclass accuracy of 90.3% on topography (61 classes) and 84.8% on morphology (134 classes) but (1) hierarchical models does not achieve better accuracy than using flat models, (2) the improvement over a simple support vector machine classifier on bag-of-words is modest, (3) a simpler aggregator of hidden representations taking element-wise maximum over time improves slightly over (flat and hierarchical) attention models for topography prediction while a flat attention model is better for morphology task, and (4) the improvements of flat models over hierarchical is stronger for difficult to learn rare classes. We additionally show that the element-wise maximum aggregator offers a new alternative strategy for interpreting prediction results.
2 Related Works
Early works for ICD-O3 code assignment were structured on rule-based systems, where the code was assigned by creating a set of handcrafted text search queries and combining results by standard Boolean operators. In order to prevent spurious matches, rules need to be very specific, making it very difficult to achieve a sufficiently high recall on future (unseen) cases.
Also more recent works employ rule-based approaches. Coden et al.  implemented a knowledge representation model that they populate processing cancer pathology reports with Natural Language Processing (NLP) techniques. They performed categorization of classes using rules based on syntactic structure. They also experimented, without satisfactory results, machine learning methods. They validate the model using a small corpus of 302 pathology reports related to colon cancer obtaining an score of for primary tumor classification and for metastatic tumor. Nguyen et al.  developed a rule based system evaluated on a set of pathology reports with full site classes (site plus sub-site) and type classes. They obtained an score of and respectively for site and type.
A number of studies reporting on the application of machine learning to this problem have been published during the last decade. Direct comparisons among these works are impossible due to the (not surprising) lack of standard publicly available datasets and the presence of heterogeneous details in the settings. Still, we highlight the main differences among them in order to provide some background. In 
, the authors employed support vector machine (SVM) and Naive Bayes classifiers on a small dataset ofFrench pathology reports and a reduced number of target classes (26 topographic classes and 18 morphological classes), reporting an accuracy of 72.6% on topography and 86.4% on morphology with SVM. A much larger dataset of English reports from the Kentucky Cancer Registry was later employed in 
, where linear classifiers (SVM, Naive Bayes, and logistic regression) were also compared but only on the topography task and using 57, 42, and 14 classes (determined considering classes that have at least respectively 50, 100, and 1000 examples). The authors reported a micro-averaged F1 measure of 90% on 57 classes using SVM with both unigrams and bigrams. Still, the bag-of-words representations used by these linear classifiers do not consider word order and are unable to capture similarities and relations among words (which are all represented by orthogonal vectors). Deep learning techniques are known to overcome these limitations but were not applied to this problem until very recently. In, a CNN architecture fed by word vectors pretrained on PubMed was applied to a small corpus of 942 breast and lung cancer reports in English with 12 topography classes; the authors demonstrate the superiority of this approach compared to linear classifiers with significant increases in both micro and macro F1 measures. In , the same research group also experimented on the same dataset using RNNs with hierarchical attention , obtaining further improvements over the CNN architecture. Also the same research group implemented in  two CNN-based multitask learning techniques and trained them on a big dataset of pathology reports ( unique tumors) from the Louisiana Tumor Registry. The models where trained on five tasks: topology main site (65 classes), laterality (4 classes), behavior (3 classes), morphology type (63 classes), and morphology grade (5 classes). They reached a micro and macro F1 score of respectively and for site prediction and respectively and for type prediction.
Recent works investigated the interpretability of supervised machine learning models. In , a novel technique called LIME explains the prediction of any classifier or regressor by locally approximating it.
3 Materials and Methods
We collected a set of anonymized anatomopathological exam results from Tuscany region cancer registry in the period 1990-2014 for which we obtained the approval from the institutional ethics committee111CEAV 14081_oss 27/11/2018. About of these records refer to a positive tumor diagnosis and have topographical and morphological ICD-O3 labels, determined by tumor registry experts. Other reports are associated with non-cancerous tissues and with unlabeled records. When multiple pathological records for the same patient existed for the same tumor, cancer registry experts selected the most informative report in order to assign the ICD-O3 code to that tumor case, leaving a set of labeled reports. In our dataset each labeled report correspond to the primary report for a single tumor case, thus the classification was performed at report level.
The histological exam records consist of three free-text fields (not all of them always filled-in) reporting tissue macroscopy, diagnosis, and, in some cases, the patient’s anamnesis. We found that field semantics was not always used consistently and that the amount of provided details varied significantly from extremely synthetic to very detailed assessments of the morphology and the diagnosis. Field length ranged from to
words, with lower, middle and upper quartiles respectively 34, 62 and 134. For these reasons, we merged the three text fields into a single text document. We did a case normalization converting the letters to uppercase and we kept punctuation. We finally removed duplicates (records that have the exact same text) and reports labeled with extremely rareICD-O3 codes (1048 samples that do not appear in either training, validation, and test sets). In the end we obtained a dataset suitable for supervised learning consisting of labeled records ( of them in the period 2004-2012). We further split the records in sentences when using hierarchical models. For this purpose we employed the spaCy sentence segmentation tool222https://spacy.io/.
After preprocessing, our documents had on average length of 105 words and contained on average 13 sentences (detailed distributions are reported in Figure 3 of Appendix 6). These statistics indicate that reports tend to be much shorter than in other studies. For example, in  the average number of words and sentences per document were 1290 and 117, respectively333Note, however, than in that study about 76% of the documents consisted of a single report, and the rest of concatenated reports (two reports in 17.7% cases, three in 4.2% cases, and four or more in 2.1% cases).. It is also the case that language is often synthetic, rich in keywords, and poor in verbs (three sample reports are shown in Figure 1).
ICD-O3 codes describe both topography (tumor site) and morphology. A topographical ICD-O3 code is structured as Cmm.s where mm represent the main site and s the subsite. For example, C50.2 is the code for the upper-inner quadrant (2) of breast (50). A morphological ICD-O3 code is structured as tttt/b where tttt represent the cell type and b the tumor behavior (benign, uncertain, in-situ, malignant primary site, malignant metastatic site). For example, 8140/3 is the code for an adenocarcinoma (adeno 8140; carcinoma 3). We defined two associated multi-class classification tasks (1) main tumor site prediction (topography) and (2) type prediction (morphology). The topography task only considers the first part of the topographical ICD-O3 code, before the dot without the sub-site. The morphology task only considers the first part of the morphological ICD-O3 code, before the slash without the behavior. As shown in Figure 4 (Appendix 6), our dataset is highly unbalanced, with many of the 71 topographical and 435 morphological classes found in the data being very rare. In an attempt to reduce bias in the estimated performance (particularly for the macro F1 measure, see below), we removed classes with less than five records in the test set, resulting in 61 topographical and 134 morphological classes. Even after these removals, our tasks have no less classes than in previous works (the most comprehensive previous study  has 65 topographical classes and 63 histological classes).
In order to provide an evaluation that does not neglect possible dataset shift issues over time (for example due to style changes or to the evolving of oncology knowledge) we split train, validation, and test data using a temporal criterion (based on record insertion date): we used the most recent of data as the test set ( records for site and for type, from March 2012 to March 2014), a similar amount of the remaining most recent records as the validation set ( for site and for type, from December 2010 to March 2012), and the rest as the training set ( for site and for type, before December 2010). Note that many other previous studies have used a k-fold cross validation strategy, which is perhaps unavoidable when dealing with small datasets.
3.2 Plain models
In our setting, a dataset consists of variable length sequence vectors . For , is the -th word in the -th document and is the associated target class. To simplify the notation in the subsequent text, the superscripts are not used unless necessary. Sequences are denoted in boldface. The GRU-based sequence classifiers444 In a set of preliminary experiments we found that Long Short-Term Memory (LSTM) did not improve over GRU. used in this work compute their predictions as follows:
is an embedding function mapping words into -dimensional real vectors where embedding parameters can be either pretrained and adjustable or fixed, see Section 3.5 below. Functions and correspond to (forward and reverse) dynamics that can be described in terms of several (possibly layered) recurrent cells. Each vector (the concatenation of and ) can be interpreted as latent representations of the information contained at position in the document. is an additional Multilayer Perceptron (MLP) (with sigmoidal output units) mapping each latent vector into a vector that can be seen as contextualized representation of the word at position . is an aggregation function that creates a single -dimensional representation vector for the entire sequence and is a classification layer with softmax. The parameters , and
(if present) are determined by minimizing a loss function(categorical cross-entropy in our case) on training data. Three possible choices for the aggregator function are described below.
. In this model, called GRU in the following, is the identity function and we simply take the extreme latent representations; in principle, these may be sufficient since they depend on the whole sequence due to bidirectional dynamics. However, note that this approach may require long-term dependencies to be effectively learned;
3.2.2 Attention mechanism
. In this model, called ATT in the following, (scalar) attention weight  are computed as
where is a single layer that maps the representation of the word to a hidden representation . Then, the importance of the word is measured as a similarity with a context vector that is learned with the model and can be seen as an embedded representation of a high level query as in memory networks ;
3.2.3 Max pooling over time
. In this model [30, 31] (called MAX in the following) the sequence of representation vectors is treated as a bag and we apply a form of multi-instance learning: each “feature” will be positive if at least one of is positive (see also ). The resulting classifier will find it easy to create decision rules predicting a document as belonging to a certain class if a given set of contextualized word representations are present and another given set of contextualized word representations are absent in the sequence. Note that this aggregator can also be interpreted as a kind of hard attention mechanism where attention concentrates completely on a single time step but the attended time step is different for each feature . As detailed in Section 3.3, a new model interpretation strategy can be derived when using this aggregator.
3.3 Interpretable model
An interpretable model can be used to assist the manual classification process routinely performed in tumor registries and to explain the proposed automatic classification for further human inspection. To this end, the plain model (Eqs. 1–6) can be modified as follows:
where , , , and are defined as in Section 3.2 and the size of is forced to equal the number of classes so that each component of will be associated with the importance of words around position for class . This information can be used to interpret the model decision. Preliminary experiments showed that the interpretation using the attention aggregator was not satisfactory. Therefore in the experiments we only report the interpretable model with the max aggregator that we call MAXi. More details on the preliminary experiments are reported in . Besides accuracy, we are also interested in the average agreement between MAXi and MAX (i.e. the fidelity of the interpretable classifier, see Appendix 8 for a definition).
3.4 Hierarchical models
The last two models in Section 3.2 can be extended in a hierarchical fashion, as suggested in . In this case, data consist of variable length sequences of sequence vectors , where, for and , is the -th word of the -th sentence in the -th document, and is the associated target class. The prediction is calculated as:
As in the plain model, is an embedding function, and correspond to forward and reverse dynamics that process word representations, is the latent representation of the information contained at position of the -th sentence, is the contextualized representation of the word at position of the -th sentence, and is an aggregation function that creates a single representation for the sentence. Furthermore, and correspond to forward and reverse dynamics that process sentence representations, and is the aggregation function that creates a single representation for the entire document. can be interpreted as the latent representation of the information contained in the sentence for the document. We call MAXh and ATTh the hierarchical versions of MAX and ATT, respectively.
3.5 Word Vectors
Most algorithms for obtaining word vectors are based on co-occurrences in large text corpora. Co-occurrence can be measured either at the word-document level (e.g. using latent semantic analysis) or at the word-word level (e.g. using word2vec  or Global Vectors (GloVe) ). It is a common practice to use pre-compiled libraries of word vectors trained on several billion tokens extracted from various sources such as Wikipedia, the English Gigaword 5, Common Crawl, or Twitter. These libraries are generally conceived for general purpose applications and are only available for the English language. Reports in cancer registries, however, are normally written in the local language and make extensive usage of a very specific domain terminology. In fact they can be considered sublanguages with a specific vocabulary usage and with peculiar sentence construction rules that differ from the normal construction rules .
Another approach is to employ a Language Model (LM) that models language as a sequence of characters instead of words. In particular, in the Flair framework , the internal states of a trained character level LM are used to produce contextual string word embeddings.
3.6.1 Linear classifiers
The classic approach is to employ bag-of-words representations of textual documents. Vector representations of documents are easily derived from bags-of-words either by using indicator vectors or taking into account the number of occurrences of each word using the Term-Frequency Inverse-Document-Frequency (TF-IDF). In those representations, frequent and non-specific terms receive a lower weight.
Bag-of-words representations (including those employing bigrams or trigrams) enable the application of linear text classifiers, such as Naive Bayes (NB), Support Vector Machine (SVM) , or boosted tree classifiers . Those representations suffer two fundamental problems: first, the relative order of terms in the documents is lost, making it impossible to take advantage of the syntactic structure of the sentences; second, distinct words have an orthogonal representation even when they are semantically close. Word vectors can be used to address the second limitation and also allow us to take advantage of unlabeled data, which can be typically be obtained in large amounts and with little cost.
Convolutional Neural Network (CNN) can be successfully employed in the context of sentence classification . The CNN model that we trained in our work is a slight variant of the architecture in 
. The original architecture produces features maps applying convolutional filters on the sequence of word vectors followed by a max pooling and the classification. We used three convolutional layers with filter size of 3, 4 and 5. Moreover we added a linear layer between the word vectors and the convolutional layers. We fine-tuned hyperparameters for the output size of the linear layer and the number of convolutional filters. The input size of the linear layer is the same as the word vector size.
BERT  is a recent model that represents the state of the art in many NLP related tasks [39, 40, 41, 42]. It is a bi-directional pre-training model backboned by the Transformer Encoder . It is an attention-based technique that learns context-dependent word representation on large unlabeled corpora, and then the model is fine tuned end-to-end on specific labeled tasks. During pre-training, the model is trained on unlabeled data over two different tasks. In Masked Language Model (MLM) some tokens are masked and the model is trained to predict those token based on the context. In Next Sentence Prediction (NSP
) the model is trained to understand the relationship between sentences predicting if two sentences are actually consecutive or if they where randomly replaced (with 50% probability).
In our work we pre-trained BERT using the same set of 1.5 million unlabeled records that we used to train word embeddings (see Section 4 for details). Then we fine tuned BERT with the specific topography and morphology prediction tasks.
All deep models (GRU, MAX, ATT, MAXi, MAXh, and ATTh) were trained by minimizing the categorical cross entropy loss with Adam  with an initial learning rate of and minibatches of samples. The remaining hyperparameters (including for SVM) were obtained by grid search using the validation accuracy as the objective (see Appendix 7 for optimal values and details on the hyperparameter space). In particular, we tuned hyperparameters in (1) - (6) and (12) - (20), which control the structure of the model.
is associated with the embedding layer and in our case refers to GloVe hyperparameters . With an intrinsic evaluation, we found that the better configuration was for the vector size, for the window size, and iterations. We constructed sets of couples of related words, i.e. 11, 12, 11, 7 and 92 couples for respectively the benign-malignant, benign-tissue, malignant-tissue, morphology-site and singular-plural relations. For example, fibroma, fibrosarcoma and lipoma, liposarcoma for the benign-malignant relation and fibroma, connective and lipoma adipose for the cancer-tissue relation. We then used those sets to evaluate if the semantic relations are captured by linear substructures in the space of the embeddings, e.g. we measure if for the benign-malignant relation and for the cancer-tissue relation. We confirmed the parameters with an extrinsic evaluation on the best model by grid search in the space of for window size and for vector dimension.
, , , and define the number of GRU layers () and the number of unit per each layer () respectively for , , , and . is a MLP, controls the number of layers () and their size (). Regarding , , and , we decided to have all the stacked layer with the same size to limit the hyperparameters space. and control the kind of aggregating function of and respectively and, in case of attention, it controls the size of the attention layer (). Finally, controls the data-dependent output size of .
In the experiments reported below word vectors were computed by GloVe  trained on our set of 1.5 millions unlabeled records. In a set of preliminary experiments, we also compared the best model that we obtained using GloVe embeddings against the same model trained using Flair embeddings obtained using a LM trained on the same unlabeled records. Although Flair has the potential advantage of robustness with respect to typos and spelling variants, extrinsic results on the topography and the morphology tasks did not show any advantages over GloVe. For example test-set accuracy attained on topography by the best model, MAX, were slightly worse with Flair embeddings (89.9%) than with GloVe embeddings (90.3%) (the latter is reported in Table 1).
|Accuracy||Top 3 Acc.||Top 5 Acc.||MacroF1|
|Accuracy||Top 3 Acc.||Top 5 Acc.||Macro F1|
|(4 cls)||(18 cls)||(39 cls)||(5 cls)||(18 cls)||(111 cls)|
In Table 1 and Table 2 we summarize the results of different models on test data in terms of multiclass accuracy (or, equivalently, micro-averaged F1 measure), top- accuracy (if the correct class appears within the top predictions) for and , and macro-averaged F1 measure (see Appendix 8 for definitions). Significance (each method against MAX) is reported with asterisks in the tables and was assessed with a one-sided McNemar test 
for accuracy and with a one-sided macro T-test for F1 score.
Collecting results for all the models (for a single hyperparameters
configuration and excluding the training of word vectors and BERT)
required approximately 11 hours on a GeForce RTX 2080 Ti
GPU555The source code for the experiments is available at the following address:
https://github.com/trianam/cancerReportsClassification. In Table 3 we report F1 score averaged on different subsets of classes. We consider a class easy if it has more than examples in the test set, average if it has between and examples, and hard if it has less than examples.
|Class||Relevant classes||Document text with highlighted words||English translation (by the authors)|
|61||61 (PROSTATE GLAND)||DISOMOGENICITA ’ DIFFUSE . PSA NON PERVENUTO . ADENOCARCINOMA PROSTATICO A GRADO DI DIFFERENZIAZIONE MEDIO - BASSO ( GLEASON 3 + 4 ) NEI PRELIEVI DI CUI AI NN . 2 E 3 . AGOBIOPSIA DELLA PROSTATA : 1 ) 1 PRELIEVO LL DX . 2 ) 2 PRELIEVI ML DX . 3 ) 2 PRELIEVI M DX . 4 ) 1 PRELIEVO M SX . 5 ) 2 PRELIEVI ML SX . 6 ) 1 PRELIEVO LL SX . 7 ) 1 PRELIEVO TRANSIZIONALE SX . 8 ) 1 PRELIEVO TRANSIZIONALE DX .||DIFFUSE DISHOMOGENEITY . PSA NOT RECEIVED . PROSTATIC ADENOCARCINOMA OF INTERMEDIATE - LOW GRADE OF DIFFERENTIATION ( GLEASON 3 + 4 ) IN SAMPLES AT N . 2 AND 3 . NEEDLE BIOPSY OF THE PROSTATE : 1 ) 1 RIGHT LL SAMPLE . 2 ) 2 RIGHT ML SAMPLES . 3 ) 2 RIGHT M SAMPLES . 4 ) 1 LEFT M SAMPLE . 5 ) 2 LEFT ML SAMPLES . 6 ) 1 LEFT LL SAMPLE . 7 ) 1 LEFT TRANSITIONAL SAMPLE . 8 ) 1 RIGHT TRANSITIONAL SAMPLE .|
|20||18 (COLON) 20 (RECTUM) 21 (ANUS AND ANAL CANAL)||ISOLATI FRAMMENTI RIFERIBILI AD ADENOMA TUBULARE INTESTINALE DI ALTO GRADO . FRAMMENTI ( NR . 2 ) DI POLIPO PEDUNCOLATO A 20 CM DALL ’ ORIFIZIO ANALE . ( ESEGUITA COLORAZIONE EMATOSSILINA - EOSINA ) .||ISOLATED FRAGMENTS ATTRIBUTABLES TO HIGH DEGREE INTESTINAL TUBULAR ADENOMA . FRAGMENTS ( NR . 2 ) OF PEDUNCULATED POLYPUS AT 20 CM FROM THE ANAL ORIFICE . ( PERFORMED HEMATOXYLIN - EOSIN COLORING ) .|
|34||34 (BRONCHUS AND LUNG) 56 (OVARY) 67 (BLADDER) 80 (UNKNOWN PRIMARY SITE)||VERSAMENTO PLEURICO SX DI N . D . D . E ADDENSAMENTI POLMONARI DI N . D . D . , NODULI PARETE ADDOMINALE . INFILTRAZIONE CANCERIGNA DEGLI STROMI CONNETTIVO - ADIPOSI . IMMUNOISTOCHIMICA : CK7 + , CK20 - , TTF - 1 - , PROTEINA S - 100 - . LESIONE DI CM 2 , 0 X 1 , 3 X 0 , 7 . 1 - 2 ) SEZIONI SERIATE .||LEFT PLEURAL EFFUSION OF UNKNOWN ORIGIN AND LUNG THICKENING OF UNKNOWN ORIGIN , ABDOMINAL WALL NODULES . CANCEROUS INFILTRATION OF THE CONNECTIVE - ADIPOSE STROMA . IMMUNOHISTOCHEMICAL : CK7 + , CK20 - , TTF - 1 - , PROTEIN S - 100 - . 2 CM LESION , 0 X 1 , 3 X 0 , 7 . 1 - 2 ) SERIAL SECTIONS .|
In the case of topography, when focusing on the performance on classes with many examples, all models tend to perform similarly, with even the interpretable model attaining high F1 scores. The advantage of recurrent networks over bag-of-word representations is more pronounced when focusing on rare classes. One possible explanation is that the representation learned by recurrent networks is shared across all classes, leveraging the advantage of multi-task learning  in this case. We also note that in no case hierarchical attention models outperform flat attention models and max-pooling performs the best on rare classes. In the case of morphology, differences among different models are more pronounced, with BERT being very effective for densely populated classes (but not for rare classes). Again hierarchical attention does not outperform flat attention. This result differs from the ones reported in  but the datasets are very different in terms of number of examples and number of classes. Differences in the writing style of pathologist trained and practicing in different countries could also impact the relative performance of different models. In this respect, our documents contain on average fewer sentences (see Figure 3 in Appendix 6), offering less structure to be exploited by the richer hierarchical models.
The interpretable classifier MAXi can be used to explain prediction by highlighting which portions of the text contribute to which classes. Its average agreement with MAX was on topography and on morphology. In Figure 1, we show three examples (topography task) where terms are underlined by class-specific colors and with intensities proportional to the importance of word in position for class (see (10)): high if , medium if , low if , not highlighted if . We consider class to be relevant to the document if at least one word has .
The first report was correctly classified and the two most relevant words are prostatico (prostatic), prostata (prostate), followed by PSA (Prostate-Specific Antigene) and Gleason score, that are two common exams in prostate cancer cases . For the second report, the model proposes three codes: 18, 20 and 21, suggesting that intestinal tubular adenoma and pedunculated polypus are terms associated with class colon, polypus associated with colon and rectum, and anal orifice associated with rectum and anus. Note that the ground truth for this record was rectum, while the text explicitly mentions that the fragments have been extracted at 20 cm from the anal orifice (the human rectum is approximately 12 cm long and the anal canal 3-5 cm ). The third report is an even more complex case where the model proposes codes 34, attached to plurial effusion and lung thickening, but interestingly also underlines the immunohistochemical results, as the pattern CK7+ CK20- commonly indicates a diagnosis of lung origin for metastatic adenocarcinoma . Also, immunohistochemistry is a common approach in the diagnosis of tumors of uncertain origin . This can be the reason for the underlying with code 80 of the immunoistochemical part. It is interesting to note that pleuric is suggested to be related to ovarian cancer, in fact the pleural cavity constitutes the most frequent site for extra-abdominal metastasis in ovarian carcinoma .
To quantify the effectiveness of the interpretable model, we designed an experiment where a set of datasets is created taking only the most relevant words based on the value of in (10) of MAXi for the topography site prediction. In Figure 2, we plot the accuracy obtained training a plain GRU model on those reduced datasets, for increasing values of . Accuracy is high even when selecting only a few words, suggesting that the interpretable model is effective in distilling the most relevant terms, and that the information contained in texts tends to be concentrated in a small number of terms.
We compared different algorithms on a large scale dataset of more than labeled records from the Tuscany region tumor registry collected between 1990 and 2014. Results confirm the viability of automated assignment of ICD-O3 codes, with an accuracy of 90.3% on topography (61 classes) and 84.8% on morphology (134 classes). Top-5 accuracies (fraction of test documents whose correct label is among the top five model’s prediction) were 98.1% and 96.9% for topography and morphology, respectively. The latter rates decreased only to 96.2% (topography) and 93.6% (morphology) when using an interpretable model that highlights the most important terms in the text.
In this specific context we did not obtain significant improvements using hierarchical attention methods, compared to a simple max pooling aggregation. The difference between deep learning models and more traditional approaches based on bag-of-words with SVM is significant but not as pronounced as in the results reported in other studies. We also found that a large window size (15 words) and relatively small dimensionality (60) works better for construction of word vectors, while other works in biomedical field  found better results with smaller window size larger word vector dimensionality. These differences can be explained, at least in part, with the specificity of the corpus used in this study, where reports tend to be short, synthetic, rich in discriminant keywords, and often lacking verb phrases. As shown in Figure 2, few words are sufficient to achieve good accuracy.
SVM perform well on topography class that are sufficiently well represented in the dataset. Also, we found that hierarchical models are not better than flat models and that a simple max aggregation achieves the best results in most cases. Interestingly, hierarchical models are outperformed by flat attention or flat max pooling for the more difficult classes (those with less than 100 training examples). Rare classes remain however challenging for all current methods and as discussed in Section 3.1 our study, like all previous similar studies in this area, do not even consider extremely rare classes. In this respect, future work may consider the use of metalearning techniques capable of operating in the few-shot learning setting [54, 55, 56] in order to include more classes and to improve prediction accuracy on the underrepresented ones. Results in this study are limited to a specific (but large) Italian dataset and might be compared in the future against results obtained on cancer reports written in other languages.
-  R. Sullivan, J. Peppercorn, et al., “Delivering affordable cancer care in high-income countries,” The Lancet Oncology, vol. 12, pp. 933–980, Sept. 2011.
-  B. Stewart and C.P. Wild, eds., World Cancer Report 2014. International Agency for Research on Cancer, WHO, Feb. 2014.
-  C. E. DeSantis, C. C. Lin, et al., “Cancer treatment and survivorship statistics, 2014,” CA: A Cancer Journal for Clinicians, vol. 64, pp. 252–271, July 2014.
-  R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2016: Cancer Statistics, 2016,” CA: A Cancer Journal for Clinicians, vol. 66, pp. 7–30, Jan. 2016.
-  M. Brown, J. Lipscomb, and C. Snyder, “The burden of illness of cancer: economic cost and quality of life,” Annual Review of Public Health, vol. 22, pp. 91–113, 2001.
-  G. Tourassi, “Deep learning enabled national cancer surveillance,” in 2017 IEEE International Conference on Big Data (Big Data), pp. 3982–3983, Dec. 2017.
-  D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer survivability: a comparison of three data mining methods,” Artificial Intelligence in Medicine, vol. 34, pp. 113–127, June 2005.
-  G. Mujtaba, L. Shuib, et al., “Clinical text classification research trends: Systematic literature review and open issues,” Expert Systems with Applications, vol. 116, pp. 494–520, Feb. 2019.
-  W.-w. Yim, M. Yetisgen, W. P. Harris, and S. W. Kwan, “Natural Language Processing in Oncology: A Review,” JAMA Oncology, vol. 2, p. 797, June 2016.
-  O. M. Jensen, Cancer registration: principles and methods, vol. 95, ch. 5 Data sources and reporting, pp. 35–48. IARC, 1991.
-  M. Colombet, S. Antoni, and J. Ferlay, Cancer incidence in five continents, vol. 11, ch. 6 Data Processing. Lyon: International Agency for Research on Cancer, 2017.
S. Ferretti, A. Giacomin, et al., Cancer Registration Handbook.
AIRTUM, January 2008.
-  A. Fritz, C. Percy, A. Jack, K. Shanmugaratnam, L. Sobin, D. M. Parkin, and S. Whelan, eds., International classification of diseases for oncology. Geneva: World Health Organization, 3 ed., 2000.
-  V. Jouhet, G. Defossez, A. Burgun, P. Le Beux, P. Levillain, P. Ingrand, and V. Claveau, “Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer:,” Methods of Information in Medicine, vol. 51, pp. 242–251, July 2011.
-  R. Kavuluru, I. Hands, E. B. Durbin, and L. Witt, “Automatic extraction of ICD-O-3 primary sites from cancer pathology reports,” in Clinical Research Informatics AMIA symposium, 2013.
-  S. Gao, M. T. Young, J. X. Qiu, H.-J. Yoon, J. B. Christian, P. A. Fearn, G. D. Tourassi, and A. Ramanthan, “Hierarchical attention networks for information extraction from cancer pathology reports,” Journal of the American Medical Informatics Association, vol. 25, pp. 321–330, Mar. 2018.
-  J. X. Qiu, H.-J. Yoon, P. A. Fearn, and G. D. Tourassi, “Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports,” IEEE Journal of Biomedical and Health Informatics, vol. 22, pp. 244–251, Jan. 2018.
-  M. Alawad, S. Gao, J. X. Qiu, H. J. Yoon, J. Blair Christian, L. Penberthy, B. Mumphrey, X.-C. Wu, L. Coyle, and G. Tourassi, “Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks,” Journal of the American Medical Informatics Association, vol. 27, no. 1, pp. 89–98, 2020.
-  T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic Regularities in Continuous Space Word Representations.,” in Hlt-naacl, vol. 13, pp. 746–751, 2013.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation.,” in EMNLP, vol. 14, pp. 1532–1543, 2014.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
-  K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014. arXiv:1409.1259.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
-  E. Crocetti, C. Sacchettini, A. Caldarella, and E. Paci, “Automatic coding of pathologic cancer variables by the search of strings of text in the pathology reports. The experience of the Tuscany Cancer Registry,” Epidemiologia e prevenzione, vol. 29, no. 1, pp. 57–60, 2004.
-  A. Coden, G. Savova, I. Sominsky, M. Tanenblatt, J. Masanz, K. Schuler, J. Cooper, W. Guan, and P. C. De Groen, “Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model,” Journal of biomedical informatics, vol. 42, no. 5, pp. 937–949, 2009.
-  A. N. Nguyen, J. Moore, J. O’Dwyer, and S. Colquist, “Assessing the utility of automatic cancer registry notifications data extraction from free-text pathology reports,” in American Medical Informatics Association Annual Symposium, 2015.
-  Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical Attention Networks for Document Classification,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (San Diego, California), pp. 1480–1489, June 2016.
-  M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should I trust you?: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, 2016.
-  S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-end memory networks,” in Advances in Neural Information Processing Systems 28, pp. 2440–2448, 2015.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa, “Natural language processing (almost) from scratch,” J. Mach. Learn. Res., vol. 12, pp. 2493–2537, 2011.
-  Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 1746–1751, 2014.
-  A. Tibo, P. Frasconi, and M. Jaeger, “A network architecture for multi-multi-instance learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 737–752, Springer, 2017.
-  S. Martina, Classification of cancer pathology reports with Deep Learning methods. PhD thesis, University of Florence, 2020.
-  P. Spyns, “Natural language processing in medicine: an overview,” Methods of information in medicine, vol. 35, no. 04/05, pp. 285–301, 1996.
-  A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for sequence labeling,” in Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649, 2018.
-  C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.
-  C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, pp. 273–297, Sep 1995.
T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, ACM, 2016.
-  A. Chatterjee, K. N. Narahari, M. Joshi, and P. Agrawal, “Semeval-2019 task 3: Emocontext contextual emotion detection in text,” in Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 39–48, 2019.
-  D. Hu, “An introductory survey on attention mechanisms in nlp problems,” in Proceedings of SAI Intelligent Systems Conference, pp. 432–448, Springer, 2019.
-  J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: pre-trained biomedical language representation model for biomedical text mining,” arXiv preprint arXiv:1901.08746, 2019.
-  V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, G. Ceder, and A. Jain, “Unsupervised word embeddings capture latent knowledge from materials science literature,” Nature, vol. 571, no. 7763, p. 95, 2019.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Computation, vol. 10, pp. 1895––1923, Oct 1998.
-  Y. Yang and X. Liu, “A re-examination of text categorization methods,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–49, 1999.
-  R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997.
-  F. Brimo, R. Montironi, L. Egevad, A. Erbersdobler, D. W. Lin, J. B. Nelson, M. A. Rubin, T. van der Kwast, M. Amin, and J. I. Epstein, “Contemporary grading for prostate cancer: Implications for patient care,” European Urology, vol. 63, no. 5, pp. 892 – 901, 2013.
-  F. Greene, C. Compton, A. J. C. on Cancer, A. Fritz, J. Shah, and D. Winchester, Ajcc Cancer Staging Atlas. Springer New York, 2006.
-  S. Kummar, M. Fogarasi, A. Canova, A. Mota, and T. Ciesielski, “Cytokeratin 7 and 20 staining for the diagnosis of lung and colorectal adenocarcinoma,” British journal of cancer, vol. 86, no. 12, p. 1884, 2002.
-  J. Duraiyan, R. Govindarajan, K. Kaliyappan, and M. Palanisamy, “Applications of immunohistochemistry,” Journal of pharmacy & bioallied sciences, vol. 4, no. Suppl 2, p. S307, 2012.
-  J. M. Porcel, J. P. Diaz, and D. S. Chi, “Clinical implications of pleural effusions in ovarian cancer,” Respirology, vol. 17, no. 7, pp. 1060–1067, 2012.
-  B. Chiu, G. K. O. Crichton, A. Korhonen, and S. Pyysalo, “How to train good word embeddings for biomedical NLP,” in Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pp. 166–174, 2016.
-  J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems 30, pp. 4077–4087, 2017.
-  S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in 5th International Conference on Learning Representations, ICLR 2017, 2017.
-  O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Advances in Neural Information Processing Systems 29, pp. 3630–3638, 2016.
6 Dataset statistics
7 Hyperparameter optimization
We report here domains and optimal values (underlined) for the hyperparameters of the models used in our experiments.
In MAX we used the max aggregation function in the plain model of Section 3.4. The hyperparameters space was:
for the topography site task, and:
for the morphology type task.
In ATT we used the attention aggregation function in the plain model. The hyperparameters space was:
for the site, and:
for the morphology.
In MAXh we used the max aggregation in the hierarchical model of Section 3.4. The hyperparameters space was:
for the topography, and:
for the morphology.
In ATTh we used the attention aggregation in the hierarchical model. The hyperparameters space was:
for the topography, and:
for the morphology.
In MAXi we used the max aggregation in the plain model. Also we set the model to be interpretable. The hyperparameters space was:
for the topography, and:
for the morphology. Note that, in this setting, the size of the last layer of must be equal to the output size of the model (and the softmax is applied directly after the aggregation , without any layer). Thus, refers only to the layers before the last one, if they exist.
Regarding GRU, we searched in a space of number of layers of dimension in . We found that the best configuration was using layers of dimension .
8 Performance measures
We report in the following precise definitions of our performance measures.
The multiclass accuracy is defined as
where denotes the indicator function and is the number of test points (recall that denotes the vector of conditional probabilities assigned to each of the classes). It is equivalent to micro-averaged F1 measure for mutually exclusive classes.
The top- accuracy is defined as
where denotes the operator that given array as input returns the set being the permutation sequence that sorts in descending order
The macro-averaged F1 measure is defined as
is the precision for class and
is the recall for class ;
The fidelity is defined as
where and denote the vectors of conditional probabilities assigned to each of the classes by the two models (MAX and MAXi in the paper).