In healthcare facilities, clinical records are classified into a set of International Classification of Diseases (ICD) codes that categorize diagnoses. ICD codes are used for a wide range of purposes including billing, reimbursement, and retrieving of diagnostic information. Automatic ICD coding is in great demand as manual coding can be labor-intensive and error-prone. ICD coding is a multi-label text classification task, which is severely challenged by the following problems. First, the distribution of the frequencies of ICD codes is highly long-tailed. While some codes occur frequently, many other codes only have a few or even no labeled data due to the rareness of the disease. For example, in the medical dataset MIMIC III , among the 17,000 unique ICD-9 codes, more than of them never occur in the training data. It is extremely challenging to perform fine-grained multi-label classification on both codes with labeled data (seen codes) and zero-shot (unseen) codes at the same time. Besides, clinical documents can be long and noisy, and extracting relevant information for all the codes need to be assigned can be difficult. Automatic ICD coding for both seen and unseen codes fits into the generalized zero-shot learning (GZSL) paradigm , where test data examples are from both seen and unseen classes and we classify them into the joint labeling space of both types of classes. Nevertheless, existing GZSL works focus on visual tasks [20, 12]. The study of GZSL for multi-label text classification is largely under-explored. In this work, we aim to bridge this gap.
To tackle the problem of generalized zero-shot ICD coding, we propose AGMC-HTS, an Adversarial Generative Model Conditioned on ICD code descriptions to generate pseudo examples in the latent feature space by exploiting the ICD code Hierarchical Tree Structure. Specifically, as illustrated in Figure 1, AGMC-HTS consists of a generator to synthesize code-specific latent features based on the ICD code descriptions, and a discriminator to decide how realistic the generated features are. To guarantee the semantic consistency between the generated features and real features, AGMC-HTS reconstructs the keywords in the input documents that are related to the conditioned ICD codes. Such a pseudo cycle generation architecture especially benefits the feature generation of zero-shot codes. Different from the pure cycle architecture, we only generate the keywords instead of the whole text, which significantly eases the training of the model and adds more semantics to the synthetic features. To further facilitate the feature synthesis of zero-shot codes, we take advantage of the hierarchical structure of the ICD codes and encourage the zero-shot codes to generate similar features with their nearest sibling code. Besides ICD coding, the proposed AGMC-HTS can be applied to various text classification problems, such as indexing biomedical articles and patent classification.
The contributions of this paper are summarized as follows: 1) To the best of our knowledge, this work represents the first one that proposes an adversarial generative model for the generalized zero-shot learning on multi-label text classification. AGMC-HTS generates pseudo document features conditioned on the zero-shot codes and finetunes the ICD code assignment classifier. 2) AGMC-HTS incorporates the hierarchical structure and domain knowledge of the codes to ensure the semantic relevance between the latent features and codes. 3) We propose a pseudo cycle generation architecture to guarantee the semantic consistency between the generated features and real features by reconstructing the keywords extracted from real input texts. It also benefits the feature generation of zero-shot codes, which do not have document samples in the training data. 4) Extensive experiments demonstrate the effectiveness of our approach. On the public MIMIC-III dataset, our methods improve the F1 score from nearly 0 to for the zero-shot codes, and increase the AUC score by 3% (absolute improvement) from previous state of the art. We also show that the framework improves the performance on few-shot codes with a handful of labeled data.
2 Related Work
Automated ICD coding. Several approaches have explored automatic assigning ICD codes on clinical text data .  proposed to extract per-code textual features with attention mechanism for ICD code assignments.  explored character based short-term memory (LSTM) with attention and  applied tree LSTM with ICD hierarchy information for ICD coding. Most existing work either focused on predicting the most common ICD code or did not utilize the ICD hierarchy structure for prediction. 
proposed a neural network models incorporating ICD hierarchy information that improved the performance on the rare and zero-shot codes. The performance is evaluated in terms of the relative ranks to other infrequent codes The model hardly ever assign rare codes in its final prediction as we show in Section4.2, making it impractical to deploy in real applications.
Feature generation for GZSL. The idea of using generative models for GZSL is to generate latent features for unseen classes and train a classifier on the generated features and real features for both seen and unseen classes.  proposed using conditional GANs to generate visual features given the semantic feature for zero-shot classes. 
added a cycle-consistent loss on generator to ensure the generated features captures the class semantics by using linear regression to map visual features back to class semantic features. further improves the semantics preserving using dual GANs formulation instead of a linear model. Previous works focus on vision domain where the features are extracted from well-trained deep models on large-scale image dataset. We introduce the first feature generation framework tailored for zero-shot ICD coding by exploiting existing medical knowledge from limited available data.
Zero-shot text classification.  has explored zero-shot text classification by learning relationship between text and weakly labeled tags on large corpus. The idea is similar to  in learning the relationship between input and code descriptions.  introduced a two-phase framework for zero-shot text classification. An input is first determined as from a seen or an unseen classes before the final classification. This approach does not directly apply to ICD coding as the input is labeled with a set of codes which can include both seen and unseen codes It is not possible to determine if the data is from a seen or an unseen class.
The task of automatic ICD coding is to assign ICD codes to patient’s clinical notes. We formulate the problem as a multi-label text classification problem. Let be the set of all ICD codes and , given an input text, the goal is to predict for all . Each ICD code has a short text description. For example, the description for ICD-9 code 403.11 is “Hypertensive chronic kidney disease, benign, with chronic kidney disease stage V or end stage renal disease." There is also a known hierarchical tree structure on all the ICD codes: for a node representing an ICD code, the children of this node represent the subtypes of this ICD code.
We focus on the generalized zero-shot ICD coding problem: accurately assigning code given that is never assigned to any training text (i.e. ), without sacrificing the performance on codes with training data. We assume a pretrained model as a feature extractor that performs ICD coding by extracting label-wise feature and predicting by , where
is the sigmoid function andis the binary classifier for code . For the zero-shot codes, is never trained on with and thus at inference time, the pretrained feature extractor hardly ever assigns zero-shot codes.
Figure 1 shows an overview of our method. We propose to use generative adversarial networks (GAN)  to generate with by conditioning on code . The generator tries to generate the fake feature given an ICD code description. The discriminator tries to distinguish between and real latent feature from the feature extractor model. After the GAN is trained, we use to synthesize and fine-tune the binary classifier with for a given zero-shot code . Since the binary code classifiers are independently fine-tuned for zero-shot codes, the performance on the seen codes is not affected, achieving the goal of GZSL.
3.1 Feature extractor
We first describe the feature extractor model that will be used for training the GAN. The model is zero-shot attentive graph recurrent neural network (ZAGRNN), modified from the only previous work (to the best of our knowledge) that is tailored towards solving zero-shot ICD coding. Figure 2 shows the architecture of the ZAGRNN. At a high-level, given an input , ZAGRNN extracts label-wise feature and performs binary prediction on for each ICD code .
Label-wise feature extraction.
Label-wise feature extraction.Given an input clinical document containing words, we represent it with a matrix where
is the word embedding vector for the-th word. Each ICD code has a textual description. To represent , we construct an embedding vector by averaging the embeddings of words in the description.
The word embedding is shared between input and label descriptions for sharing learned knowledge. Adjacent word embeddings are combined using a one-dimension convolutional neural network (CNN) to get the n-gram text features. Then the label-wise attention feature for label is computed by:
where is the attention scores for all rows in and is the attended output of for label . Intuitively, extracts the most relevant information in about the code by using attention. Each input then has in total attention feature vectors for each ICD code.
Multi-label classification. For each code , the binary prediction is generated by:
We use graph gated recurrent neural networks (GRNN)  to encode the classifier . Let denote the set of adjacent codes of from the ICD tree hierarchy and be the number of times we propagate the graph, the classifier is computed by:
where4] and the construction is detailed in Appendix A. The weights of the binary code classifier is tied with the graph encoded label embedding so that the learned knowledge can also benefit zero-shot codes since label embedding is computed from a shared word.
The loss function for training is multi-label binary cross-entropy:
As mentioned above, the distribution of ICD codes is extremely long-tailed. To counter the label imbalance issue, we adopt label-distribution-aware margin (LDAM) 
, where we subtract the logit value before sigmoid function by a label-dependent margin:
where function outputs 1 if and and is the number of training data labeled with and is a constant. The LDAM loss is thus: .
3.2 Zero-shot Latent Feature Generation with WGAN-GP
For a zero-shot code , the code label for any training data example is and the binary classifier for code assignment is never trained with data examples with due to the dearth of such data. Previous works have successfully applied GANs for GZSL in the vision domain [19, 5]. We propose to use GANs to improve zero-shot ICD coding by generating pseudo data examples in the latent feature space for zero-shot codes and fine-tuning the code-assignment binary classifiers using the generated latent features.
More specifically, we use the Wasserstein GAN  with gradient penalty (WGAN-GP)  to generate code-specific latent features conditioned on the textual description of each code. Detail of WGAN-GP is described in Appendix B. To condition on the code description, we use a label encoder function that maps the code description to a low-dimension vector . We denote . The generator, , takes in a random Gaussian noise vector and an encoding vector of a code description to generate a latent feature for this code. The discriminator or critic, , takes in a latent feature vector (either generated by WGAN-GP or extracted from real data examples) and the encoded label vector to produce a real-valued score representing how realistic is. The WGAN-GP loss is:
is the joint distribution of latent features and encoded label vectors from the set of seen code labels, with and is the gradient penalty coefficient. WGAN-GP can be learned by solving the minimax problem: .
Label encoder. The function is an ICD-code encoder that maps a code description to an embedding vector. For a code , we first use a LSTM  to encode the sequence of words in the description into a sequence of hidden states
. We then perform a dimension-wise max-pooling over the hidden state sequence to get a fixed-sized encoding vector. Finally, we obtain the eventual embedding of code by concatenating with which is the embedding of produced by the graph encoding network. contains both the latent semantics of the description (in ) as well as the ICD hierarchy information (in ).
Keywords reconstruction loss. To ensure the generated feature vector captures the semantic meaning of code , we encourage to be able to well reconstruct the keywords extracted from the clinical notes associated with code .
For each input text labeled with code , we extract the label-specific keyword set as the set of most similar words in to
, where the similarity is measured by cosine similarity between word embedding inand label embedding . Let be a projection matrix, be the set of all keywords from all inputs and denote the cosine similarity function, the loss for reconstructing keywords given the generated feature is as following:
Discriminating zero-shot codes using ICD hierarchy. In the current WGAN-GP framework, the discriminator cannot be trained on zero-shot codes due to the lack of real positive features. In order to include zero-shot codes during training, we utilize the ICD hierarchy and use , the latent feature extracted from real data of the nearest sibling of a zero-shot code , for training the discriminator. This formulation would encourage the generated feature to be close to the real latent features of the siblings of and thus can better preserving the ICD hierarchy. More formally, let , we propose the following modification to for training zero-shot codes:
where is the distribution of encoded label vectors for the set of zero-shot codes and is defined similarly as in Equation 3.2. The loss term by the cosine similarity to prevent generating exact nearest sibling feature for the zero-shot code . After adding zero-shot codes to training, our full learning objective becomes:
where is the balancing coefficient for keyword reconstruction loss.
Fine-tuning on generated features. After WGAN-GP is trained, we fine-tune the pretrained classifier from baseline model with generated features for a given zero-shot code . We use the generator to synthesize a set of and label them with and collect the set of from training data with using baseline model as feature extractor. We finally fine-tune on this set of labeled feature vectors to get the final binary classifier for a given zero-shot code .
Dataset description. We use the publicly available medical dataset MIMIC-III  for evaluation, which contains approximately 58,000 hospital admissions of 47,000 patients who stayed in the ICU of the Beth Israel Deaconess Medical Center between 2001 and 2012. Each admission record has a discharge summary that includes medical history, diagnosis outcomes, surgical procedures, discharge instructions, etc. Each admission record is assigned with a set of most relevant ICD-9 codes by medical coders. The dataset is preprocessed as in . Our goal is to accurately predict the ICD codes given the discharge summary.
We split the dataset for training, validation, and testing by patient ID. In total we have 46,157 discharge summaries for training, 3,280 for validation and 3,285 for testing. There are 6916 unique ICD-9 diagnosis codes in MIMIC-III and 6090 of them exist in the training set. We use all the codes for training while using codes that have more than 5 data examples for evaluation. There are 96 out of 1,646 and 85 out of 1,630 unique codes are zero-shot codes in validation and test set, respectively.
Baseline methods. We compare our method with previous state of the art approaches on zero-shot ICD coding  as described in Section 3.1, meta-embedding for long-tailed problem  and WGAN-GP with classification loss  and with cycle-consistent loss 
that were applied to ZSL in computer vision domain. Detailed description and hyper-parameters of baseline methods are in AppendixC.
Training details. For WGAN-GP based methods, the real latent features are extracted from the final layer in the ZAGRNN model. Only features for which are collected for training. We use a single-layer fully-connected network with hidden size 800 for both generator and discriminator. For the code-description encoder LSTM, we set the hidden size to 200. We train the discriminator 5 iterations per each generator training iteration. We optimize the WGAN-GP with ADAM 
with mini-batch size 128 and learning rate 0.0001. We train all variants of WGAN-GP for 60 epochs. We set the weight ofto 0.01 and to 0.1. For , we predict the top 30 most relevant keywords given the generated features.
After the generators are trained, we synthesize 256 features for each zero-shot code and fine-tune the classifier using ADAM and set the learning rate to 0.00001 and the batch size to 128. We fine-tune on all zero-shot codes and select the best performing model on validation set and evaluate the final result on the test set.
|ZAGRNN + ||56.06||47.14||51.22||96.70||31.72||28.06||29.78||94.08|
|ZAGRNN + ||0.00||0.00||0.00||90.78||0.00||0.00||0.00||91.91|
|ZAGRNN + Meta ||46.70||0.89||1.74||90.08||3.88||0.95||1.52||91.88|
We report both the micro and macro precision, recall, F1 and AUC scores on the zero-shot codes for all methods. Micro metrics aggregate the contributions of all codes to compute the average score while macro metrics compute the metric independently for each code and then take the average. All scores are averaged over 10 runs using different random seeds.
Table 1 shows the results of ZAGRNN models on all the code. Note that fine-tuning the zero-shot codes classifier using meta-embedding or WGAN-GP will not affect the classification for seen codes since the code assignment classifiers are independently fine-tuned.
Table 2 summarizes the results for zero-shot codes. For the baseline ZAGRNN and meta-embedding models, the AUC on zero-shot codes is much better than random guessing. improves the AUC scores and meta-embedding can achieve slighter better F1 scores. However, since these methods never train the binary classifiers for zero-shot codes on positive examples, both micro and macro recall and F1 scores are close to zero. In other words, these models almost never assign zero-shot codes at inference time. For WGAN-GP based methods, all the metrics improve from ZAGRNN and meta-embedding except for micro precision. This is due to the fact that the binary zero-shot classifiers are fine-tuned on positive generated features which drastically increases the chance of the models assigning zero-shot codes.
Ablation studies on WGAN-GP methods. We next examine the detailed performance of WGAN-GP methods using different losses. Adding hurts the micro metrics, which might be counter-intuitive at first. However, since the is computed based on the pretrained classifiers, which are not well-generalized on infrequent codes, adding the loss might provide bad gradient signal for the generator. Adding and improves and achieves comparable performances in terms of both micro and macro metrics. At a closer look, improves the recall most, which matches the intuition that learning with the sibling codes enables the model to generate more diverse latent features. The performance drops when combing with and . We suspect this might be due to a conflict of optimization that the generator tries to synthesize close to the sibling code and simultaneously maps back to the exact code semantic space of . Using resolves the conflict as it reconstructs more generic semantics from the words instead of from the exact code descriptions. Our final model that uses the combination of and achieves the best performance on both micro and macro F1 and AUC score.
T-SNE visualization of generated features. We plot the T-SNE projection of the generated features for zero-shot codes using WGAN-GP with and in Figure 3. Dots with lighter color represent the projections of generated features and those with darker color correspond to the real features from the nearest sibling codes. Features generated for zero-shot codes using are closer to the real features from the nearest sibling codes. This shows that using can generate features that better preserve the ICD hierarchy.
|Code||Description||Keywords from||Keywords from|
|V10.62||Personal history of myeloid leukemia||AICD, inferoposterior, cardiogenic, leukemia, silent||leukemia, Zinc, myelogenous, CML, metastases|
|E860.3||Accidental poisoning by isopropyl alcohol||apneic, pulses, choking, substance, fractures||intoxicated, alcoholic, AST, EEG, alcoholism|
|956.3||Injury to peroneal nerve||vault, injury, pedestrian, orthopedics, TSICU||injuries, neurosurgery, injury, TSICU, coma|
|851.05||Cortex contus-deep coma||contusion, injury, trauma, neurosurgery, head||brain, head, contusion, neurosurgery, intracranial|
|772.2||Subarachnoid hemorrhage of fetus or newborn||subarachnoid, SAH, neurosurgical, screening||subarachnoid,hemorrhages, SAH, newborn, pregnancy|
Keywords reconstruction from generated features. We next qualitatively evaluate the generated features by examining their reconstructed keywords. We first train a keyword predictor using on the real latent features and their keywords extracted from training data. Then we feed the generated features from zero-shot codes into the keyword predictor to get the reconstructed keywords.
Table 3 shows some examples of the top predicted keywords for zero-shot codes. Even the keyword predictor is never trained on zero-shot code features, the generated features are able to find relevant words that are semantically close to the code descriptions. In addition, features generated with can find more relevant keywords than . For instance, for zero-shot code V10.62, the top predicted keywords from include leukemia, myelogenous, CML (Chronic myelogenous leukemia) which are all related to myeloid leukemia, a type of cancer of the blood and bone marrow.
|ZAGRNN + ||60.53||1.82||3.53||92.10||6.29||1.80||2.80||90.74|
|ZAGRNN + Meta ||48.88||6.75||11.84||92.15||16.65||6.77||9.62||90.92|
Few-shot codes results. As we have seen promising results on zero-shot codes, we also evaluate our feature generation framework on few-shot ICD codes, where the number of training data for such codes are less than or equal to 5. We apply the exact same setup as zero-shot codes for synthesizing features and fine-tuning classifiers for few-shot codes. There are 220 and 223 few-shot codes in validation and test set, respectively.
Table 4 summarizes the results. The performance of ZAGRNN models on few-shot codes is slightly better than zero-shot codes yet the recall are still very low. Meta-embedding can boosts the recall and F1 scores from baseline models. WGAN-GP methods can further boosts the performance on recall, F1 and AUC scores and the performance using different combination of losses generally follows the pattern in zero-shot code results. In particular, and can perform slightly better than other WGAN-GP models in terms of F1 and AUC scores.
We introduced the first feature generation framework, AGMC-HTS, for generalized zero-shot multi-label classification in clinical text domain. We incorporated the ICD tree hierarchy to design GAN models that significantly improved zero-shot ICD coding without compromising the performance on seen ICD codes. We also qualitatively demonstrated that the generated features using our framework can preserve the class semantics as well as the ICD hierarchy compared to existing feature generation methods. In addition to zero-shot codes, we showed that our method can improve the performance on few-shot codes with limited amount of labeled data.
-  (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: Appendix B, §3.2.
-  (2019) Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS, Cited by: §3.1, Table 1, Table 2, Table 4.
-  (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, Cited by: §1.
-  (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.1.
-  (2018) Multi-modal cycle-consistent generalized zero-shot learning. In ECCV, Cited by: Appendix C, §2, §3.2, §4.1, Table 2, Table 4.
-  (2014) Generative adversarial nets. In NeurIPS, Cited by: Appendix B, §3.
-  (2017) Improved training of wasserstein gans. In NeurIPS, Cited by: Appendix B, §3.2.
-  (1997) Long short-term memory. Neural computation. Cited by: §3.2.
-  (2016) MIMIC-iii, a freely accessible critical care database. Scientific data. Cited by: §1, §4.1.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix C, §4.1.
-  (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §3.1.
-  (2019) Large-scale long-tailed recognition in an open world. In CVPR, Cited by: Appendix C, §1, §4.1, Table 2, Table 4.
-  (2018) Explainable prediction of medical codes from clinical text. In NAACL, Cited by: §2, §4.1.
-  (2019) Dual adversarial semantics-consistent network for generalized zero-shot learning. In NeurIPS, Cited by: §2.
-  (2017) Train once, test anywhere: zero-shot learning for text classification. arXiv preprint arXiv:1712.05972. Cited by: §2.
-  (2018) Few-shot and zero-shot multi-label learning for structured label spaces. In EMNLP, Cited by: §2, §2, §3.1, §4.1, Table 1, Table 2, Table 4.
Towards automated icd coding using deep learning. arXiv preprint arXiv:1711.04075. Cited by: §2.
-  (2010) A systematic literature review of automated clinical coding and classification systems. JAMIA. Cited by: §1, §2.
-  (2018) Feature generating networks for zero-shot learning. In CVPR, Cited by: Appendix C, §2, §3.2, §4.1, Table 2, Table 4.
-  (2017) Zero-shot learning-the good, the bad and the ugly. In CVPR, Cited by: §1.
-  (2018) A neural architecture for automated icd coding. In ACL, Cited by: §2.
-  (2019) Integrating semantic knowledge to tackle zero-shot text classification. In ACL, Cited by: §2.
-  (2019) BioWordVec, improving biomedical word embeddings with subword information and mesh. Scientific data. Cited by: Appendix C.
Appendix A Appendix: Gated Recurrent Units
Appendix B Appendix: Generative adversarial networks
GANs  have been extensively studied for generate highly plausible data. The idea of GAN is to train a generator and a discriminator through a minimax game. The generator takes in a random noise and generate fake data to fool the discriminator while the discriminator tries to distinguish between generated data and real data. The training procedure of GANs can be unstable, thus  proposes Wasserstein-GAN (WGAN) to counter the instability problem by optimizing the Wasserstein distance instead of the original Jenson-Shannon divergence.  further improves WGAN by using gradient instead of weight clipping for the required 1-Lipschitz constraint in WGAN discriminator.
Appendix C Appendix: More training details
ICD-9 code information. We extract the ninth version of the ICD code descriptions and hierarchy from the CDC website111https://www.cdc.gov/nchs/icd/icd9cm.htm. In addition to the official description, we extend the descriptions with medical knowledge, including synonyms and clinical information, crawled from online resources222http://www.icd9data.com/.
ZAGRNN. For the ZAGRNN model, we use 100 convolution filters with a filter size of 5. We use 200 dimensional word vectors pretrained on PubMed corpus333https://github.com/ncbi-nlp/BioWordVec . We use dropout on the word embedding layer with rate 0.5. We use the ADAM  for optimization with a minibatch size of 8 and a learning rate of 0.001. The final feature size and GRNN hidden layer size are both set to 400. We train the ZAGRNN model for 40 epochs.
Meta-embedding.  proposed meta-embedding for solving large long-tail problem by transferring knowledge from head classes to tail classes. The method naturally fits ICD coding due to the long-tailed code distribution. To apply meta-embedding in ICD coding, we first construct a set of centroids as the mean of for each code from the training data. Let denote dimension-wise multiplication, then the meta-embedding for is calculated as:
where is the attention scores for selecting centroids and is a dimension-wise coefficient for selecting the attended features. Both and are parameterized as neural networks and are learned during fine-tuning. The final classification is performed by .
For meta-embedding, we fine-tune the neural network modules and using ADAM and set learning rate to 0.0001 and batch size to 32.
WGAN-GP with classification loss.  proposed to add a cross-entropy loss during training WGAN-GP to generate features being correctly classified as conditioned labels. In ICD coding, this loss translates to enforcing being classified as positive for code :
WGAN-GP with cycle consistency loss. Similar to adding to prevent the generated features being random,  proposed to add a loss that constrains the synthetic representations to generate back their original semantic features. Let
be a linear regression estimate the label embeddingfrom the generated feature , the cycle consistency loss is defined as: