Exemplar Auditing for Multi-Label Biomedical Text Classification

04/07/2020 ∙ by Allen Schmaltz, et al. ∙ Harvard University 0

Many practical applications of AI in medicine consist of semi-supervised discovery: The investigator aims to identify features of interest at a resolution more fine-grained than that of the available human labels. This is often the scenario faced in healthcare applications as coarse, high-level labels (e.g., billing codes) are often the only sources that are readily available. These challenges are compounded for modalities such as text, where the feature space is very high-dimensional, and often contains considerable amounts of noise. In this work, we generalize a recently proposed zero-shot sequence labeling method, "binary labeling via a convolutional decomposition", to the case where the available document-level human labels are themselves relatively high-dimensional. The approach yields classification with "introspection", relating the fine-grained features of an inference-time prediction to their nearest neighbors from the training set, under the model. The approach is effective, yet parsimonious, as demonstrated on a well-studied MIMIC-III multi-label classification task of electronic health record data, and is useful as a tool for organizing the analysis of neural model predictions and high-dimensional datasets. Our proposed approach yields both a competitively effective classification model and an interrogation mechanism to aid healthcare workers in understanding the salient features that drive the model's predictions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A considerable amount of medical and scientific knowledge is encoded in unstructured, high-dimensional data. Much of this data is the result of human processes that produce human-mediated labels at a particular level of granularity, but then the investigator seeks analyses at a lower granularity of the data, for which there are no explicit ground-truth labels. With data modalities such as text, it is particularly challenging to uncover correlations across label granularities, given the complex dependencies among words and phrases.

Attention mechanisms, and related methods, provide a stochastic approach to relating document- and token-level scores in a test instance, but they are an incomplete solution to the broader challenge of interpreting black box neural models and data, since they do not provide a clear means of assessing the extent to which an example at test time is reflective of the data used to train the model.

Toward this end, the recent work of Schmaltz (2019)

proposed an approach for decomposing a document-level CNN binary classifier to produce token-level scores. This approach, “binary labeling via a convolutional decomposition” (


), was shown to yield sharp token-level feature detections for a challenging zero-shot binary sequence labeling task. Importantly, this approach has the benefit that the token-level scores have a natural local, token-specific vector summarization derivable from the CNN filters. This enables an analysis method, exemplar auditing, for leveraging the CNN filters as representative keys of the strong class conditional feature detection of the binary

BLADE model.

We extend the BLADE model and exemplar auditing to the multi-label classification setting and demonstrate its usage for medical text. We further examine the relative distances among exemplars in this context, proposing a novel, intuitive approach for analyzing these distances at an instance and label-specific level. We suggest that this analysis machinery is particularly applicable to medical settings, where the data is high-dimensional and noisy, and yet there is a need to have some level of verification of discovered patterns based on the available labeled training data.

Technical Significance

We extend the previous work of Schmaltz (2019) to the multi-label classification setting. Furthermore, we demonstrate an approach for applying exemplar auditing to the case when the input to the CNN is not a contextualized model, and where the relative rankings of the predicted multi-labels is important, combining losses for signal from both the local and global levels in a straightforward but effective manner. In this context, we demonstrate competitive classification results on a multi-label classification task of electronic health record data. Finally, we provide what is, to the best of our knowledge, a novel approach for analyzing and normalizing the distances from exemplar auditing, demonstrating that distances to nearest true positive, false negative, false positive, and true negative representative vectors from the training set provide useful signal in assessing a prediction. This is an intuitive and effective means of normalizing the distances for a given instance for a given label.

Clinical Relevance

In order to put our methods in the context of previous work, we focus on annotating free text clinical notes with ICD-9 labels in patient discharge summaries. It is estimated that the United States spends in excess of many billions annually on unnecessary administrative costs such as from a complex billing infrastructure

(Yong et al., 2010), so improving and streamlining this process is of high value. This setting is emblematic of the more general setting where we have a large amount of high-dimensional data (such as text) about patients and labels at a high-granularity, but then we want to analyze the data at a lower-level of granularity. This could be for introspecting a prediction about a patient: We predict an outcome (or possible applicable diagnosis) for a patient with a model, but we want to examine the training set (for which we have known ground-truth labels) for similar patients to help inform our own clinical decision-making. It could also be used more generally for uncovering previously unknown patterns in large, high-dimensional datasets of patients or drugs, or for discovering label discrepancies in training datasets that may be used in high-stakes decision making.

We are proposing a rather different way of making sense of models and data than, say, interpreting coefficients on models, and/or examining p-values. We instead distill the relevant data into a small number of intuitive and information-dense values: localized features and associated predictions (and ground-truth, were available) from test and training, and the relative, normalized distances to the nearest true positive, false negative, false positive, and true negative representative examples from the training set. As a result, we can leverage the ability of the neural networks to find signals in large amounts of data, while retaining a straightforward means of ingesting and assessing those insights at a human level.

2 Methods

We extend the convolutional decomposition proposed in Schmaltz (2019), which was originally evaluated on binary labeling settings, to the case where the labels are themselves high-dimensional. The approach utilizes a one-dimensional convolutional network (Kim, 2014), which is then decomposed to produce scores at the token-level (i.e., the “local” level), even though token-level labels are not available during training. We start by describing the CNN, as used for classification at the document-level111In practice, the document can also be a single sentence. The key distinction is that the level of analysis of the base classifier is that at which human labels are available, and then the CNN is decomposed to produce scores at a lower level of granularity.

(i.e., the “global” level), before detailing the decomposition, associated loss functions, and exemplar auditing, the means of introspecting the predictions, for token-level analyses.

Multi-Label Classification

Each token in the document is represented by a -dimensional vector, where

is the length of the document, including padding symbols, as necessary. This

matrix vector is the input to the CNN. This mapping of a token to -dimensional vectors can be, for example, via standard word embeddings (Pennington et al., 2014; Mikolov et al., 2013), a concatenation of standard word embeddings and contextualized embeddings (Devlin et al., 2018), or in principle, a neural network over other input modalities pre-trained with a masked-language-model-style, or related, loss (over images, time-series data, etc.). In this work, we consider the first input modality.

The convolutional layer is applied to this matrix, using a filter of width , sliding across the -sized ngrams of the input. The convolution results in a feature map for each of total filters. Note that each of the filters has a bias and weights.

We then compute

a ReLU non-linearity followed by a max-pool over the ngram dimension resulting in

. A final linear fully-connected layer, , with a bias, , produces a vector of scores, , for each of the class labels:

Typically such classifiers are trained for document classification (with similar effect) with a fully connected layer of dimension . Here, we have replaced with (with a concomitant subtraction of the output) as a minor change to maintain the semantics of the binary decomposition (i.e., each label has a positive, or “on”, state and a negative, or “off”, state). Using this convention, the “on” weights of label are in row of , and the “off” weights of label are in row of (and analogously for the bias).

Here, we have a multi-label setting (i.e., each document can be assigned multiple labels) rather than a multi-class setting (i.e., with exclusive assignment of one label from a set of 2 or more labels); as such, we train with a sigmoid transform and a binary cross-entropy loss, as opposed to a softmax cross-entropy as used in multi-class settings:

where is the corresponding true class assignment for label (at the document level). This loss is averaged over all classes, over the documents in the mini-batch.

CNN Decomposition: multi-BLADE

We seek token-level scores for every label, which we obtain by decomposing the final layer CNN. We use the notation

to identify the index into the feature map that survived the max-pooling operation, which corresponds to the application of filter starting at index of the input (i.e., the set contains all of the indices of the input covered by this particular application of the filter of width ). Note that the filter output is constant across labels, but each label is associated with a unique set of weights (and a bias) from the final fully-connected layer. We obtain a positive (“on” state) contribution score for each input token , for each label , as follows:

where we have used an Iverson bracket for the indicator function. The corresponding negative (“off” state) contribution score for token and label is analogous:

Fine-Tuning: Multi-Label Min-Max + Global Normalization

The token-level scores can then be used directly (perhaps with some lightweight tuning of the bias—i.e., the decision threshold—by an end-user222

End-user tuning can also be useful for the fine-tuned models and is simpler and more perfunctory than it may sound: In practice, a clinician, or data annotator, is given access to the output in an interface with a single “slider”, or other mechanism, to adjust a single real value to offset the learned bias, which has the effect of modulating precision and recall. Since the preferred balance between precision and recall varies across settings and end-users, such a mechanism is likely necessary in practice. We leave this for future HCI studies to investigate further.

). However, the decomposition affords flexibility in defining additional loss constraints, with which we can fine-tune the model, that can be useful in practice.

In documents associated with a label, we assume that at least one token in the document is associated with the label (i.e., the positive contribution score is greater than the negative contribution score for at least one token), but at least some tokens (and in fact, perhaps most, in practice) are not primarily associated with the label (in the sense that an end-user would not label the tokens with that class, but of course, there could be indirect dependence). Similarly, for documents not associated with a label, we assume all of the token-level positive contribution scores are less than or equal to the negative contribution scores for that label. We can encode this in the following min-max loss over labels333This is a generalized form of the two-class min-max binary cross-entropy loss of Schmaltz (2019), which was adapted from the two-class min-max squared loss of Rei and Søgaard (2018) over attention for grammatical error detection.:

where is a combined token contribution and is the smallest combined token contribution in the document for label ; and

where is the largest combined token contribution in the document for label and is the corresponding true class assignment for label .

Just fine-tuning using the aforementioned min-max loss can yield strong F-scores at the label level (derived from the max token contribution scores), but the contribution scores across labels (at the document level) may lack the normalization of the original fully-connected layer. In other words, the contribution scores for label

may be reasonable at the token level, but comparing with may be a less reliable measure of the relative propensity for label vs. at the document level than just comparing with resulting from training with the standard binary cross-entropy loss. This can be an issue if we aim to evaluate labels at the document level with retrieval-style ranking metrics—or in practice, aim to only present the end user with the subset of the top most likely labels for the document. One resolution to this issue is to simply ensemble the originally trained model and the min-max fine-tuned weights (as for example, using the former for label ranking and the latter for visualizing token-level scores), but we can alternatively modify the loss to incorporate the intuition of this ensemble approach, which has the benefit of generating a single, shared set of CNN filter weights for use with our analysis methods.

To enforce this global constraint, after training the base model and prior to fine-tuning, we instantiate a second linear fully-connected layer, , with a bias, , with un-tied weights copied from and , respectively. These two linear layers share the same convolutional filters (and input to the CNN), but the weights and biases are free to change separately. We then consider the following additional loss:

where is calculated in the same manner as (in the base model), but with and instead of and .

, , and are then averaged over all classes over all documents in the mini-batch. In this way, local sparsity is enforced on the token-level scores (via and ), and global normalization across labels is maintained with the document-level maxpooling inherent in the calculation of . Both the local and global constraints interact in .


At inference, we assign label to the document if .

Visualization of Token-Level Scores

When visualizing the token-level score for token for label (i.e., the token-level label assignment) of a model fine-tuned just with the and losses, we find that is a natural baseline decision threshold. When using , , and , we instead use the following:

which takes into account the addition of the global score.

Exemplar Auditing

The previous work of Schmaltz (2019) proposed exemplar auditing, an approach for leveraging the CNN filters as representative keys of the strong class conditional feature detection of the binary BLADE model, affording a means to introspect the training set (hereafter, “database”) for a nearest neighbor to a relevant local feature in a test (hereafter, “query”) prediction. This can be useful to audit the prediction, either for labeling additional data, or more generally for analyzing the data and model behavior. We further explore this idea in the context of multi-label classification.

Each token is associated with a vector that corresponds to the relevant filter applications from the convolution. In order to consider filters of arbitrary width, we associate a token with the average of all filter applications that covered the token (prior to the global maxpool operation). More specifically, with filters of width , for each token we have a vector :

where we have averaged all components of each of the feature maps that resulted from an application over the token at index . In the case of multiple filter widths, we concatenate all of the resulting vectors.

Since in our experiments the input to the CNN is not necessarily a contextualized embedding that has access to the full document, and since our inference scoring takes into account the maxpool vector (via ), we also consider the document-level maxpool vector (with ReLU444The motivation for not applying ReLU and not censoring the negative components in

is that there is potentially informative signal in the negative values for distinguishing exemplars at the n-gram level. In the case of

, we use the ReLU for consistency with the maxpooling of training (and in any case, padding tokens and n-grams impose a de-facto ReLU, unless masked, since they are zero by default).) :

which is constant for all tokens in the document. The full vector for the token at index is then the concatenation of the applicable token-specific components of the feature maps and the document-level maxpool components:

We subsequently use to refer to exemplar vectors from the database, and we use to refer to such vectors from a query. As an important distinction, has access to the ground truth labels from training, whereas does not. Since operating over all exemplar vectors from every token in the database can be computationally expensive in both time and space for large numbers of long documents, we make the restriction that we only store one exemplar vector (with the max token-level contribution score) for each predicted or gold label for each document in the database. In other words, for a given class for a given document, we only store the exemplar vector corresponding to when and/or . The number of predicted and gold labels per document is typically considerably less than both the total number of classes and the total number of tokens (and the same exemplar vector can be associated with multiple labels, but not vice-versa), so this restriction dramatically decreases the size of the database.

When classifying new documents at test time, for any class with a positive prediction, , we can associate the query token with the exemplar vector at index from the database, , (and corresponding document) that minimizes the Euclidean distance with that of the query token’s vector :

Previous work demonstrated that the exemplars could be effectively utilized at inference by combining the query and database predictions via a conjunctive decision rule to increase the precision of the predictions. For reference (and as a means of organizing one’s analysis), we also consider a soft combination between the query prediction and the prediction of the matched database exemplar modulated by relative distances. In practice when classifying new documents at test time, for the predicted class for each query token’s vector, , we retrieve up to 4 distinct database vectors , each of which corresponds to a unique document in the database:

  1. The vector minimizes the Euclidean distance with that of the query token’s vector with the restriction that is associated with both a positive model prediction for class and a corresponding positive ground truth label for class (i.e., this is a true positive in the training set).

  2. The vector minimizes the Euclidean distance with that of the query token’s vector with the restriction that is associated with a positive ground truth label for class but a negative model prediction for class (i.e., this is a false negative in the training set).

  3. The vector minimizes the Euclidean distance with that of the query token’s vector with the restriction that is associated with a positive model prediction for class , but a negative ground truth label for class (i.e., this is a false positive in the training set).

  4. The vector minimizes the Euclidean distance with that of the query token’s vector with the restriction that is associated with both a negative model prediction for class and a negative ground truth label for class .555Given our database restrictions to reduce computational costs, in practice this case retrieves a token for which the document is not associated with class (either as a prediction or ground-truth label), but the token is associated with at least one other class label (either as a prediction or ground-truth label) as the max token-level contribution score.

Of these four vectors666It is possible for one or more of these vectors to not exist in the database (e.g., will not exist if the model never correctly predicted that label in training), in which case we simply assign a very large default distance., we use the notation to identify that which is the overall minimal distance to the query:

As one means of analyzing whether these pieces of information provide signals in the expected directions, we also provide results where at inference, if the query , we assign label to the document if

where (with a slight overloading of notation between scores from the query and database), is the model score associated with from training (i.e., from running the model on that training document).

The resulting score could be used as a blind, automatic substitute for the original model score (and we provide results to this effect below for context), but that is not the intended use case. Rather, this machinery is a way of organizing a human end-user’s evaluation of a model prediction (and the data), as analyzed further below.

Note that the softmax over the negative distances has the effect of down-weighting the impact of the score from the database when the other exemplars have relatively similar distances. In a high-stakes scenario, an end-user could instead impose a hard rejection of a label if, for example, the exemplar was not , or use that as reference context for re-labeling the data.

3 Experiments

Our approach can be generally applied to any multi-label text classification task. In the interest of comparing to previous work in the medical domain with data that is available for replication, we focus on the clinical text from the previous work of Mullenbach et al. (2018).

Data and Task

The Medical Information Mart for Intensive Care (MIMIC-III) dataset version 1.4 (Johnson et al., 2016; Pollard, 2016; Goldberger et al., 2000) is a large-scale dataset of de-identified patient data derived from admissions to a Boston-area hospital. The dataset is available to researchers under a data use agreement. We focus on the text of the patient discharge notes, which are labeled with International Classification of Diseases (ICD-9) codes. These codes are primarily for billing and administrative purposes, but serve as a useful testing ground for high-dimensional multi-label classification in the medical setting given the availability of data and previous works for comparison. We hypothesize that many of the challenges involved in this reasonably well-defined, replicable setting will be present in other medical classification scenarios, and that it is thus a reasonable testing grounds on which to focus.

More specifically, the task is to assign one or more ICD-9 codes to each discharge summary. We follow the publicly available MIMIC-III preprocessing and setup of Mullenbach et al. (2018)777We use the preprocessing code available at https://github.com/jamesmullenbach/caml-mimic., which lowercases and truncates the documents to a maximum length of 2500, removing any tokens that lack at least one alphabetic character. Low-frequency tokens (those occurring in less than three training documents) are replaced with a placeholder symbol. We use a comparable vocabulary size of the 50k most common tokens.

We follow past work and provide results on two subsets of the data. In the first, we restrict the data to the top 50 most common labels, and only consider the 8066, 1573, and 1729 discharge summaries (hereafter, documents) associated with those labels in each of the train, development, and test sets, respectively. We also consider the full set of 8921 labels seen in the documents, which includes 47723, 1631, and 3372 documents in each of the train, development, and test sets, respectively. In this setting, there are 73 labels in the development set and 172 labels in the test set that are never seen in training (reflective of the larger universe of available ICD-9 codes). We follow past work in assigning these labels as missed predictions for the model at inference time.

The task is challenging for at least three reasons:

  1. The label space is high-dimensional, with documents assigned a variable number of labels. For reference, in the test set in the top 50 labels subset, there are on average 6 labels assigned to every document, ranging from a minimum of 1 label to a maximum of 20 labels. In the test set for the full set, there are on average 18 labels assigned to every document, ranging from 1 to 65 labels.

  2. The data is noisy, consisting of many incomplete, grammatically incorrect sentences with various abbreviations and medical-domain-specific language. Headings and other structures from the EHR are flattened into the text. As a result of the aforementioned, the text differs considerably from standard text used to pre-train typical NLP models. Additionally, while the number of documents may seem modest, in fact there is a considerable amount of text owing to the long length of the documents (as opposed to the “documents” consisting of single sentences).

  3. Owing to points (1) and (2) above, the task is also non-trivial for humans, introducing potential ambiguity and noise into the ground-truth labels.

We assess our approaches using the metrics of the previous works analyzing this dataset, where for consistency we have used the same evaluation scripts of Mullenbach et al. (2018). This includes the micro-averaged and macro-averaged and the area under the ROC curve (AUC). Following the previous work, we focus on the retrieval metric, precision @ (), as our primary metric. In this context, this metric is the average number of highest-scoring labels out of that are true labels in the ground truth data.888In the implementation of this metric in Mullenbach et al. (2018), the denominator was calculated as a constant across documents, which means that the gold labels will not yield a value of 1 against the ground truth, since some documents have less than true labels. In practice, we found that adjusting the denominator to the real number of true labels (when less than ) did not change the direction of any of the results in relative terms, so we stick to the previous formulation for consistency purposes. This metric is chosen under the assumption that the real-world use case for such models is as an annotation support tool, emphasizing precision over recall, with the additional consideration that the relative ranking of predicted labels is important. In other words, we aim for a system that predicts labels that are relevant and true, with a ranking to allow an end-user to review the top few predicted labels. For the top 50 subset, we chose model parameters and perform tuning on the held-out development set using , and similarly, for the full set.

In analyzing our proposed approaches, we aim for an input modality to the multi-BLADE layer that yields levels of effectiveness that are at least competitive with previous works. As we show below, the input word embeddings of previous work serve this purpose. We then use that as the substrate upon which to consider exemplar auditing. Note that the input (i.e., the underlying model of the bottom layers of the network) to the multi-BLADE layer is orthogonal to exemplar auditing in so far that we would assume that if there existed a significantly stronger model, it could be co-opted by incorporating the frozen version as input.

We turn now to the details of the models used in the experiments.999In the published version, we will include a link to the replication code.

CNN Model

As our base model, we use a CNN with 100 filters of width 1 and 1000 filters each for widths of 3, 4, and 5. We train with Adadelta (Zeiler, 2012)

, with dropout of 0.5 on the input to the final fully-connect layer, choosing the epoch with the highest

score on the held out development set. We use the label CNN to refer to this model. With the full set of labels, we use a similar model, but increase the model capacity to 200 filters of width 1 and 2000 filters each for widths of 3, 4, and 5. We found that training on the full set was very sensitive to model parameters. Based on results on the development set, we train with Adam (Kingma and Ba, 2014) using a small learning rate of and dropout of 0.6. Additionally, we train with a schedule such that the model only considers the top 1000 most frequent labels for the first 30 epochs before transitioning to training with the full label set. As in previous work, we choose the epoch with the highest score on the development set. We use the label CNN+full to refer to this model. With all models, we use pre-trained, 100 dimensional Word2Vec embeddings (Mikolov et al., 2013) over the documents as in the work of Mullenbach et al. (2018).

CNN Fine-Tuning

We fine-tune the base models using the , , and losses, for which we use the labels CNN+mmc and CNN+full+mmc. In the case of CNN+full+mmc, based on results on the development set, we only calculate token-level scores (and assign non-zero loss scores) for the top 1000 labels predicted by for each training instance in the mini-batch.

Exemplar Auditing

The exemplar auditing machinery is primarily intended as a per-document level analysis tool for a human end-user. To assess the quality of the signals presented to the end-users, we provide empirical results using the same aggregated metrics as the core models. We label experiments using the aforementioned soft combination of query and database scores (and distances) with the suffix +ExA. In further analyses, we also consider a decision rule in which we only admit a prediction for a label if the retrieved exemplar vector is , for which we use the label +ExADR. Finally, to provide a further evaluation of the similarity between the query and database vectors, we show results in which for a given model prediction, we substitute the score from the model on the query (i.e., the test set) with the score associated with (i.e., the score from the training set). We label these experiments with +onlyDB. Note that the exemplars for the CNN+mmc model are vectors, and for CNN+full+mmc, vectors.

Previous Models

The previous work of Mullenbach et al. (2018) considers replacing the standard maxpooling of the base CNN with a per-label attention mechanism, which in effect is a learned weighted average over the filters, specific to each label. This model is referred to as Convolutional Attention for Multi-Label classification (CAML). A variant (DR-CAML) is also considered which regularizes the predictions using embeddings of the labels. Both CAML and the decomposition examined here can be used to generate token-level scores; however, the manner of doing so is rather different. Whereas CAML utilizes a softmax attention mechanism, multi-BLADE is a method of leveraging the maxpooling behavior of the base classifier, and can also be used to derive token-level scores without additional parameters to the base model (including if fine-tuned with only the min-max loss). Finally, we also consider the LEAM model of Wang et al. (2018), which learns a joint embedding attention between both the document text and the label text.101010As suggested in passing above, we could also use frozen versions of these alternative models as input to the multi-BLADE layer. However, as we show below, using standard word embeddings as input already yields at least competitive results on the primary metrics of interest, so we do not pursue this avenue further in this work. In preliminary experiments, we found that using the frozen contextualized embeddings of Devlin et al. (2018) led to a significant degradation in effectiveness, almost certainly owing to the large divergence between the domains on which these models were trained and the non-standard language of the discharge summaries. We leave retraining the contextualized embeddings on this domain of text to future work. To our knowledge, the results in these works constitute the current baselines on this particular MIMIC-III task.

4 Results

In the analysis of the experimental results, we demonstrate the following two high-level points:

  1. We show that the proposed model and losses are at least competitive with previously proposed approaches on the main metrics on these datasets.

  2. We show empirical evidence that the signals provided by exemplar auditing (as would be presented to an end-user at a per-document level) behave as expected.

Model Macro Micro Macro Micro
LEAM 0.881 0.912 0.540 0.619 0.612
CAML 0.875 0.909 0.532 0.614 0.609
DR-CAML 0.884 0.916 0.576 0.633 0.618
CNN 0.910 0.935 0.586 0.655 0.652
CNN+mmc 0.913 0.937 0.598 0.663 0.654
CNN+mmc+ExA 0.913 0.937 0.591 0.658 0.652
Table 1: MIMIC-III test set results on the top 50 labels. The CAML and DR-CAML model results are as reported in Mullenbach et al. (2018). (bolded column) is the metric used to tune the models on the development set.
Model Macro Micro Macro Micro
CAML 0.895 0.986 0.088 0.539 0.709 0.561
DR-CAML 0.897 0.985 0.086 0.529 0.690 0.548
CNN+full 0.806 0.972 0.035 0.447 0.691 0.531
CNN+full+mmc 0.790 0.969 0.040 0.467 0.697 0.538
CNN+full+mmc+ExA 0.790 0.969 0.034 0.454 0.696 0.537
Table 2: MIMIC-III test set results on all 8921 labels. The CAML and DR-CAML model results are as reported in Mullenbach et al. (2018). (bolded column) is the metric used to tune the models on the development set.

With regard to (1), Table 1 displays the main results for the top 50 labels subset of the data. Of note is that on this subset, the benefits of the previously proposed attention mechanisms (CAML and DR-CAML) and label embedding approaches (LEAM) are within parameter variation of the base CNN model.

In the top 50 labels set, the addition of the min-max loss (+mmc) does not lead to a real difference in the primary metric of interest (). However, the key advantage of using this loss is that it does not degrade these document-level scores, but it does encourage sparsity in the scores at the token-level, which can be helpful when visualizing the output. This can be useful in practice when the approach is used as an annotation support tool.

The analogous results on the full set are shown in Table 2. In this case, the training of CAML

appears to have found a particularly effective setting in the parameter space. In general, we found training on this full set to be very sensitive to minor changes in learning parameters (optimizer, learning rates, dropout probability, etc.), perhaps owing to the very long tail of infrequently occurring labels. Nonetheless, we find that the effectiveness of

CNN+full+mmc in terms of (the metric we tuned against on the development set) to be between that of DR-CAML and CAML, and to be competitive for all practical purposes. (Along these lines, note, too, that the relative effectiveness of DR-CAML and CAML flips across the top 50 subset and the full label set.)

Macro Micro
Model Precision Recall Precision Recall
CNN+mmc 0.704 0.520 0.598 0.765 0.586 0.663
CNN+mmc+ExA 0.705 0.508 0.591 0.769 0.575 0.658
CNN+mmc+onlyDB 0.715 0.480 0.574 0.777 0.548 0.643
CNN+mmc+ExADR 0.712 0.432 0.538 0.784 0.501 0.611
Table 3: MIMIC-III test set results on the top 50 labels with a breakdown of precision and recall with and without the various exemplar auditing decision rules. The precision columns are highlighted for discussion in the main text.
Macro Micro
Model Precision Recall Precision Recall
CNN+full+mmc 0.062 0.029 0.040 0.727 0.343 0.467
CNN+full+mmc+ExA 0.057 0.025 0.034 0.760 0.324 0.454
CNN+full+mmc+onlyDB 0.055 0.022 0.031 0.777 0.294 0.426
CNN+full+mmc+ExADR 0.055 0.020 0.029 0.785 0.263 0.395
Table 4: MIMIC-III test set results on all 8921 labels with a breakdown of precision and recall with and without the various exemplar auditing decision rules. The precision columns are highlighted for discussion in the main text.
Softmax Threshold for
Model 0.0 0.2 0.4 0.6
CNN+mmc+ExADR+t 0.784/0.501 0.784/0.501 0.871/0.208 0.984/0.018
CNN+full+mmc+ExADR+t 0.785/0.263 0.785/0.263 0.815/0.203 0.885/0.071
Table 5: Micro Precision/Recall on the MIMIC-III test set on the top 50 labels subsets and all 8921 labels, only admitting a label prediction based on ExADR and if the corresponding softmax distance probability is greater than the specified threshold.

With regard to analysis point (2) above, we see in Tables 1 and  2 that the combination of the query scores and the database scores (with +ExA) does not significantly change the and scores. This provides evidence that the query scores and the exemplar scores (weighted by relative distances) tend to be in the same direction. We examine this further in Table 3 for the top 50 labels subset and in Table 4 for the full label set where we break down the precision and recall values used to calculate the scores. We see that the precision of the predictions is generally retained when combining the query and database scores, and in fact, the Micro precision rises by around 3 points for CNN+full+mmc+ExA relative to CNN+full+mmc.

Interestingly, if we throw away the model prediction of the query and replace it with the database prediction associated with the exemplar, the scores only suffer a modest decline, and it is the result of a decline in recall but in fact is accompanied by a rise in precision, as we see in Tables 1 and  2 for +onlyDB. Note that this is without constraining or censoring the choice of , and so the relative stability of this change is reflective of most exemplar vectors being associated with predictions in the same direction as the query. We also see that the hard decision rule of +ExADR tends to push up Micro precision. Most selected database exemplars are vectors, which is why we see only a modest (and not catastrophic) decline in recall with the +ExADR decision rule.

Exceptions to some of the above patterns are with the Macro metrics, which have the effect of heavily weighting (in relative terms) low-frequency labels.111111As Mullenbach et al. (2018) note, “A hypothetical system that performs perfectly on the 500 most common labels, and ignores all others, would achieve a Macro of 0.052 and a Micro of 0.84.” As with previous work, the values are sufficiently low in the full set (resulting from relatively rare correct predictions on the long tail of labels that occur infrequently in training) that the observed differences may not be meaningfully different in practice, and are thus difficult to draw conclusions from, beyond concluding that none of these models are particularly effective on rare labels, at least in the aggregate.

It is also useful to have an empirical sense of whether the relative distances behave as expected. In particular, the relative distance associated with the vectors should contain information regarding the reliability of the prediction. If the vector is close to the query vector , and at the same time, is comparatively far from each of , , and , we would expect the query prediction to be more likely to be right than if the distance to the vector is farther in relative terms to the distances to , , and . A clean way of evaluating this is to simply vary a threshold on the normalized softmax distance for the vectors. We show results in Table 5 in which we only admit a label prediction if is and the normalized softmax probability is greater than a given threshold. We label these results with +ExADR+t. (Recall that the softmax probability is derived from a negative distance, so a higher probability implies closer similarity.) Here we show a relatively coarse grid search, but the pattern is clear: As the query and increase in similarity in relative terms to the distances to the false negatives, false positive, and true negative vectors, the Micro precision of the predictions rise.

Appendix A contains output from three random sentences from the test set for the CNN+full+mmc, along with the exemplar vectors. Given the noisy nature of the data, the long documents, and high-dimensional label set, it is at times striking how sharp the feature detections are. Note that although we are not using contextualized embeddings as input to the CNN, as a result of the averaging over the filters to construct exemplar vectors, which means that a token with an application of a filter width of 5 sees filter applications over a total of 9 tokens, and the concatenation of the global maxpool vector, each vector and has a relatively expansive view of the document.

The above results encapsulate the crux of the approach: We can focus on representative local features (which are human interpretable, or at least human manageable as a means of pivots for organizing one’s analysis) and exploit relative distances between summary vectors of true positive, false negative, false positive, and true negative elements of the training set in order to aid in the analysis of unidentifiable neural models and associated high-dimensional data. It is a surprisingly parsimonious, yet powerful idea that we expect will have a number of real-world applications.

5 Limitations

The primary limitation of the approach is that—relative to performing a standard forward pass with a CNN classifier—it can be relatively computationally expensive to search for the exemplars over a large database for many queries. However, we found in practice that the calculation of the distances remains practical provided the Euclidean distances are calculated on a GPU, noting that the exemplar search itself is embarrassingly parallel, which allows for straightforward splitting across multiples GPUs.

It is also important to reiterate (which should be clear from above) that utilizing exemplar auditing does not automagically make a decent classifier a significantly better classifier. As we show above, the various pieces of information can be used (if so desired) as a means of constraining predictions to boost precision (along the precision-recall curve), but we would not typically expect huge improvement swings in overall model prediction effectiveness in doing so over data similar to that seen in training (which in that way, would not be faithful to the underlying model, in any case), with the notable possible exception of the case of database exemplars over data not seen in explicit training. Rather, the approach is a means of providing a human with the key pieces of (likely applicable) information, among large amounts of possible information, to assess a model decision and its associated data. With that information, a human user can more effectively use the model as an assisting tool in decision-making.

6 Related Work

In NLP, many surface-level interpretation methods have been proposed, often based on attention mechanisms. Belinkov and Glass (2019) provides a recent overview. Here, we premise our approach for relating document-level scores to token-level scores on the previous work of Schmaltz (2019), which demonstrated that for binary zero-shot grammatical error detection (a sequence labeling task for which ground-truth token-level labels are available), a decomposition of a CNN was at least competitive with previously proposed attention-based approaches. We extend the approach to the multi-label setting. High-dimensional multi-label classification opens a number of possibilities for adjacent tasks; in future work, we plan to explore the utility of this approach in regression settings via discretizing real-valued output.

The exemplar auditing concept of relating a fine-grained feature of a test instance back to a feature in training and utilizing relative distances to analyze a model and its data is a rather different notion of model interpretation than is typically considered in the attention-mechanism literature in NLP, and we think it is an important avenue for further work. This notion of relative distances bears some resemblance to—but is largely orthogonal to—the large literature of bayesian and frequentist approaches for calculating decision bounds. Card et al. (2019) propose a conformal-based method to describe a model prediction in terms of a weighted sum of training instances, where the measure of non-conformity is a distance between the final hidden state of a neural classifier (prior to the softmax). As an important distinction, their proposed approach relates predictions at the instance level (e.g., at the document level), whereas the machinery presented here provides a means of dissecting model predictions at the fine-grained feature level, which is critical for domains such as text, particularly when the documents are very long. Either the distance to the exemplars, or in fact, the softmax distribution over exemplar distances for an instance, itself, could be used as part of a non-conformity score in a conformal framework, which we leave to future work.

The prototypical networks of Snell et al. (2017) can be used for zero-shot and one-shot classification by assigning an instance to a cluster based on a softmax over distances to vectors representing the classes, which are means over the instances of the classes. With exemplar auditing combined with [multi-]BLADE, we instead retain granularity over the fine-grained features of the label classes (and document classifier) as represented by the exemplar vectors, which is a key difference. Additionally, our approach produces a softmax distribution over distances to the nearest true positive, false negative, false positive, and true negative representatives of a feature for a particular document for a particular label, which is, to our knowledge a new approach that has not been previously explored. As we show above, these relative distances provide informative signal as to the reliability of the prediction.

7 Conclusion

We have examined an approach for organizing the analysis of a multi-label classifier and its associated data, using a CNN as the final layer of a network. Via a sparsity-encouraging loss, we relate document-level scores to token-level scores and then we unwind the CNN to produce representative vectors for the tokens. We demonstrate that distances between these vectors can be exploited to establish a mapping between training and test features. We find that distances to nearest true positive, false negative, false positive, and true negative representative vectors from the training set provide a useful and intuitive means of analyzing the data and model. We demonstrate the viability of the approach on a multi-label classification task of electronic health record data, and hypothesize that it will lend itself to a number of additional practical applications in medicine and science.


Appendix A: MIMIC-III Output Samples

In Tables 6 to 8 we illustrate the visualization of the token-level scores and the associated exemplars from the training set with 3 random documents from the test set for the top 50 labels subset. The exemplar is often, but not always a lexical match, and sometimes the surrounding tokens can shed light on the connection between a particular token and its associated exemplar, as with for label 272.0 in Table 6, where “high” (the token of focus) proceeds “cholesterol” and the exemplar is associated with “hypercholesteremia”.

Often the nearest exemplars are vectors. In Table 7 with Label 39.95 we see a relatively rare example in which the nearest exemplar vector was a vector. On inspection, we see that is associated with continuous veno-venous hemofiltration (CVVH), is associated with continuous veno-venous hemodialysis (CVVHD), and with hemodialysis. The next token after the token associated with the vector is in fact “cvvhd”, which helps suggest why it was selected. In practice, this would be a case that would be singled out for further review by a human annotator, who would then see that the softmax distribution was relatively diffuse, and make a final decision based on these examples (and the context of the original query).

In Table 8 with Label V58.61, we see an example where the query prediction is a false positive and the nearest associated database vector () is also a false positive.

To a non-specialist, in cases where the model diverges from the ground truth label but the nearest exemplar is a vector, it is not always clear whether the source of the discrepancy is an idiosyncrasy of ICD-9 coding or noise in the labeling. In some cases, as with Label 96.04 in Table 7, the difference is apparently due to specificity in the choice of a disease or procedure (here, with regard to “intubation”, using Label 96.71 instead).

It does seem that closer relative distances (here, a higher value of the “Normalized Softmax Distance” included in the tables implies closer similarity) are associated with more reliable coupling between and , which is consistent with the empirical results in Table 5. Note that these normalized distances are values between 0 and 1, and if the distances between and each of , , , and were the same, then these distance values would be uniformly 0.25. In practice, if a given prediction was not associated with , or associated with but with a low relative distance, it could be (tagged in particular to be) shown to a human for further review.

Given that the documents are long, the text is noisy, and the labels are relatively high-dimensional, the output does seem to suggest that such an approach is a useful additional tool for the analysis toolbox.

Test Document 225
Label 401.9 unspecified essential hypertension; Label Frequency in training: 3233
CNN+mmc ...and he went to a osh er where a head ct showed a subdural hematoma patient reports taking two aspirin on the day of admission past medical history htn[401.9] high cholesterol social history lawyer lives with...
Exemplar [401.9] Normalized Softmax Distance: 0.604
Exemplar [401.9], Train Doc. 2201 ...but this was not covered by insurance and hence he does not take it past medical history htn[401.9] chol bph right renal cyst social history he is...
Label 272.0 pure hypercholesterolemia; Label Frequency in training: 926
CNN+mmc ...past medical history htn high[272.0] cholesterol social history lawyer lives with...
Exemplar [272.0] Normalized Softmax Distance: 0.384
Exemplar [272.0], Train Doc. 6565 ...and the hematuria has since resolved he has been experiencing insomnia for the past month past medical history htn hypercholesteremia[272.0] etoh daily use gout...
Table 6: Exemplar auditing output for the first of three random documents from the test set for the top 50 labels subset for the CNN+mmc model. In this case, both ground truth labels are correctly predicted and all of the exemplar vectors from training correspond to (and the remaining vectors are not displayed). These are short snippets of longer documents. We further truncate subsequent instances of the same document (token scores are label specific per document). We color highlights associated with correct predictions at the document-level in blue, and those associated with incorrect predictions in red, but note that ground-truth token-level labels are not available here. Labels, associated descriptions, and label frequencies in training are provided. The tokens associated with and are marked with brackets (with the ICD-9 code), and the identity (TP, FN, FP, etc.) of is specified along with the normalized softmax distance (where greater values imply closer similarity).
Test Document 314
Label 96.04 insertion of endotracheal tube; Label Frequency in training: 1581
CNN+mmc ...chief complaint depakote overdose major surgical or invasive procedure intubation[96.04] hemodialysis femoral and jugular central line placements history of present illness the patient is a year old female with a reported history of alcohol abuse and bipolar disorder who...
Exemplar [96.04] Normalized Softmax Distance: 0.328
Exemplar [96.04], Train Doc. 2981 ...chief complaint attempted suicide major surgical or invasive procedure intubation[96.04] wrist laceration repair history of present illness year old man presented to the hospital1 ed in the setting of a reported suicide attempt via laceration to his right wrist...
Label 96.71 continuous invasive mechanical ventilation for less than 96 consecutive hours; Label Frequency in training: 1395
CNN+mmc ...she developed progressive confusion to the point of somlanence she was reported to vomit she was subsequently intubated[96.71] for airway protection and transfered to the hospital1 ed for further manegment...
Exemplar [96.71] Normalized Softmax Distance: 0.391
Exemplar [96.71], Train Doc. 1338 ...this is a yo f s p suicide attempt with cymbalta klonopin alcohol and cyproheptadine now s p extubation and medically stable she was initially intubated known firstname ed for somnolence she was extubated on without further events...
Label 39.95 hemodialysis; Label Frequency in training: 549
CNN+mmc ...chief complaint depakote overdose major surgical or invasive procedure intubation hemodialysis[39.95] femoral and jugular central line placements history of present illness the patient is a year old female with a reported history of alcohol abuse and bipolar disorder who...
Exemplar [39.95] Normalized Softmax Distance: 0.263
Exemplar [39.95], Train Doc. 1379 ...invasive procedure leukophareisis cvvh[39.95] history of present illness...
Exemplar [39.95] Normalized Softmax Distance: 0.150
Exemplar [39.95], Train Doc. 5439 ...was not resumed gu uop augmented with **cvvhd**[39.95] perioperatively from date range creatinine stablilized...
Exemplar [39.95] Normalized Softmax Distance: 0.206
Exemplar [39.95], Train Doc. 846 ...the patient was noticed to have decreased mental status after **hemodialysis**[39.95] yesterday which worsened on the day of presentation...
Exemplar [39.95] Normalized Softmax Distance: 0.381
Exemplar [39.95], Train Doc. 517 ...major surgical or invasive procedure intubation r ij[39.95] cvvhd history of present illness...
Label 311 depressive disorder, not elsewhere classified; Label Frequency in training: 493
CNN+mmc ...and folic acid for vitamin supplementation depression[311] she readily admitted to her overdose being an attempt at suicide and expressed considerable remorse in this action...
Exemplar [311] Normalized Softmax Distance: 0.374
Exemplar [311], Train Doc. 2981 ...year old man with polysubstance dependence on suboxone and depression[311] admitted after suicide attempt...
Table 7: Exemplar auditing output for the second of three random documents from the test set for the top 50 labels subset for the CNN+mmc model, with similar formatting as Table 6. In this case, of the 5 ground-truth labels (96.71,285.9,276.2,39.95,305.1), two were correctly predicted, and the remainder were false negatives (not shown). Additionally, two predictions were false positive. For Label 311, we display all 4 exemplar vectors, as this is a relatively rare case in which was selected. Symbols ** are used for exemplars associated with or .
Test Document 316
Label 401.9 unspecified essential hypertension; Label Frequency in training: 3233
CNN+mmc ...has never been diagnosed with dementia past medical history pmh atrial fibrillation on coumadin htn[401.9] acoustic neuroma resected years ago on the right hyperthyroidism now hypothyroid after iodine therapy...
Exemplar [401.9] Normalized Softmax Distance: 0.420
Exemplar [401.9], Train Doc. 951 ...collar was removed at outside facility past medical history iddm a fib on coumadin htn[401.9] mild aortic stenosis cva in past history of old small reportedly lacunar infarcts etoh abuse...
Label 427.31 atrial fibrillation; Label Frequency in training: 1992
CNN+mmc ...dementia past medical history pmh atrial fibrillation[427.31] on coumadin htn acoustic...
Exemplar [427.31] Normalized Softmax Distance: 0.540
Exemplar [427.31], Train Doc. 2064 ...pt is a age over yo female with atrial fibrillation[427.31] on coumadin htn and csf who fell at her nursing home...
Label 599.0 urinary tract infection, site not specified; Label Frequency in training: 1067
CNN+mmc ...she was treated with cipro for days for an e coli uti[599.0] on hd she removed her foley catheter and there was no evidence of trauma...
Exemplar [599.0] Normalized Softmax Distance: 0.304
Exemplar [599.0], Train Doc. 5784 ...the patient s primary oncologist was notified of her admission uti[599.0] pt had postive urine culture after leukocytosis...
Exemplar [599.0] Normalized Softmax Distance: 0.367
Exemplar [599.0], Train Doc. 4599 ...wenckebach rhythm delirium and an e coli **uti**[599.0] treated with levofloxacin and then bactrim with the concern...
Label 244.9 unspecified acquired hypothyroidism; Label Frequency in training: 761
CNN+mmc ...resected years ago on the right hyperthyroidism now hypothyroid[244.9] after iodine therapy years ago macular degeneration left sided hearing loss...
Exemplar [244.9] Normalized Softmax Distance: 0.308
Exemplar [244.9], Train Doc. 5784 ...ulcer disease colonic adenoma goiter with hypothyroidism[244.9] osteoporosis osteoarthritis...
Label V58.61 long-term (current) use of anticoagulants; Label Frequency in training: 604
CNN+mmc ...atrial fibrillation on[V58.61] coumadin htn...
Exemplar [V58.61] Normalized Softmax Distance: 0.327
Exemplar [V58.61], Train Doc. 5702 ...atrial fibrillation on[V58.61] coumadin osteoarthritis s p hemithyroidectomy...
Exemplar [V58.61] Normalized Softmax Distance: 0.394
Exemplar [V58.61], Train Doc. 951 ...iddm a fib **on**[V58.61] coumadin htn mild aortic stenosis cva...
Table 8: Exemplar auditing output for the third of three random documents from the test set for the top 50 labels subset for the CNN+mmc model, with similar formatting as Tables 6 and  7. In this case, of the 3 ground-truth labels (401.9,427.31,599.0), all three were correctly predicted, but there were also two false positives. For Labels 599.0 and V58.61, the nearest exemplars were vectors.