Annotating Social Determinants of Health Using Active Learning, and Characterizing Determinants Using Neural Event Extraction

by   Kevin Lybarger, et al.
University of Washington

Social determinants of health (SDOH) affect health outcomes, and knowledge of SDOH can inform clinical decision-making. Automatically extracting SDOH information from clinical text requires data-driven information extraction models trained on annotated corpora that are heterogeneous and frequently include critical SDOH. This work presents a new corpus with SDOH annotations, a novel active learning framework, and the first extraction results on the new corpus. The Social History Annotation Corpus (SHAC) includes 4,480 social history sections with detailed annotation for 12 SDOH characterizing the status, extent, and temporal information of 18K distinct events. We introduce a novel active learning framework that selects samples for annotation using a surrogate text classification task as a proxy for a more complex event extraction task. The active learning framework successfully increases the frequency of health risk factors and improves automatic detection of these events over undirected annotation. An event extraction model trained on SHAC achieves high extraction performance for substance use status (0.82-0.93 F1), employment status (0.81-0.86 F1), and living status type (0.81-0.93 F1) on data from three institutions.



page 1

page 2

page 3

page 4


Extracting COVID-19 Diagnoses and Symptoms From Clinical Text: A New Annotated Corpus and Neural Event Extraction Framework

Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much ...

Events Beyond ACE: Curated Training for Events

We explore a human-driven approach to annotation, curated training (CT),...

CrudeOilNews: An Annotated Crude Oil News Corpus for Event Extraction

In this paper, we present CrudeOilNews, a corpus of English Crude Oil ne...

Crude Oil-related Events Extraction and Processing: A Transfer Learning Approach

One of the challenges in event extraction via traditional supervised lea...

The Benefits of Word Embeddings Features for Active Learning in Clinical Information Extraction

This study investigates the use of unsupervised word embeddings and sequ...

FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction

This paper presents FAMIE, a comprehensive and efficient active learning...

Mining Legacy Issues in Open Pit Mining Sites: Innovation Support of Renaturalization and Land Utilization

Open pit mines left many regions worldwide inhospitable or uninhabitable...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

US life expectancy is decreasing,murphy2018mortality even as medical care advances. Decreasing life expectancy may be partly attributable to deteriorating social determinants of health (SDOH).daniel2018addressing; himmelstein2018determined For example, substance abuse (including alcohol, drug, and tobacco use) is increasingly recognized as a key factor for morbidity and mortality.centers2005annual; world2019global; degenhardt2012extent More Americans are living alone, leading to increased social isolation and negative health outcomes.cacioppo2003social Employment and occupation impact income, societal status, hazards encountered, and health.clougherty2010work Understanding SDOH, including behaviors influenced by these social factors, can inform clinical decision-making.blizinsky2018leveraging SDOH are characterized in the Electronic Health Record through structured data and unstructured clinical text; however, clinical text captures detailed descriptions of these determinants, beyond the representation in structured data. This text-encoded information must be automatically extracted for secondary use applications, like large-scale retrospective studies and clinical decision support systems. The automatically extracted data can augment the available structured data to create a more comprehensive patient representation in these downstream applications.demner2009can; jensen2012mining Leveraging the social history information in clinical text requires high-quality annotated data to create machine learning-based information extraction models. This work presents a new annotated clinical corpus, referred to as Social History Annotation Corpus (SHAC). SHAC is comprised of 4,480 social history sections with detailed annotations for 12 critical SDOH. SHAC utilizes clinical notes from MIMIC-IIIjohnson2016mimic and an existing data set from the University of Washington (UW) and Harborview Medical Centers. It includes event-based annotations for more than 55K annotated spans and 18K distinct events across four note types. Samples were selected for annotation using a novel active learning framework that increases the diversity and richness of the annotations and improves extraction performance. The active learning framework uses a simplified surrogate task for assessing sample informativeness for the more complex event extraction task associated with the SHAC annotation scheme. Active selection using the surrogate task improves extraction performance for a variety of event types and attributes. The first reported extraction results on SHAC are presented for the most frequently annotated SDOH: substance use, employment, and living status. The event extraction model combines a clinical version of BERT with a state-of-the-art neural multi-task framework. The event extraction model identifies substance use, employment, and living status events at 0.89-0.98 F1 and characterizes the status of these determinants with 0.81-0.96 F1. The annotation guidelines and source code are available online111A link will be provided if the paper is accepted..

2 Related work

2.1 SDOH Corpora

Multiple corpora with note-level SDOH annotations have been developed. For example, the i2b2 NLP Smoking Challenge introduced a publicly available corpus where tobacco use status is labeled at the note-level.uzuner2008identifying Gehrmann, et al. annotated MIMIC-III discharge summaries with note-level phenotype labels, including substance abuse and obesity.gehrmann2018comparing Feller annotated 38 different SDOH at the note-level.feller2018towards Annotated corpora with more detailed SDOH annotations describing status, extent, temporal information, and other characteristics also exist. For example, Wang, et al. introduced a corpus with detailed substance use annotations for 691 clinical notes,wang2015automated and Yetisgen and Vanderwende created detailed annotations for 13 SDOH in a publicly available corpus of 364 notes.Yetisgen_2017_substance Both Wangwang2015automated and YetisgenYetisgen_2017_substance utilized artificial notes from the MTSamples website222MTSamples website:, which were created by human transcriptionists, as opposed to real notes created by clinicians. To achieve high SDOH extraction performance that generalizes across clinicians, institutions, and specialties, annotated corpora must be large and diverse. Unfortunately, currently available corpora with SDOH annotations are lacking in either annotation detail, public availability, size, and/or heterogeneity. SHAC addresses limitations of existing corpora by providing a relatively large, heterogeneous corpus with high quality, detailed SDOH annotations.

2.2 Active Learning

In annotation projects, the available unlabeled data is often significantly larger than the annotation budget. Randomly selecting samples for annotation is suboptimal from a model learning perspective, as samples vary in their usefulness, particularly when the phenomena of interest may be infrequent. Active learning identifies samples for annotation that maximize model learning.cohn1994improving; cohn1996active Samples are selected using a query function that scores sample informativeness, representativeness, and/or diversity.shen2004multi; yang2015multi; du2017exploring Informativeness describes the potential for a sample to reduce classification uncertainty. The literature varies in the usage of the terms “representativeness” and “diversity.” Here, representativeness describes the degree to which a sample describes the structure of the data, and diversity characterizes the variation in the samples selected. Active learning is well-established for classification tasks, where a single label is predicted for each sample. Multiple studies have applied active learning to text classification tasks, where a sample is a sentence or a document. Sample informativeness is derived from classification uncertainty scores, such as maximizing entropyWu2013graph

or minimizing a support vector machine margin

TongKoller2002; PARK2015efficient

. Du, et al. assesses diversity based on classifier posterior distributions,

du2017exploring and Wu, et al. assesses diversity and representativeness based on sample similarity within the observation space.Wu2013graph Approaches for applying active learning to sequence tagging problems are also well-established.chen2015a_study; chen2017an_active; Kholghi2017clnical; li2019efficient; Gao2019recognizing; Shelmanov2019active Although predictions are made at the token-level, sample selection is typically performed at the sentence or document-level. Representativeness and/or diversity are often assessed by calculating sentence similarity metrics in the observation space.chen2015a_study; chen2017an_active; Kholghi2017clnical; Gao2019recognizing Sequence-level uncertainty scores are calculated by various measures, like normalized prediction sequence likelihood and minimum token-level confidence. In the clinical and biomedical domain, uncertainty scores are generated with conditional random field (CRF) modelschen2015a_study; chen2017an_active; Kholghi2017clnical; li2019efficient; Gao2019recognizing or a neural tagger based on contextualized embeddings from ELMo and BERT.Shelmanov2019active Active learning is less explored in relation and event extraction tasks, where triggers (heads), arguments, and/or relations are annotated. The predictions are more complex, involving labeling and linking spans of text. Maldonado, et al. apply active learning to a clinical relation extraction task, selecting samples using the average entropy of all predicted phenomena as an uncertainty score.maldonado2017active More recently, Maldonado, et al. explore active learning in a medical concept and relation extraction task.MALDONADO2019103265active

In lieu of a heuristic query function, an optimal selection strategy is learned from data with strong and weakly supervised labels, including 1,000 electroencephalogram (EEG) reports with automatic annotations generated by existing extraction models. SHAC is annotated using an event-based structure, where SDOH are characterized through multiple argument types. These argument types are not equally important for secondary use applications, and the entropy of different determinant-argument combinations may differ significantly. Without sufficient annotated data to learn an optimal selection strategy, we use a simplified text classification task as a surrogate for assessing sample uncertainty, to prevent under sampling the critical phenomena. We hypothesized that the surrogate task would improve extraction performance in the more complex event extraction task and validated the hypothesis with experiments on SHAC data.

3 Materials

3.1 Data

This work utilized two clinical data sets without SDOH annotations: MIMIC-III and UW Dataset. MIMIC-III (referred to here as MIMIC) is a publicly available, deidentified health database for over 40K critical care patients at Beth Israel Deaconess Medical Center from 2001-2012.johnson2016mimic MIMIC contains clinical notes, diagnosis codes, and other data. This work utilized 60K MIMIC discharge summaries. UW Dataset is an existing clinical data set from the UW and Harborview Medical Centers generated between 2008-2019. This work utilized 83K emergency department, 22K admit, 8K progress, and 5K discharge summary notes from UW Dataset. An existing corpus with SDOH annotations, YVnotes, was used for model training during active learning.Yetisgen_2017_substance

3.2 Annotation Scheme

We created detailed annotation guidelines for 12 SDOH (referred to here as event types), including substance use (alcohol, drug, and tobacco), physical activity, employment, insurance, living status, sexual orientation, gender identity, country of origin, race, and environmental exposure. Each event type is annotated across multiple dimensions. Table 1 summarizes the annotation of the most frequent SHAC event types: substance use, employment, and living status. Table A1 in the Appendix contains a summary of all annotated event types.

Event type Argument Label set Span examples
Substance use (Alcohol, Drug, & Tobacco) Status* none, current, past “denies,” “smokes”
Duration “for the past 8 years”
History “seven years ago”
Type “beer,” “cocaine”
Amount “2 packs,” “3 drinks”
Frequency “daily,” “monthly”
Employment Status* employed, unemployed, retired,
on disability, student, homemaker
“works,” “unemployed”
Duration “for five years”
History “15 years ago”
Type “nurse,” “office work”
Living status Status* current, past, future “lives,” “lived”
Type* alone, with family, with others, homeless “with husband,” “alone”
Duration “for the past 6 months”
History “until a month ago”
Table 1: Annotation guideline summary for the most frequent event types. *indicates the argument is required.

Figure 1: BRAT annotation example

SDOH are annotated as events using the BRAT rapid annotation tool,stenetorp2012brat where each event consists of a trigger and assigned arguments. Figure 1 is a BRAT annotation example, describing a patient’s employment and substance use. The trigger indicates the event type (e.g. Employment or Tobacco) and arguments describe the event. Labeled arguments, like Status, include both an annotated span and multi-class label. Span-only arguments, like Duration or History, include an annotated span without an additional label.

3.3 Annotation Cycle

Social history sections, referred to here as samples

, were extracted from MIMIC and the UW Dataset, using pattern matching to identify section headings (alphanumeric, forward slash, backslash, ampersand, or white space characters followed by a colon). SHAC includes

train, development, and test sets. Samples for the train set were randomly and actively selected. Training samples were randomly selected for initial model training in active learning, then the initial model is used in actively selecting samples to bias the training set towards diverse samples that frequently contain the phenomena of interest. All development and test samples were randomly selected to approximate the true distribution. Samples were annotated by four medical students through 12 rounds of annotation (8 randomly selected and 4 actively selected). Table A2 in the Appendix describes each round of annotation. The first two rounds were randomly sampled and double-annotated, to assess inter-annotator agreement. After the initial annotation round, the annotation guidelines were revised, and the initial annotations were updated.

3.4 Evaluation and Annotation Scoring

We treat event annotation and extraction as a slot filling task, as this is most relevant to secondary use applications. As such, there can be multiple equivalent span annotations. Figure 2 presents the same sentence annotated by two annotators (labeled A and B), along with the populated slots.

Figure 2: Annotation examples describing event extraction as a slot filling task

Both annotators labeled two Drug events: Event 1 and Event 2. Event 1 describes past intravenous drug use (IVDU), and Event 2 describes current cocaine use. Event 1 is annotated identically by both annotators. However, there are differences in the annotation spans of Event 2, specifically for the Trigger (“cocaine” versus “cocaine use”) and Status (“use” vs. “Recent”). From a slot perspective, the annotations for Event 2 are equivalent. Thus, scoring of automatic detection and annotator agreement is based on relaxed span match criteria, as described below. Trigger, argument, and argument role performance is evaluated using precision, recall, and F1, micro averaged over the event types, argument types, and/or argument labels. Trigger: Triggers, , are represented by a pair (event type, ; token indices, ). For Event 2 in Figure 2, and . Triggers of the same event type, , are aligned based on minimizing the distance between span centers computed from the token indices. Trigger equivalence is defined as


Although there are two drug events in the Figure 2 example, aligns with because of the overlapping spans. Argument: Labeled arguments, , are represented as a triple (argument type, ; token indices, ; argument label, ). For Event 2 in Figure 2, and . Labeled arguments of the same argument type are aligned similarly to the triggers. For labeled arguments, the multi-class label, , captures the salient information associated with the argument. Labeled argument equivalence is defined as


Span-only arguments, , are represented as a pair (argument type, ; token indices, ). For Event 2 in Figure 2, corresponds to “cocaine.” Span-only arguments are not easily mapped to a fixed set of classes, and the identified span, , contains the most salient argument information. Span-only arguments of the same argument type, , are evaluated at the token-level (rather than the span level) to allow partial matches. Argument role: The SHAC annotation scheme has a constrained event structure, where the relationship between a trigger and argument is a binary indicator of whether or not the argument is part of the event. Events are aligned based on the triggers, and the arguments of aligned events are compared. Argument role equivalence requires the trigger-argument pairs to be equivalent. We evaluate annotator agreement using Cohen’s Kappa, , coefficient.cohen1960coefficient Calculating

for the full event structure is not informative, because the probability of random agreement is close to zero. Instead, we calculate

for trigger annotation in the subset of sentences with zero or one trigger for a given event type in either set of annotations, which covers most of the data. We focus on this subset of sentences, because triggers for a given event type are equivalent, if the annotated sentences both include one trigger of that type. We assess annotator agreement on the full event structure using F1 scores.

3.5 Annotation Statistics

Source Train Dev Test
MIMIC 1316 188 376
UW Dataset 1820 260 520
Table 2: Corpus composition by source

SHAC consists of 4,480 annotated social history sections, including 3,136 train, 448 development, and 896 test samples (70%/10%/20% split). Table 2 presents the corpus composition by source. The SHAC training samples are 29% randomly selected and 71% actively selected. All development and test data are randomly sampled. Figure 4 presents the event type distribution. The most frequent event types are Drug, Tobacco, Alcohol, Living status, and Employment, with the remaining event types occurring infrequently.

Figure 3: Event type distribution
Figure 4: Annotator agreement for 300 doubly annotated MIMIC samples

Figure 4 presents the annotator agreement for all event types in terms of F1 score for 300 doubly annotated notes from the first two rounds of annotation. For Alcohol, Drug, Tobacco, Employment, and Living status, trigger is . For the remaining event types, trigger is . is calculated for sentences with 0-1 events for each type ( sentences). The event agreement is very high, in terms of F1 and , indicating the annotators are consistently identifying and distinguishing between events. The argument and argument role agreement is also high for labeled arguments. The somewhat lower agreement for span-only arguments is primarily due to small differences in the start and end token spans (e.g. “construction worker” vs. “construction”).

4 Active Learning

4.1 Methods

A portion of the SHAC training samples were selected using active learning, where a sample is a social history section. Specifically, batch-mode active learning was used to facilitate coordination with human annotators through the cyclical process shown in Figure 5.

Figure 5: Active learning annotation cycle

A batch of samples, , was annotated and added to the labeled pool, . The surrogate classifier was trained on and then generated uncertainty scores for unlabeled data . Using the uncertainty scores, the query function identified the next batch of samples, . This process was repeated until the annotation objective was met. The query function builds on Wu and Ostendorf’s maximum batch network gain approach,Wu2013graph maximizing the informativeness and diversity of a batch of samples, . Here, the query score has the form:


where is the uncertainty entropy of sample , is the similarity score of sample relative to , and is the diversity score. is a weight used to balance the relative importance of the two scores (). The objective is to maximize the batch score, . We explored different forms for the uncertainty and similarity scores for this multi-label scenario. We implemented a greedy approach to selecting examples, as shown in Algorithm 1.

Input: unlabeled samples , batch size
Output: batch of samples
while  do
end while
Algorithm 1 Greedy query function

Diversity: Sample diversity is assessed in the observation space using two different similarity metrics: average similarity and maximum similarity. The average similarity, , is defined as



is the cosine similarity of samples

and . The maximum similarity, , is defined as


The maximum similarity approach is a stricter condition that pushes the batch of samples farther apart in the observation space, especially with larger batch sizes. Similar to Lilleberg et al.,lilleberg2015support unsupervised vector representations of samples were learned as the TF-IDF weighted averages of pre-trained word embeddings. Word embeddings were created using the word2vec skip-gram modelMikolov_2013_word2vec and trained on the entirety of the MIMIC discharge summaries (not just the social history sections). Separate TF-IDF weights were calculated for MIMIC and UW Dataset samples. Uncertainty: Active learning query functions typically assess sample informativeness (uncertainty) using the target classification task. In this work, sample uncertainty was assessed using a simplified surrogate classification task, as a proxy for the more complex event-based annotation scheme. The SHAC annotation scheme includes some arguments (e.g. Status for Alcohol) that are more predictive of negative health outcomes than others (e.g. Type for Alcohol), and the prediction uncertainty varies across event types and arguments. To ensure the query function biases selection towards the most salient arguments, each of the five most frequent event types in SHAC were represented using the single argument that is most predictive of negative health outcomes: Alcohol-Status, Drug-Status, Tobacco-Status, Employment-Type, and Living status-Status.

Figure 6: Surrogate Classifier used to assess sample uncertainty in active learning

The text classification model, Surrogate Classifier in Figure 6, was used to assess sample uncertainty. The Surrogate Classifier operates on a sample, as a single sequence of tokens without line breaks. The input social history section is mapped to contextualized word embeddings using Bio+Discharge Summary BERTLeeBioBERT2019, a version of BERTdevlin2019bert trained on clinical text from MIMIC. The BERT output (an

matrix) feeds into a bidirectional long short-term memory (bi-LSTM) layer. The forward and backward outputs states of the bi-LSTM are concatenated resulting in

matrix, , where is the hidden size. feeds into event type and argument-specific output layers. Separate self-attention (Attn) output layers for each event type make sample-level predictions. Surrogate Classifier attention weights are calculated as


where is a learned vector for event type, , is the concatenated forward and backward bi-LSTM hidden states, and are the learned attention weights. A social history section may have multiple events for the same event type. For example, a sample may describe both previous and current tobacco use, resulting in Tobacco Status labels of past and current. An additional class, multiple, is included with the classes in Table 1, for these multi-event cases. The Surrogate Classifier generates a set of five multi-class predictions for each sample (one for each event type). We explored two approaches to characterizing sample uncertainty. In one, sample uncertainty is the sum of the argument entropy values, similar to previous work,Yang2009effective; wu2014multi; maldonado2017active; Reyes2018evolutionary as


where is the event type and is the number of event types. The second method was motivated by the concern that summing the entropy values (referred to as “sum”) could overly bias the selection process in favor of high-entropy event types, reducing the diversity of event types. Experimentation also explored using the entropy for a single argument to represent the sample, , iterating through the five event types, , as sample are drawn (referred to as “loop”). For example, Alcohol-Status entropy is used for sample 1, Drug-Status entropy is used for sample 2, and so forth, starting over with Alcohol-Status entropy for sample 6.

4.2 Experiments & Results

To determine the best query process early in the annotation effort, we used the first 700 annotated samples, , which consists of random MIMIC samples. was partitioned into and . For random sampling and each active sampling configuration, 10 runs were performed:

  1. . Train model, , on .

  2. (random or active). Train model, , on .

  3. Evaluate the performance of on

Active sampling experimentation included different uncertainty types (“loop” vs. “sum”), similarity types (“average” vs “maximum”), and values

. The hyperparameters of the Surrogate Classifier were tuned on

(parameter values in Table A3 of the Appendix).

Uncertainty Similarity F1
loop average 1.0 0.788*
loop maximum 0.1 0.776*
sum average 2.0 0.788*
sum maximum 0.1 0.794*
Table 3: Query function tuning performance. *indicates statistical significance () relative to a random baseline of 0.752 F1.

Table 3 presents the results for the best value for each uncertainty-similarity type combination. Performance is assessed using precision, recall, and F1-score, micro-averaged across classes and event types. All of the presented active learning configurations outperform the random baseline with significance (). The best configuration, uncertainty type =“sum”, similarity type=“maximum”, and , was used in active selection. Active learning performance was evaluated by adding random and active samples to an initial training set. Model training included the sets: and . was partitioned into and . For the first round of active selection, an initial model, , was trained on and used to select 400 MIMIC samples, . was withheld when training to validate the active learning approach. Hyperparameters were tuned on (parameter values in Table A3 of the Appendix). Figure 7 presents the performance of four cases on :

  • MIMIC-only initial: Models trained only on the MIMIC samples, .

  • initial: The initial model, , trained on and from the first active round.

  • +random: Models trained on the initial set and additional random samples, .

  • +active: Models trained on the initial set and additional active samples, .

For MIMIC-only initial, +random, and +active

, 10 runs were performed to account for variance in model initialization. For

MIMIC-only initial and +random, the training sets are fixed, as all of the subset is used each run. For +active, the training set varies because only a subset of is randomly selected each run, so sampling variance is introduced. The error bars in Figure 7

indicate the standard deviation of the F1 scores across runs.

Figure 7: Surrogate Classifier performance with random and active samples, evaluated on the MIMIC test samples.

Comparing MIMIC-only initial to initial demonstrates that including YVnotes improves performance. Adding active samples to the initial training set yields a statistically significant improvement over adding random samples (), demonstrating the effectiveness of the active learning framework on the surrogate task. The effectiveness of the active learning framework on the target event extraction task is presented in the subsequent section. Annotation included 4 rounds of active selection (see Table A2 of the Appendix for details), and the Surrogate Classifier model was retrained prior to each active round. We hypothesized the Surrogate Classifier uncertainty would bias the selection process to include more health risk factors (e.g. positive substance abuse, unemployment, being on disability, homelessness, etc.), which tend to be more challenging to automatically extract than less risky behavior (e.g. no substance use, being employed, and living with family). Active learning successfully identified samples with richer, more detailed SDOH descriptions. Figure 8 presents the label frequency per sample (note section) for random and active samples for the entirety of SHAC.

Figure 8: Label frequency per social history section, comparing random and active sampling

The frequency of positive substance use () is 83% higher in active samples than random samples, with the frequency of positive drug use 151% higher with active selection. Active sampling produced higher rates for all Employment Status labels, except retired. Descriptions of retirement, tend to have low entropy, because of the reliable presence of keywords like “retired” or “retirement.” Regarding Living Status, the rate of homeless is 109% higher in active samples than random samples, and the rate of with others is 81% higher. The rate of alone is slightly lower in active samples, likely due to lower entropy associated with the limited vocabulary used to describe living alone (e.g. “alone” or “by herself”).

5 Event Detection

5.1 Methods

Figure 9: Event Extractor model

This section presents the event extraction model, Event Extractor, which predicts all the phenomena in Table 1. The Event Extractor generates sentence and token-level predictions that are assembled into events, similar to the SHAC annotation scheme. The Event Extractor builds on our previous state-of-the-art neural multi-task extractor for substance abuse information.lybarger2018using It is a generalized version of this previous work and is shown in Figure 9. The Event Extractor was trained to simultaneously extract substance abuse, living situation, and employment information, although the framework can be expanded to any number of event types or arguments. Similar to previous multi-task work,collobert2008unified; luo2017segment; jaques2016multi; liu2016attention; luan2018multi; harutyunyan2019multitask the Event Extractor shares information across tasks (event types and arguments in this application). Shared layers: Individual sentences are encoded using Bio+Discharge Summary BERT,LeeBioBERT2019 creating an matrix, where is the sentence length in tokens and is the BERT embedding size. Similar to other work,kitaev2018multilingual only the last word piece embedding for each token is used, to simplify the downstream sequence tagging. The BERT encoding feeds into a bi-LSTM. The forward and backward outputs states of the bi-LSTM are concatenated resulting in matrix, , where is the hidden size. feeds into event type and argument-specific output layers. Trigger: The presence of each event type is predicted using separate self-attentive binary classifiers (not present/present). Positive predictions serve as the trigger for assembling events, and the token position with the maximum attention weight serves as the trigger span. During training, event type is considered present, if the sentence contains one or more events of type . The trigger probability for event type is calculated as


where , is a weight matrix, is a bias vector, and , is a vector of weights. The trigger probabilities, , are concatenated to form a matrix, , for the labeled argument prediction. An event is detected if it has probability greater than . Labeled arguments: Labeled argument prediction is also treated as a text-classification task, and utilizes separate self-attentive output layers for each labeled argument. The token position with the maximum attention weight serves as the argument span. The probability of labeled argument for event type is calculated as


where is a weight matrix, is a vector of attention weights, and is a bias vector. The labeled argument probabilities, , are concatenated to form a matrix, , for span-only argument detection. Experimentation included six labeled arguments: Status for Alcohol, Drug, and Tobacco; Status for Employment; and Status and Type for Living status. Span-only arguments: Span-only arguments are predicted using linear-chain Conditional Random Field (CRF)Lafferty_2001_conditional output layers at the output of the bi-LSTM, which is a popular sequence tagging approach.lample2016neural; Luan_2017_scientific_info The bi-LSTM network learns sequential word dependencies, and the CRF learns conditional dependencies between labels. A separate CRF extracts the span-only arguments for each event type (i.e. five CRF output layers), with input features and . Sequence labels are represented using the begin-inside-outside (BIO) approach. Experimentation included 20 span-only arguments: Duration, History, Type, Amount, and Frequency for Alcohol, Drug, and Tobacco; Duration, History, and Type for Employment; and Duration and History for Living status.

5.2 Results

In Section 4, we showed that active learning improved the Surrogate Classifier performance, as expected. Here, we show that it also improves performance on the more complex event extraction tasks. The active learning performance with the Event Extractor was assessed, similar to that of the Surrogate Classifier, with the exception of the use of samples from YVnotes, . YVnotes does not include all of the labeled phenomena of SHAC, so was not included in the Event Extractor active learning assessment. Figure 11 presents the performance of the Event Extractor on the test set, , for three cases:

  • initial: Models trained on the initial set .

  • +random: Models trained on the initial set and additional random samples, .

  • +active: Models trained on the initial set and additional active samples, .

For each case, 10 runs were performed. Performance is assessed using the F1 score criteria described in Section 3.4. The performance achieved by adding active samples outperforms that of adding random samples for labeled argument and span-only argument extraction, with significance (). The difference in trigger performance is not statistically significant. This result validates the use of the simplified surrogate text classification task as a proxy for the more complex event extraction task. The Event Extractor hyperparameters were tuned on the development set, (parameter values in Table A3 of the Appendix).

Figure 10: Event Extractor trigger and argument role performance with random and active samples, evaluated on the MIMIC test samples.
Figure 11: Event Extractor trigger and argument role performance trained on the entire SHAC train set, evaluated on the MIMIC and UW Dataset test sets.

Figure 11 presents the trigger and argument role performance of the Event Extractor trained on the entire SHAC train set and evaluated on the MIMIC and UW Dataset test sets. Overall, performance is higher on MIMIC, even though there are more UW Dataset training samples, including more active samples. The UW Dataset portion of SHAC includes four different note types, whereas the MIMIC portion includes only one note type, which likely contributes to the lower performance on the UW Dataset.

Field Event type Argument MIMIC UW
# P R F1 # P R F1
Trigger Alcohol 314 0.99 0.96 0.97 404 0.97 0.99 0.98
Drug 194 0.96 0.95 0.96 481 0.97 0.92 0.94
Tobacco 324 0.98 0.95 0.97 432 0.97 0.97 0.97
Employment 169 0.93 0.96 0.94 148 0.86 0.91 0.89
Living status 244 0.96 0.97 0.97 343 0.93 0.88 0.90
Labeled argument Alcohol Status 314 0.92 0.89 0.90 404 0.92 0.94 0.93
Drug Status 194 0.91 0.89 0.90 481 0.85 0.80 0.82
Tobacco Status 324 0.91 0.89 0.90 432 0.91 0.90 0.90
Employment Status 169 0.84 0.88 0.86 148 0.79 0.83 0.81
Living status Status 244 0.96 0.95 0.96 343 0.92 0.86 0.89
Type 244 0.93 0.93 0.93 343 0.85 0.78 0.81
Span-only argument Alcohol Amount, Duration, Frequency, History, Type 396 0.70 0.74 0.72 420 0.67 0.80 0.73
Drug 219 0.67 0.75 0.71 583 0.62 0.63 0.62
Tobacco 799 0.81 0.83 0.82 880 0.78 0.81 0.79
Duration, History,
441 0.80 0.74 0.77 261 0.77 0.77 0.77
Living status Duration, History 21 0.21 0.57 0.31 57 0.19 0.26 0.22
Table 4: Event Extractor trigger and argument role performance trained on the entire SHAC train set, evaluated on the MIMIC and UW Dataset test sets.

Table 4 presents detailed results for the same Event Extractor model and data configuration as Figure 11. Trigger performance is greater than 0.89 F1 for all event types in both data sets. Labeled argument performance is similar in both data sets for Alcohol and Tobacco Status; however, there are performance differences for Drug, Employment, and Living status labeled arguments. In substance use Status prediction, the none label is typically less confusable and easier to predict than past and current. In the test set, the relative frequency of none Status labels for Drug events is higher in MIMIC samples (80%) than UW Dataset samples (57%), which contributes to the higher performance on MIMIC. Living status Status performance is lower in the UW Dataset, even though the distribution of Status labels is similar in both data sets. Living status Type performance is 0.12 F1 higher in MIMIC than the UW Dataset. In the test set, the distribution of Living status Type labels differs greatly between the data sets with the UW Dataset at 37% with family, 22% with others, 26% homeless, and 15% alone and MIMIC at 57% with family, 16% with others, 2% homeless, and 25% alone. For the span-only arguments, the performance is calculated at the token-level and micro averaged across the arguments for each event type. Span-only argument performance is comparable for Alcohol, Tobacco, and Employment. However, it is higher for Drug span only-arguments in MIMIC than the UW Dataset. Living status span-only argument performance is very low for both data sets, primarily due to sparsity in the training set (only 167 Duration and History arguments among 3,267 Living status events).

5.3 Limitations

Although the Event Extractor achieved high performance for most target phenomena, the extraction framework has several limitations. The Event Extractor treats trigger and labeled argument prediction as a text classification task and can only represent a single event of a given type per sentence. Figure 12 presents predicted labels for a sentence with multiple gold Drug events describing current marijuana use and previous cocaine use.

Figure 12: Example with multiple gold drug events in one sentence

While the Type predictions in this example are correct, the Status prediction of past is incorrectly associated with both marijuana and cocaine. Of the sentences with at least one event in SHAC, 6% contain multiple events of the same type. Span-only arguments for each event type are extracted using a single CRF, which cannot accommodate overlapping spans. Figure 13 presents predictions for a sentence where the gold span-only argument spans overlap. The Amount is correctly labeled as “about 1 pint of vodka,” but there should also be a Type argument of “vodka.” Approximately 6% of span-only arguments in events of the same type overlap in SHAC.

Figure 13: Example where gold span-only arguments overlap

The Event Extractor treats sentences independently. It does not incorporate context from the preceding sentences and cannot generate events that span multiple sentences. Figure 14 presents an example where past tobacco use is described in concurrent sentences. The first sentence includes a strong cue for past Status, “quit”; however, the Status in the second sentence is less clear, without previous context. Fewer than 2% of SHAC events span multiple sentences.

Figure 14: Example where intra-sentence information would likely benefit classifier

6 Conclusions

We present a new clinical corpus, SHAC, with detailed event-based annotations for 12 SDOH. SHAC includes approximately 4.5K social history sections from multiple institutions and note types and contains frequent descriptions of alcohol, drug, and tobacco use, employment, and living status. Approximately 71% of the SHAC training set was selected using a novel active learning framework that utilizes a surrogate task for assessing sample uncertainty. The proposed active learning framework increased the prevalence of critical risk factors in the annotated training data, including positive substance use, unemployment, disability, and homelessness, and increased event extraction performance, relative to using only randomly selected samples. The actively selected samples improve performance in both the surrogate task and the target event extraction task, validating the surrogate task approach. A neural multi-task model is presented for characterizing substance use, employment, and living status across multiple dimensions, including status, extent, and temporal fields. The event extractor model achieves high performance on the MIMIC and UW Dataset: 0.89-0.98 F1 in identifying distinct SDOH events, 0.82-0.93 F1 for substance use status, 0.81-0.86 F1 for employment status, and 0.81-0.93 F1 for living status type. The annotation guidelines and source code are available online333A link will be provided if the paper is accepted..


This study was funded by the Seattle Flu Study through the Brotman Baty Institute and by the National Center For Advancing Translational Sciences of the National Institutes of Health under Award Number UL1 TR002319.



Event type Argument Label set Span examples
Substance use (Alcohol, Drug, & Tobacco) Status* none, current, past “denies,” “smokes”
Duration “for the past 8 years”
History “seven years ago”
Type “beer,” “cocaine”
Amount “2 packs,” “3 drinks”
Frequency “daily,” “monthly”
Employment Status* employed, unemployed, retired,
on disability, student, homemaker
“works,” “unemployed”
Duration “for five years”
History “15 years ago”
Type “nurse,” “office work”
Living status Status* current, past, future “lives,” “lived”
Type* alone, with family, with others, homeless “with husband”
Duration “for the past 6 months”
History “until a month ago”
Insurance Status yes, no “has been off”’
Sexual orientation Status current, past “participated in”
Type heterosexual, homosexual, bisexual “homosexual”
Gender identity Status current, past “identifies as”
Type cisgender, transgender “transgender”
Country of origin Type “England”
Race Type “African American”
Physical activity Status none, current, past “currently jogs”
Duration “for several years”
History “10 years ago”
Type “walks”
Amount “4 miles”
Frequency “every evening”
Environmental exposure Status none, current, past “no history”
Duration “since 2001”
History “until a month ago”
Type “asbestos”
Amount “significant”
Frequency “daily”
Table A1: Annotation guideline summary for all event types. *indicates the argument is required.
Round Source Selection Active learning training set Train Dev Test Total
1 MIMIC Random 100 100
2 MIMIC Random 144 56 200
3 MIMIC Random 288 112 400
4 UW Dataset Random 84 140 280 504
5 MIMIC Active 572 samples (Round 3 train + 284 YVnotes) 400 400
6 UW Dataset Random 168 120 240 528
7 MIMIC Random 20 280 300
8 UW Dataset Random 112 112
9 UW Dataset Active 1336 samples (Rounds 3-8 train + 284 YVnotes) 728 728
10 UW Dataset Active 2064 samples (Rounds 3-9 train + 284 YVnotes) 728 728
11 MIMIC Active 3036 samples (Rounds 1-10 train + 284 YVnotes) 384 384
12 MIMIC Random 96 96
TOTAL 3136 448 896 4480
Table A2: Annotation round summary, including selection type (random versus active) and training data used in active selection.
Parameter Query function selection in Table 3 Active learning evaluation in Figure 7
batch size 20 100
learning rate 0.001 0.005
maximum gradient L2 norm 1.0 1.0
maximum length 200 200

number of epochs

500 500
LSTM hidden size 100 100
dropout, input to LSTM 0.7 0.4
dropout, output of LSTM 0.0 0.4
dropout, self-attention 0.7 0.4
Table A3: Surrogate Classifier hyperparameters
Parameter Figure 11, Figure 11, and Table 4
batch size 50
learning rate 0.005
maximum gradient L2 norm 0.5
maximum length 30
number of epochs 250
LSTM hidden size 100
dropout, input to LSTM 0.6
dropout, output of LSTM 0.4
dropout, self-attention 0.4
Table A4: Event Extractor hyperparameters