Learning to Prescribe Interventions for Tuberculosis Patients using Digital Adherence Data

Digital Adherence Technologies (DATs) are an increasingly popular method for verifying patient adherence to many medications. We analyze data from one city served by 99DOTS, a phone-call-based DAT deployed for Tuberculosis (TB) treatment in India where nearly 3 million people are afflicted with the disease each year. The data contains nearly 17,000 patients and 2.1M phone calls. We lay the groundwork for learning from this real-world data, including a method for avoiding the effects of unobserved interventions in training data used for machine learning. We then construct a deep learning model, demonstrate its interpretability, and show how it can be adapted and trained in three different clinical scenarios to better target and improve patient care. In the real-time risk prediction setting our model could be used to proactively intervene with 21 baselines. For outcome prediction, our model performs 40 methods, allowing cities to target more resources to clinics with a heavier burden of patients at risk of failure. Finally, we present a case study demonstrating how our model can be trained in an end-to-end decision focused learning setting to achieve 15 problem faced by health workers.


A predictive analytics approach to reducing avoidable hospital readmission

Hospital readmission has become a critical metric of quality and cost of...

MetaPred: Meta-Learning for Clinical Risk Prediction with Limited Patient Electronic Health Records

In recent years, increasingly augmentation of health data, such as patie...

A Machine Learning System for Retaining Patients in HIV Care

Retaining persons living with HIV (PLWH) in medical care is paramount to...

Predicting Treatment Initiation from Clinical Time Series Data via Graph-Augmented Time-Sensitive Model

Many computational models were proposed to extract temporal patterns fro...

Predicting Medical Interventions from Vital Parameters: Towards a Decision Support System for Remote Patient Monitoring

Cardiovascular diseases and heart failures in particular are the main ca...

Optimal sizing of a holdout set for safe predictive model updating

Risk models in medical statistics and healthcare machine learning are in...

Code Repositories


A repo containing the report for the final project of Harvard's CS182 (Artificial Intelligence) using artificial intelligence to predict if low-income households are at risk of meeting loan repayments.

view repo

1. Introduction

The World Health Organization (WHO) reports that the lung disease tuberculosis (TB) is one of the top ten causes of death worldwide (Organization, 2018), yet in most cases it is a curable and preventable disease. The prevalence of TB is caused in part by non-adherence to medication, which results in greater risk of death, reinfection and contraction of multidrug-resistant TB (Thomas et al., 2005). To combat non-adherence, the WHO standard protocol is Directly Observed Treatment, short-course (DOTS), in which a health worker directly observes and confirms that a patient is consuming the required medication multiple times in a week. However, requiring patients to travel to the DOTS clinic causes financial burden, and potentially social stigma due to public fear of the disease. Such barriers cause patients to default from treatment, making TB eradication difficult. Thus, digital adherence technologies (DATs), which give patients flexible means to prove adherence, have gained popularity globally (Subbaraman et al., 2018).

Figure 1. 99DOTS electronic adherence dashboard seen by health workers. Missed doses are marked in red while consumed doses are marked in green.

DATs allow patients to be ”observed” consuming their medication electronically, e.g. via two-way text messaging, video capture, electronic pillboxes, or toll-free phone calls. Health workers can then view real-time patient adherence on a dashboard such as Figure 1. In addition to improving patient flexibility and privacy, the dashboard enables health workers to triage patients and focus their limited resources on the highest risk patients. Preliminary studies suggest that DATs can improve adherence in multiple disease settings (Haberer et al., 2017; Corden et al., 2016; Sabin et al., 2015), prompting its use and evaluation for managing TB adherence (Garfein et al., 2015; Liu et al., 2015; 99DOTS, [n. d.]). The WHO has even published a guide for the proper implementation of the technology in TB care (Organization et al., 2017).

In this paper, we study how the wealth of longitudinal data produced by DATs can be used to help health workers better triage TB patients and deliver interventions to boost overall adherence of their patient cohort. The data we analyze comes from a partnership with the nonprofit 99DOTS (99DOTS, [n. d.]) and the healthcare technology company Everwell (Everwell, [n. d.])

who have implemented a DAT by which patients prove adherence through daily toll-free calls. 99DOTS operates in India where there were an estimated 2.7 million cases of TB in 2017

(Organization, 2018); they shared data from one major city in Maharashtra (referred to as ”The City.”) Patients enrolled in 99DOTS in The City currently receive interventions according to the following general guidelines. If they have not taken their medication by the afternoon, they (and their health worker) receive a text message reminder. If the patient still does not take their medication by some time later, the worker will call the patient directly. Finally, if a patient simply does not respond to these previous interventions after some number of days, they may be personally visited by a health worker. Note that many of these patients live in low-resource communities where each health worker manages tens to hundreds of patients; far more than they can possibly visit in a day. Thus, models that can identify patients at risk of missing doses and prioritize interventions by health workers are of paramount importance.

At first glance, the problem of predicting whom to target for an intervention appears to be a simple supervised machine learning problem. Given data about a patient’s medication adherence through their calls to the 99DOTS system, one can train a machine learning model to predict whether they will miss medication doses in the future. However, such a model ignores the concurrent interventions from health workers as the data was collected, and can lead to incorrect prioritization decisions even when it is highly accurate. For instance, we might observe that missed doses are followed by a period of medication adherence: this does not mean that people with missed doses are more likely to take medication, but most likely that there was an intervention by a health worker after which the patient restarted their medication.

Thus, for prescribing interventions, we need to disentangle the effect of manual interventions from other underlying factors that result in missing a dose. However, since this data was collected via an extensive rollout to real patients, the data contains the effects of interventions carried out by health workers. As an additional challenge, the 99DOTS system does not record interventions, making it difficult to estimate their effects. While there is a well-developed literature on estimating heterogeneous treatment effects, standard techniques uniformly require knowledge of which patients received an intervention (Morgan and Winship, 2014; Dehejia and Wahba, 2002; Athey and Imbens, 2016; Sutton et al., 2009). We note that such gaps will be common as countries eagerly adapt DAT systems in the hopes of benefiting low-income regions; to support the delivery of improved care, we must be able to draw lessons from this messy but plentiful data.

In this work, therefore, we introduce a general approach for learning from adherence data with unobserved interventions, based on domain knowledge of the intervention heuristics applied by health workers. Through our partnership with Everwell and 99DOTS, we construct a proxy for interventions present in the historical data and develop a model that can help prioritize targets of interventions for health workers in three different clinical scenarios:

Modeling Daily Non-Adherence Risk. We first propose the following prediction task: given adherence data up to a certain time period for patients not currently considered for intervention, predict risk of non-adherence in the next week. We then introduce machine learning models for this task, which enable health workers to intervene with 21% more patients and catch nearly 76% more missed doses compared to the heuristics currently used in practice.

Predicting success of treatment. Next, we apply our framework to predict the final outcome at the end of the six-month treatment for a patient based on their initial adherence data. Like the previous model, this can be useful for health workers to prioritize patients who are at risk of an unsuccessful treatment, even though their adherence might be high. Additionally, since this prediction applies over the course of several months (rather than just one week in the previous task), this model can be useful for public health officials to better plan for TB treatment in their area, e.g. by assigning or hiring additional health workers. We show that our model can be used to achieve city-wide treatment outcome goals at nearly 40% less cost than via baselines.

Decision Focused learning. Finally, building on recent work in end-to-end decision-focused learning (Wilder et al., 2018), we build a machine learning model which is tailored for a specific intervention planning problem. In the planning problem, workers must balance travel costs while predicting which patients will benefit most from interventions. This example demonstrates how the modeling flexibility enabled by our approach allows us to fine-tune and extract additional gains for particular decision support tasks (in this case, a 15% improvement over our earlier model).

With our proposed models, 99DOTS can now leverage several years of collected adherence data to better inform patient care and prioritize limited intervention resources. In addition, the challenges we address are not unique to 99DOTS or TB adherence. DATs have been implemented for disease treatment regimens such as HIV and diabetes in regions across the globe, and for each of those cases health workers face the same challenge of prioritizing patient interventions. By enabling health workers to intervene before more missed doses, our model will directly contribute to saving the lives of those afflicted with TB and other diseases. That is why, though our model has not yet been deployed, we are excited that Everwell will soon adopt the technology and test it in the field.

2. Related Work

Outcomes and adherence research are well studied in the medical literature for a variety of diseases (Kardas et al., 2013). Traditionally, studies have attempted to identify demographic or behavioral factors correlated with non-adherence so that health workers can focus interventions on patients who are likely to fail. Tuberculosis in particular, given its lethality and prevalence in third world countries, has been studied throughout the world including in Ethiopia (Shargie and Lindtjørn, 2007), Estonia (Kliiman and Altraja, 2010), and India (Roy et al., 2015). Typically these studies gather demographic and medical statistics on a cohort of patients, observe the cohort’s adherence and outcomes throughout the trial, then retrospectively apply survival (Shargie and Lindtjørn, 2007; Kliiman and Altraja, 2010)

or logistic regression

(Roy et al., 2015)

analysis to determine covariates predictive of failure. Newer work has improved classification accuracy via machine learning techniques such as Decision Trees, Neural Networks, Support Vector Machines and more

(Kalhori and Zeng, 2013; Hussain and Junejo, 2018; Sauer et al., 2018; Mburu et al., 2018). However, the conclusions connecting predictors to risk are largely the same as in previous medical literature. While such studies have improved patient screening at the time of diagnosis, they offer little knowledge about how risk changes during treatment. In this work, we show how a patient’s real-time adherence data can be used to track and predict risk changes throughout the course of their treatment. Previous studies likely did not address this question because accurately measuring patient adherence has historically been difficult.

However, in recent years, new technologies have made measuring daily adherence feasible in the context of many diseases such as HIV or stroke. One such common device is an electronic pill bottle cap that records the date/time when the cap is removed. While some previous work has used electronic cap data to determine predictors of non-adherence (Platt et al., 2010; Pellowski et al., 2016; Cook et al., 2017), almost no research has used the daily measurements made possible by the electronic cap to study changes in adherence over time. One study used data from a smart pillbox to retrospectively categorize patient adherence (Kim et al., 2018), but our focus is on prospective identification of patients at risk of missing doses before failures occur. As such devices enter mainstream use, machine learning techniques like the ones that we propose will play an important role in the treatment of a wide spectrum of diseases.

Methodologically, our work is related to the large body of research that deals with estimating the causal impact of interventions from observational data (Morgan and Winship, 2014; Dehejia and Wahba, 2002; Athey and Imbens, 2016; Sutton et al., 2009). Given appropriate assumptions, such techniques allow for valid inferences about counterfactual outcomes under a different policy for determining interventions. However, they crucially require exact knowledge of when interventions were carried out. This information is entirely absent in our setting, requiring us to develop new methods for handling unobserved interventions in the training data.

3. Data Description

99DOTS provides each patient with a cover for their sleeve of pills that associates a hidden unique phone number with each pill. As patients expose each pill, they expose the associated phone number. Each patient is instructed to place a toll-free call to the indicated number each day. 99DOTS counts a dose only if the patient calls the correct number for a given day. Due to the sensitivity of the health domain, all data provided by our partners was fully anonymized before we received it. The dataset contains over 2.1 million calls by about 17,000 patients, served by 252 health centers across The City. Table LABEL:DataDescriptionTable summarizes the data. We now describe the available information in more detail.

Metric Count
Total calls 2,169,976
—By Patient 1,459,908
—Manual (entered by health worker) 710,068
Registered Phones 38,000
Patients 16,975
Health Centers 252
Calls per patient


—Min/Mean/Max 1/136/1409
Patients per center
—Quartiles 21/51/92
—Min/Mean/Max 1/63/421
Table 1. Data Summary

Patient Details. This is the primary record for patients who have enrolled with 99DOTS. The table includes demographic features such as weight-band, age-band, gender and treatment center ID. Also included are treatment start and end dates, whether treatment is completed or ongoing, and an ”adherence string” which summarizes a patient’s daily adherence. For patients who completed treatment, a treatment outcome is also assigned according to the standard WHO definitions (Organization, 2013, p. 5). We label ”Cured” and ”Treatment Complete” to be successful outcomes and ”Died”, ”Treatment failed”, and ”Lost to follow-up” to be unsuccessful outcomes.

Mapping phone numbers to patients. Patients must call from a registered phone number for a dose to be counted by the 99DOTS system. Patients can register multiple phones, each of which will be noted in the Phone Map table. We filtered out phones that were registered to multiple patients since they could not be uniquely mapped to patients. Also, patients who had any calls from shared phones were filtered out to avoid analyzing incomplete call records. This removed ¡1% of the patients from the data set.

Call Log. The Call Log records every call received by 99DOTS, including from patients outside of The City. It also includes ”manual calls” marked by health workers. Manual calls allow workers to retroactively update a patient’s adherence on the dashboard. For instance, if a patient missed a week of calls due to a cellular outage, the worker could update the record to account for those missed doses. We filtered the Call Log to only contain entries with patients and phones registered in The City, then attached a Patient ID to each call by joining the filtered Call Log and Phone Map.

Patient Log. Each time a health worker interacts with a patient’s data in the 99DOTS dashboard, an automatic note is generated describing the interaction. The Patient Log records each such event, noting the type of action, Patient ID, health worker ID, the health worker’s medical unit, action was taken, and a timestamp. We did not calculate features from this table as they tended to be sparse. However, this table was used for calculating our training labels as described in Section 4.

4. Unobserved Interventions

The TB treatment system operates under tight resource limitations, e.g. one health worker may be responsible for more than 100 patients. Thus while recommendations for additional interventions can be valuable, recommendations that reprioritize existing resources are of even greater use. The former can be accomplished by applying traditional machine learning to the data as-is. However, the latter requires taking special care to understand how intervention resources were allocated in the existing data.

Thus, a key challenge is that the 99DOTS platform does not record interventions: workers may make texts, calls, or personal visits to patients to try to improve adherence, but these interventions are not logged in the data. While far from ideal, such gaps are inevitable as countries with differing standards of reporting adopt DATs for TB treatment. Given the abundance of data created by DATs and their potential to impact human lives, we emphasize the importance of learning lessons in this challenging setting where unobserved interventions occur. We next resolve this challenge by formulating a screening procedure which identifies patients who were likely candidates for particular interventions. However, we first illustrate the difference between a model that can be used to recommend additional interventions versus one that can recommend reprioritizing interventions.

Consider a naive model trained on the data as-is. Some of the data will be influenced by the historical interventions carried out by health workers. Thus, such a model will learn how to predict patient adherence given existing worker behaviors. Such predictions will be useful to find patients who will fail despite existing efforts, so the naive model is suited to recommend additional interventions.

Now consider using the same naive model to reprioritize interventions. That is, some patients who would have received interventions under the historical policy will be judged not to require intervention by the new model. While such prioritization is desirable under resource constraints, naive models which ignore the impact of interventions in the dataset can actually worsen patient outcomes. For instance, assume we use the naive model to make a prediction about the patient from Section 1 who had a week of missed doses, an intervention, then a week of good adherence. By correctly predicting this patient’s good adherence the naive model would recommend no intervention – but this patient’s good adherence is contingent on the hidden intervention in the data. Hence, the naive model will take resources away from exactly the patients who would benefit most. To avoid such pitfalls arising from unobserved interventions, we must train and evaluate on data that is not influenced by such intervention effects. We now describe our general method for reshaping data around intervention effects to build valid models.

Intervention Proxy. Our goal is to use the available data to formulate a proxy for when an intervention is likely to have occurred, so that we can train our models on data points which are unaffected by interventions. The key is to identify a conservative estimate for where interventions occur to ensure that data with intervention signals are not included. First, we draw a distinction between different types of health worker interventions. Specifically, we consider a house visit to be a ”resource-limited” intervention since workers cannot visit all of their patients in a timely manner. Generally, this is a last resort for health workers when patients will not respond to other methods. Alternatively, we consider calls and texts to be ”non-resource-limited” interventions since they could feasibly be made on every patient in one day. We develop a proxy only for resource-limited interventions since there is no reason to reprioritize non-resource-limited interventions which are effectively “free”.

To formulate our proxy, we first searched for health worker guidelines for carrying out house visits. The 2005 guide by India’s Revised National Tuberculosis Control Program (RNTCP) (RNTCP, [n. d.]b) required that workers deliver a house visit after a single missed dose, but updated guides are far more vague on the subject. Both the most recent guide by the WHO (Organization et al., 2017) and by the RNTCP (RNTCP, [n. d.]a) leave house visits up to the discretion of the health worker. However, our partners at Everwell observed that health workers prioritize non-adherent patients for resource-limited interventions such as house visits. Thus, we formulated our proxy based on the adherence dashboard seen by health workers.

The 99DOTS dashboard gives a daily ”Attention Required” value for each patient. First, if a patient has an entry in the Patient Log (i.e. provider made a note about the patient) in the last 7 days they are automatically changed to ”MEDIUM” attention, but this rule affects ¡1% of the labels. The remaining 99% of labels are as follows: If a patient misses 0 or 1 calls in the last 7 days, they are changed to ”MEDIUM” attention, whereas if they miss 4 or more they are changed to ”HIGH” attention. Patients with 2-3 missed doses retain their attention level from the previous day. As our conservative proxy, we assumed that only ”HIGH” attention patients were candidates for resource-limited interventions since the attention level is a health worker’s primary summary of recent patient adherence. This ”attention required” system for screening resource-limited interventions is generalizable to any daily adherence setting; one need only to identify the threshold for a change to HIGH attention.

With this screening system, we can identify sequences of days during which a patient was a candidate for a resource-limited intervention, and subsequently avoid using signal from those days in our training task. We accomplish this with our formulation of the real-time risk prediction task as follows.

Consider a given set of patients on the dashboard of a health worker at day . Each patient will have a ”attention required” value in {MEDIUM, HIGH} representing their risk for that day. Over the course of the next week up to , we will observe call behavior for each patient and so the attention for each patient may also change each day. Between and , any patient that is at HIGH on a given day may receive a resource-limited intervention while those at MEDIUM may not. Note that a change from MEDIUM to HIGH on day where means that a patient missed 4 doses over days . Patients at HIGH attention are already known to the health worker, so the goal for our ML system is to help prevent transitions from MEDIUM to HIGH by predicting which patients are at greatest risk before the transition occurs and allowing a health worker to intervene early.

We formalize our prediction task as follows. For each patient who is MEDIUM at time , use data from days to predict whether or not they change to HIGH at any time where . We now demonstrate that, with our intervention proxy, resource-limited intervention effects cannot effect labels in this formulation. First, if a patient stays at MEDIUM for all , then the label is 0. Since the patient was at MEDIUM for all , our proxy states that no resource-limited intervention took place between our prediction time and the time that produced the label, . Second, if a patient changes from MEDIUM to HIGH on day , then on day we establish that the label is 1. By our proxy, any resource-limited intervention effect must happen in , since attention is established at the end of a day . So again, we have that no resource-limited intervention took place between our prediction time and the time that produced the label, .

Since we ensure that no resource-limited interventions happen between our prediction time and the time the label is generated, we ensure that intervention effects cannot influence our labels. Now, if we predict that a patient will have good adherence we can safely recommend no intervention since our combined screening and training method guarantees that their good adherence is not

contingent on an intervention. Thus our classifier is suited to make predictions that reprioritize resource-limited interventions.

Despite messy data affected by unobserved interventions, this conservative, general proxy generates clean data without interventions. In the next section, we show how this approach leads to significant improvements in prediction performance and creates valid recommendations to enable interventions among high-risk patients.

5. Real-Time Risk Prediction

We now build a model for the prediction task formalized in Section 4 which leverages our intervention screening proxy. Our goal was to develop a model corresponding to the health worker’s daily task of using their patients’ recent call history to evaluate adherence risk with the goal of scheduling different types of interventions. Better predictions allow workers to proactively intervene with more patients before they miss critical doses.

Sample Generation. We started with the full population of 16,975 patients and generated training samples from each patient as follows. We considered all consecutive sequences of 14 days of call data where the first 7 days of each sequence were non-overlapping. We excluded each patient’s first 7 days and the last day of treatment to avoid bias resulting from contact with health workers when starting or finishing treatment. We then took two filtering steps. First, we removed samples where the patient had more than 2 calls manually marked by a provider during the input sequence since these patients likely had contact with their provider outside of the 99DOTS system. Second, we removed samples in which the patient did not miss any calls in the input sequence. These samples made up the majority of data but included almost no positive (high risk) labels, which distorted training. Further, positive predictions on patients who missed 0 calls are unlikely to be useful; no resource-limited intervention can be deployed so widely that patients with perfect recent adherence are targeted. The above procedure generated 16,015 samples (2,437 positive).

Features. Each sample contains both a time-series of call data and static features. The time series data included two sequences of length 7 for each sample. The first sequence was a binary sequence of call data (1 for a call and 0 for a miss.) The second sequence was a cumulative total of all calls missed up to that day, considering the patient’s entire history in the program. The static features included four basic demographic features from the Patient Table: weight-band, age-band, gender, and treatment center ID. Additional features were engineered from the patient Call Logs and captured a patient’s behavior

rather than just their adherence. For example, does the patient call at the same time every morning or sporadically each day? This was captured through the mean and variance of the call minute and hour. Other features included number of patient calls, number of manual calls, mean/max/variance of calls per day as well as days per call. We also included analogous features which used only

unique calls per day, or ignored manual calls. This process resulted in 29 descriptive features.

Figure 2.

ROC Curve for the weekly risk prediction task comparing the missed call baseline (blue), Random Forest (yellow) and DeepNet (green). Numbers under the blue curve give thresholds used to calculate the baseline’s ROC curve.


We first tested standard models which use only the static features: linear regression, a random forest

(Pedregosa et al., 2011)

(with 100 trees and a max depth of 5), and a support vector machine. The random forest performed best, so we exclude the others for clarity. To leverage the time series data, we also built a deep network (DeepNet) implemented with Keras

(Chollet et al., 2015)

which takes both the time series and static features as input. DeepNet has two input layers: 1) a LSTM with 64 hidden units for the time series input and 2) a dense layer with 100 units for the static feature input. We concatenated the outputs of these two layers to feed forward into another dense layer with 16 units, followed by a single sigmoid activation unit. We used a batch size of 128 and trained for 20 epochs.

Model Evaluation. To evaluate models we randomized all data then separated 25% as the test set. We used grid search with 4-fold cross validation to determine the best model parameters. To deal with class imbalance, we used SMOTE to over-sample the training set (Chawla et al., 2002) implemented with the Python library imblearn (Lemaître et al., 2017). We also normalized features as percentiles using SKLearn (Pedregosa et al., 2011) which we found empirically to work well. The baseline we compared against was the method used by health workers in the field to asses risk, namely calls made by the patient in the last week (lw-Misses).

Figure 2 shows the ROC curve of our models vs. the baseline. The random forest narrowly outperforms the baseline and DeepNet clearly outperforms both. However, to evaluate the usefulness of our methods over the baseline, we consider how the baseline would be used in practice. First, we consider the scenario where workers are planning a house-visit intervention. Since this is a very limited resource, we set the strictest baseline threshold to consider patients for this intervention; that is 3 missed calls. Fixing the FPR of this baseline method, Table LABEL:aucWeeklythresh3Table shows how many more patients in the test set would be reached each week by our method (as a result of its higher TPR) as well as the improvement in number of missed doses caught. To calculate missed calls caught, we count only missed doses that occur before the patient moves to HIGH risk. Our model catches 21.6% more patients and 76.5% more missed calls, demonstrating substantially more precise targeting than the baseline.

Method True Positives Doses Caught
Baseline 204 204
Deep Net 248 360
Improvement 21.6% 76.5%
DeepNet vs. baseline for catching missed doses with a fixed false positive rate. Our method learns behaviors indicative of non-adherence far earlier than the baseline, allowing for more missed doses to be prevented.
Table 2. DeepNet vs. Baseline - Missed Doses Caught
TPR Baseline FPR Deep Net FPR Improvement
75% 50% 35% 30%
80% 63% 41% 35%
90% 82% 61% 26%
DeepNet vs. baseline for implementing new interventions. At any TPR DeepNet improves over the baseline FPR, allowing for more precisely targeted interventions.
Table 3. DeepNet vs. Baseline: Additional Interventions

Table LABEL:aucWeeklythresh1Table shows that our model also outperforms the baseline as both the true positive rate (TPR) and FPR increase. This suggests that our model has greater discriminatory power for difficult-to-detect non-adherence. This is useful for non-resource-limited interventions such as calls or texts. Recall, that our screening procedure does not apply to this type of intervention, so our predictions may only recommend additional interventions. It is important that additional interventions be carefully targeted since repeated contact with a given patient reduces the efficacy of each over time (Demonceau et al., 2013). This highlights the value of the greater precision offered by our model, since simply blanketing the entire population with calls and texts is likely counterproductive.

Interpretability. Our model has the potential to catch more missed doses than current methods. However, these gains cannot become reality without health workers on the ground delivering interventions based on the predictions. Interpretability is thus a key factor in our model’s usefulness because health workers need to understand why our model makes its predictions to trust the model and integrate its reasoning with their own professional intuition.

However, the best predictive performance was achieved via the black-box deep network rather than a natively interpretable model such as linear regression. Accordingly, we show how a visualization tool can help users draw insights about our model’s reasoning. We used the SHapley Additive exPlanations (SHAP) python library, which generates visualizations for explaining machine learning models (Lundberg, [n. d.]). Figure 3a shows how static features influence our model’s prediction, where red features push predictions toward 1 (HIGH) and blue toward 0 (MEDIUM). Recall that features are scaled as percentiles. In the blue, we see that this patient makes an above-average number of calls each week pushing the prediction toward 0. However, in the red we see that this patient has a very low average but a high variability in time between calls. These features capture that this patient missed two days of calls, then made three calls on one day in an attempt to ”back log” their previous missed calls. Our model learned that this is a high-risk behavior.

(a) SHAP values for the model’s dense layer features for a high-risk sample (0.5).
(b) SHAP values for the model’s LSTM layer input for 4 samples.
Figure 3. Visualization of the (a) dense layer and (b) LSTM layer of our weekly risk prediction model. Red values correspond to inputs that push predictions toward output of 1; blue values push toward output of 0.

Figure 3b shows four different samples as input to the LSTM layer of our model. The left shows the binary input sequence as colored pixels where black is a call and yellow is a missed call. On the right are SHAP values corresponding to each day of adherence data, and grey denotes the start of the call sequence. We see that the model learned that calls later in the week carry more weight than calls earlier in the week. In Sample 1, the bottom two pixels (the most recent calls) have blue SHAP values while the other pixels have SHAP value close to 0. In Sample 3, a single missed call at the beginning of the week combined with a call made at the end of the week result in essentially cancelling SHAP values. Sample 4 also has one missed call but on the last day of the week, resulting in a net positive SHAP value.

This visualization technique provides intuitive insights about the rules learned by our model. When deployed, workers will be able to generate these visualizations for any sample on the fly in order to aid their decision-making process.

6. Outcome Prediction

Next we investigate how adherence data can be used to predict final treatment outcome. Traditional TB treatment studies model outcomes only as they relate to patient covariates such as demographic features. Exploiting the daily real-time adherence data provided by DATs, we investigate how using the first days of a patient’s adherence enables more accurate, personalized outcome predictions. Note that interventions effects are still present in this formulation. However, our intervention screening procedure will not apply since we predict over a period of several months, during which virtually all patients would have had repeated in-person contact with health workers.

Sample Generation and Features. We formalize the prediction task as follows: given the first days of adherence data, predict the final binary treatment outcome. We considered ”Cured” and ”Treatment Complete” to be successful outcomes and ”Died”, ”Lost to follow-up”, and ”Treatment Failure” to be unsuccessful outcomes. We only include patients who had completed treatment and were assigned an outcome from the above categories. Further, since patients who have an outcome of ”Died” or ”Lost to follow-up” exit the program before the full 6 months of treatment, we removed those who were present for less than days. Our final dataset contained 4167 samples with 433 unsuccessful cases.

Our partners at Everwell observed that health workers tend to track the first month of a patient’s behavior then place them informally into risk categories indicative of their chance of treatment failure. To model this process, we set k=35 for our prediction task, capturing the first month of each patient’s adherence after enrollment in 99DOTS. Both the static features and the sequence inputs were the same as calculated for the weekly prediction task, but now taken over the initial 35 days. We included two versions of the health worker baseline: missed calls in the last week (lw-Misses) and total missed calls in 35 days (t-Misses).

Model Evaluation. We used the same models, cross-validation design, and training process as before. For the Random Forest we used 150 trees and no max depth. For the DeepNet, we used 64 hidden units for the LSTM input layer, 48 units for the dense layer input, and 4 units in the penultimate dense layer.

Figure 4 shows ROC curves for each model. Even the very simple baseline of counting the calls made in the last 7 days before the 35 day cutoff is fairly predictive of outcome suggesting that the daily data made available by DATs is valuable in evaluating which patients will fail from TB treatment. Our ML models display even greater predictive power, with DeepNet performing the best, followed closely by the random forest. These predictions could help officials to minimize the costs necessary to reach medical outcome goals for their city. For instance, say The City’s goal is to catch 80% of failures (true positives in Figure 4). Over the  17,000 patients in The City, where 10% have unsuccessful outcomes as in our test set, an 80% catch rate requires saving 1360 patients. Using either baseline, achieving the 80% TPR requires a FPR of 70%, i.e., following 10710 patients. However, using our method only incurs a FPR of 42%, translating to 6426 patients followed. Recall that in The City, the median health worker cares for about 50 patients. At a yearly starting salary of 216,864 (Channel, [n. d.]) (or $3026) our model yields 18.6M in saved costs (or $̃260,000) per year.

Figure 4. ROC curves for outcome prediction models.

7. Decision Focused Learning

We now explore a case study of how our DeepNet model can be specialized to provide decision support for a particular intervention. We exploit end-to-end differentiability of the model to replace our earlier loss function (binary cross-entropy) with a performance metric tailored to the objective and constraints of specific decision problem. To accomplish this end-to-end training, we leverage recent advances in

decision-focused learning, which embeds an optimization model in the loop of machine learning training (Wilder et al., 2018; Donti et al., 2017).

We focus on a specific optimization problem that models the allocation of health workers to intervene with patients who are at risk in the near future. This prospective intervention is enabled by our real-time risk predictions and serves as an example of how our system can enable proactive, targeted action by providers. However, we emphasize that our system can be easily modified to capture other intervention problems. Such flexibility is one benefit to our technical approach, which allows the ML model to automatically adapt to the problem specified by a domain expert.

Our optimization problem models a health worker who plans a series of interventions over the course of a week. The health worker is responsible for a population of patients across different locations, and may visit one location each day. We use location identifiers at the level of the TB Unit since this is the most granular identifier which is shared by the majority of patients in our dataset. Visiting a location allows the health worker to intervene with any of the patients at that location. The optimization problem is to select a set of locations to visit which maximizes the number of patients who receive an intervention on or before the first day they would have missed a dose. We refer to this quantity as the number of successful interventions, which we choose as our objective for two reasons. First, it measures the extent to which the health worker can proactively engage with patients before adherence suffers. Second, this objective only counts patients who start the week at MEDIUM attention and receive an intervention before they could have transitioned to HIGH, dovetailing with our earlier discussion on avoiding unobserved interventions in the data. This extends our earlier intervention proxy to handle day-by-day rewards.

We now show how this optimization problem can be formalized as a linear program. We have a set of locations

and patients where patient has location . Over days of the week , the objective coefficient is 1 if an intervention on day with patient is successful and 0 otherwise. Our decision variable is , and takes the value 1 if the health worker visit location on day and 0 otherwise. With this notation, the final LP is as follows:


where the second constraint prevents the objective from double-counting multiple visit to a location. We remark that the feasible region of the LP can be shown to be equivalent to a bipartite matching polytope, implying that the optimal solution is always integral.

The machine learning task is to predict the values of the , which are unknown at the start of the week. We compare three models. First, we extend the lw-Misses baseline to this setting by thresholding the number of doses patient missed in the last week, setting for all if this value falls below the threshold and otherwise. We used since it performed best. Second, we trained our DeepNet system (DN) directly on the true as a binary prediction task using cross-entropy loss. Third, we trained DeepNet to predict using performance on the above optimization problem as the loss function (training via the differentiable surrogate given by (Wilder et al., 2018)). We refer to this model as DN-Decision.

We created instances of the decision problem by randomly partitioning patients into groups of 100, modeling a health worker under severe resource constraints (as they would benefit most from such a system). We included all patients, including those with no missed doses in the last week, since the overall resource allocation problem over locations must still account for them.

Figure 5 shows results for this task. In the top row, we see that DN and DN-Decision both outperform lw-Misses, as expected. DN-Decision improves the number of successful interventions by approximately 15% compared to DN, demonstrating the value of tailoring the learned model to a given planning problem. DN-Decision actually has worse AUC than either DN or lw-Misses, indicating that typical measures of machine learning accuracy are not a perfect proxy for utility in decision making. To investigate what specifically distinguishes the predictions made by DN-Decision, the bottom row of Figure 5

shows scatter plots of the predicted utility at each location according to DN and DN-Decision versus the true values. Visually, DN-Decision appears better able to distinguish the high-utility outliers which are most important to making good decisions. Quantitatively, DN-Decision’s predictions have worse correlation with the ground truth overall (0.463, versus 0.519 for DN), but better correlation on locations where the true utility is strictly more than 1 (0.504 versus 0.409). Hence, decision-focused training incentivizes the model to focus on making accurate predictions specifically for locations that are likely to be good candidates for an intervention. This demonstrates the benefit of our flexible machine learning modeling approach, which can use custom-defined loss functions to automatically adapt to particular decision problems.

Figure 5. Results for decision focused learning problem. Top row: successful interventions and AUC for each method. Bottom row: visualizations of model predictions.

8. Discussion

We present a framework for learning to make intervention recommendations from data generated by DAT systems applied to TB care. We develop a general approach for learning from medical adherence data that contains unobserved interventions and leverage this approach to build a model for predicting risk in multiple settings. In the real-time adherence setting, we show that our model would allow health workers to more accurately target interventions to high risk patients sooner – catching 21% more patients and 76% more missed doses than the current heuristic baseline. Next, we train our model for outcome prediction, showing how adherence data can more accurately detect patients at risk of treatment failure. We finally show that tailoring our model for a specific intervention via decision-focused learning can improve performance by a further 15%. The learning approaches we present here are general and could be leveraged to study data generated by DATs as applied to any medication regimen. With the growing popularity of DAT systems for TB, HIV, Diabetes, Heart Disease, and other medications, we hope to lay the groundwork for improved patient outcomes in healthcare settings around the world.

We thank the Everwell team, especially Brandon Liu, Priyanka Ivatury, Amy Chen, and Bill Thies for sharing the data and thoughtfully advising us each step of the way.


  • (1)
  • 99DOTS ([n. d.]) 99DOTS. [n. d.]. 99DOTS. ([n. d.]). Retrieved Jan 16, 2019 from https://www.99dots.org/
  • Athey and Imbens (2016) Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 27 (2016), 7353–7360.
  • Channel ([n. d.]) India Study Channel. [n. d.]. MBMC Recruitment 2018 for 04 District PPM Co-ordinator, TBHV Posts. ([n. d.]). Retrieved Feb 3, 2019 from https://www.indiastudychannel.com/jobs/428552-MBMC-Recruitment-2018-for-04-District-PPM-Co-ordinator-TBHV-Posts.aspx
  • Chawla et al. (2002) Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16 (2002), 321–357.
  • Chollet et al. (2015) François Chollet et al. 2015. Keras. https://keras.io. (2015).
  • Cook et al. (2017) Paul F Cook, Sarah J Schmiege, Whitney Starr, Jane M Carrington, and Lucy Bradley-Springer. 2017. Prospective state and trait predictors of daily medication adherence behavior in HIV. Nursing research 66, 4 (2017), 275–285.
  • Corden et al. (2016) Marya E Corden, Ellen M Koucky, Christopher Brenner, Hannah L Palac, Adisa Soren, Mark Begale, Bernice Ruo, Susan M Kaiser, Jenna Duffecy, and David C Mohr. 2016. MedLink: A mobile intervention to improve medication adherence and processes of care for treatment of depression in general medicine. Digital Health 2 (2016), 2055207616663069.
  • Dehejia and Wahba (2002) Rajeev H Dehejia and Sadek Wahba. 2002. Propensity score-matching methods for nonexperimental causal studies. Review of Economics and statistics 84, 1 (2002), 151–161.
  • Demonceau et al. (2013) Jenny Demonceau, Todd Ruppar, Paulus Kristanto, Dyfrig A Hughes, Emily Fargher, Przemyslaw Kardas, Sabina De Geest, Fabienne Dobbels, Pawel Lewek, John Urquhart, et al. 2013. Identification and assessment of adherence-enhancing interventions in studies assessing medication adherence through electronically compiled drug dosing histories: a systematic literature review and meta-analysis. Drugs 73, 6 (2013), 545–562.
  • Donti et al. (2017) Priya Donti, Brandon Amos, and J Zico Kolter. 2017. Task-based end-to-end model learning in stochastic optimization. In Advances in Neural Information Processing Systems. 5484–5494.
  • Everwell ([n. d.]) Everwell. [n. d.]. Everwell. ([n. d.]). Retrieved Jan 29, 2019 from http://www.everwell.org/
  • Garfein et al. (2015) Richard S Garfein, Kelly Collins, Fátima Muñoz, Kathleen Moser, Paris Cerecer-Callu, Fredrick Raab, Phillip Rios, Allison Flick, María Luisa Zúñiga, Jazmine Cuevas-Mota, et al. 2015. Feasibility of tuberculosis treatment monitoring by video directly observed therapy: a binational pilot study. The International Journal of Tuberculosis and Lung Disease 19, 9 (2015), 1057–1064.
  • Haberer et al. (2017) Jessica E Haberer, Nicholas Musinguzi, Alexander C Tsai, BM Bwana, C Muzoora, PW Hunt, JN Martin, DR Bangsberg, et al. 2017. Real-time electronic adherence monitoring plus follow-up improves adherence compared with standard electronic adherence monitoring. AIDS (London, England) 31, 1 (2017), 169–171.
  • Hussain and Junejo (2018) Owais A Hussain and Khurum N Junejo. 2018. Predicting treatment outcome of drug-susceptible tuberculosis patients using machine-learning models. Informatics for Health and Social Care (2018), 1–17.
  • Kalhori and Zeng (2013) Sharareh R Niakan Kalhori and Xiao-Jun Zeng. 2013. Evaluation and comparison of different machine learning methods to predict outcome of tuberculosis treatment course. Journal of Intelligent Learning Systems and Applications 5, 03 (2013), 184.
  • Kardas et al. (2013) Przemyslaw Kardas, Pawel Lewek, and Michal Matyjaszczyk. 2013. Determinants of patient adherence: a review of systematic reviews. Frontiers in pharmacology 4 (2013), 91.
  • Kim et al. (2018) Kyuhyung Kim, Bumhwi Kim, Albert Jin Chung, Keekoo Kwon, Eunchang Choi, and Jae-wook Nah. 2018. Algorithm and System for improving the medication adherence of tuberculosis patients. In 2018 International Conference on Information and Communication Technology Convergence (ICTC). IEEE, 914–916.
  • Kliiman and Altraja (2010) K Kliiman and A Altraja. 2010. Predictors and mortality associated with treatment default in pulmonary tuberculosis. The international journal of tuberculosis and lung disease 14, 4 (2010), 454–463.
  • Lemaître et al. (2017) Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365.html
  • Liu et al. (2015) Xiaoqiu Liu, James J Lewis, Hui Zhang, Wei Lu, Shun Zhang, Guilan Zheng, Liqiong Bai, Jun Li, Xue Li, Hongguang Chen, et al. 2015. Effectiveness of electronic reminders to improve medication adherence in tuberculosis patients: a cluster-randomised trial. PLoS medicine 12, 9 (2015), e1001876.
  • Lundberg ([n. d.]) Scott Lundberg. [n. d.]. A unified approach to explain the output of any machine learning model. ([n. d.]). Retrieved Jan 18, 2019 from https://github.com/slundberg/shap
  • Mburu et al. (2018) Josephine W Mburu, Leonard Kingwara, Magiri Ester, and Nyerere Andrew. 2018. Use of classification and regression tree (CART), to identify hemoglobin A1C (HbA1C) cut-off thresholds predictive of poor tuberculosis treatment outcomes and associated risk factors. Journal of Clinical Tuberculosis and Other Mycobacterial Diseases 11 (2018), 10–16.
  • Morgan and Winship (2014) Stephen L Morgan and Christopher Winship. 2014. Counterfactuals and causal inference. Cambridge University Press.
  • Organization (2013) World Health Organization. 2013. Definitions and reporting framework for tuberculosis – 2013 revision. World Health Organization, Geneva.
  • Organization (2018) World Health Organization. 2018. Global tuberculosis report 2018. World Health Organization. Licence: CC BY-NC-SA 3.0 IGO.
  • Organization et al. (2017) World Health Organization et al. 2017. Handbook for the use of digital technologies to support tuberculosis medication adherence. (2017).
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Pellowski et al. (2016) Jennifer A Pellowski, Seth C Kalichman, Moira O Kalichman, and Chauncey Cherry. 2016. Alcohol-antiretroviral therapy interactive toxicity beliefs and daily medication adherence and alcohol use among people living with HIV. AIDS care 28, 8 (2016), 963–970.
  • Platt et al. (2010) Alec B Platt, A Russell Localio, Colleen M Brensinger, Dean G Cruess, Jason D Christie, Robert Gross, Catherine S Parker, Maureen Price, Joshua P Metlay, Abigail Cohen, et al. 2010. Can we predict daily adherence to warfarin?: Results from the International Normalized Ratio Adherence and Genetics (IN-RANGE) Study. Chest 137, 4 (2010), 883–889.
  • RNTCP ([n. d.]a) RNTCP. [n. d.]a. Revised National TB Control Programme: Technical and Operational Guidelines for Tuberculosis Control in India 2016. ([n. d.]). Retrieved Feb 2, 2019 from https://tbcindia.gov.in/index1.php?lang=1&level=2&sublinkid=4573&lid=3177
  • RNTCP ([n. d.]b) RNTCP. [n. d.]b. Technical and Operational Guidelines for Tuberculosis Control. ([n. d.]). Retrieved Feb 2, 2019 from http://www.tbonline.info/media/uploads/documents/technical_and_operational_guidelines_for_tuberculosis_control_%282005%29.pdf
  • Roy et al. (2015) Nirmalya Roy, Mausumi Basu, Sibasis Das, Amitava Mandal, Debashis Dutt, and Samir Dasgupta. 2015. Risk factors associated with default among tuberculosis patients in Darjeeling district of West Bengal, India. Journal of family medicine and primary care 4, 3 (2015), 388.
  • Sabin et al. (2015) Lora L Sabin, Mary Bachman DeSilva, Christopher J Gill, Zhong Li, Taryn Vian, Xie Wubin, Cheng Feng, Xu Keyi, Lan Guanghua, Jessica E Haberer, et al. 2015. Improving adherence to antiretroviral therapy with triggered real time text message reminders: the China through technology study (CATS). Journal of acquired immune deficiency syndromes (1999) 69, 5 (2015), 551.
  • Sauer et al. (2018) Christopher Martin Sauer, David Sasson, Kenneth E Paik, Ned McCague, Leo Anthony Celi, Ivan Sanchez Fernandez, and Ben MW Illigens. 2018. Feature selection and prediction of treatment failure in tuberculosis. PloS one 13, 11 (2018), e0207491.
  • Shargie and Lindtjørn (2007) Estifanos Biru Shargie and Bernt Lindtjørn. 2007. Determinants of treatment adherence among smear-positive pulmonary tuberculosis patients in Southern Ethiopia. PLoS medicine 4, 2 (2007), e37.
  • Subbaraman et al. (2018) Ramnath Subbaraman, Laura de Mondesert, Angella Musiimenta, Madhukar Pai, Kenneth H Mayer, Beena E Thomas, and Jessica Haberer. 2018. Digital adherence technologies for the management of tuberculosis therapy: mapping the landscape and research priorities. BMJ global health 3, 5 (2018), e001018.
  • Sutton et al. (2009) Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. 2009. A Convergent Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. In Advances in neural information processing systems. 1609–1616.
  • Thomas et al. (2005) Aleyamma Thomas, PG Gopi, T Santha, V Chandrasekaran, R Subramani, N Selvakumar, SI Eusuff, K Sadacharam, and PR Narayanan. 2005. Predictors of relapse among pulmonary tuberculosis patients treated in a DOTS programme in South India. The International Journal of Tuberculosis and Lung Disease 9, 5 (2005), 556–561.
  • Wilder et al. (2018) Bryan Wilder, Bistra Dilkina, and Milind Tambe. 2018.

    Melding the Data-Decisions Pipeline: Decision-Focused Learning for Combinatorial Optimization. In

    AAAI Conference on Artificial Intelligence.