1 Introduction
Clinical decision support systems aim to relay clinically relevant information while the patient is being treated. The relevant information can vary and can include recommendations of standardized pathways of care (Panella et al., 2003), evidencebased guidelines, and warnings about allergies and other contraindicated medications. The most effective systems are those that can understand the patient’s electronic medical record as it is being populated. By harnessing information entered as part of the clinician’s regular workflow, these systems do not add additional work or cognitive burden and integrate seamlessly into clinical care.
While the field of machine learning has shown tremendous success learning to recognize patterns from large collections of labeled examples, one issue that arises repeatedly when applying machine learning to medical applications is the difficulty and cost of obtaining accurate labels for training. In this work, we focus on the socalled “anchored” setting, where goldstandard labels are difficult to obtain, but noisy versions of these labels can be easily extracted from clinical text using simple rules
(Halpern et al., 2014; Agarwal et al., 2016). These rules (called anchors) are then used as surrogate labels in standard machine learning pipelines, with appropriate adjustments to account for noise (Natarajan et al., 2013; Elkan and Noto, 2008). Anchors need to be specified manually by experts, but are much easier than labeling large numbers of patients with manual chart abstraction.Previous work with anchors (Halpern et al., 2016)
showed that they can be used to build a large number of individual classifiers to identify a range of clinical conditions, but did not address the joint modeling of these conditions. Without a joint model, it is difficult to calibrate the individual classifiers against each other to properly respond to queries such as: “what are the most likely clinical conditions for this patient?” or “what
else might the patient have?”In this work, we present a method of training joint probabilistic models with anchors, and show an improvement of the joint model over individual classifiers in a clinical condition tagging task, specifically in answering the question “what else might the patient have?”
2 Modeling clinical conditions
We model 23 clinical conditions relevant to the emergency department. These are a subset of the conditions modeled in Halpern et al. (2016), chosen to have anchors which include both ICD9 billing codes and another form of observation (either free text or medication). The anchors are available at https://github.com/clinicalml/clinicalanchors. To simulate the anchor setting while still having labels for evaluation, we use the ICD9 codes to represent “ground truth” for the purposes of this study and other observations such as medications or freetext for anchors and features. While ICD9 codes are generally unreliable for establishing goldstandard clinical conditions (e.g., Cipparone et al., 2015; Tieder et al., 2011; BirmanDeych et al., 2005; Aronsky et al., 2005), we consider them reliable enough to assess relative performance of different methods which are trained using anchors. Table 1 gives the full list of clinical conditions that were modeled.
Clinical Conditions  

abdominal pain acute  alcohol acute  allergic reaction acute 
asthmacopd acute  back pain acute  cellulitis acute 
cva acute  epistaxis acute  fall acute 
gi bleed acute  headache acute  hematuria acute 
intracranial hemorrhage acute  kidney stone acute  vehicle collision acute 
pneumonia acute  severe sepsis acute  sexual assault acute 
suicidal ideation acute  syncope acute  uti acute 
liver history  hiv history 
3 Cohort
The study was performed in a 55,000visit/year level 1 trauma center and tertiary academic teaching hospital. All consecutive emergency department (ED) patients between 2008 and 2013 were included. Records were deidentified (Neamatullah et al., 2008) and personal health information was removed before beginning the analysis. Each record represents a single patient visit, leading to a total of 273,174 records of emergency department patient visits. The study was approved by the hospital’s institutional review board.
3.1 Cohort Selection
We focus on patients with at least two of the modeled clinical conditions so that it is possible to identify one condition and ask “what else might the patient have?” After filtering for patients with at least two of the modeled conditions, we were left with 16,268 patients. Of these patients, 11,000 were designated for training and 5,000 for testing. The final 268 patients were not used.
3.2 Data extraction and feature selection
For each visit, we extracted data from the fields listed in Table 2 to build a binary bagofwords representation for every patient. Full details of the freetext processing pipeline including negation and bigram detection can be found in Appendix A. For each condition, we create an anchor token which appears if any
of the condition’s anchors appears in the record. Terms that appear in more than 50% of the patient records are removed as stopwords, and the most common 1000 terms are kept. Any anchors that were filtered out in this step are added back in, yielding a final feature vector with 1003 binary indicators.
Field  Representation 

Age  Binned to nearest decade 
Sex  M / F 
Chief Complaint  Free text 
Triage Assessment  Free text 
MD comments  Free text 
Medication history  GSN codes 
Dispensed medications  GSN codes 
Billing codes  ICD9 codes 
4 Methods
We model conditions and observations as a bipartite Bayesian network. In the following sections, we will describe the structure of the model and methods for learning its parameters. Throughout, we will follow the convention that random variables are denoted by uppercase letters (e.g.,
) and their values indicated by lowercase variables ().4.1 Anchor assumption
We assume the anchors are corrupted versions of the true labels and that the corruption process obeys a conditional independence constraint: The state of the anchor depends only on the true label. Specifically, conditioned on the true label, it is independent of all other observations. Halpern et al. (2014) additionally required a positiveonly assumption (the corruption process does not produce false positive cases), which we do not require here. Instead we will require that that the classconditional noise rates of the corruption process are known.
4.2 Model structure
We use a graphical model patterned after the historical QMRDT network for diagnosis (Shwe et al., 1991) (see Figure 1). The Bayesian network consists entirely of binary random variables, which are partitioned into conditions and observations . The model is bipartite with directed edges from conditions to observations.
The conditions are assumed to be marginally independent with individual prior probabilities, denoted as
:(1) 
The conditional probabilities of the observations (given the state of the conditions), are parametrized with a “noisyor” distribution (Equation 2) (Shwe et al., 1991; Pearl, 1988):
(2) 
where the parameters are referred to as failure probabilities and is the leak probability.
The network can be viewed as a generative model. For each new patient, each condition independently turns on with probability . Each condition which is on tries to turn on each of its children, , but fails with probability . An additional “noise” parent is always on and fails to turn on its children with probability .
Rather than naively treat the anchors (which are unreliable labels) as telling us which conditions are present, we treat the conditions as latent variables and treat the anchors as observations. The anchor assumption places a structural constraint on the graphical model. Specifically, each anchor may only have one parent which is .
Treating the conditions as latent variables makes learning the parameters of the model computationally difficult. When the variables are all observed, maximum likelihood estimation of the parameters is a concave optimization problem that can be solved efficiently. However, when the conditions are unobserved, the problem is no longer concave and optimization procedures get stuck in local minima, even when the anchor assumption holds true. In the following two sections, we describe a likelihoodbased objective and an effective initialization that allow us to learn models with good predictive capabilities.
4.3 Semisupervised objective
Exact inference in the QMRDT model is known to be NPhard, even if there is an anchor for every condition. The typical approach maximizing likelihood in the presence of latent variables, Expectation Maximization (EM)
(Dempster et al., 1977), is computationally challenging because at each step one has to use an approximate inference algorithm such as Markov chain Monte Carlo to approximate the necessary expectations.
Instead, we follow Mnih and Gregor (2014) in formulating a variational lower bound on the likelihood function using a recognition model. We start with the standard Evidence Lower BOund (ELBO), which holds true for any distribution, :
(3) 
In mean field variational inference, is chosen to be a fully factorized distribution. In this work, we restrict
to be the output of a parametrized model, with parameters
. The parametrized distribution is referred to as a recognition model and its function is to perform approximate inference in the network. The bound is tightest as approaches , that is, as it approximates inference in the generative model. As learning proceeds, the recognition model learns to compile inference.In our work we use a simple recognition model that performs logistic regression to approximate the posterior of each condition independently:
(4) 
where
is the sigmoid function,
andis padded with a 1 to allow for a bias term. In contrast with meanfield inference, where
is optimized for every data point separately, here we train a single model, allowing us to amortize the cost of inference over many data points. Methods to take gradients with respect to and and optimize with stochastic gradient ascent are described in detail in Mnih and Gregor (2014). Our hyperparameter settings are described in Appendix
B. Using this algorithm enabled us to learn orders of magnitude faster than EM, with comparable results in terms of likelihood objective obtained.However, we found that without adding an additional term to the objective, the anchor constraints (i.e., each anchor only has a single, specified parent) were not sufficient to make sure that the latent variables took on their intended meanings. Specifically, we observed that as the heldout likelihood of the observations improved, the predictive quality of the models got worse (measured using the heldout tag prediction task of Section 5.1). Upon inspecting the model, we found that drift occurred: the latent variables took on new meanings and lost their original grounding.
Inspired by recent work on semisupervised learning with deep generative models
(Kingma et al., 2014), our solution is to add a supervised term to the objective that encourages the recognition model to additionally predict the presence or absence of the anchors, ensuring that the meaning of the latent variable is tightly tied to the anchor. The prediction cannot use the anchors themselves, so we form a new censored vector, , which is a copy of but has the values of the anchors set to a constant 0. We also introduce an additional bias term to allow the prediction of the anchors and the labels to differ from each other. The supervised term has the form:(5) 
where is log loss and is the vector of anchors. The final objective is thus:
(6) 
where is a hyperparameter specifying the tradeoff between these two terms in the objective, and although we wrote Eq. 6 for a single data point, the actual objective we optimize is the sum of this over all the data points. A more detailed version of the objective is found in Appendix C.
At test time, we discard the recognition model, , which was used to train the parameters of the generative bipartite Bayesian network, but cannot support queries with conditioning on some of the clinical variables, and use the joint probabilistic model, , for inference. Inference in the joint model does not have an efficient closed form solution, but can be approximated with Gibbs sampling.
4.4 Model initialization
In order to initialize the model, we use the anchors to get a rough estimate of the failure probabilities for each of the observations. If we observed the latent clinical conditions (the variables), we could use a simple momentsbased estimator using empirical counts to estimate the failure probabilities (Equation 7).
(7) 
The estimator is then clipped to lie between [0,1]. The consistency of the method is not affected by this clipping, since if sufficient data were drawn from the model, the estimator would naturally lie in that range and clipping would not be necessary. Once all the failure probabilities are estimated, the leak probabilities can be estimated to account for the difference between the true observed counts and those predicted by the model (Appendix D).
Since the clinical conditions are generally unobserved, we estimate these conditional probabilities using empirical counts assuming that the anchors (which are noisy versions of the labels) are , and then perform a denoising step to estimate the conditional probabilities as though the true labels were observed. Specifically, in Section 4.1 we assumed that the label corruption process was independent of all other observed variables. This leads to the following equation:
(8) 
The lefthand side of this equation is a quantity that only involves observed variables and can be estimated from empirical counts. The righthand side uses the noise rates of the corruption process and the conditional probabilities that we care about, . If we assume that the noise rates of the corruption process are known, then we can form four independent linear equations with four unknowns and solve the following matrix equation:
(9) 
where is a column vector with four entries, one for each setting of in . is a matrix encoding the noise rates of the corruption process. Explicit constructions of these terms are given in Appendix E.
We could simply invert the noise matrix to solve , however, it would not be guaranteed that the solution would give a valid probability (i.e., nonnegative and sumtoone conditions) for . Instead, we explicitly solve the optimization problem with simplex constraints to minimize a KLdivergence measure between a proposed distribution and the denoised version of the empirical counts :
(10) 
The optimization is convex and we solve it with exponentiated gradient descent (Kivinen and Warmuth, 1995). The cleaned distribution, , obtained from solving Equation 10 is then substituted into the failure probability estimator in Equation 7 to obtain estimates of the failure and leak probabilities (regarding the leak probabilities, see Appendix D). This whole procedure can be shown to be a consistent estimator, meaning that if the model assumptions hold (i.e., of conditional independence), this will converge to the true probabilities as the amount of data goes to infinity.
4.5 Complete algorithm
The full parameter estimation algorithm is summarized in Algorithm 1.
4.6 Model selection
We do not assume that we have any ground truth labels to do model selection, so we use a stopping criteria based on heldout anchor prediction. We hold out 1000 patients from the train set as a validate set to determine the stopping criteria. For each patient in the validate set, we censor one positive anchor that appears in the record and all of the negative anchors, and perform inference with the joint model to determine which anchor is missing. Inference for the final anchor is performed with Gibbs sampling to obtain marginals for the tags, and then the likelihood of each anchor is computed as a function of the marginal likelihood of its parent tag (details in Appendix F). This model selection criterion mimics the heldouttag prediction task described in Section 5.1 used in the evaluation, but uses anchors instead of the true labels. Other stopping criteria using anchors could be developed depending on the intended use.
5 Results
5.1 Heldout tag prediction task
We test the ability of our model to perform inference by presenting it with a heldouttag prediction task. Each clinical state is assigned a tag. The model is presented with a patient record and all but one of the tags that apply for that record. The task is to predict the final tag that applies for this patient. We record the accuracy of each model (i.e., the proportion of times it chooses the correct tag to fill in), top5 performance (i.e., proportion of times the correct tag appears in the top 5 predictions) and the meanreciprocal rank of the correct prediction (MRR). Since this task is choosing the best last tag, we do not need to perform approximate inference. Instead we evaluate the likelihood of each possible final tag and perform the normalization, which corresponds to exact inference (Appendix G).
In addition to highlighting the joint modeling aspect of the new model, that it is able to respond to arbitrary inference queries with conditioning, this task is also clinically relevant in that it can be used to combat a recognized cognitive bias known as “search satisfaction error” (Groopman and Prichard, 2007)
. This cognitive bias, a failing of the Occam’s razor heuristic, is the tendency to overlook additional conditions once a single unifying diagnosis is found. This is particularly dangerous when a more serious diagnosis is overlooked because a unifying diagnosis was discovered first. For example, a patient with signs of a severe infection may be diagnosed with urinary tract infection and treated with antibiotics, while missing a second diagnosis of pneumonia. The delay in treatment of the pneumonia could be potentially dangerous to the patient. Another example is a patient with a kidney stone, whose coexisting urinary tract infection was missed. Although kidney stones generally resolve on their own, patients who also have a concurrent urinary tract infection require immediate intervention.
Clinical decision support systems can mitigate this cognitive bias by suggesting additional diagnosis that may explain a set of symptoms. We simulate this problem by randomly removing a tag, and trying to recover it using the model. This could correspond to a situation where the physician has “confirmed” one diagnosis and the model attempts to suggest a likely second diagnosis that would go along with the first. Since the model performs inference, this could potentially be very different from the next most likely tag if the model received no confirmation of the first tag.
This tagging system is currently implemented with the electronic medical record system in the emergency department Beth Israel Deaconess Medical Center in Boston, MA. Physicians are presented with possible tags for each patient (see figure 2). Confirming or rejecting a tag updates recommendations through inference.
5.2 Baselines
We compare against three different baselines. Since our proposed method assumes knowledge of the correct failure and leak parameters for the anchors, we provide that information to all of the baselines for fairness.

Naive labels – treats the anchors as true labels and learns a noisyor model with maximum likelihood estimation. In the final model, we “edit” the failure and leak parameters of the anchors to set them to the correct values.

Noise tolerant classifiers – We use the method presented in Natarajan et al. (2013) to learn noisetolerant classifiers (explicitly providing the algorithm with the noise rates). In experiments we found that this method was more effective than the method of Elkan and Noto (2008), so we present this baseline to compare to the individual classifiers learned in Halpern et al. (2014). The method of Natarajan et al. (2013) does not explictly describe how to predict when anchors are observed in the record. In this case, we simply predict the noise rate of the anchor, which we found to be more effective than ignoring the special status of the anchor.

Oracle MLE – An upper bound on performance which uses the true labels to learn a a noisyor model with maximum likelihood estimation. This is impractical in practice, but gives us a sense of how close to optimal we are performing using our noisy labels.
5.3 Results on heldout tag prediction
Table 3 presents our method compared with the baselines presented in Section 5.2. ‘Noisyor init’ refers to the momentsbased estimator described in Section 4.4, whereas ‘Noisyor final’ refers to the results after the semisupervised learning algorithm described in Section 4.3.
Model  Accuracy  Top 5  MRR 

Noise tolerant classifiers  0.54  0.86  0.67 
Naive Labels  0.61  0.85  0.71 
Noisyor init  0.64  0.91  0.76 
Noisyor final  0.68  0.92  0.79 
Noisyor oracle MLE  0.71  0.93  0.81 
The noisyor model significantly outperforms the noisetolerant classifiers and the naive labeling baselines. Our performance comes close to the optimal maximum likelihood performance, suggesting that even though we don’t use the true labels in training, we are still able to recover a model which is similar to the one we would learn if we had access to the true labels. The method of moments initialization is helpful. Using random initialization, we do not beat the naive labels baseline. Appendix H shows that the likelihood and tagging objectives are aligned after introducing the semisupervised objective. Table 4 shows the highly weighted words learned by our model. All of the noisyor models learn similar sets of highly weighted words, the main differences between the models are in the exact settings of the parameters which affect the inference procedure to choose the correct last tag.
Tag  Top weighted terms 

abdominal pain  pain, Ondansetron, nausea, days 
alcohol  male, sober, ed, admits, found, denies 
asthma  albuterol sulfate, sob, Methylprednisolone, cough 
fall  s/p fall, fall, fell, neg:loc, neg:head 
hematuria  infection, male, urology, urine, flank pain 
HIV+  male, Truvada, cd4, age:4050, Ritonavir 
collision  car, neg:loc, age:2030, hit, neck, driver 
6 Discussion
There are a number of limitations to this study. First, we used the ED ICD9CM discharge diagnoses which may have misclassified patients. Patients may have been suspected of having one diagnosis in the ED and ultimately may have had an alternative diagnosis. As such, we can only assess relative performance of the various models, but cannot draw conclusions about absolute accuracy.
This study occurred at a single institution with a custom built information system. These results might not generalize to other systems that may not be modified to support complete electronic capture of clinical data and customized decision support. While we internally validated the results, external validation is warranted. It will be interesting to discover whether the same algorithm may be applied to another institution, or whether reliable machine learning requires first training on local clinical data.
Our heldout tag prediction task is a synthetic task intended to simulate real clinical scenarios where some, but not all conditions are known about a patient. The evaluation is performed using retrospective data. Further study would be needed to confirm these results in real clinical practice.
We depend on estimates of the parameters of the corruption process to perform the methodofmoments initialization in Section 4.4. This is a potential weakness of the algorithm, though we are careful to consider baselines which can make use of the same information. In this work we do not address how those parameters could be obtained and rely on oracle estimates of these parameters. However, we expect that even by labeling a small number of examples we could obtain good enough estimates of these parameters to serve for the initialization.
Diseases are not truly independent of one another, despite our modeling them as such in this paper. More elaborate methodofmoment estimation techniques can be used together with anchors to learn the joint distribution of the conditions, even when they are never observed in the data
(Halpern et al., 2015).Acknowledgments
This work is partially supported by a Google Faculty Research Award and NSF CAREER award #1350965. Y.H. was supported by a postgraduate scholarship from the Natural Sciences and Engineering Research Council of Canada (NSERC).
References
 Agarwal et al. (2016) Vibhu Agarwal, Tanya Podchiyska, Juan M Banda, Veena Goel, Tiffany I Leung, Evan P Minty, Timothy E Sweeney, Elsie Gyang, and Nigam H Shah. Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, 2016.
 Aronsky et al. (2005) Dominik Aronsky, Peter J Haug, Charles Lagor, and Nathan C Dean. Accuracy of administrative data for identifying patients with pneumonia. American Journal of Medical Quality, 20(6):319–328, 2005.
 BirmanDeych et al. (2005) Elena BirmanDeych, Amy D Waterman, Yan Yan, David S Nilasena, Martha J Radford, and Brian F Gage. Accuracy of icd9cm codes for identifying cardiovascular and stroke risk factors. Medical care, pages 480–485, 2005.
 Chapman et al. (2001) Wendy W. Chapman, Will Bridewell, Paul Hanbury, Gregory F. Cooper, and Bruce G. Buchanan. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34(5):301 – 310, 2001.
 Cipparone et al. (2015) Charlotte W Cipparone, Matthew WithiamLeitch, Kim S Kimminau, Chet H Fox, Ranjit Singh, and Linda Kahn. Inaccuracy of icd9 codes for chronic kidney disease: A study from two practicebased research networks (pbrns). The Journal of the American Board of Family Medicine, 28(5):678–682, 2015.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Statist. Soc. Ser. B, pages 1–38, 1977.
 Elkan and Noto (2008) Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In KDD, 2008.
 Groopman and Prichard (2007) Jerome E Groopman and Michael Prichard. How doctors think. Springer, 2007.
 Halpern et al. (2014) Yoni Halpern, Choi Youngduck, Steven Horng, and David Sontag. Using anchors to estimate clinical state without labeled data. In Proceedings of the American Medical Informatics Association Annual Symposium, pages 606–615, 2014.
 Halpern et al. (2015) Yoni Halpern, Steven Horng, and David Sontag. Anchored discrete factor analysis. In arXiv:1511.03299, 2015.
 Halpern et al. (2016) Yoni Halpern, Steven Horng, Youngduck Choi, and David Sontag. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 2016.
 Heckerman (1990) David E Heckerman. A tractable inference algorithm for diagnosing multiple diseases. Knowledge Systems Laboratory, Stanford University, 1990.
 Jernite et al. (2013) Yacine Jernite, Yoni Halpern, Steven Horng, and David Sontag. Predicting chief complaints at triage time in the emergency department. NIPS Workshop on Machine Learning for Clinical Data Analysis and Healthcare, 2013.
 Kingma et al. (2014) Diederik P. Kingma, Shakir Mohamed, Danilo J. Rezende, and Max Welling. Semisupervised learning with deep generative models. In Z. Ghahramani, M. Welling, C. Cortes, N.d. Lawrence, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3581–3589. 2014.
 Kivinen and Warmuth (1995) Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Inform. and Comput., 132, 1995.
 Mnih and Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1791–1799, 2014.
 Natarajan et al. (2013) Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
 Neamatullah et al. (2008) Ishna Neamatullah, Margaret M Douglass, H Lehman Liwei, Andrew Reisner, Mauricio Villarroel, William J Long, Peter Szolovits, George B Moody, Roger G Mark, and Gari D Clifford. Automated deidentification of freetext medical records. BMC medical informatics and decision making, 8(1):1, 2008.
 Panella et al. (2003) M Panella, S Marchisio, and F Di Stanislao. Reducing clinical variations with clinical pathways: do pathways work? International Journal for Quality in Health Care, 15(6):509–521, 2003.
 Pearl (1988) Judea Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. 1988.
 Shwe et al. (1991) Michael A Shwe, Blackford Middleton, David Heckerman, Max Henrion, Eric Horvitz, Harold Lehmann, and Gregory Cooper. Probabilistic diagnosis using a reformulation of the internist1/qmr knowledge base. Meth. Inform. Med, 30:241–255, 1991.
 Tieder et al. (2011) Joel S Tieder, Matthew Hall, Katherine A Auger, Paul D Hain, Karen E Jerardi, Angela L Myers, Suraiya S Rahman, Derek J Williams, and Samir S Shah. Accuracy of administrative billing codes to detect urinary tract infection hospitalizations. Pediatrics, 128(2):323–330, 2011.
Appendix A Text processing
We apply negation detection to the freetext section using “negex” rules (Chapman et al., 2001) with some manual adaptations appropriate for Emergency department notes (Jernite et al., 2013), and replace common bigrams with a single token (e.g. “chest pain” is replaced by “chest_pain”. the patient record). Negated terms are then added as additional vocabulary words.
The following words are indicators of the beginning of a negation: no, not, denies, without, non, unable.
The token  is treated as a special negation word whose scope only includes the word that follows it.
The following tokens stop a negation: . ; [  newline + but and pt except reports alert complains has states secondary per did aox3.
Appendix B Parameter settings for learning with Neural Variational Inference and Learning (Mnih and Gregor, 2014)
We use the signalcentering and normalization described in the paper as well as inputdependent baselines. The input dependent baselines use a twolayer neural network with 100 hidden units and tanh activations. Learning rate is set to 0.0001. 10 samples are used to estimate each gradient. When
is initialized with method of moments, we have 50 epochs of “burnin” where
is held fixed and only is optimized. The learning rate of is set to 1/5th of that of .parameters and the failure probabilities of the anchors are initialized using the estimated noise rates and never optimized. The code is implemented in Torch and and RMSprop is used for optimization. Failure probabilities in
are mapped using a sigmoid function (i.e., ) to allow for continuous optimization over an unbounded space. We experimented with different values of L2 regularization (weight decay) in the recognition model, in the range {0,0.1, 0.01, 0.001} and chose the value that gave the best heldoutanchor performance. Parameters of are initialized uniformly at random between [0.1, 0.1].Appendix C Semisupervised objective details
In this appendix, we expand the objective which was written in compressed notation in Section 4.3.
The parameters to be optimized are:

: Parameters of the generative model, consisting of failure probabilities (), prior probabilities (), and leak probabilities ().

: Parameters of independent logistic regression models, one for each tag.

: Additional bias terms that are introduced to account for the difference between the predictions of the recognition model, which predicts the tags, and the desired semisupervised objective which predicts the anchors.
For the a single data point, is the binary feature vector. In practice, we center the inputs to the model (the
model cannot support centering since the generative model is defined for binary variables) and pad with a single 1 to allow for a bias term.
is a centered and padded copy of . is a copy of , with the values of the anchors set to 0. is a vector containing the binary values of the anchors from .The likelihood of the generative model is then:
(11) 
The recognition model consists of independent logistic regression models.
(12) 
where is the sigmoid function, .
Let be a hyperparameter specifying the tradeoff between the lower bound on the likelihood and the semisupervised objective term. The final objective, for a collection of patient records, indexed by , is to maximize:
(13) 
Appendix D Leak probabilities
Leak probabilities are calculated to account for the difference between the actual observed counts and those predicted by the model. If all of the failure probabilities are known, then the marginal probability of an observation predicted by the model can be calculated with the Quickscore equation (Heckerman, 1990).
(14) 
Thus, we can solve:
(15) 
This assumes that we have estimates for , but that is no different from assuming that we have estimates of the noise rates , since , where can be estimated from counts.
Appendix E Explicit matrix version of Equation 9
(16)  
(17)  
Appendix F Held out anchor inference
To perform the held out anchor inference (described in Section 4.6), we use the fact that conditioned on some feature vector (in this case, some of the anchors are held out so it is not the full vector ), we have:
We assume that the corruption rate, , is known, and estimate with Gibbs sampling.
Appendix G Held out tag inference
Exact inference is possible in the held out tag prediction task. For a single data point, we have a feature vector . Assume without loss of generality that the first tags known to be positive, and the final tags are unknown. Let refer to the unknown tags.
Of the unknown tags, we know that by the design of the task, only one is on and the rest are off. We can condition on the sum of all of the variables being , and we know that has to be an indicator vector (i.e., one index on and the rest off).
We would like to calculate the likelihood that (i.e., the ith unknown tag is on). Let be all of the unknown tags other than . The likelihood is calculated as follows:
The first equality uses the fact that is an indicator vector. The second equality is from the definition of conditioning. The third line marginalizes over unknown values of in the denominator. Note that the final line uses only complete likelihoods, so each term can be calculated efficiently using Equation 11. The sum in the denominator is over the set of indicator vectors of size , so it can be computed efficiently as well.
Appendix H Optimization course
In Section 4.3 we discuss how the likelihood objective is not aligned with the held out tag prediction task. Figure 3 shows that the semisupervised objective remedies this situation, displaying how the likelihood objective and heldout tag predictions improve from the initialization baseline as optimization on the semisupervised objective is run.
Comments
There are no comments yet.