The idea of learning across multiple environments has been studied by [pearl2011transportability] where there is some prior knowledge about the data generating process which can be represented via a causal diagram. The causal diagram helps to identify the places of stabilities in the data generating process across the environments. We present our causal mechanism via the selection diagram represented in Figure 1b. We represent the data generating process of our problem via the selection diagram as shown in Figure 1b. Having knowledge about the instabilities helps to formulate the relation to be transported from the source to the target environment. It has already been established that the relations to be transported have to be learned from both the source and the target environments. We also present why this is the scenario in our case where we try to predict the presence of the influenza virus from the symptoms reported by an individual. Along with the system (endogenous) variables: virus (), symptoms () and demographic attributes () like age and gender we also have the context/selection (exogenous) variables () which denote the differences in the data-generating process across the domains as well as the selection bias [pearl2011transportability].
Thus, the selection diagram indicates places of variance in the graphical model. In other words, the-variables locate the mechanisms where differences in conditional dependencies exist between domains - the absence of a selection node pointing to a variable represents the assumption that the mechanism responsible for assigning value to that variable does not vary across domains/environments. The selection diagram helps to identify the places of invariance across the data generating process across the environments. It is crucial to be aware of these invariances to understand the scope of observational transport. We specifically deal with the situation where the data generating process is not stable across the environments and there is a selection bias involved as well; which is often the case when data is publicly generated.
Standardization in clinical case definitions is a significant challenge. This is becoming more pertinent as the number and types of places, modes of data collection (from clinical data to healthworker-facilitated data wherein healthworkers visit individuals’ houses, record symptoms and take specimens, to citizen-science studies in which participants report symptoms from home and mail in or submit specimens [goff2015surveillance, fragaszy2016cohort]) and populations generating data are expanding making influenza prediction based on a specific syndromic case definition (set of symptoms) challenging. Moreover, it’s extremely rare for health data from different studies to be collected in the exact same mode, context and from the same type of population. Therefore symptoms (features) can mean different things; “fever” may mean something different reported to a doctor than at home through a smartphone app [ray2017predicting, rehman2018domain]. Also, it is known that certain population subgroups share common characteristics inspite of the manner in which the data is collected. This motivates the need to understand how the different characteristics across the subgroups can be modelled as multiple invariant components of information. Furthermore, how young people report may be different from how older people report symptoms. These differences in the data collection as well as variance in the demographic distributions of the different datasets make the important problem of predicting influenza based on syndromic case definitions challenging.
Modeling the data generative process from human-sourced information, like symptom reports, requires an understanding of the data generating mechanisms, which are not completely well modelled. Thus we leverage public health knowledge and other work in the health domain that shows that 1) self-reports of symptoms in relation to infection status vary by the data collection mode, and 2) there are shared characteristics within population groups
[ray2017predicting, saria2010learning]. Incorporation of population structure has not been explored extensively, though in health practice and research attributes of the people contributing the data (here we consider population demographics like age, gender) are commonly available, and it is understood that there are shared characteristics within these groups [saria2010learning].
Therefore, though in some tasks, only the stable relationships are desired to be transported [subbaswamy2018learning],[magliacane2018domain] here we explore if it is possible to harness information specific to population-attributes in tasks to improve prediction, given that our model is concerned with human-generated data in both the source and target. Moreover, this is particularly relevant in observational settings where it may not have been possible to sample a representative population in each case; and we may not have enough features for each demographic group in a particular observational study.
While selection diagrams are important for illustrating the data generating process, solving an observational transportability problem requires relying on conditional independencies encoded in the data [pearl2002causality]. Early work has shown that data collection methods for health data can be conceptualized as domains, and domain adaptation can be useful for prediction from symptom data sets obtained via these different collection modes [rehman2018domain]. Beyond this, to the best of our knowledge no work has addressed the issue of domain differences in health data while also explicitly accounting for differences in population attributes.
Accordingly, here we address situations in which it is desirable to transport the relevant information between settings, including not just the information that is stable across all possible situations; but also, for example, leverage population-invariant information where possible. To accomplish this, we formulate an undirected hierarchical modeling approach to capture the invariant relationships when appropriate. This will enable the model to look to the population-specific information if feature-specific information is lacking. This is particularly relevant in observational data cases wherein the underlying population distributions (e.g. proportion men, women, different ages) will differ. Our work also allows for multiple sources from different domains to be used together to improve prediction on a largely unlabelled target. Specific contributions of this work are:
Formalizing the data generating process between observational symptom reports and infection status, capturing sources of stability and invariance
Proposing an observational transport model that learns multi-component invariant information shared across population subgroups from multiple environments as well as the unstable environment specific information that is used as needed in the presence of the selection bias
Demonstrating the model on real-world public health syndromic data, improving prediction performance of infection prediction on largely unlabelled target datasets across population subgroups (by demographic attributes)
Observational to observational transport. The idea of transporting causal relationships across environments has been studied by [pearl2011transportability, subbaswamy2018learning, mooij2016joint]. Some of the methods rely on the assumption that the causal graph is known [pearl2011transportability, subbaswamy2018learning] while some don’t [mooij2016joint]. [pearl2011transportability] state that the causal relation to be transported is to be learned from both the source and target environments. [subbaswamy2018learning, mooij2016joint] learn the invariant relations in the source domain that can be transferred to the target domain; finding the set of features which can be conditioned on to get rid of the instabilities in the data generating process. We build on the idea of causal transport while knowing the data generating process; to learn the invariant relations while also learning the environment specific characteristics; which are critical in cases where specific population subgroups are underrepresented and the invariant features do not capture the properties of these subgroups.
. Hierarchical approaches have primarily been developed in natural language processing, and use Bayesian priors to tie parameters across multiple tasks[evgeniou2005learning]
. In such methods, each domain has its own domain-specific parameter for each feature which the model links via a hierarchical Bayesian global prior instead of a constant prior. This prior encourages features to have similar weights across domains, unless there is good contrary evidence. Hierarchical Bayesian frameworks are a more principled approach for transfer learning, compared to approaches which learn parameters of each task/distribution independently and smooth parameters of tasks with more information towards coarser-grained ones[carlin2010bayes, mccallum1998improving]. An undirected Bayesian transfer hierarchy has been used to jointly model the shapes of different mammals [elidan2012convex]. Incorporation of population structure has not been explored extensively but has been considered in other work using health-related data. While increasing representation granularity by increasing the number of classes can help, ad hoc discretization into fixed sets can limit ability to model instance-specific variability. Therefore hierarchical approaches have been used (but not yet for domain adaptation); Dirichlet processes to allow sharing of mixture components in time-series data, generating global and individual topic parameters [saria2010learning].
In regards to domain adaptation on syndromic data as is being used in this work, some early work has shown that public health collection methods can be conceptualized as domains, and domain adaptation can be useful for prediction from symptom data sets obtained via these different modes [rehman2018domain]. Moreover, while we build on this idea of hierarchical modeling for domain adaptation, here we go further to explicitly model population attributes to allow empirical information about the included population contribute to learning the model posterior and improve transfer of information to a new population and domain, with limited infection labels.
Multi-source domain adaptation In formulating the empirical model, this work builds upon the domain adaptation literature, and specifically hierarchical Bayesian frameworks. Domain adaptation is focused on improving performance for a target data set, in situations where the domain of the target is different from the that of the source(s) from which information is transferred. Broadly, in regards to types of data transfer learning has widely been applied in image recognition (image data) [oquab2014learning], natural language processing (text) [daume2009frustratingly], and hospital health care datasets (features about hospitals, e.g. admissions, size, etc.) [wiens2014study]. The “Frustratingly Easy Domain Adaptation” method is notable for simplicity and good performance on text data [daume2009frustratingly] and is equivalent to hierarchical domain adaptation [finkel2009hierarchical]
(although it ties hyperparameters across the entire model, while hierarchical models explicitly allow these to be separated, which is especially important when considering data features and population attributes in the hierarchy as in our problem here).
Given that health-related data sets can be collected in many different ways and from varied population samples, here we explicitly consider a multi-source situation to harness information from multiple datasets. Another approach to learning from multiple sources by pooling and analyzing multisite datasets includes transforming the source and target features spaces to correct the distributional shift in the data [zhou2018statistical]. Instead of just considering source and target datasets for domain adaptation, some prior work have also studied domain adaption with multiple source datasets owing to more information that can be learned from multiple sources [guo2018multi]. This task has also been formulated from a causal view [mooij2016joint, tsamardinos2009multi, zhang2015multi], where the posterior of the target is a weighted average of the source datasets.
Given the related work, here we propose an undirected hierarchical multi-source domain adaptation model that harnesses population-invariant information when needed. To motivate this model, we first present this task from a causal perspective, focusing on the invariant and variant information, and present a model to allow transfer of invariant information while explicitly learning the variant information. While the problem of feature transformation has long been studied we do not resort to feature transformation since real-world data is generated by various processes and these subtle differences are important while transforming covariate relationships across domains. Therefore our model leverages some labels in the target domain, instead of unsupervised domain adaptation. In sum, the idea of harnessing population-specific information is specific to applications wherein data is generated by humans, of which there are many such applications particularly in human health and well-being.
As motivated we need to learn from both the source and the target datasets for observational transport. We therefore need limited labelled data in the target environment. This is especially the case when we wish to transport across environments where we have access to a limited sample of labelled data in the target environment (domain) owing to the expenses and difficulties in obtaining the labels for the entire dataset. We consider source datasets from multiple domains where ( comprises of all the source environments) and a single target dataset where . For the target dataset we have limited number of labeled samples whereas for the source datasets all the samples are labeled. denotes all the datasets: source as well as the target (). Sets of variables are denoted by bold capital letters whereas their individual assignments are denoted by lowercase letters. denotes the presence () or absence () of the influenza virus. represents the age of the individual, and is categorized into common epidemiological groupings: age 0-4, age 5-15, age 16-44, age 45-64, age 65+. Similarly, represents the gender of the individual (male or female). The demographic attributes (in this study, and , but can be expanded to other demographic attributes where available) are together represented as ; .
is the feature vector representing the presence of the symptoms: fever, cough, muscle pain and sorethroat. Hereis a 4-dimensional binary vector representing the symptoms that an individual has (if an individual has fever and sorethroat but no cough and muscle pain; the feature vector looks like ). We consider subgroups in the data to be the specific demographic populations of interest belonging to a specific gender and age group . The task is to predict the value of for each of the subgroups from the symptom information . This can be formalized as:
We aim to learn the classifierfor the target dataset parameterized by for each of the demographic population subgroups () that minimizes the empirical risk while minimizing the total risk across the source domains as well. It should be noted that the distribution across the population subgroups () might not be uniform and hence the resulting might not be the same across all the subgroups. We now present the assumptions that makes this task a well-posed problem.
Our main assumption is that the data generating process is known and can be represented via a graphical causal diagram which helps to identify the information that can be transported [pearl2011transportability], [subbaswamy2018learning].
We adapt the definition of a selection diagram below, which is previously defined [pearl2002causality, subbaswamy2018learning, pearl2011transportability].
Definition 1 (Selection diagram). A selection diagram is a probabilistic causal model (as defined in [pearl2002causality]) augmented with
auxiliary selection variables S (denoted by square nodes) comprising of two types; . An edge from to any observed variable ; denotes that the mechanism of assigning value to changes across the domains. This denotes the place of instability in the data-generating process. represents the selection bias. An edge from to ; denotes that there is some selection bias with respect to .
We illustrate the structural causal model for our setting. As motivated above, we formalize the causal diagram (Figure 1a) for our setting (symptom reports and influenza virus) based on prior knowledge and research about the data generative process of such data. The symptoms that result are generally shaped by infection status [CDC_Flu_Symp], thus we have . The population demographic attributes also directly affect the symptoms, and susceptibility to infection by the virus: ().
Now, we consider the uncertain parts of the data-generating process that varies across different domains/environments. The data collection method (e.g. citizen science or health-worker facilitated) affects (it is known that citizen science is less specific than in a hospital, for example) [ray2017predicting]. Thus the collection mode introduces differences in the manner in which is observed across the domains and thus contributes towards the selection variable pointing towards (). The absence of a selection variable pointing at and indicate that the mechanism of assigning values to these variables is the same across the different domains (which makes sense intuitively, as demographic variables, e.g. man or woman, do not change or have different meanings in the different domains, nor does the process for obtaining the influenza virus which is performed via laboratory confirmation in all cases). There is a selection bias associated with population demographic attributes. The number of individuals in each of the subgroups varies across the domains (. Thus there is an edge from to . We now state the assumptions that help to formulate the observational transport for this causal structure.
Assumption 1. Let be a causal graph with variables V consisting of the system variables and the context variables .
No system variable directly causes any context variable () except while representing the selection bias.
No system variable is confounded with a context variable.
Assumption 2. Let be a causal graph with variables V consisting of the system variables and the context variables and be the corresponding distribution on . Let be the indicator denoting the differences in the source and target domains.
The distribution is Markov and faithful with respect to .
has no direct effect on ().
Multi Component Invariant transfer
Having knowledge of the data-generating process via the graphical casual model , we seek to understand what invariant conditional distributions can be transferred from the source domains () to the target domain (). Recent work [magliacane2018domain] [2019arXiv190702893A] in domain adaptation aims on finding the invariant set of features , the conditional distribution of which can be transferred across the domains. However, according to the causal diagram in Figure 1b, , the features do not d-separate and , . Thus, there does not exist a separating set of features that d-separate from . However, we do notice that . The invariant information across the demographic attributes can be transferred across the domains. As has been studied in public health and epidemiology studies the different demographic subgroups of the population share characteristics; for example, babies are known to be susceptible to certain symptoms as opposed to other; strengthening the fact that the conditional distribution can be transferred across domains. We, thus, aim to learn multiple components of the invariant representation for each of the age and gender (population subgroups) groups.
Observational transport across domains
Motivated by the approach stated in [pearl2011transportability] we aim to leverage a statistical relation, we learn from one study, and transfer to another situation, , particularly when gaining complete information about that relationship is costly (e.g. relationship between symptoms and infection status, when confirming infection status requires performing expensive laboratory tests).
The definition of observational transportability defined in [pearl2011transportability] (Definition 5), asserts that the relation to be transported has to be constructed from the source data as well as observations from the target data. As there is no control on the data-generating process (no intervention on any of the system variables), as is generally possible for experimental data we cannot use do-calculus for formalizing the causal relation, and instead must use the conditional independencies to understand relationship between the target, and features , by obtaining the joint probability distribution . The graph pruning methods proposed by [magliacane2018domain] find the optimal set of features that separate the target from the selection variables reveals that as mentioned before. We therefore provide an empirical hierarchical model that learns this for the target dataset while transferring information about the invariant links .
Formal framework of the undirected hierarchical multi-source Bayesian approach
In this framework, the lowest level of the hierarchy represents the datasets (within each domain, in our case, collection mode), , for each of which we have the labeled data of the dataset as shown in Figure 2(ii). As in all Bayesian problems, the dataset parameters should represent the data well. Here, are influenced by the domain-specific parameters (); are generated according to , where is the collection mode and where represents the parameters for the citizen-science collection mode and represents the parameters for the health-worker supported collection mode. In the undirected population-aware hierarchical model we allow the domain specific parameters to have multiple parents and learn all parameters simultaneously. Accordingly, the domain parameters are generated according to the distribution . Here, we explicitly include to represent the population parameters; here for the different age group categories, and similarly for genders where , and . The population parameters and have the root parameter as the parent, which represents invariant information across all of the datasets, classes and population attributes,
. Then, the joint distribution accounting for all of these data and parameters is:. The hierarchical model presents a way to learn and .
Invariant component representation
The hierarchical model learns the posterior distribution for each specific domain (). We do not seek a single invariant representation as has been proposed in [2019arXiv190702893A] but instead we extract invariant component representations. The hierarchical approach allows to have different components of the invariant representation. Since we have the demographic information along with the symptoms (features) it makes it easier to model invariant representations of the features into specific components where components represent the age groups and the genders. We thus learn the invariant component representations for the different demographic subgroups (age 0-4, age 5-15, age 16-44, age 45-64, age 65+, male and female). We find that the invariant components capture the intricate characteristics shared between the subgroups. We also provide the conditions under which this invariant component representation does not fully represent the information for the subgroup in which case the domain specific representation helps. We thus explicate the conditions under which this invariant information is useful as well as when the domain-specific information is to be utilized.
For all parameters we use independent priors, computed based on symptom predictivity for each age group and gender. The inclusion of data dependent priors in Bayesian learning has been explored to incorporate domain knowledge into the posterior distribution of parameters [darnieder2011bayesian]
. For population-aware modeling, data-informed prior distributions are important because the distributions from each dataset are particular to the study, and thus capturing this information adds more information to the analysis than improper or vague priors (e.g. for a sample wherein one demographic group is under-represented), also motivates the multiple parents in the hierarchy. In contrast, using just the root prior for estimating the posterior ignores the demographic information available. Therefore, we use an empirical Bayes approach to specify weakly informative priors, centered around the estimates of the model parameters[van2017prior]. Root parameters are centered on the cumulative data since the root parameter captures domain invariant information.
First, we use a probabilistic framework to jointly learn each parameter based on all levels of the hierarchy. We use a maximum a-posteriori parameter estimate instead of the full posterior for the joint distribution, which would be computationally intractable. We use a formulation, proposed in [elidan2012convex] that is amenable to standard optimization techniques, resulting in the objective:
For dataset , denotes the parameter for symptom . From a specific dataset’s parameter space, represents individual symptoms. is a statistical measure of the symptom in the dataset, in this case the proportion of the particular symptom resulting in a positive influenza virus (i.e. the positive predictive value). is the set of all nodes in the hierarchy (here, ). Regularizing parameter was chosen as 1 to allow Laplacian smoothing [mccallum1998improving]. The function is a divergence (L2 norm used) over the child and the parent parameters that encourages child parameters to be influenced by parent parameters, and allows a child parameter to be closely linked to more than one parent. The weight represents the influence balance between node parameters and node parent parameters. Based on hyperparameter tuning, a value of 0.2 for was used in all experiments. For optimization of the objective function, we use Powell’s method [fletcher1963rapidly].
Second, we learn the influence () of each parent on a particular dataset (child node). This is necessary since we need to learn for the target dataset as observed from the casual structure. We provide a mechanism to learn that as follows: . The weights are learned by performing a non-linear least square regression; the information from the different parents and the dataset can only be positive and hence we restrict the weights to be positive. This enables the model to give more weight to one level of the hierarchy when needed. In other words, how much demographic-invariant or domain-invariant information is needed depends upon how much information is in a given dataset. The reason for learning the weights for the different levels for each dataset independently is that each dataset would require different amounts of information from the demographic-specific and the domain-specific parameters, depending upon the demographic distribution of the sample in that dataset as well as the collection mode.
Licensing conditions for the use of the invariant representations
We are in the setting where varies across the domains owing to the selection bias. This motivates to learn the influence of the invariant parameters on each of the subgroups (). To understand the cases under which the invariant representations captured by fail to capture the information for the specific age group we analyze the information at the demographic subgroup level. The model structure consists of different hierarchies wherein each hierarchical level learns invariant information. The underlying assumption is that the invariant information learned by the higher levels is more as compared to the leaf nodes whereas the leaf nodes learn more data specific information as compared to the invariant information. We begin by describing the conditions on which the information is evaluated.
Definition 3. Let be the difference between the conditional probabilities of X (symptoms) given the value for Y.
Definition 4. Let be the expectation of over the symptoms for the subgroup of the dataset . Similarly we define to be the expectation of over the symptoms for the population subgroup comprising of the subgroups from all the environments/domains ().
Theorem 1. The parameters for a subgroup () of a dataset () depends on the and the conditional probability for the entire population comprising of the subgroups from the individual environments/domains and the conditional probability for the subgroup of the specific dataset.
a) We make use of the information function
which represents the information present about event . If then (this condition is explained in the proof in the apendix). Since is a monotonically decreasing function, . Since the specific dataset has more information, the dataset specific parameters are used instead of using the invariant parameters learned over all the global population.
b) if and . This means that the specific subgroup () is over represented in the specific dataset but we do not have much information about the specific subgroup from the invariant global representation since it is underrepresented in the global population. ∎
The conditions determine the cases in which spurious relations could be picked up by the invariant component representations and hence the data-specific parameters better represent the relations persistent in the specific dataset. The underlying assumption of the hierarchical model is that the higher levels capture the invariant information as opposed to the leaf nodes in the hierarchy which represent the data dependent information. The theorem basically states the conditions under which the invariant component representations should be used and when we need to rely on the data-specific parameters to capture the relations between the specific subgroup of the dataset.
Each dataset includes symptoms from individuals, laboratory confirmation of type of influenza virus they had (if any), and age and gender of each person as example population attributes. GoViral data is from volunteers who self-reported symptoms online and mailed in bio-specimens for laboratory confirmation of illness in New York City. It consists of 520 observations out of which 291 had positive laboratory results [goff2015surveillance]. FluWatch consists of 915 observations (567 positive cases of flu) of volunteers in the United Kingdom. These two datasets belong to the “citizen science” domain [fragaszy2016cohort, rehman2018domain]. Hong Kong consists of 4954 observations (1471 positive cases of flu) collected by health workers in Hong Kong [cowling2010comparative]. The Hutterite data is composed of 1281 observations (787 positive cases of flu) from colonies in Alberta, Canada sampled by nurses [loeb2010effect].
It should be emphasized that each of the datasets have a varied composition in terms of total number of observations and population demographics (Appendix Figure 3). We choose to use them all without any pre-processing, as these demonstrate real data set differences and will indicate model performance in such real-world situations.
As motivated, we consider the case of transferring information from multiple source data sets from different domains to a largely unlabelled target dataset.
We conduct multiple experiments to compare the proposed framework with relevant baselines to specifically examine the value of i) the hierarchical structure and ii) incorporation of population attributes, and iii) the amount of labelled data available from the target. Area under the ROC curve (AUC) metric is used to assess the performance. This is reported for the entire dataset comprising of all the subgroups.
We evaluate across all the population subgroups of the dataset (). We compare results to three methods: Target only (Target), Logistic Regression (LR), Frustratingly Easy Domain Adaptation, which is noted for extreme simplicity and was used previously on symptom data [daume2009frustratingly, rehman2018domain], (FEDA), Undirected Hierarchical Bayesian Domain adaptation (Hier).
Of the methods compared, Target and LR have the poorest performance (Table 1). This makes sense, as a target-only model doesn’t incorporate any information from other domains or populations. And, LR doesn’t account for any population attributes. These methods also perform worse than the domain adaptation methods (FEDA and Hier). This indicates that there is domain-specific structure to the data. Finally, the methods that do account for population attributes perform the best. Generally the Hier method performs the best; this was studied more based on amount of labelled training data available. We observe that Hier performs consistently better than the baselines even with very low labelled target data. Figure 2(ii), 2(iii) show that Hier improves the performance across increasing proportion of labelled target data. Goviral has limited sample size which leads to low performance of the baseline methods but Hier captures the invariant information across the source environments to improve the performance over the baselines drastically. As compared to Goviral, Hutterite has better representation of the population subgroups and hence the baselines do not perform poorly but Hier still performs substantially better. This highlights the finding that multi-component invariant learning helps to capture the information shared among subgroups even when they are underrepresented. We also examined the learned parameters for the subgroups (), finding that they comply to the conditions discussed in section Licensing conditions for the use of the invariant representations. For a further details Appendix Table 2 highlights for which subgroups the dataset specific parameters are used instead of the invariant subgroup parameters.
We thus present a framework for observational transport especially in the scenarios where there is instability due to selection variables as well as selection bias. We also provide the conditions under which demographic attributes when available yield in better prediction across different environments by capturing the important multiple invariant components representing the population subgroups. Future work can explore how this method can be extended to cases where no labelled target data is available.
Appendix A Appendix
Proof of Theorem 1
We have two conditions:
Let us consider ; we encounter four conditions :
, similarly .
and ; which results into and .
and ; which results into and .
; which results into and .
For all of the conditions we obtain that which means that there is more information available from the entire population rather than just the specific dataset.
Demographic distribution in datasets
Figure 3 shows the demographic distribution across four datasets: Goviral, Fluwatch, Hongkong, Hutterite. The darker shade of color denotes the number of females in the particular age group while the lighter shade denotes the number of males. The population subgroups are not equally represented in a dataset. Goviral and Hongkong have the highest proportion of observations in the age group of 16-44, Fluwatch has the highest proportion of observations across the age group 45-64 while Hutterite has the highest proportion of observations in the age group of 5-15.
Performance across population subgroups
Performance across subgroups for Goviral, Fluwatch, Hongkong and Hutterite is reported in Table 2. We also report where the dataset specfic parameters () are used instead of the invariant (). This complies with the conditions provided in Theorem 1.
|Dataset||Method||Age 0-5||Age 5-15||Age 16-44||Age 45-64||Age 65+|