A Data Analytics Framework for Aggregate Data Analysis

09/16/2018 ∙ by Sanket Tavarageri, et al. ∙ San Jose State University 0

In many contexts, we have access to aggregate data, but individual level data is unavailable. For example, medical studies sometimes report only aggregate statistics about disease prevalence because of privacy concerns. Even so, many a time it is desirable, and in fact could be necessary to infer individual level characteristics from aggregate data. For instance, other researchers who want to perform more detailed analysis of disease characteristics would require individual level data. Similar challenges arise in other fields too including politics, and marketing. In this paper, we present an end-to-end pipeline for processing of aggregate data to derive individual level statistics, and then using the inferred data to train machine learning models to answer questions of interest. We describe a novel algorithm for reconstructing fine-grained data from summary statistics. This step will create multiple candidate datasets which will form the input to the machine learning models. The advantage of the highly parallel architecture we propose is that uncertainty in the generated fine-grained data will be compensated by the use of multiple candidate fine-grained datasets. Consequently, the answers derived from the machine learning models will be more valid and usable. We validate our approach using data from a challenging medical problem called Acute Traumatic Coagulopathy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reconstructing individual behavior from aggregate data is termed ecological inference [1]. The necessity for ecological inference occurs because 1) the underlying data that gave rise to the aggregate statistics is unavailable and, 2) the analysis that we intend to carry out requires individual level data. Examples for the need for this kind of analysis abound in various fields. Below, we list a few scenarios.

  • Medical studies sometimes report only aggregate statistics about prevalence of a disease because of privacy concerns. Even if data is de-identified before it is put in the public domain, it is susceptible to re-identification attacks [2]

    . Therefore, medical professionals often choose to only publish aggregate statistics out of abundance of caution. Yet other researchers working to understand the disease better may wish to regenerate the original data for detailed analysis. E.g., for pattern recognition, and for building machine learning models to predict outcomes. In this paper, we use aggregate data about a condition called

    Acute Traumatic Coagulopathy (ATC) as the use case to demonstrate the algorithms, and the system developed in this work. Specifically, we use the Odds Ratios (ORs) of the known factors causing ATC published in a medical journal and reconstruct patient data. We use the regenerated patient data to train machine learning models to predict mortality.

  • The voting data of elections is generally available at the precinct level. However as ballots are cast in secret, data on how each individual voted cannot be known. Politicians and political scientists are often interested in knowing how different demographic groups voted. To come up with a reasonably valid answer to this question, one could couple the voting data with the Census data concerning the precinct and reconstruct the individual voting behavior. Enforcement of certain laws may require having individual level voting data at hand. For instance, the Voting Rights Act prohibits voting discrimination based on race, color, or language. The plaintiffs challenging any alleged discrimination have to first demonstrate that minority groups vote differently than majority groups, which can be done via ecological inference.

  • Sales in a supermarket: data on what product is sold in what quantities in a supermarket is available, but tracing the sales to individuals may not always be possible. Running effective advertising campaigns for grocery items mandates that the buying patterns of different segments of the customer base be known so that those segments can be reached via relevant advertisements.

In this paper, we present the design and implementation of a highly scalable system for analyzing aggregate data. We develop a novel algorithm to reconstruct individual level attributes from summary statistics. Because this is a probabilistic method, to increase the confidence in the analysis, the reconstruction algorithm outputs multiple candidate datasets. Each of the candidate datasets is then used to train a machine learning model to predict a quantity of interest. Thus, at the end of the training step, we have an ensemble method to predict the outcome.

We run the outlined pipeline in the context of ATC data. Additionally, to validate the approach we synthesize various patient datasets that have a range of ORs. The synthetic datasets serve as the ground truth for validation: we run the entire pipeline – compute summary statistics, reconstruct several candidate datasets, train the machine learning models, and obtain prediction results. The predicted outcomes are compared with the ground truth. We show that the error rates are low indicating that the reconstruction algorithms, and the machine learning models are effective in understanding the underlying processes.

The contributions of the paper are as follows.

  • We present a novel data reconstruction algorithm to regenerate individual level data from aggregate data given odds ratios.

  • A scalable data pipeline architecture is developed to train machine learning models with the multiple reconstructed datasets.

  • We present the results of an extensive experimental evaluation validating the proposed approach.

The rest of the paper is organized as follows. We introduce the Acute Traumatic Coagulopathy condition in Section II. The aggregate data available for ATC will be used as a running example to describe the algorithms, and the techniques developed in the paper. Section III develops the data reconstruction algorithm. In that section, we delineate the system architecture that uses the regenerated data to train machine learning (ML) models in parallel. Application of the developed system in the context of ATC as well as the experiments performed to assess the efficacy of the system are described in Section IV. The related work is discussed in Section V. Section VI concludes the paper with the key findings of the work.

Ii Acute Traumatic Coagulopathy

Acute Traumatic Coagulopathy (ATC) [3, 4] is a condition characterized by prolonged and/or excessive bleeding immediately following a traumatic injury. Despite the many recent advances in trauma care, failure to stop bleeding (hemostasis) following hemorrhage and shock remains the leading cause of death among children and adolescents [5]. This condition may present as early as 30 minutes after trauma prior to intervention, and this period is particularly critical in determining the mortality rates. It is associated with higher injury severity, coagulation abnormalities, and increased blood transfusions. The unfortunate consequences are poor clinical outcomes, and high mortality rates in trauma patients.

The underlying biochemical mechanisms that lead to ATC are not definitively known, and it results from a wide range of symptoms and phenotypes seen in the patients [6]. A lot of the complexity stems from the fact that there are two distinct pathways – intrinsic, and extrinsic – that cause the blood to clot. If any of the biochemical reactions in the cascade breaks down or is impaired, that causes insufficient coagulation and may lead to ATC. Since ATC is failure of the coagulation system, the laboratory test for identifying ATC has historically been prolonged prothrombin time (PT) or hypocoagulable condition. However, this is known to be not true: there are patients arriving with shortened PT following trauma, particularly after burn injury and yet are hypocoagulable. Newer measures such as injury severity score, partial thromboplastin time (PTT), degree of fibrinolysis, depletion of coagulation factors and inhibitors, and general failure of blood system have all been identified as primary indicators of ATC [7].

However, there are inherent discrepancies in the diagnostic tests due to timely sample collection, quality and availability of assays, lack of baseline pre-injury measurements, inter-individual variability, and the multivariate nature of coagulopathy itself. These issues have made the conventional, reductionist approach to understanding and treatment of ATC a failure, warranting a holistic approach instead. The long-term goal of ours is to develop machine learning approaches in conjunction with clinical assays to understand the physiological mechanisms of ATC as well as to predict the phenotype and treatment outcomes dependably. The objective in this paper is to develop computational models that can classify ATC phenotypes as a function of various known ATC indicators. Our central hypothesis is that we can use machine learning technology to model the complex interplay between various hematological and physiological parameters and predict the chances of ATC accurately. The rationale for the current research is that the discerned mathematical relationship between the various indicators and clinical outcomes will naturally lead to hypotheses that can subsequently be tested experimentally .

Figure 1 shows the impact of abnormal coagulation parameters in terms of odds ratios reported in a medical journal.

Fig. 1: Example of aggregate data on ATC. Source: MacLeod et al. [7]

The figure indicates the odds ratio (OR) of 3.58 for death with an abnormal PT. Table I enumerates the number of patients that had an abnormal PT value, and the number of people who died. The odds of dying given an abnormal PT value is , while the odds of dying given a normal PT value is . The odds ratio (OR) is the ratio of the two quantities: . When the OR is greater than 1, having an abnormal PT is considered to be associated with dying. Higher the OR, greater is the association between abnormal PT and dying.

Dead Survived
Abnormal PT 579 2,415
Normal PT 489 7,307
TABLE I: PT odds ratio

The OR for the other two parameters namely, PTT and Platelets have a similar meaning. The ecological inference problem is to generate individual patient level records for patients such that the ORs would be as given in Figure 1. Let us denote a normal PT/PTT/Platelets value with 1, and an abnormal value with 0. The “Dead?” column will be populated with a 1 if the patient survived, and a 0 if the patient succumbed. Then, the problem becomes one that of filling Table II with 0s, and 1s in such a way that when we compute the ORs on this Table, they would match with the ones in Figure 1.

Id PTT PT Platelet Dead?
1 ? ? ? ?
2 ? ? ? ?
3 ? ? ? ?
? ? ? ?
TABLE II: Plausible patient data

The challenges to address include:

  • How to efficiently explore the huge space of possible tables? We notice that for the four unknown columns case in Table II where each cell assumes either a 0 or a 1, we have a total of

    unique probable tables. In general, when the number of columns is

    , and each column corresponds to a valued attribute, the number of combinations is .

  • How do we exploit the given summary statistics such as odds ratios to narrow down the search space, and speed up data reconstruction?

  • How to quantify the feasibility of the generated table(s)? Given the fact that a large number of feasible tables could have given rise to a given set of aggregate statistics, is there a way to empirically assess the validity of the reconstructed table(s)?

We address these challenges next.

Iii Parallel Data Pipeline Architecture

Iii-a Ecological Inference Algorithm

We illustrate the solution approach on the ATC data first, and then generalize the solution. To generate one candidate individual level dataset, we fix the outcome column – “Dead?” first, and then populate the other feature columns – PT, PTT, Platelet. From Table I it is seen that the total number of patients in this study is . Of the patients died. Therefore, Ids among Ids are randomly selected, and are marked as 0 to indicate that patients with those Ids died. The rest are marked 1.

We subsequently demarcate patients who have abnormal PT values. A total of individuals had an abnormal PT. We also know that of them died. Since we already have identified those that have died and in particular of them, we consider this set and randomly select Ids from this set, and assign them an abnormal PT (0). The remaining Ids in the dead set are given a normal PT score (1). Similarly, we have previously determined who survived. Among the survived set, randomly selected persons will receive an abnormal PT, and the rest will get a normal PT value. We repeat this exercise for the PTT, and Platelet columns.

In this problem, we are attempting to model mortality as a function of PTT, PT, and Platelet. Therefore, “Dead?” is the outcome variable, and the others are predictor variables/features. The key to being able to construct the table efficiently is to first fix the outcome variable and then fill in the predictor variables.

Data: Features: , Outcome:
Odds ratios: , Occurrence ratios:
Outcome classes: , Class ratios:
Individual observations:
Seed for random number generator:
Result: individual observations with populated
Initialize random number generator with seed
for  to  do
       Randomly select observations
       class label
end for
for  to  do
       for  to  do
             Solve for four unknowns using four equations involving
             Randomly select observations from
       end for
end for
Algorithm 1 Individual data reconstruction from aggregate statistics

Algorithm 1 presents the steps to compute individual level data using aggregate statistics. The algorithm takes as input the features, the outcome variable, odds ratios for different features, the fraction of observations that carry positive class within the features, the outcome classes, fraction of positive and negative class within the outcome variable. Additionally, the seed to the random number generator is inputted. By varying the random number generator seed, we will be able to generate multiple candidate individual datasets.

The algorithm first populates the outcome variable column . The proportion of observations that have class label is . Therefore, observations among observations are randomly selected and are assigned label . Next, the remaining records are given label .

TABLE III: Odds ratio for feature

The different feature values for features are subsequently populated. Table III and equations 2 through 4 show the relationship between various parameters. The goal is to 1) find the values of , , , and such that the various constraints are met 2) select individual records to assign class labels or for each feature .

We have four equations 2 4 and four unknowns . We solve for s. individual records are selected that already have and label is assigned to them. The remaining records are given label . Among the records that have , we choose records and set . Finally, the unassigned observations will receive label . If certain inputs to Algorithm 1 are not known in a given application say a few of occurrence ratios s, but only the numerical ranges the inputs can assume are provided, then the algorithm selects values within the range for those inputs.

Iii-B Multiple Candidate Datasets

The original dataset , which is hidden/unavailable is summarized using the aggregate statistics. In §III-A, we presented an algorithm that generates one plausible candidate dataset whose characteristics in terms of aggregate statistics are identical to that of the original dataset . In this section, we will define metrics to quantify the similarity between and . We will develop a methodology using Algorithm 1 to generate several plausible candidate datasets – so that we can provide strong guarantees on the similarity between and s.

The central idea is that the plausible datasets are generated in such a way that any two datasets and are sufficiently distinct from each other. Consequently, as we increase the number of plausible datasets generated – , it will be increasingly likely that a row that appeared in the original dataset will appear in at least one of s.

Iii-B1 Similarity score

To compute the similarity score between two datasets and , we proceed as follows.

1) Normalization: The two datasets are normalized using Min-Max scaling. The minimum, and maximum values of each attribute are calculated - , and . The attribute value is then subtracted with and divided by the difference between minimum and maximum values.

As a result, the attribute values after scaling will be between 0 and 1. The min-max scaling ensures that when we calculate the distance between two rows (explained below), all attributes contribute equally.

2) Manhattan distance: The distance between a row of and that of is defined as the Manhattan distance between them scaled by the inverse of number of attributes. If is the set of attributes then the distance between two rows , and is defined as follows.

3) Average distance: The average distance among pairs of rows of the two datasets is computed. The similarity score is one minus the average distance. The similarity score ranges between 0 and 1 with 1 indicating that the two datasets exactly match, while 0 denotes that the two datasets are very dissimilar.

4) Matching in a Bipartite graph: We derive a mapping between rows of and that minimizes the sum of distances among all pairs of rows. Minimizing the sum of distances between paired rows also minimizes the average distance between the two datasets. We show that this problem is equivalent to the matching problem in complete bipartite graphs.

While calculating the distance between two datasets - and , we pair rows of the two datasets and calculate pairwise row distances. It is observed that the exact ordering of the rows does not matter from the point of view of reconstruction of individual level data from the summary statistics. Therefore, we need to find a one-to-one mapping of rows of one dataset into the other that yields maximum similarity between the two datasets. We illustrate the scenario with the following example. Consider the datasets and shown in Table V and Table V respectively.

Id PTT PT Plt. Dead?
1 1 0 1 1
2 0 1 0 0
3 0 0 1 1
4 0 1 1 1
TABLE V: Candidate dataset
Id PTT PT Plt. Dead?
1 0 0 1 1
2 0 1 1 1
3 0 0 1 1
4 1 1 0 0
TABLE IV: Candidate dataset
Fig. 2: The bipartite graph mapping rows of to rows of

A naïve mapping of rows of to : row 1 to row 1, row 2 to row 2 etc will result in the average distance of . However the alternate mapping of row 1 row 3, row 2 row 4, row 3 row 1, row 4 row 2 will cause the average distance of because row 3 of is identical to row 1 of , and row 4 of is the same as row 2 of while the other two row pairs differ in the value of only one attribute each.

We model this problem of figuring out the optimal mapping from rows of to rows of that gives rise the minimal average distance as a matching problem in a bipartite graph where one set of vertices are the rows of and the rows of form the second set of vertices. The edges connect rows of to rows and the distance between the two respective rows is the edge weight. The objective is to find a matching that minimizes the sum of weights of edges, and thus minimizes the average distance between the two datasets. Figure 2 shows the bipartite graph formed for the two datasets presented in Tables V and V. The optimal matching between vertices is shown in the figure using red lines.

The Hungarian method [8] may be utilized to solve the matching problem in polynomial time - polynomial in the number of vertices. The time complexity for the solution is of the order where

is the vertex set. In the present work, we implement is a heuristic that runs in linear time which we find satisfactory and is more efficient: We rank order rows of

, and based on the sum of attribute values of respective rows. The ranked row of is mapped to the ranked row of . The rationale being that two rows that are similar to each other would have the sum of their attribute values close to each other, and therefore would be ranked similarly in the respective datasets , and .

Iii-B2 Heterogeneous candidate datasets

While generating candidate datasets, we ensure that any two candidate datasets have the average distance between them greater than a threshold value – we set. An important consequence of this stipulation is that we can guarantee that the rows of the original dataset are reconstructed in the candidate datasets. Specifically, we provide the following guarantee.

Theorem 1

As we increase , the probability of each row of the original dataset appearing in at least one of the candidate datasets increases. Further,

The above assertion follows from two observations: 1) The number of ways of reconstructing the original dataset is finite. 2) Any two candidate datasets we generate are separated by . Therefore, as we generate more and more candidate datasets, the likelihood of a row in appearing in a candidate dataset increases.

If has rows, and is the attribute set, then we have a total of cells to populate. If an attribute can assume different possible values, then the total number of combinations we can create is bounded by:

The number of combinations in practice will be much smaller because of constraints derived from odds ratios, fraction ratios in Algorithm 1. We note that the reconstructed dataset is invariant with respect to row permutations as we compute the distance between two datasets using the bipartite matching formulation. Hence, in the above upper bound, we have not considered the combinations that can result from permutations of rows.

Any two datasets that are separated by even the smallest of should be giving rise to one of the distinct plausible combinations. Since the number of combinations is finite and upper bounded by the above expression, it must be the case that one of thes must be identical to . As a consequence, the theorem statement that each row of will be present in one of s is trivially true. Additionally, for this weaker requirement (compared to for some ) to hold, we would need a much smaller number of candidate datasets.

Fig. 3: Architecture of the data processing pipeline

Iii-C Machine Learning

Often ecological inference is carried out so that the individual level data generated may be used for further analysis. In other words, the end-goal many a time is to perform data analytics on the ecologically inferred data rather than use the inferred data directly. To address this need, we develop a parallel data pipeline architecture to process aggregate data, perform ecological inference, and use the generated data to train machine learning (ML) models.

Figure 3 presents the architecture of the analytics system. The aggregate statistics are inputted into the ecological inference engine which computes plausible candidate datasets in parallel. The candidate datasets are then used to train the machine learning models simultaneously to answer questions of interest. The outputs of multiple machine learning (ML) models are combined to produce one aggregate output. If we are developing the system for a classification problem, then the majority vote () of the ML models will be the combined result of the system. If it is a regression problem that the system is addressing then the of the outputs of the different models will be the aggregate outcome of the models.

Config. Gender OR Gender ratio PT OR PT ratio PTT OR PTT ratio Plate OR Plate ratio DOA ratio
1 2 0.6 4 0.3 6 0.2 8 0.1 0.1
2 6 0.7 2 0.2 4 0.1 6 0.3 0.1
3 8 0.8 8 0.1 2 0.3 4 0.4 0.2
4 10 0.9 8 0.2 4 0.4 2 0.5 0.2
5 10 0.6 4 0.1 2 0.2 6 0.5 0.3
6 4 0.5 2 0.3 6 0.1 8 0.2 0.3
7 2 0.5 6 0.3 8 0.2 10 0.1 0.4
8 6 0.5 4 0.2 8 0.1 10 0.5 0.4
9 8 0.5 2 0.4 10 0.3 10 0.4 0.45
10 10 0.5 6 0.4 10 0.3 10 0.3 0.49
TABLE VI: Parameters for ground truth generation. Legend: Plate - Platelet, DOA ratio - Dead or Alive ratio

Iv Case Study: Predicting Acute Traumatic Coagulopathy Outcomes

We use the developed parallel data pipeline on the Acute Traumatic Coagulopathy (ATC) data to predict mortality. The system is implemented in the Apache Spark [9, 10] version 2.2.0 cluster computing framework. Spark is an in-memory Big data computing framework which helps us scale our implementation to large volumes of data seamlessly. The predictor variables and the possible values they can assume are the following:

  • Gender: Male or Female

  • Platelet count: Abnormal, or Normal

  • Prothrombin Time (PT): Abnormal or Normal

  • Partial Thromboplastin Time (PTT): Abnormal or Normal

  • Age: age of the patient

Iv-a Experimental Set Up

We perform two sets of experiments:

1) We first assess the efficacy of ecological inference algorithm in reconstructing the datasets that are similar to the ground truth data. We feed the aggregate data to the ecological inference algorithm, and generate candidate datasets. Then we compute the similarity scores between the ground truth data and candidate datasets.

2) We run the entire the data analytics pipeline to understand its effectiveness. For this, we use the generated data to train the Random Forest

[11] machine learning models. The performance of the Random Forests are assessed against the ground truth.

Ground truth data for evaluation

The ground truth data are synthetically created. We generate a number of datasets that differ in their characteristics. For each predictor, we vary odds ratios from 2 to 10, and occurrence ratios from 0.1 to 0.5. For the outcome variable – mortality, the class ratios range from 0.1 to 0.5 as well.

We note that in the ATC context, a predictor variable, for example, either can have a normal value or an abnormal value. Similarly, the outcome variable – mortality can set to “dead” or “alive”. Thus, in binary valued variables scenario such as this, the experimental results when we set occurrence ratio of a variable to say, 0.6 will be identical to the case when the occurrence ratio is 0.4 (). The results when occurrence ratio is 0.7 will be indistinguishable to the case when the ratio is 0.3 (). Therefore, we vary the ratios only between 0.1 to 0.5.

A total of 10 datasets are synthesized in this manner using 10 different sets of parameters. Table VI enumerates the 10 sets of parameters used. Each set of parameters consists of a unique combination of odds ratios and occurrence ratios for predictors, and outcome class ratios. For example, the first configuration in the Table has Gender ratio of 0.6 indicating that the male fraction of the population is 0.6 while the OR of dying is 2 if one is a male patient. The ORs of dying because of abnormal PT, PTT, and Platelet values are 4, 6, and 8 respectively. The fractions of patients who have abnormal PT, PTT, and Platelet counts are 0.3, 0.2, and 0.1 in that order. The DOA ratio is 0.1 signifying that 0.1 fraction of the patients died. Similar interpretations are ascribed to the other

parameter configurations. For all configurations, the age values are generated such that the mean age of patients is 36, and standard deviation is 19. Each ground truth dataset consists of

patient records.

Performance of aggregate data analytics pipeline

Once we have the ground truth data created in this manner, for each parameter configuration shown in Table VI, we run the pipeline shown in Figure 3. For each ground truth dataset, we output candidate datasets and compute the similarity scores between the ground truth data and candidate datasets. The Random Forest machine learning models are trained using the

candidate datasets. A Random Forest comprises of 50 decision trees, each of depth 8.

We record the performance of Random Forests in terms of accuracy, precision, and recall. The ground truth data sans the label is inputted to the Random Forest model to predict mortality. Each patient can have the label as either dead, or alive and dead is the positive class for our evaluation. Consequently, accuracy, precision, and recall are defined as follows.


Iv-B Experimental Results

Similarity between the ground truth and reconstructed datasets

Figure 4 shows the similarity score between the ground truth dataset and the reconstructed datasets for various parameter configurations shown in Table VI. For each configuration we generate a ground truth dataset meeting the parameter specifications. We calculate aggregate statistics on the ground truth dataset and use that as the input to the ecological inference algorithm. The algorithm outputs 9 different candidate datasets. The threshold distance between any pair of candidate datasets is set to . The figure shows the average similarity score between the ground truth and all the candidate datasets (the highest possible score being 1). We observe that the similarity score is consistently high and hovers around

for all configurations. The variance in similarity scores between a ground truth data and the candidate datasets generated for it is low. For Config.1, for example, the minimum similarity score is

, while the maximum is . The average score is .

Fig. 4: The average similarity score between the ground truth and reconstructed datasets
Fig. 5: The average percentage of rows that match exactly between the ground truth and a reconstructed dataset

We additionally record the percentage of rows of the ground truth dataset and candidate datasets that match exactly. Figure 5 plots the average percentage over different candidate datasets. Between 10% and 30% of the rows match exactly.

We note that the similarity scores and exact match rows are computed using the optimized greedy algorithm (§III-B1), and therefore the computed scores and row counts represent lower bound values. The actual similarity scores and row counts could be higher if the Hungarian method to derive matching in the bipartite graph is used. It is expensive computationally, however.

Overall prediction performance

Figure 6 shows the performance of the parallel aggregate data analytics pipeline in terms of accuracy, precision, recall achieved for the different parameter configurations listed in Table VI.

Fig. 6: Performance of the pipeline for various configuration parameters

We observe that high accuracy is obtained for all configurations. It fluctuates between 0.77 and 0.91. Config. 7 has the lowest accuracy – 0.77 while Config. 1 and 2 both attain accuracy of 0.91. The precision metric which is the ratio of true positives (the number of patients that the model correctly classified as dead) to the sum of true positives and false positives (the number of patients the model correctly or incorrectly classified as dead), improves as we move from Config. 1 to 10. The reason being that the DOA ratio (the fraction of patients who died) increases with the configuration number. The result is that the model has more positive (dead patients) samples to learn from, and its precision increases. A similar influence is at play with respect to the recall metric as well – as the DOA ratio increases so does the recall. As the fraction of dead patients rises, the model is able to correctly identify a larger fraction of dead patients. The recall rate is 0.22 when only 10% of patients are dead (Config. 1), and it climbs to 0.50 when the percentage of dead patients rises moderately to 20% (Config. 3). The recall is highest at 0.83 when 49% of patients have died (Config. 10).

In sum, we note consistently high accuracy and precision measures while recall is high when the DOA ratio is moderate to high, but is on the lower side when the DOA ratio is small.

Fig. 7: Config. 1: Effect of downsampling majority class in improving recall
Boosting recall via undersampling of majority class

For a context like Acute Traumatic Coagulopathy which is the subject of case study in this work, having a high recall rate at all mortality levels is crucial: a false negative may have a more adverse impact than a false positive. Here, the main reason why the recall rate is low for Config. 1, and 2 is that the DOA ratio is small, namely 0.1. As a result, the Random Forest model is being trained with severely imbalanced candidate datasets – 90% of the patients have class label ‘Alive’, and a small percentage of patients – 10% have class label ‘Dead’. To mitigate the imbalanced nature of the dataset, we undersample the ‘Alive’ patients and investigate if that improves the recall rate. Indeed, as the dataset becomes more balanced, the recall value rises. Simultaneously, the accuracy very slightly degrades, and there is no discernible impact on precision.

Fig. 8: Config. 2: Effect of downsampling majority class in improving recall

Figures 7 and 8 show the outcomes of these experiments. Figure 7

outlines trends seen for accuracy, precision, and recall measures as the sampling rate for ‘Alive’ class is varied from

to . When the alive class sample fraction is , the Random Forest model is being trained with records of all dead patients which is of the entire candidate dataset, and fraction of the alive patients’ records which amounts to ( fraction of of records) of the entire candidate dataset. When the Alive class sample rate is , the model is being trained with of the positive samples (patients who are dead), and of the negative samples (patients who are alive). The highest recall of is achieved when the majority class sample fraction is . When the majority sample fraction increases further, recall rate comes down while accuracy improves. Accuracy when the ‘Alive’ sample fraction is is , and it reaches a high of when the entire dataset is used for training. Nearly identical trends are seen in Figure 8 which shows the effect of undersampling of majority class for Config. 2 parameters.

Controlled experiments
Fig. 9: Controlled experiments: effect of varying PT OR alone

Figure 6 depicts the performance for the parameter combinations shown in Table VI. Although we were able to, to a great extent, discern how each of the parameters affects accuracy, precision, and recall achieved by the Random Forest models, we perform “controlled” experiments to tease apart the effects of each of the parameters further. In this set of experiments, we individually vary 1) Odds Ratio 2) DOA ratio, and 3) Occurrence ratio of a predictor variable while keeping the rest of the parameters constant.

The predictive performance of PT OR is shown in Figure 9. The PT OR is varied from 2 to 10. We notice that accuracy, and recall are nearly invariant with respect to OR. The precision measure however sees an uptick as the OR is increased. This is because, a higher OR indicates a stronger connection between an abnormal PT value and mortality. Therefore, the model is able to bring down false negatives thereby increasing precision.

Fig. 10: Controlled experiments: effect of varying DOA ratio alone

Figure 10 illustrates the impact DOA ratio has on accuracy, precision, and recall. DOA ratio is varied from 0.1 to 0.5 (since this is a binary classification problem, varying DOA ratio from 0.6 to 0.9 will cause mirror image performance: performance at DOA ratio will be the same as when it is ). We notice that as the DOA ratio is increased, recall rate improves because the dataset becomes more balanced – the number of records with the positive class labels will approach the number of records with the negative class labels. Precision measure also shows an upward trend as the DOA ratio is increased. Accuracy slightly dips as the DOA ratio becomes larger. This is explainable from the fact that when the DOA ratio is small, even if the model predicts everyone as ‘Alive’, the accuracy will be high. But, as the DOA ratio becomes greater, the model has to learn to be more discriminating, and consequently accuracy takes a small hit.

Fig. 11: Controlled experiments: effect of varying PT occurrence ratio alone

In Figure 11, we see how increasing the fraction of patients that have an abnormal PT value affects performance. The abnormal PT occurrence ratio is varied from 0.1 to 0.5, and accuracy, recall are little affected by its variation. The precision on the other hand sees a small gain as the occurrence ratio is raised.

V Related Work

Colbaugh and Glass [12]

present a method to build machine learning models to predict individual-level labels using aggregate data. The methodology consists of three steps: 1) feature extraction, 2) aggregate-based prediction, 3) prediction refinement. This approach is similar to hierarchical classification/regression: first the coarse label is predicted, and then a fine-grained label is assigned. Their method assumes that individual-level labels are present to train the model in the refinement step. Though the high-level goals of their work and our work are similar, the problem we address is distinct in that our solution approach is applicable even when labels for individual-level data are not available. Additionally, we have presented an algorithm that reconstructs individual-level data from various aggregate statistics including odds ratios.

Gary King’s book [1] illustrates various known techniques for reconstructing individual behavior from aggregate data, also termed ecological inference. The main approach advocated is to utilize a method of bounds where in the domain specific knowledge is used to place bounds on variables. The second principal technique in the context of voting in elections is to perform ecological inference for smaller precincts and then combine the results for a larger context, say a state. This is best illustrated with regards to voting data. For example, Imai and King [13] apply ecological inference for the 2000 U.S. presidential elections to answer the question: “Did Illegal Overseas Absentee Ballots Decide” the election? Here, the inference problem is figuring out how many valid, and invalid absentee ballots were cast for each of the two presidential candidates given that there were a total of invalid ballots, valid ones, and the two candidates received , and votes overall.

Musicant and others [14]

address an interesting problem of performing supervised learning by training on aggregate outputs. In their framework, the training set contains observations for which all attribute values are known, but the output variable’s value is known only in the aggregate. The need for such an analysis arose when studying mass spectrometry data. They examine how k-nearest neighbor, neural networks, and support vector machines can be adapted for this problem.

MacLeod et al. [7]

analyze data collected in a prospective study on patients admitted to a Level I trauma center. They apply logistic regression using the various known acute traumatic coagulopathy predictors, and determine that coagulopathy is strongly linked to mortality in trauma patients.

Vi Conclusion

In this paper, we described the architecture of a system for performing analysis of aggregate statistics. We developed a novel algorithm to reconstruct individual-level data from aggregate data. To increase the confidence in the data analysis performed, we set up the data reconstruction algorithm to compute a parametric number of individual-level datasets. Furthermore, this step is completely parallel and makes the generation of datasets efficient. The datasets will be used to train several machine learning models concurrently to discover knowledge. We performed extensive experiments to evaluate the predictive performance of the system in the context of a medical condition called Acute Traumatic Coagulopathy. The experimental results indicate that the system developed achieves good performance thereby showing that the end-to-end aggregate data analytics system developed can be reliably used to extract knowledge from aggregate data.


  • [1] G. King, A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data.    Princeton University Press, 2013.
  • [2] K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A systematic review of re-identification attacks on health data,” PloS one, vol. 6, no. 12, p. e28071, 2011.
  • [3] D. Dirkmann, A. A. Hanke, K. Görlinger, and J. Peters, “Hypothermia and acidosis synergistically impair coagulation in human whole blood,” Anesthesia & Analgesia, vol. 106, no. 6, pp. 1627–1632, 2008.
  • [4] J. R. Hess, K. Brohi, R. P. Dutton, C. J. Hauser, J. B. Holcomb, Y. Kluger, K. Mackway-Jones, M. J. Parr, S. B. Rizoli, T. Yukioka et al., “The coagulopathy of trauma: a review of mechanisms,” Journal of Trauma and Acute Care Surgery, vol. 65, no. 4, pp. 748–754, 2008.
  • [5] C. for Disease Control, P. (CDC et al., “Vital signs: Unintentional injury deaths among persons aged 0-19 years-united states, 2000-2009.” MMWR. Morbidity and mortality weekly report, vol. 61, p. 270, 2012.
  • [6] M. A. Meledeo, M. C. Herzig, J. A. Bynum, X. Wu, A. K. Ramasubramanian, D. N. Darlington, K. M. Reddoch, and A. P. Cap, “Acute traumatic coagulopathy: The elephant in a room of blind scientists,” Journal of Trauma and Acute Care Surgery, vol. 82, no. 6S, pp. S33–S40, 2017.
  • [7] J. B. MacLeod, M. Lynn, M. G. McKenney, S. M. Cohn, and M. Murtha, “Early coagulopathy predicts mortality in trauma,” Journal of Trauma and Acute Care Surgery, vol. 55, no. 1, pp. 39–44, 2003.
  • [8] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics (NRL), vol. 2, no. 1-2, pp. 83–97, 1955.
  • [9] “Apache Spark: Lightning-fast cluster computing,” https://spark.apache.org/.
  • [10] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, “Apache spark: A unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, Oct. 2016.
  • [11] A. Liaw, M. Wiener et al., “Classification and regression by randomforest,” R news, vol. 2, no. 3, pp. 18–22, 2002.
  • [12] R. Colbaugh and K. Glass, “Learning about individuals’ health from aggregate data,” in Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE.    IEEE, 2017, pp. 3106–3109.
  • [13] K. Imai and G. King, “Did illegal overseas absentee ballots decide the 2000 us presidential election?” Perspectives on Politics, vol. 2, no. 3, pp. 537–549, 2004.
  • [14] D. R. Musicant, J. M. Christensen, and J. F. Olson, “Supervised learning by training on aggregate outputs,” in Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on.    IEEE, 2007, pp. 252–261.