Enabling Personalized Decision Support with Patient-Generated Data and Attributable Components

11/22/2019 ∙ by Elliot G Mitchell, et al. ∙ University of Colorado Denver NYU college Columbia University California Institute of Technology 0

Decision-making related to health is complex. Machine learning (ML) and patient generated data can identify patterns and insights at the individual level, where human cognition falls short, but not all ML-generated information is of equal utility for making health-related decisions. We develop and apply attributable components analysis (ACA), a method inspired by optimal transport theory, to type 2 diabetes self-monitoring data to identify patterns of association between nutrition and blood glucose control. In comparison with linear regression, we found that ACA offers a number of characteristics that make it promising for use in decision support applications. For example, ACA was able to identify non-linear relationships, was more robust to outliers, and offered broader and more expressive uncertainty estimates. In addition, our results highlight a tradeoff between model accuracy and interpretability, and we discuss implications for ML-driven decision support systems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In complex domains like health, it can be difficult to anticipate the consequences of daily choices on short- and long-term health status. Collecting and analyzing data about behaviors and indicators of health can elucidate patterns of association between a behavior and a range of outcomes. Thanks to wearable sensors and mobile health applications, patient-generated health data can be collected more easily than ever, but questions remain about how to incorporate these data into health decisions [Genes et al., 2018, Rehg et al., 2017, Fiordelli et al., 2013].

One area where patient-generated data holds promise to inform decision-making is type 2 diabetes self-management. In type 2 diabetes, a key goal of self-management is keeping blood glucose (BG) within target ranges. Daily behaviors like diet have a direct relationship with BG levels. Importantly, different individuals have different glycemic responses to different foods [Zeevi et al., 2015], emphasizing a need for personalization [American Diabetes Association, 2018]. Estimating the impact of a meal on BG is difficult, even for experts [Mamykina et al., 2016]. Machine learning (ML) may be better suited to identify consistent patterns than human reasoning [Albers et al., 2017].

Using patient-generated data for personalized analysis in the context of nutrition and BG, however, poses challenges. BG measurements and meals need to be actively tracked by users, which requires effort. Fingers need to be pricked to record BG, and meal details need to be entered. Because of the burden of entry, these data points are incomplete and non-randomly missing [Cordeiro et al., 2015a]. With nutrition logging in particular, there is a tradeoff between the time and effort of logging and the detail and accuracy of the nutrition information logged [Cordeiro et al., 2015b, Andrew et al., 2013]. Gold standard nutrition evaluations require analysis in a specialized lab, which is unavailable for patient-generated meal logs. In addition, glucometers can be miscalibrated, and users can mistype entries leading to both systematic bias and random errors. Glucose dynamics themselves are non-linear, oscillatory, noisy, and depend on individual characteristics [Ismail-Beigi, 2012, Albers et al., 2017]. Similar to the data quality concerns of electronic health records, the incompleteness, inaccuracy, complexity, and bias of patient-generated data create challenges for accurately representing a patient’s state [Hripcsak and Albers, 2013, Weiskopf and Weng, 2013]. Still, prior work has demonstrated that accurate inference can be possible with similar data sets [Albers et al., 2017, 2018]

In addition to the challenges of the data, though, designing analysis for decision support tools brings its own substantial challenges. Algorithms need to be able to run as a part of an automated system, identifying complex relationships while being robust to outliers. In addition, it’s important for the output to be interpretable, so that it can be translated into useful and actionable support. Even the most accurate ML machinery is not helpful if it cannot affect decision-making or be transformed into an understandable action. Quantifying uncertainty is an important part of interpretability, so that the model output can be weighed appropriately in the decision-making process [Cabitza et al., 2019, 2017].

There is a need for methods that address these challenges. Optimal transport is a theory that offers tools to estimate and compare probability distributions

[Peyré and Cuturi, 2018, Villani, 2009]

. In its original formulation, optimal transport sought to optimize the transportation of goods and resources, but has since been applied to many problems like computer vision and machine learning

[Peyré and Cuturi, 2018]. Optimal transport is particularly useful for data where values are highly individualized, as in medicine [Albers et al., 2014]. Blood pressure, for instance, may be related to many factors like age, exercise, diet, sex, prescribed drugs, and the device used to take the measurement. Here we adapt a optimal transport-based method invented by Tabak and Trigila [2018] termed attributable components analysis (ACA). This method to was created to explain the variability in a quantity of interest based on a set of related or potentially confounding covariates, or “attributable components”. Each component represents a contribution to the observed variability while simultaneously filtering out irrelevant effects to focus on a particular relationship.

Here, we apply an adapted version of the ACA method to type 2 diabetes self-monitoring data, using ACA to estimate the mean glycemic impact of a meal—the difference between pre-meal and post-meal measurements—based on the meal’s macronutrient composition. By estimating how each attributable component, in this case each macronutrient, contributes to the variability in BG after a meal, ACA can identify patterns of association between each macronutrient and expected mean BG impact. To better understand and convey how ACA performs for this task, we compare its output to linear regression. We then discuss how these estimates can be used as input to decision support systems, for example finding personalized ranges of macronutrient values where BG impact is expected to be higher or lower, to inform clinical care or create personalized meal plans.

2 Materials and Methods

2.1 Data Set

The data used in this research originates from prior user studies of a smartphone application for diabetes self-monitoring. In the application, participants logged meals and BG readings. To log a meal, users captured a photograph of the meal, assigned a category of the meal (breakfast, lunch, dinner, or a snack) and entered a free-text description of the meal contents. Users entered pre-meal BG readings when logging the meal. Two hours after each meal, users received a prompt to record and enter their post-meal BG reading. Later, each meal was evaluated by a registered dietitian (RD) who performed a nutrient assessment of the meal using a standard protocol and the USDA food composition database [USDA, , Ahuja et al., 2013]. The RD recorded the carbohydrates, fat, protein, and fiber, in grams, as well as the total calories of the meal.

Data came from 40 users who used the smartphone application for 4 to 12 weeks in a separate IRB approved study. Each participant consented for their data to be re-used in future research. In this analysis, we included all participants with 30 or more total meals logged, and considered only the meals with both pre- and post- meal BG readings, for a total of 16 users.

2.2 Descriptive Statistics

The 16 users with type 2 diabetes collected a median of 67 meals over 4 to 12 weeks. As seen in Figure 1, most users logged close to the median number of meals, with a few users logging considerably more. As shown in Figure 2, users varied substantially in their BG levels before and after meals.

Figure 1: Kernel density estimate of the number of users with n-many meals in the data set. The mass of the distribution sits near the median of 67 meals loggged, with a long tail of users logging considerably more
Figure 2: Violin plots showing the distribution of blood glucose readings across all users. Users varied considerably in their blood glucose levels before and after meals.

Two users, 56 and 1821, were chosen for a detailed inspection of model performance because they were representative of the overall data set, but differed from each other in BG control and macronutrient consumption paterns. Users 56 and 1821 logged a total of 58 and 88 meals over 4 and 12 weeks, respectively. See Table 1 for a detailed breakdown by meal type. As seen in Figure 3 user 56 had less variability in BG impacts compared to 1821. Figure 4 shows kernel density estimates of the macronutrient features for both users. Shown side by side, these densities show variability between and within each user. For example, user 56 eats 25 grams of carbohydrates at lunch most of the time, while user 1821 has much more variability in their lunchtime carbohydrate intake. An important artifact and limitation is that nutrition evaluations only allowed up to 100 grams of each macronutrient to be entered. User 1821 regularly ate 100 grams or more of carbohydrates at dinner.

User ID Meal Type Count
56 Breakfast 13
Lunch 10
Dinner 23
Other 12
Overall 58
1821 Breakfast 16
Lunch 19
Dinner 44
Other 9
Overall 88
Table 1: Count of meals of each meal type for users 56 and 1821.
Figure 3: A histogram of BG impacts for users 56 and 1821. User 56 had less variability in BG impacts compared to user 1821
Figure 4: Kernel density estimate plots of macronutrient consumption for users 56 and 1821. There is variability in macro consumption between and within each user. Note that nutrition evaluations only allowed up to 100 grams of each macronutrient, and user 1821 regularly ate 100 grams or more of carbohydrates at dinner.

2.3 Feature Selection

We experimented with different representations of features to predict BG impact. We began with the three main macronutrients—carbohydrates, fat, and protein—represented as their weight in grams, or their proportion of each meal’s calories. ACA performed slightly better when representing macronutrients as proportions than as grams, but we opted to use grams because we thought this would be more useful for decision support. In an effort to make decisions more straightforward, nutrition education in diabetes emphasizes the importance of macronutrients, and usually focuses on amounts of foods with units like grams, not their contribution to calories [Wheeler et al., 2014]. While some materials like the USDA’s MyFoodPlate are based on the proportion of the plate filled with different foods, the proportion of calories is very different than the volume a food takes up on a plate. (Consider 1 stick of butter vs. 4 cups of raw spinach.) And finally, representing macronutrients as proportions means that the values sum to one, which introduces strong multicollinearity that creates challenges for inference with linear regression.

In addition to the three macronutrients, we also included fiber and pre-prandial BG as features. We included fiber because increasing fiber is a common recommendation for individuals with diabetes [Anderson et al., 2004]. We included pre-prandial BG because of its relationship with post-prandial BG. Glucose dynamics at their simplest consist of a glycemic response to nutrition. Because of this, to infer glycemic response to nutrition—to solve the equations uniquely—we need the initial state (pre-prandial glucose), the kick (nutrition consumption), and the response (post-prandial glucose).

A particular challenge of type 2 diabetes self-monitoring data is representing impact of a particular meal on BG, or the glycemic impact. An optimal sampling rate for BG is on the order of minutes, not hours [Breton et al., 2008, Gough et al., 2003]. A single reading two hours after the meal is the clinical standard for postprandial measurement [Aschner, 2017] but is not well suited to capture the fluctuations in BG after a meal. Even with appropriately sampled continuous glucose monitor (CGM) data, it’s not clear which features are most important to diabetes-related complications; the highest peak in blood glucose, the integral of the glycemic curve from the mean to some time after the meal, the average value over time, or the speed of oscillations following a meal are different ways of representing BG impact, with different potential physiologic implications. While more frequent or continuous measurement would be preferred from a data standpoint, checking BG 6-10 times per day is recommended for those on insulin therapy, and there is no recommendation for those not on insulin [American Diabetes Association, 2018]. Here, we follow the standard practice for postprandial measurement, and take the difference of post-meal BG minus pre-meal BG to represent the glycemic impact of a meal.

2.4 Attributable Components

Attributable component analysis [ACA; Tabak and Trigila, 2018] is a methodology for explaining the potentially nonlinear variability in a quantity of interest, , in terms of covariates . The method is highly motivated by theory and ideas from optimal transport [Villani, 2009, Santambrogio, 2010]. In our application, represents the glycemic impact, and for the macronutrient content of a meal. The covariates can be categorical (such as “meal”, with values in ), real (such as “total amount of carbohydrates”) or, in fact, of nearly any type.The output of attributable component analysis is , the conditional expectation of with respect to covariates ; this conditional mean is provided as a sum of components, which can be thought of as modes of variability. Each component is represented by the product of one-dimensional functions of each covariate .

A more detailed explanation of ACA is provided in A, but a summary is provided here.

Given a set of observations of the variable of interest and covariates, , the ACA algorithm seeks to estimate the conditional mean with the following equation:


each is a component of the variability in , the ’s are essentially basis functions that represent the variability, and can be represented by many classes of functions, e.g., as the sum of the product of sinusoidal functions in the case of Fourier decomposition (cf. Appendix [ACA; Tabak and Trigila, 2018]), and when and otherwise.

The complete estimate of based on all features is useful, but being a probability distribution, is difficult to translate into useful recommendations because of the complexity dimensionality. To address this problem, we instead use the marginal dependence that translates from an dimensional function into a one dimensional function.

2.4.1 Interpretability through marginalization

We make the ACA output more interpretable for decision-making by ”marginalizing” the ACA output function. To understand what this means, why this is necessary, and how this works, begin with the ACA estimated conditional mean that adopts the form in Equation 1 where the are found by the algorithm, and the

are known via interpolation on grids or prototypal analysis. Even though this estimation allows us to make predictions for new values of

, its complexity makes it difficult to interpret. For example, if we limit the covariates to only binary forms, e.g., increases or decreases, then there are combinations of actions a person must interpret and choose among; this is too complex. Because the point of this intervention is to help people understand glycemic impacts of nutrition to make balanced choices that are sustainable behaviorally, we must translate ACA output into a simpler form, one where the impact of a single covariate is considered at a time, leading to only different options. We can do this by asking simpler questions, such as: averaging over all other covariates, how does depend on a specific or small set thereof. Such questions ask us to marginalize the full estimated conditional mean and the separated form of the estimation makes it straightforward to perform this task. In order to find the marginal dependence of on a group of covariates denoted by , with and , one has


In this case, represents a function that captures the impact of a particular subset of features on . For a single covariate of interest , is a one-dimensional function that captures the impact that one covariate, for example fat, has on glycemic impact. In Figs. 5, 6, and 7 where we compare the ACA to linear regression, the one-dimensional ACA output shown is as opposed to the full ACA model .

2.5 Linear Regression

As a comparison method, we fit the data with multiple linear regression


where is the quantity of interest and are covariates and is the intercept term. More compactly


We then find the best fit using the ordinary least squares method

[Mendenhall and Sincich, 1997].

As with ACA, to improve the interpretability of the output, we fit the model with all covariates, , but marginalize to consider a specific (or small subset) by averaging over the other covariates. To compute the marginal dependence of on a group of covariates denoted by , with and , one has


2.6 Translating Inference-Based Analysis – ACA and Linear Regression – to Decision Support

The outcome of the marginalization calculation in Eq. 2 and the linear regression in Eq. 5 is a one dimensional graph, e.g., Figure 5, where the macronutrient is given on the x-axis as the independent variable or covariate and the y-axis is the glycemic impact. This plot is not, alone, useful for making decisions for most patients, clinicians, or machines. Instead the plot needs translation and additional information. The missing information is the clinically derived understanding of what is a good/bad glycemic impact, or what gradations of good/bad glycemic impact are and at what resolution, e.g., whether 100s of categories or two, are most useful for making decisions. For example, one approach, would be to determine a clinically significant threshold for BG impact to keep individuals below. Then one draws a horizontal line that identifies ranges of each macronutrient where mean BG impact is expected to be above or below the threshold. It is then these ranges that could be useful for patients, educators, or providers in setting a personalized nutritional plan [American Diabetes Association, 2018], or as input to another system that recommends recipes or meal plans with nutritional constraints. Using a simple threshold highlights differences between ACA and linear regression. With a linear relationship, regression can only identify at most one higher-impact and one lower-impact range. Because ACA is non-linear, it could identify multiple higher- and lower-impact ranges, which could potential be more beneficial or meaningful for decision making. A similar issue arises with uncertainty. When a method is wrong too often—e.g., no better than chance—people stop trusting it and begin to ignore it. The quantification of better than chance is expressed with uncertainty, and if a model cannot accurately estimate uncertainty, it will produce recommendations that are of little use.

While we do not go all the way to translating the output for practical use in this paper, we mention it to provide context for the evaluation metrics and questions discussed below.

2.7 Uncertainty Estimates

We used several bootstrapping algorithms to estimate uncertainty of the regressions. Specifically, we used bootstrap to estimate distributions of regression coefficients, allowing us to estimate the variability of the estimate. Given this distribution we can calculate quantities that characterize the uncertainty; here we focus on confidence intervals over the range of input values. Often, bootstrapping is accomplished by drawing multiple samples with replacement from the data set and computing the estimate for that resampled data

[Davison and Hinkley, 1997]. Empirical confidence intervals can be calculated from the distribution of estimates. In addition, ACA is stochastic, with a random initial state, so we can estimate the variability through repeated calculations with the same subset but different starting states, carving out the error surface that defines the uncertainty. We experimented with both methods for bootstrapping ACA, and the results were nearly identical. We opted for the typical approach of bootstrapping via multiple subsamples so that we could apply the same bootstrapping procedure for both methods, because linear regression is not stochastic.

A second question is the size of the bootstrap samples. A common approach is for each bootstrap sample to have the same number of data points as the original data set. Because data sets for some of the users were quite small, there were advantages to using larger bootstrap samples. For example, bootstrap samples may have very few unique data points. This negatively impacts the performance of the model, and poses challenges for aggregating variance estimates across the complete range of feature values. Larger bootstrap samples can improve model performance, and help ensure that estimates cover the full range of independent variable values; of course bootstrap ensembles cannot represent the tails of distributions that are not observed in the data, and can underestimate variance. We experimented with the original size of the dataset, 100, and 500 data points, and found that a bootstrap sample size of 500 performed well for both ACA and regression.

A third question is how many bootstrap iterations to run. 100 iterations has been suggested as a minimum for variance estimations, but it depends on the situation [Davison and Hinkley, 1997]. We inspected the change in variance across all iterations after each subsequent bootstrap iteration to look for convergence. We experimented with up to 200 iterations and found that 100 iterations were sufficient for variance to converge.

All analysis was performed in MATLAB 2016b (9.1). Additional plots and descriptive statistics were produced in R v3.3.2 with tidyverse v1.1.1.

2.8 Experimental Design

We estimate ACA and linear regression on all of the data sets, as well as data subsets by meal type (breakfast, lunch, and dinner). To estimate confidence intervals, we performed a bootstrap with 100 iterations, based on the procedure described in section 2.7. Each bootstrap sample had 500 data points, and the same samples were used to fit ACA and linear regression. 95% confidence intervals were determined empirically from the aggregated bootstrap output.

We then produced a series of plots each user and closely inspected them for the twu users described in section 2.1. Each plot included an individual feature on the horizontal axis, with BG impact on the vertical axis, the actual data points, and average fit of ACA and linear regression with confidence intervals. With each of the 5 features for the overall data sets and the 3 meal-type subsets across two users, there were a total of 40 plots. See Figure 5 in the Results for an example.

2.9 Evaluation

To compare the performance of the two models we calculated the root mean squared error (RMSE) of the data fit for both ACA and linear regression.

RMSE for the overall model:

RMSE for the marginals:

In addition, we qualitatively inspected the plots for evidence of non-linear relationships, and examined the situations where the two models agreed and disagreed. To quantify non-linear relationships, we heuristically evaluated the plots to tally the number of data sets where the average fit line of ACA had more than a 10-degree bend.

To quantify differences in the uncertainty calculations between the two methods, and to assess the coherence and usefulness of the confidence intervals, we calculated the percentage of data points falling within the confidence interval across all data sets.

3 Results

3.1 Evaluation

As shown in Table 2, the RMSE for full ACA model was significantly lower—by a factor of

—than for linear regression with a standard deviation similarly lower by a factor of


ACA Linear regression
RMSE 4.36 3.40 29.15 10.02
Table 2: Root mean squared error (RMSE) for ACA and linear regression, for the full model with all covariates

However, as shown in Table 3 examining the marginal output that considers one feature at a time, linear regression outperforms ACA in RMSE by 2 to 7 mg/dl for breakfast, lunch, and dinner meals, while ACA slightly outperforms linear regression for analysis when all meals are pooled together. The explanation: ACA, being a complex nonlinear regression, is more data-hungry than linear regression, and because it underperforms linear regression for a single meal but outperforms for three meals, it needs at most three times the data to have a lower RMSE than linear regression.

Meal type ACA Linear regression
breakfast 28.81 16.2 26.27 14.3
lunch 35.06 18.0 32.62 16.0
dinner 40.21 26.1 33.60 20.3
overall 37.21 21.3 37.44 21.4
Table 3: Root mean squared error (RMSE) for ACA and linear regression, for the marginal model considering one covariate at a time

The difference between ACA and the marginalized ACA – that ACA itself produces very accurate representations of the data while the marginalization is substantially less accurate – has important implications. First, this difference shows that there is substantial correlation between the covariates; this is not surprising because individual meals are combinations of food items, which in turn have combinations of macronutrients, suggesting that the macronutrients in a meal are not independent of each other. Second, it is clear that because of the systematic relationships between covariates, there is predictive information that we are not using to help people make decisions. The problem of course, is that the full portrait of how these covariates influence glycemic impact is a complex mathematical object. And to be useful in practice there is an imposed tradeoff that is not about algorithmic accuracy, but about human factors: we need the algorithm to be accurate but we must balance accuracy against the ability to use the output of the algorithm to make decisions. And this leads us to the third implication of the difference between the ACA and its marginalized form: we must find a way to exploit this yet-unused predictive information in a way that also allows for useful decision making.

3.1.1 Non-linear relationships

In some situations, ACA did identify non-linear relationships between macronutrients and BG impact, as shown in Figure 5. Because of the regularization built into ACA, most of the identified trends were linear, but some were non-linear. Non-linear relationships may be expected in some situations because of complexity of BG dynamics. Linear regression, of course, would by definition never be able to find a non-linear relationship.

Figure 5: Comparison of ACA and linear regression for user 56 and the relationship between carbohydrates and BG impact, across all meals. In this case, ACA identifies a non-linear relationship, while linear regression does not.

3.1.2 Outliers and Errors

When inspecting the plots, we found that some data sets had outliers that were clearly errors. For example, User 56’s data had two meals recorded with 50 grams of fiber. These points are clearly errors not only because they are visibly separated from the rest of the data, but also because 50 grams is the default value for nutrient assessments by RDs, and 50 grams of fiber is an infeasible amount to eat in one sitting. The recommended amount of fiber is 38 grams per day for men, and 95% of adults don’t manage to eat the recommended amount of fiber; 50 grams of fiber would be over 3 cups of lentils. As shown in Figure 6, linear regression is unable to ignore the outliers, and continues the downward trend beyond what is reasonable. ACA, on the other hand, also finds a slight downward trend in the non-outlier data, but evens out to be flat—showing no relationship—over the sparsely populated region before the outliers. The ACA is a more robust estimator Huber [2011] than linear regression.

Figure 6: Comparison of ACA and linear regression for user 56, and the relationship between fiber and BG impact, across all meals. ACA shows no trend leading out to the outlier data points with 50 grams of fiber, while linear regression continues a downward trend beyond what is reasonable

3.1.3 Uncertainty

One of the most drastic differences between ACA and linear regression was in the size and variability of the confidence intervals. Confidence intervals for ACA were broad, and varied in their width across data sets. In some instances, ACA would have a relatively narrow confidence interval, suggesting a higher degree of certainty in the identified trend.In other situations, though, ACA has broad confidence intervals, encapsulating most of the data sets, suggesting a low degree of confidence in the identified trend. On the other hand, the less flexible linear regression typically had narrow confidence intervals, regardless of the plausibility of the trend identified. See Figure 7 for a comparison of uncertainty between two data subsets for the same user.

Figure 7: Comparison of ACA and linear regression for user 1821. On the left is the relationship between carbohydrates and BG impact for lunch meals. On the right is the relationship between fat and BG impact at dinner for the same user. On the left, ACA has wide confidence intervals, indicating uncertainty about the true relationship, while confidence intervals are narrower on the right. In contrast, linear regression has narrow confidence intervals in both figures.

In general, the confidence intervals were much wider and more expressive with ACA. As shown in Table 4, more of the actual data points—by a factors ranging from to with an average of —fell within the confidence intervals for ACA than with linear regression.

N ACA Linear Regression
User 56
Breakfast 13 84.62% 10.77%
Lunch 10 28.00% 2.00%
Dinner 23 58.26% 7.83%
All meals 58 15.17% 7.59%
User 1821
Breakfast 16 96.25% 6.25%
Lunch 19 52.63% 8.42%
Dinner 44 32.27% 11.36%
All meals 88 22.05% 12.05%
All Users (Mean SD)
Breakfast 23 16 62% 21% 11% 8%
Lunch 21 14 47% 21% 8% 6%
Dinner 24 15 47% 22% 10% 7%
All meals 82 63 25% 12% 12% 7%
Table 4: Percent of data points within the 95% confidence interval for attributable components analysis (ACA) and linear regression.

4 Discussion

In this study, we explored the use of a method based on optimal transport theory to analyze patient-generated data. As compared to linear regression, we found that attributable components analysis (ACA) was able to identify non-linear relationships, was more robust to outliers, and offered more representative and accurate uncertainty estimates. These characteristics make ACA a good candidate to be used in the wild for decision support systems. For example, model output could be used in a tool to help clinicians deliver personalized coaching to patients with T2D, to automatically generate meal plans, or in a smartphone application that delivers personalized nutritional recommendations directly to patients.

Unlike post hoc data analysis, when datasets can be cleaned, curated, and processed, algorithms used in decision support systems need to run automatically without direct oversight using data with all their imperfections. Given the constraints of real self-monitoring data, the marginalized ACA preformed well. But it is important to understand the modeling workflow we develop here, and its advantages and evaluation. We compared a simple regression, linear regression, to a complex nonlinear regression that was then simplified after the fact. It seems that, given enough data, it is more productive to being with a model capable of representing the structures in the data and have the necessary features necessary for useful decision-making

, and then simplifying the model output as is required for practical decision support. Non-linear regressions are not always required or useful, and often a linear or logistic regression—as a sophisticated use of a simple tool—will be a better choice due to the needs of the application, e.g.,

Levine et al. [2018]. Here we had substantial gains from basing the analysis in a more flexible tool, but also saw some drawbacks, all of which are noted below.

Nonlinear relationships in data and decision support The ACA was able to identify non-linear relationships, which is important because of the complexity of BG dynamics and other systems in health. Importantly, ACA is also regularized to prevent overfitting, and the majority of relationships identified were linear. As discussed in 2.6, one approach to make regression output useful for decision support is to use a clinically meaningful threshold for BG impact to identify ranges of values to expect higher or lower BG impacts. Because ACA is non-linear, it can identify multiple ranges, but with linear regression, this approach would only identify 1 high and 1 low impact range. Distinct ranges may be more clinically meaningful.

Robust estimation ACA was more robust to outliers and erroneous data points than linear regression Huber [2011]. Data accuracy is a central concern in assessing the quality of electronic health data [Hripcsak and Albers, 2013, Weiskopf and Weng, 2013], especially for patient-generated health data, when patients are directly entering data points [Codella et al., 2018]. While rule-based or statistical methods can be used to detect and remove outliers, analytic approaches that are robust to outliers, like ACA, are still advantageous.

Uncertainty quantification, ACA offered broader and more representative and accurate uncertainty Smith [2013] estimates than linear regression. It’s important to represent and consider the confidence of the model for a given patient’s data set. Uncertainty is intrinsic to the practice of medicine. If a model is going to be used for clinical decision support, representing the uncertainty can help clinicians appropriately weight the information against everything else they know about the patient [Cabitza et al., 2019, 2017]. For patient-facing application, the certainty can help prioritize what is and isn’t shared with users.

Reducing model flexibility to gain interpretability Linear regression is rather interpretable, especially in one dimension. Nonlinear regressions, including ACA here that models a distribution function that models glycemic response to output, is far less interpretable in its raw form, often requires mathematical sophistication to interpret, and often cannot be visualized due to the high-dimensional nature of the model. While the full ACA model with all covariates outperformed linear regression, the quality of the fit dropped substantially when considering one covariate at a time in the marginal model given the data constraints. We focused on the marginal relationship between each covariate and glycemic impact because interpretability for decision support was a key objective. Simultaneously making changes to multiple macronutrients is challenging for individuals to implement because of the cognitive burden and because behavior change is often grounded on incremental, achievable adjustments. The poorer performance of the marginal model points to a tradeoff between accuracy and interpretability in machine learning tasks [Johansson et al., 2011]. In this context, there is substantial information shared between covariates that is lost through marginalization. There is a need for richer and more detailed model outputs in clinical characterization [Hripcsak and Albers, 2018], and future work could explore ways to improve the interpretability of the full model with all covariates for use for decision support while still aligning with what clinician and patients need from a human factors standpoint.

Data limitations and machine learning

The ACA, like all nonlinear regressions such deep learning or Gaussian process models, is more data-hungry than linear regression. Meaning, the ACA requires more data

to become

as accurate as linear regression. Then, given enough data, the nonlinear regressions are generally more accurate or able to represent data than linear or other more rigid and simple regressions. The flexibility of nonlinear regressions may not always be beneficial, depending on the application, but here the real question is of the limiting effect of data availability. For example, while nonlinear regressions require more data than their more rigid counter-parts, not all methods have an equal hunger for data. And here, because of the nature of our experiment, we have a window into how hungry ACA is compared to linear regression: ACA underperforms linear regression for a single meal but outperforms for three meals for most patients, meaning it needs at most three times the data to have a lower RMSE than linear regression. This is important because the whole point of personalized forecasts of glycemic impact of nutrition is to use an individual’s data to estimate a model and provide decision support. Moreover, because health states change, models must be re-estimated periodically—potentially every 3-6 weeks—and so to be impactful, the model must perform with self-monitoring data collected on the order of weeks. Given its lower RMSE than ACA in the marginal case, linear regression could still be useful for decision support, as its results are similarly interpretable. However, for the reasons discussed above, augmentations would be necessary. For example, to improve robustness to outliers or apply statistical approaches like anomaly detection to remove possibly erroneous data points. In addition, linear regression would benefit from improved uncertainty estimates, or it would be difficult to determine when signals are clinically meaningful or actionable. Because of these reasons, devising methods for boosting the impact of finite yet personal self-management data will be crucial.

Human-centered data collection limitations

The data available for analysis in realistic settings represents a limitation. T2D self-monitoring data is effortful for individuals to collect, and data sets are often small. As discussed in the feature selection section, BG readings before and two hours after each meal don’t fully capture fluctuations in BG. Continuous glucose monitors (CGM) could provide more granular and accurate data for machine learning, but are not standard care for T2D, making them prohibitively expensive for most patients. Still, prior research has demonstrated the feasibility of similar data sets to make accurate prediction of BG values

[Albers et al., 2017]. In the future, self-monitoring and other data sources from more individuals could be combined to find patterns of individuals with similar characteristics who share similar BG dynamics, for example by utilizing microbiome or electronic health record data [Zeevi et al., 2015]. Notably, while researchers have been successful in predicting blood glucose and making nutrition recommendations to improve BG control, their models relied on extensive, and complete data about each individual [Zeevi et al., 2015]. Personalized nutrition recommendations from self-monitoring data would be considerably more scalable for a large population.

Other regression methods and ACA

There are, of course, many methods that can be used for similar tasks. ACA is a non-parametric density estimation method, and its task of explaining variability based on a set of covariates is similar to regression with clustering or principal components analysis (PCA). Importantly, ACA’s output is more interpretable than these alternatives. If the goal is to identify patterns between an individual’s nutrition and their glycemic control or to make recommendations to change diet, then it’s important that the output can be translated for human understanding. With ACA, each attributable component is a covariate, meaning the relationships identified are in the same dimensions as the input data. PCA finds the uncorrelated components that explain the most variability in the dependent variable

[Jolliffe and Cadima, 2016], but what exactly each component means could be difficult to explain in a clinical situation. Similarly, clusters can be difficult to convey to clinicians without extensive training, and require interpretation [Feller et al., 2018]. It’s important that the model output aligns with cognitive models [Pazzani et al., 2001]; a complex, black box method with strong performance metrics is only useful if it can be translated into something clinically meaningful.

In conclusion, this work presents initial progress in applying machinery from optimal transport theory to address important problems in machine learning with patient-generated health data.

5 References


  • Genes et al. [2018] Nicholas Genes, Samantha Violante, Christine Cetrangol, Linda Rogers, Eric E. Schadt, and Yu-Feng Yvonne Chan. From smartphone to EHR: a case report on integrating patient-generated health data. npj Digital Medicine, 1(1):23, 12 2018. ISSN 2398-6352. doi: 10.1038/s41746-018-0030-8. URL http://www.nature.com/articles/s41746-018-0030-8.
  • Rehg et al. [2017] James M. Rehg, Susan A. Murphy, and Santosh Kumar, editors. Mobile Health. Springer International Publishing, Cham, 2017. ISBN 978-3-319-51393-5. doi: 10.1007/978-3-319-51394-2. URL http://link.springer.com/10.1007/978-3-319-51394-2.
  • Fiordelli et al. [2013] Maddalena Fiordelli, Nicola Diviani, and Peter J Schulz. Mapping mHealth Research: A Decade of Evolution. Journal of Medical Internet Research, 15(5):e95, 5 2013. ISSN 14388871. doi: 10.2196/jmir.2430. URL http://www.jmir.org/2013/5/e95/.
  • Zeevi et al. [2015] David Zeevi, Tal Korem, Niv Zmora, David Israeli, Daphna Rothschild, Adina Weinberger, Orly Ben-Yacov, Dar Lador, Tali Avnit-Sagi, Maya Lotan-Pompan, Jotham Suez, Jemal Ali Mahdi, Elad Matot, Gal Malka, Noa Kosower, Michal Rein, Gili Zilberman-Schapira, Lenka Dohnalová, Meirav Pevsner-Fischer, Rony Bikovsky, Zamir Halpern, Eran Elinav, and Eran Segal. Personalized Nutrition by Prediction of Glycemic Responses. Cell, 163(5):1079–1095, 11 2015. ISSN 10974172. doi: 10.1016/j.cell.2015.11.001. URL http://www.ncbi.nlm.nih.gov/pubmed/26590418.
  • American Diabetes Association [2018] American Diabetes Association. 4. Lifestyle Management:Standards of Medical Care in Diabetes-2018. Diabetes care, 41(Suppl 1):S38–S50, 1 2018. doi: 10.2337/dc18-S004. URL http://www.ncbi.nlm.nih.gov/pubmed/29222375.
  • Mamykina et al. [2016] Lena Mamykina, Matthew E Levine, Patricia G Davidson, Arlene M Smaldone, Noemie Elhadad, and David J Albers. Data-driven health management: reasoning about personally generated data in diabetes with information technologies. Journal of the American Medical Informatics Association, 23(3):526–531, 5 2016. ISSN 1067-5027. doi: 10.1093/jamia/ocv187. URL https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocv187.
  • Albers et al. [2017] David J. Albers, Matthew Levine, Bruce Gluckman, Henry Ginsberg, George Hripcsak, and Lena Mamykina. Personalized glucose forecasting for type 2 diabetes using data assimilation. PLoS Computational Biology, 13(4):e1005232, 4 2017. ISSN 15537358. doi: 10.1371/journal.pcbi.1005232. URL http://dx.plos.org/10.1371/journal.pcbi.1005232.
  • Cordeiro et al. [2015a] Felicia Cordeiro, Daniel A. Epstein, Edison Thomaz, Elizabeth Bales, Arvind K. Jagannathan, Gregory D. Abowd, and James Fogarty. Barriers and Negative Nudges. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems - CHI ’15, pages 1159–1162, New York, New York, USA, 2015a. ACM Press. ISBN 9781450331456. doi: 10.1145/2702123.2702155. URL http://dl.acm.org/citation.cfm?doid=2702123.2702155.
  • Cordeiro et al. [2015b] Felicia Cordeiro, Elizabeth Bales, Erin Cherry, and James Fogarty. Rethinking the Mobile Food Journal: Exploring Opportunities for Lightweight Photo-Based Capture. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems - CHI ’15, pages 3207–3216, New York, New York, USA, 2015b. ACM Press. ISBN 9781450331456. doi: 10.1145/2702123.2702154. URL http://dl.acm.org/citation.cfm?doid=2702123.2702154.
  • Andrew et al. [2013] Adrienne Andrew, Gaetano Borriello, and James Fogarty. Simplifying Mobile Phone Food Diaries. In Proceedings of the ICTs for improving Patients Rehabilitation Research Techniques. IEEE, 2013. ISBN 978-1-936968-80-0. doi: 10.4108/icst.pervasivehealth.2013.252101. URL http://eudl.eu/doi/10.4108/icst.pervasivehealth.2013.252101.
  • Ismail-Beigi [2012] Faramarz Ismail-Beigi. Glycemic Management of Type 2 Diabetes Mellitus. New England Journal of Medicine, 366(14):1319–1327, 4 2012. ISSN 0028-4793. doi: 10.1056/NEJMcp1013127. URL http://www.nejm.org/doi/10.1056/NEJMcp1013127.
  • Hripcsak and Albers [2013] George Hripcsak and David J Albers. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association : JAMIA, 20(1):117–21, 2013. ISSN 1527-974X. doi: 10.1136/amiajnl-2012-001145. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3555337&tool=pmcentrez&rendertype=abstract.
  • Weiskopf and Weng [2013] Nicole Gray Weiskopf and Chunhua Weng. Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. Journal of the American Medical Informatics Association, 20(1):144–151, jan 2013. ISSN 10675027. doi: 10.1136/amiajnl-2011-000681. URL https://academic.oup.com/jamia/article-lookup/doi/10.1136/amiajnl-2011-000681.
  • Albers et al. [2018] David J Albers, Matthew E Levine, Andrew Stuart, Lena Mamykina, Bruce Gluckman, and George Hripcsak.

    Mechanistic machine learning: how data assimilation leverages physiologic knowledge using Bayesian inference to forecast the future, infer the present, and phenotype.

    Journal of the American Medical Informatics Association, 25(10):1392–1401, 10 2018. ISSN 1067-5027. doi: 10.1093/jamia/ocy106. URL https://academic.oup.com/jamia/article/25/10/1392/5128461.
  • Cabitza et al. [2019] Federico Cabitza, Davide Ciucci, and Raffaele Rasoini. A Giant with Feet of Clay: On the Validity of the Data that Feed Machine Learning in Medicine. pages 121–136. Springer, Cham, 2019. doi: 10.1007/978-3-319-90503-7–_˝10. URL http://link.springer.com/10.1007/978-3-319-90503-7_10.
  • Cabitza et al. [2017] Federico Cabitza, Raffaele Rasoini, and Gian Franco Gensini. Unintended Consequences of Machine Learning in Medicine. JAMA, 318(6):517, 8 2017. ISSN 0098-7484. doi: 10.1001/jama.2017.7797. URL http://jama.jamanetwork.com/article.aspx?doi=10.1001/jama.2017.7797.
  • Peyré and Cuturi [2018] Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. 3 2018. URL http://arxiv.org/abs/1803.00567.
  • Villani [2009] Cédric Villani. Optimal transport : old and new. Springer, 2009. ISBN 9783540710509.
  • Albers et al. [2014] David J. Albers, Noémie Elhadad, Esteban G. Tabak, Adler J. Perotte, and George Hripcsak. Dynamical phenotyping: Using temporal analysis of clinically collected physiologic data to stratify populations. PLoS ONE, 9(6):e96443, jun 2014. ISSN 19326203. doi: 10.1371/journal.pone.0096443. URL http://dx.plos.org/10.1371/journal.pone.0096443.
  • Tabak and Trigila [2018] Esteban G. Tabak and Giulio Trigila. Conditional expectation estimation through attributable components. Information and Inference: A Journal of the IMA, 7(4):727–754, December 2018. ISSN 2049-8764. doi: 10.1093/imaiai/iax023. URL https://academic.oup.com/imaiai/article/7/4/727/4931217.
  • [21] USDA. USDA Food Composition Database. URL https://ndb.nal.usda.gov/ndb/.
  • Ahuja et al. [2013] Jaspreet K.C. Ahuja, Alanna J. Moshfegh, Joanne M. Holden, and Ellen Harris. USDA Food and Nutrient Databases Provide the Infrastructure for Food and Nutrition Research, Policy, and Practice. The Journal of Nutrition, 143(2):241S–249S, 2 2013. ISSN 0022-3166. doi: 10.3945/jn.112.170043. URL https://academic.oup.com/jn/article/143/2/241S/4569846.
  • Wheeler et al. [2014] M L Wheeler, A Daly, A Evert, and others. Choose Your Foods, Food Lists for Diabetes. Chicago, IL: Academy of Nutrition and Dietetics/American Diabetes Association, 2014.
  • Anderson et al. [2004] James W. Anderson, Kim M. Randles, Cyril W. C. Kendall, and David J. A. Jenkins. Carbohydrate and Fiber Recommendations for Individuals with Diabetes: A Quantitative Assessment and Meta-Analysis of the Evidence. Journal of the American College of Nutrition, 23(1):5–17, 2 2004. ISSN 0731-5724. doi: 10.1080/07315724.2004.10719338. URL http://www.tandfonline.com/doi/abs/10.1080/07315724.2004.10719338.
  • Breton et al. [2008] Marc D Breton, Devin P Shields, and Boris P Kovatchev. Optimum subcutaneous glucose sampling and fourier analysis of continuous glucose monitors. Journal of diabetes science and technology, 2(3):495–500, 5 2008. ISSN 1932-2968. doi: 10.1177/193229680800200322. URL http://www.ncbi.nlm.nih.gov/pubmed/19885217http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2769727.
  • Gough et al. [2003] David A. Gough, Kenneth Kreutz-Delgado, and Troy M. Bremer. Frequency Characterization of Blood Glucose Dynamics. Annals of Biomedical Engineering, 31(1):91–97, 1 2003. ISSN 0090-6964. doi: 10.1114/1.1535411. URL http://link.springer.com/10.1114/1.1535411.
  • Aschner [2017] Pablo Aschner. New IDF clinical practice recommendations for managing type 2 diabetes in primary care. Diabetes Research and Clinical Practice, 132:169–170, 10 2017. ISSN 0168-8227. doi: 10.1016/J.DIABRES.2017.09.002. URL https://www.sciencedirect.com/science/article/pii/S016882271731464X.
  • Santambrogio [2010] Filippo Santambrogio. Introduction to Optimal Transport Theory. 9 2010. URL http://arxiv.org/abs/1009.3856.
  • Mendenhall and Sincich [1997] William Mendenhall and Terry Sincich.

    A Second Course in Statistics: Regression Analysis.

    Journal of the American Statistical Association, 92(438):797, 6 1997. doi: 10.2307/2965740. URL https://www.jstor.org/stable/2965740?origin=crossref.
  • Davison and Hinkley [1997] Anthony Christopher Davison and D. V. Hinkley. Bootstrap methods and their application. Cambridge University Press, Cambridge, 1997. ISBN 9780521574716.
  • Huber [2011] Peter J Huber. Robust statistics. Springer, 2011.
  • Levine et al. [2018] Matthew E Levine, David J Albers, and George Hripcsak. Methodological variations in lagged regression for detecting physiologic drug effects in ehr data. Journal of biomedical informatics, 86:149–159, 2018.
  • Codella et al. [2018] J. Codella, C. Partovian, H.-Y. Chang, and C.-H. Chen. Data quality challenges for person-generated health and wellness data. IBM Journal of Research and Development, 62(1):1–3, 1 2018. ISSN 0018-8646. doi: 10.1147/JRD.2017.2762218. URL http://ieeexplore.ieee.org/document/8269766/.
  • Smith [2013] Ralph C Smith. Uncertainty quantification: theory, implementation, and applications, volume 12. Siam, 2013.
  • Johansson et al. [2011] Ulf Johansson, Cecilia Sönströd, Ulf Norinder, and Henrik Boström. Trade-off between accuracy and interpretability for predictive in silico modeling. Future Medicinal Chemistry, 3(6):647–663, 4 2011. ISSN 17568927. doi: 10.4155/fmc.11.23. URL http://www.future-science.com/doi/10.4155/fmc.11.23.
  • Hripcsak and Albers [2018] George Hripcsak and David J Albers. High-fidelity phenotyping: richness and freedom from bias. Journal of the American Medical Informatics Association, 25(3):289–294, mar 2018. ISSN 1067-5027. doi: 10.1093/jamia/ocx110. URL https://academic.oup.com/jamia/article/25/3/289/4484121.
  • Jolliffe and Cadima [2016] Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent developments. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 374(2065):20150202, 4 2016. ISSN 1471-2962. doi: 10.1098/rsta.2015.0202. URL http://www.ncbi.nlm.nih.gov/pubmed/26953178http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4792409.
  • Feller et al. [2018] Daniel J Feller, Marissa Burgermaster, Matthew E Levine, Arlene Smaldone, Patricia G Davidson, David J Albers, and Lena Mamykina.

    A visual analytics approach for pattern-recognition in patient-generated data.

    Journal of the American Medical Informatics Association, 6 2018. ISSN 1067-5027. doi: 10.1093/jamia/ocy054. URL https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocy054/5037318.
  • Pazzani et al. [2001] M. J. Pazzani, S. Mani, and W. R. Shankle. Acceptance of rules generated by machine learning among medical experts. Methods of Information in Medicine, 40(5):380–385, 2001. ISSN 00261270. doi: 01050380[pii]. URL https://pdfs.semanticscholar.org/8cea/aabb4866dfc6a72e25d502b30f3e6a98634c.pdf.
  • Wu and Tabak [2019] Chenyue Wu and Esteban G. Tabak. Prototypal analysis and prototypal regression. 2019.
  • Cutler and Breima [1994] A Cutler and L Breima. Archetypal analysis. Technometrics, 34:338–347, 1994.

Appendix A Attributable Components Analysis

a.1 Modes of variability

Given a set of observations of the variable of interest and the covariates , we seek to estimate the conditional mean . We will assume throughout that assumes real values; this is a reasonable assumption given that represents glycemic impact that we define here as the difference between two real numbers, pre- and postprandial blood glucose measurements. If we instead specified

as a vector, the

’th component of their mean is the mean of the ’th component, so there is no loss of generality in considering one dimension of at a time. In this application, only has one dimension. We will leave the specification of the allowable variable types for each covariate temporarily open.

The conditional mean can be characterized as the minimizer of the variance:


over a proposed family of functions . We would like our specification of this family of functions to satisfy some properties:

  1. The family should be big enough to accurately represent complex dependencies of on , while at the same time constrained so as not to overfit the data.

  2. The procedure should be applicable to covariates of quite arbitrary type.

  3. It should be interpretable, in the sense that one should be able to compute with ease the marginal dependence on on some subset of the , averaging over the others.

  4. Performing the minimization in (6) should be computationally effective.

The choice made in Tabak and Trigila [2018] is to approximate the multivariable function by the superposition of products of functions of the individual covariates :


This can be thought as an extension of the low-rank factorization of matrices

from matrix entries considered as functions of the row and column, to tensors of arbitrary order and variables of arbitrary type. Two explicit examples with real covariates but functions

pre-assigned except for a global multiplicative factor are the power series

and the Fourier series

In the context of low rank factorization, particularly when it is re-arranged so that both the and are orthogonal sets of vectors (i.e. principal component analysis), each represents a component of variability, typically sorted by the fraction of total variability that each component explains. In Fourier analysis, a linear case, one speaks of Fourier modes. Because we are not anchored to a particular functional form for the ’s we will refer to each product as a mode of variability of , and think of it as a pattern of dependence on to be extracted from the data that explains a significant fraction of the variability of .

Under the proposal in (7), the conditional expectation problem in (6) reduces to


over the degrees of freedom available in the specification of the functions

. We discuss next how to specify these functions.

a.2 Hard and soft assignments (coping with missing values), grids and prototypes

If the

are categorical variables, such as the rows and columns in low-rank matrix factorization, we can assign an integer

to each of the values they can adopt. Then each is fully described by a matrix with components , and we can write


where when and otherwise.

Since is quadratic in each , one can perform the minimization of (6) through an alternating direction methodology, minimizing alternatively over each , which yields the updating rule



We can extend the applicability of (9) to situations where the value of in some or all observations are not known with certainty. Then

is no longer a binary variable with values zero or one, but represents instead the probability that

adopts the value . This soft assignment satisfies


This allows us a means of naturally accommodating both measurement uncertainty and one pathway for coping with missing data within a covariate, .

In the event that the covariates, , are real, we can extend (9, 11) by adopting a grid; here we adopt a grid and define . Then, performing a piecewise linear interpolation, one can assign to each observation values of such that

and write again

Here the satisfy, in addition to (11), the condition that at most two differ from zero for each value of , the ones corresponding to the two grid points surrounding .

Finally, the formulation in (9, 11) can be further extended to any type of covariate that admits a norm, via prototypal analysis [Wu and Tabak, 2019]. In this case the grid is replaced by the prototypes , which are optimal convex combinations of the ,

where and solve the following minimization problem:


The interpretation is the following: in archetypal analysis [Cutler and Breima, 1994], we seek points within the convex hull of the such that the latter can be well approximated by convex combinations of the former. What the prototypes add is the penalization term with strength , which favors expression for the that are local, i.e. involve only nearby , as with the piecewise linear expansions adapted to a grid.

a.3 Smoothness and bounded variability

As the grids become finer or the number of prototypes grows to permit a more accurate representation of the , the risk of overfitting the data also increases. To avoid this, one can enforce smoothness on , for instance by penalizing the squared norm of a finite difference approximation to its gradient. A general form for such a penalization term is

where the matrix encodes the specific penalization used, such as the squared norms of first or second derivatives. A similar term can be used for categorical variables, encoding into their variance, to bound the amount of variability that they can explain. Then the full problem adopts the form


The inclusion of the products of squares of the norms of the as pre-factors to the penalty terms follows from the need to make the objective function invariant under re-scalings of the that preserve their product. Without these, the penalty terms could be made arbitrarily small by rescaling each while preserving their product, assigning large amplitudes to those that can explain little or no variability, and can therefore be taken as constants so that the corresponding vanishes.

Notice that the objective function in (13) is still quadratic in each , and so can be solved through an alternative direction methodology that finds the optimal matrix explicitly given the current values of the for .