Introduction
Social data is often highly heterogeneous, coming from a population composed of diverse classes of individuals, each with their own characteristics and behaviors. As a result of heterogeneity, a model learned on population data may not make accurate predictions on heldout test data, nor offer analytic insights into the underlying behaviors that motivate interventions. To illustrate, consider Figure 1
, which shows data collected for a hypothetical nutrition study measuring how the outcome, body mass index (BMI), changes as a function of daily pasta calorie intake. Multivariate linear regression (MLR) analysis finds a negative relationship in the population (red dotted line) between these variables. The negative trend suggests that—paradoxically—increased pasta consumption is associated with lower BMI. However, unbeknownst to researchers, the hypothetical population is heterogeneous, composed of classes that varied in their fitness level. These classes (clusters in Fig.
1) represent, respectively, people who do not exercise, people with normal activity level, and athletes. When data is disaggregated by fitness level, the trends within each subgroup are positive (dashed lines), leading to the conclusion that increased pasta consumption is in fact associated with a higher BMI. Recommendations for pasta consumption arising from the naive analysis are opposite to those arising from a more careful analysis that accounts for the confounding effect of different classes of people. The trend reversal is an example Simpson’s paradox, which has been widely observed in many domains, including biology, psychology, astronomy, and computational social science
[6, 15, 23, 4].Social scientists analyze heterogeneous data with underlying structure (classes) using mixed effects models [32]
. This variant of linear regression allows for intercepts to be random variables in order to account for shifts between classes, and slopes to be random variables in order to model differences in regression coefficients within classes. Mixed effects models are used to describe nonindependent observations of data from the same class, and they can even handle trend reversal associated with Simpson’s paradox. Mixed effects models assume that classes are specified by a categorical variable, which can be constructed by disaggregating or binning data on an existing variable
[1]. In practice, however, these classes may not be known a priori, or may be related to multiple variables. Instead, they must be discovered from data, along with the trends they represent.Many methods already exist for finding latent classes within data; however, they have various shortcomings. Unsupervised clustering methods disaggregate data regardless of the outcome variable being explained although distinct outcomes may be best described by different clusterings of the same data. Recent methods have tackled Simpson’s paradox by performing supervised disaggregation of data [2, 12], but these methods disaggregate data into subgroups using existing features, and thus are not able to capture effects of latent classes (i.e., unobserved confounders). Additionally, they perform “hard” clustering, but assigning each data point to a unique group. Instead, a “soft” clustering is more realistic as it captures the degree of uncertainty about which group the data belong to. To address these challenges, we describe Disaggregation via Gaussian Regression (DoGR
), a method that jointly partitions data into overlapping clusters and estimates linear trends within them. This allows for learning accurate and generalizable models while keeping the interpretability advantage of linear models. Proposed method assumes that the data can be described as a superposition of clusters, or components, with every data point having some probability to belong to any of the components. Each component represents a latent class, and the soft membership represents uncertainty about which class a data point belongs to. We use an expectation maximization (EM) algorithm to estimate component parameters and component membership parameters for the individual data points.
DoGR jointly updates component parameters and its regression coefficients, weighing the contribution of each data point to the component regression coefficients by its membership parameter. The joint learning of clusters and their regressions allows for discovering the latent classes that explain the structure of data.Our framework learns predictive and interpretable models of data. DoGR is validated on synthetic and realworld data and is able to discover hidden subgroups in analytically challenging scenarios, where subgroups overlap along some of the dimensions, and identify interesting trends that may be obscured by Simpson’s paradox. This includes scenarios where the trends in data show Simpson’s reversal. Our method achieves performance on prediction tasks comparable to or better than stateoftheart algorithms, but in a fraction of the time, making it well suited for big data analysis.
Background
Regression is one of the most widely used methods in data analysis due to its flexibility, interpretability, and computational scalability [13]
. Linear models are perhaps one of the most popular regression methods to learn the relationship between predictor variables (also known as fixed effects or features) and the outcome variable. However, linear models make assumptions about the data that are often violated by realworld data. One of these assumptions is independence of observations. In reality, however, observations may come from groups of similar individuals, violating the independence assumption. To account for individual differences within such groups, mixedeffect or multilevel models were developed
[22]. These models assume that similar individuals, or data samples, come from the same class, and that classes are specified by some categorical variable [32]. In some cases, this variable can be created by binning a continuous variable and disaggregating data by these bins [1]. However, existing methods do not work when differences between groups or individuals cannot directly be described by observed variables or features. In this chapter, we describe a method that captured differences between groups as latent confounders (i.e., latent clusters) that are learned from data. Our work improves the utility of linear models in the analysis of social data by controlling for these clusters. A number of previous works have attempted to tackle the issue of latent confounders. Clusterwise linear regression (CLR) [28]starts with initial clusters and updates them by reassigning one data point to another partition that minimizes a loss function. The method is slow, since it moves one data point at each iteration. Two other methods, WCLR
[9] and FWCLR [10], improve on CLR by using kmeans as their clustering method. These methods were shown to outperform CLR and other methods, such as Kplane
[20]. In Conditional Linear Regression [5], the goal is to find a linear rule capable of achieving more accurate predictions for a segmentof the population by ignoring outliers. One of the larger differences between their method and ours is that Conditional Linear Regression focuses on a small subset of data, while we model data as a whole, alike to CLR, WCLR, FWCLR and Kplane. The shared parameter
across all clusters in WCLR and FWCLR makes these methods perform poorly if clusters have different variance, as we will show in our results section. Other methods have combined Gaussian Mixture Models and regression to create algorithms similar to our own
[29, 14]. We call these methods Guassian Mixture Regression (GMR). In contrast to these methods, however, we can capture relationships between independent and outcome variables through regression coefficients, which previous methods were unable to do. We also use Weighted Least Squares to fit our model, which makes it less sensitive to outliers [7].Causal inference has been used to address the problem of confounders, including latent confounders [19], and to infer the causal relationship between features and outcomes [3, 26, 30]. One difficulty with causal inference however, is that the focus is traditionally on one intervention [26]. Taking into account synergistic effects of multiple features is not well understood, but has been attempted recently [26, 30]. With adequate datasets, these can help infer causal relationships between multiple causes, but certain causal assumptions are needed, which might not correspond to reality. In contrast, regression offers us the opportunity to understand relationships between each feature and an outcome, regardless of the dataset, even if we cannot make causal claims.
DoGR Method
Let be a set of observations, or records, each containing a realvalued outcome of an observation
and a vector of
independent variables, or features,. Regression analysis is often used to capture the relationship between each independent variable and the outcome. Specifically, MLR estimates regression coefficients
by minimizing the residuals of over all observations. However, parameters learned by the MLR model may not generalize to outofsample populations, as they can be confounded by Simpson’s paradox and sampling bias. We can call this the “robustness” problem of regression.As discussed in the introduction, Figure 1(a) illustrates this problem with synthetic data. The MLR model trained on the aggregate data gives . This suggests a negative relationship (solid red line) between the independent variable Daily pasta intake and the outcome . However, there are three components each with a positive trend. Indeed, applied separately to each component, MLR learns the proper positive relationship (dashed green line) between Daily pasta intake and : , where is cluster’s intercept, and is coefficient,
. Associations learned by the
MLR model trained on the populationlevel data are not robust or generalizable. This could help explain the reproducibility problem seen in many fields [24, 27], where trends seen in some experiments cannot be reproduced in other experiments with different populations. This can be clearly observed in Figure 1(b), which represents a different sampling of the population. The underlying clusters are the same, but the middle cluster is oversampled during data collection. As a result, MLR trained at the populationlevel finds only a weak association between independent variable and the outcome (solid red line). In contrast, regressions learned separately for each cluster (green dashed line) remain the same even in the new data: thus, the cluster regressions represent robust and generalizable relationships in data.Model Specification
The goal of this work is to learn robust and reproducible trends through a regression model that accounts for the latent structure of data, for example, the presence of three clusters in data shown in Fig. 1. Our model jointly disaggregates the data into overlapping subgroups, or components, and performs weighted linear regression within each component. We allow components to overlap in order to represent the uncertainty about which component or subgroup an observation belongs to.
In what follows, we used capital letters to denote random variables and lowercase letters their values. We model the independent variable of each component
as a multivariate normal distribution with mean
and covariance matrix :(1) 
In addition, each component is characterized by a set of regression coefficients . The regression values of the component are
(2) 
with giving the residuals for component . Under the assumption of normality of residuals [11], has a normal distribution with mean
.(3) 
where is defined by Eq. 2. Under the assumption of homoscedasticity, in which the error is the same for all , the joint density is the product of the conditional (Eq. 3) and marginal (Eq. 1) densities:
(4)  
Multiplication of can be converted to a normal distribution: comes from , where and is a block matrix.
Model Learning
The final goal of the model is to predict the outcome, given independent variables. This prediction combines the predicted values of outcome from all components by taking the average of the predicted values, weighed by the size of the component. We define as the weight of the component , where
. We can define the joint distribution over all components as
.Then the loglikelihood of the model over all data points is:
(5) 
The formula here is same as the Gaussian Mixture Model, except that the target is a function of . To find the best values for parameters we can leverage the Expectation Maximization (EM) algorithm. The algorithm iteratively refines parameters based on the expectation (E) and maximization (M) steps.
Estep
Let’s define (membership parameter) as probability that data point belongs to component . Given the parameters of last iteration, membership parameter is:
(6) 
Thus, the Estep disaggregates the data into clusters, but it does so in a “soft” way, with each data point having some probability to belong to each cluster.
Mstep
Given the updated membership parameters, the Mstep updates the parameters of the model for the next iteration as :
In addition, this step updates regression parameters based on the estimated parameters in . Our method uses Weighted Least Squares for updating the regression coefficients for each component, with as weights. In the other words, we find that minimizes the Weighted Sum of Squares (WSS) of the residuals:
Using the value of , the updated would be:
Intuitively, shows us where in the center of each subgroup resides. The further a data point is from the center, the lower its probability to belong to the subgroup. The covariance matrix captures the spread of the subgroup in the space relative to the center . Regression coefficient gives low weight to outliers (i.e., is a weighted regression) and captures the relationship between and near the center of the subgroup. Parameter tells us about the variance of the residuals over fitted regression line, and tells us about the importance of each subgroup.
Prediction
For test data , the predicted outcome is the weighted average of the predicted outcomes for all components. The weights capture the uncertainty about which component the test data belongs to. Using equation 3, the best prediction of outcome for component would be , which is the mean of conditional outcome value of the component :
(7) 
The green solid line in figure 1 represents the predicted outcome as function of . The solid green line, is weighted average over dashedgreen lines.
Data
We apply our method to a synthetic and realworld data sets, including large social data sets described below.
The Synthetic data consists of two subgroups, with the same mean on , but different variances, as shown in Figure 2. The variance of for the top component is and for the bottom component is . The number of data points in bottom component is and in top component . The value for the bottom component is and the top component it is , with as variance of residual.
The Metropolitan data comes from a study of emotions expressed through Twitter messages posted from locations around Los Angeles County [18]. a large metropolitan area. The data contains more than 6 million georeferenced tweets from 33,000 people linked to US census tracts through their locations. The demographic and socioeconomic characteristics of tracts came from the 2012 American Fact Finder.^{2}^{2}2http://factfinder.census.gov/ A tract is a small region that contains about 4,000 residents on average. The tracts are designed to be relatively homogeneous with respect to socioeconomic characteristics of the population, such as income, residents with bachelors degree or above, and other demographics. Of more than 2000 tracts within this metropolitan area, 1688 were linked to tweets.
Emotional valence
was estimated from the text of the tweets using Warriner, Kuperman, and Brysbaert (WKB) lexicon
[31]. The lexicon gives valence scores—between 1 and 9—which quantify the level of pleasure or happiness expressed by a word, for 14,000 English words. Since valence is a proxy for the expressed happiness, the data allows us to investigate the social and economic correlates of happiness. We use valence as the outcome in data analysis.The Wine Quality data combines two benchmark data sets from UCI repository^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/wine+quality related to red and white wines [8]. The data contains records about white wine and red wine samples of wine quality and concentrations of various chemicals. Quality is a score between and , and is typically around . Our goal is to model wine quality based on physicochemical tests. Note that the type of wine (red or white) is only used to evaluate the learned components.
The New York City Property Sale data^{4}^{4}4https://www.kaggle.com/newyorkcity/nycpropertysales contains records of every building or building unit (such as an apartment) sold in the New York City property market over a 12month period, from September 2016 to September 2017. We removed all categorical variables like neighborhood and tax class. We use each property’s borough to study relevance of the components to different neighborhoods in NYC; however, we do not use it in analysis. The outcome variable is sale price, which is in million dollars, and it ranges from to .
The Stack Overflow dataset contains data from a questionanswering forum on the topic of computer programming. Any user can ask a question, which others may answer. We used anonymized data representing all answers to questions with two or more answers posted on Stack Overflow from August 2008 until September 2014.^{5}^{5}5https://archive.org/details/stackexchange We created a userfocused data set by selecting at random one answer written by each user and discarding the rest. To understand factors affecting the length of the answer (the outcome, which we measure as the number of words in the answer), we use features of answers and features of users. Answer features include the number of hyperlinks, and lines of code the answer contains, and its Flesch readability score [16]. Features describing answerers are their reputation, tenure on Stack Overflow (in terms of percentile rank) and experience, i.e., the total number of answers written during their tenure. We also use activityrelated features, including time since previous answer written by the user, session length, giving the number of answers user writes during the session, and answer position within that session. We define a session as a period of activity without a break of 100 minutes or longer. Features acceptance probability, reputation and number of answers are only used to evaluate the learned components.
Preprocessing
When variables are linearly correlated, regression coefficients are not unique, which presents a challenge for analysis. To address this challenge, we used Variance Inflation Factor (VIF) to remove multicollinear variables. We iteratively remove variables when their VIF is larger than five. For example, in the Metropolitan data, this approach reduced the number of features from more than 40 to six, representing the number of residents within a tract who are , , and , as well as percent of adult residents with degree or with incomes line. Table 1, represents information about all data sets.
Dataset  records  features  after VIF  outcome 

Synthetic  Y  
Metropolitan  Valence  
Wine Quality  Quality  
NYC  Sale Price  
Stack Overflow  Answer Length 
Finding Components
As first step in applying DoGR to data for qualitative analysis, we need to decide on the number of components. In general, finding the appropriate number of components in clustering algorithms is difficult and generally requires ad hoc parameters. For Gaussian Mixture Models, the Bayesian information criterion (BIC) is typically used. We also use BIC with parameters, where is number of components and is number of independent variables. Based on the BIC scores, we choose five components for this data. In our explorations, using three or six components gave qualitatively similar results, but with some of the components merging or splitting. With the same procedure, the optimal number of components for Wine Quality and NYC and Stack Overflow data is four.
Results
Due to the unique interpretability of our method, we first use it to describe meaningful relationships between variables in the realworld data (“Qualitative Results”). We then show how it compares favorably to competing methods (“Quantitative Results”).
Qualitative Results
We use a radar chart to visualize the components discovered in the data. Each colored polygon represents mean of the component, , in the feature space. Each vertex of the polygon represents a feature, or dimension, with the length giving the coordinate of along that dimension. For the purpose of visualization, each coordinate value of was divided by the largest coordinate value of
across all components. The maximum possible value of each coordinate after normalization is one. The mean value of the outcome variable within each component is shown in the legend, along with 95% confidence intervals.
Our method also estimates the regression coefficients, which we can compare to MLR. We computed the pvalue assuming coefficients in DoGRare equal to MLR. If and are two coefficients to compare, and and are the standard errors
of each coefficient, then the zscore is
, from which we can easily infer the pvalue assuming a normal distribution of errors [25].% Below  %  
Poverty  Graduate  
mean  std  
All  
Orange  
Purple  
Red  
Green  
Blue  
* pvalue; *** pvalue 
Metropolitan
Figure 3 visualizes the components of the Metropolitan data and reports regression coefficients for two variables. Interestingly, the data is disaggregated along ethnic lines, perhaps reflecting segregation within the metropolitan area. Figure 4 represents the regions in each component in Los Angeles County map. The component consists of census tracts with many highly educated white residents. It also has highest valence (), meaning that people post the happiest tweets from those tracts. The Purple component, is ethnically diverse, with large numbers of white, Asian and Hispanic residents. It’s valence () is only slightly lower than of the Orange component. The Red component has largely Asian and some Hispanic neighborhoods that are less welloff, but with slightly lower valence (). The Blue and Green components represents tracts with the least educated and poorest residents. They are also places with the lowest valence ( and , respectively). Looking at the regression coefficients, education is positively related to happiness across the entire population (All), and individually in all components, with the coefficients significantly different from MLR in four of the five components, suggesting this trend was a Simpson’s paradox. However, the effect is weakest in the most and least educated components. Poverty has no significant effect across the entire population, but has a negative effect on happiness in poorest neighborhoods (Blue). Counterintuitively, regression coefficients are positive for two components. It appears that within these demographic classes, the poorer the neighborhood, the happier the tweets that originate from them.
Wine Quality
Wine Quality  Citric Acid  SO2  Sugar  

mean  std err  
All  
Blue  6.02  
Orange  5.91  0.016  
Green  5.6  0.020  
Red  5.36  0.037  
* pvalue; *** pvalue 
Figure 5 visualizes the disaggreagation of the Wine Quality data into four components. Although we did not use the type of wine (red or white) as a feature, wines were naturally disaggregated by type. The Blue component is almost entirely () composed of high quality () white wines. All data in the Orange component () are white wine, while the Green component is composed mostly () of red wines with average quality . The lowest quality (Red) component, with quality equal to , contains a mixture of red () and white () wine. In other words, we discover that in higherquality wines, wine color can be determined with high accuracy simply based on its chemical components, which was not known before, to the best of our knowledge. Low quality wines appear less distinguishable based on their chemicals. We find Chlorides have a negative impact on quality of wine in all components (not shown), Sugar has positive impact on high quality white wines and red wines, and negative impact on low quality wines. Surprisingly, Free Sulfur Dioxide has a positive impact on high quality white wines, but a negative impact in other components. These are findings that may be important to wine growers, and capture subtleties in data that commonlyused MLR does not.
NYC Sale Price
Sale  Resid.  Commer.  Gross  Land  

Price  Units  Units  Sq.Feet  Sq.Feet  
mean  std err  
All  
Purple  
Green  
Red  
Blue  
all significant: pvalue 
The mean sale price of all properties in NYC property sales data is $1.31 Million. The mean price, however, hides a large heterogeneity in the properties on the market. Our method tamed some of this heterogeneity by identifying four components within the sales data. Figure 6 shows that these components represent large commercial properties (Purple), large residential properties (Green), mixed commercial/residential sales (Red), and singleunit residential properties (Blue).
Component  Manhattan  Bronx  Brooklyn  Queens  Staten 
Purple  17%  20%  41%  17%  5% 
Green  41%  22%  26%  10%  1% 
Red  8%  9%  65%  15%  3% 
Blue  0%  13%  37%  33%  16% 
All  3%  13%  40%  30%  14% 
Table 2 show what percentage of each component is made up of New York City’s five boroughs. Large commercial properties (Purple component), such as office buildings, are located in Brooklyn and Manhattan, for example. These are the most expensive among all properties, with average price of more than $12 Million. The next most expensive type of property are large residential buildings (Green component)—multiunit apartment buildings. These are also most likely to be located in Manhattan and Brooklyn. Small residential properties (Blue component)—most likely to be single family homes—are the least expensive, on average half a million dollars, and most likely to be located in Brooklyn and Queens, with some in Staten Island.
Regressions in this data set show several instances of Simpson’s paradoxes. Property price in populationlevel data increases as a function of the number of residential units. In disaggregated data, however, this is true only for the Green component, representing apartment buildings. Having more residential units in smaller residential properties (Red and Blue components) lowers their price. This could be explained by smaller multiunit buildings, such as duplexes and row houses, being located in poorer residential areas, which lowers their price compared to single family homes. Another notable trend reversal occurs when regressing on lot size (land square feet). As expected, there is a positive relationship in the Blue component with respect to lot size, as single family homes build on bigger lots are expected to fetch higher prices in New York City area. However, the trends in the other components are negative. This could be explained the following way. As land becomes more expensive, builders are incentivized to build up, creating multistory apartment and office buildings. The more expensive the land, the taller the building they build. This is confirmed by the positive relationship with gross square feet, which are strongest for the Purple and Green components. In plain words, these components represent the tall buildings with small footprint that one often sees in New York City.
Stack Overflow
As our last illustration, we apply DoGR to Stack Overflow data to answer the question how well we can predict the length of the answer a user writes, given the features of the user and the answer.
DoGR splits the data into four clusters, as shown in 7. Green and Red components contain most of the data, with 47% and 39% of records respectively. The radar plot shows the relative strength of the features in each component, while the table above the plot shows features characterizing each discovered group. Except for Percentile tenure, these features were not used by DoGR and are shown to validate the discovered groups.
Size  Acceptance  Answerer  Num. of  Percentile  

Probability  Reputation  Answers  Tenure  
Red  39%  0.26  338.17  16.02  0.43 
Green  47%  0.24  314.35  14.56  0.42 
Blue  8%  0.33  492.19  20.65  0.47 
Orange  5%  0.32  784.62  40.21  0.40 
Words  Codes  Readab.  Percnt.  URLs  

mean  std  
All  
Red  
Green  
Blue  0.22  
Orange  
* Italicized effects are not significant 
The Orange component ( of data) contains very active (longer Session Length) users, who meticulously document their answers with many lines of code and URLs, so we can label them “power users”. These are among the longest (high Words) and more complex (low Readability) answers in the data, and they also tend to be high quality (high Acceptance Probability). Surprisingly this group has newer users (lower Percentile Tenure), but they have high reputation (Answerer Reputation) and wrote more answers previously (higher Number of Answers). These users, while a minority, give life to Stack Overflow and make it useful for others. Orange component users have more code lines within shorter answers. This is in contrast to other groups, which tend to include more lines of code within longer answers. The brevity of Orange users when documenting code is an example of a trend reversal.
Another interesting subgroup is the Blue group ( of data), which is composed of “veterans” (high Percentile Tenure), who write easytoread answers (high Readability) that are documented with many URLs. These users have a relatively high reputation, but they are selective in the questions they answer (lower Number of Answers than for the Orange users). Interestingly, tenure (Percentile) does not have an effect on length of the answer for these Blue users, while it has positive effect in other groups (i.e., more veteran users write longer answers). Negative effect of URLs on the length of the answer suggests that these users use URLs to refer to existing answers.
The Red (39%) and Green (47%) components contain the vast majority of data. They are similar in terms of user reputation, tenure (Percentile) and experience (Number of Answers), as well as the quality of the answers they write (Acceptance Probability), with the Red component users scoring slightly higher on all the measures. The main difference is in the answers they write: Red users do not include URLs in their answers, while Green users do not include code. Another difference between these groups is that Green users have longer answers than Red users, but as their answers become longer, they also become more difficult to read (lower Readability). In contrast, longer answers by Red users are easier to read.
Overall, we find intuitive and surprising results from data that are largely due to DoGR’s interpretability.
Quantitative Results
We compare the performance of DoGR to existing stateoftheart methods for disaggregated regression: the three variants of CLR, WCLR [9], FWCLR [10], and GMR [29]. We use CART
as a method which uses decision tree for regression. We also use
MLR, which does not disaggregate data, as a baseline.Prediction Performance
For the prediction task, we use
fold nested cross validation to train the model on four folds and make predictions on the outofsample data in the fifth fold. As hyperparameters, we used
1–6 as potential number of components for all methods. For WCLR and FCWCLR, we set , and for FWCLR we set for CART we set the depth of the tree as . We use grid search to find best hyperparameters. Table 3 presents results on our datasets and synthetic data. To evaluate prediction quality, we use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). When comparing the quality of crossvalidated predictions across methods, we use the KruskalWallis test, a nonparametric method to compare whether multiple distributions are different
[17]. When distributions were significantly different, a pairwise comparisons using the TukeyKramer test was also done as posthoc test, to determine if mean crossvalidated quality metrics were statistically different [21]. Prediction results are shown in table 3. We use to indicate statistically significant () differences between our method and other methods. Bolded values differ significantly from others or have lower standard deviation. Following [10], we use lower deviation to denote more consistent predictions.CART, WCLR and FWCLR do not perform well in Synthetic data, because variances of the two components are different, which these methods do not handle. This is a problem that exists in an arbitrary number dimensions; we observe in higher dimensions the gap between performance of FWCLR and our method increases.
For Metropolitan data, there is no statistically significant difference between methods: all clusterbased regression methods outperform MLR and CART on the prediction task. This shows that any method that accounts for the latent structure of data helps improve predictions. For Wine Quality
data, while the null hypothesis of equal performance of
GMR and DoGR cannot be rejected, our method has a smaller standard deviation in both datasets. We were not able to successfully run GMR and WCLR on NYC data, due to exceptions (returning null as predicted outcome) and extremely long run time, respectively. It took three days for FWCLR, and four hours for our method to finish running on NYC data. Our method has significantly lower MAE compared to FWCLR, while CART has a better MAE and worse RMSE. This shows that CART has a worse performance for outliers. Figure 8 represents the reason behind the poor performance for outliers for hard clustering methods like CART.We were also not able to successfully run GMR and WCLR on Stack Overflow data for the same reasons. It took six days for FWCLR to run one round of crossvalidation, after which we stopped it. Therefore, the mean reported in the table for FWCLR is the average of five runs, while for MLR and DoGR it is the average of runs. The best performing method is CART. The main reason is that Stack Overflow has discrete variables and CART is a better method is compare with FWCLR and DoGR for handling that types of variables.
Method  RMSE ()  MAE () 

Synthetic  
MLR  294.88 ( 1.236)*  288.35 ( 0.903)* 
CART  264.70 ( 7.635)*  224.57 ( 6.138)* 
WCLR  261.14 ( 3.370)*  232.76 ( 2.682)* 
FWCLR  261.27 ( 4.729)*  233.05 ( 3.772)* 
GMR  257.36 ( 4.334)  219.15 ( 3.567) 
DoGR  257.32 ( 3.871)  219.11 ( 3.106) 
Metropolitan  
MLR  0.083 ( 0.0061)  0.062 ( 0.0033) 
CART  0.086 ( 0.0056)  0.064 ( 0.0036)* 
WCLR  0.083 ( 0.0029)  0.062 ( 0.0024) 
FWCLR  0.082 ( 0.0044)  0.061 ( 0.0021) 
GMR  0.083 ( 0.0043)  0.061 ( 0.0023) 
DoGR  0.083 ( 0.0052)  0.061 ( 0.0031) 
Wine Quality  
MLR  0.83 ( 0.018)*  0.64 ( 0.015)* 
CART  0.79 ( 0.015)  0.62 ( 0.013) 
WCLR  0.83 ( 0.013)*  0.64 ( 0.011)* 
FWCLR  0.80 ( 0.013)*  0.63 ( 0.009)* 
GMR  0.79 ( 0.017)  0.62 ( 0.014) 
DoGR  0.79 ( 0.014)  0.62 ( 0.011) 
NYC  
MLR  13.36 ( 7.850)  2.20 ( 0.064)* 
CART  15.33 ( 9.128)  1.34 ( 0.190) 
FWCLR  13.14 ( 7.643)  1.76 ( 0.321)* 
DoGR  11.88 ( 9.109)  1.40 ( 0.222) 
Stack Overflow  
MLR  60.69 ( 1.118)  37.74 ( 0.152) 
CART  58.19 ( 0.781)*  34.05 ( 0.208)* 
FWCLR  60.47 ( 0.960)  37.25 ( 0.794) 
DoGR  60.68 ( 1.298)  37.62 ( 0.314) 
Run Time
To compare the run time of all algorithms, we performed one round of cross validation (not nested) for each method. The same machine (GB RAM, GHz Intel CPU, Windows OS) was used for time calculation. The available code for WCLR and FWCLR methods are in R while the other methods are written in Python. Table 4 presents the run time in minutes. The slowest method is WCLR, while the fastest one is MLR. WCLR and FWCLR are sensitive to size of data, perhaps due to the many hyperparameters they need to tune. To find the best hyperparameters for GMR and DoGR, we ran the methods times, while WCLR requires runs and FWCLR runs. Beside NYC and Stack Overflow datasets for which exceptions occur in GMR, the run time of DoGR is twice that of GMR. The reason GMR throws exception could be because of a singular covariance matrix.
While our method is similar in spirit to GMR, out method is more stable, as shown on the NYC and Stack Overflow data. In addition, our method is interpretable, as it directly computes regression coefficients, while GMR represents relationships between variables via the covariance matrix. Covariance values are not guaranteed to have the same sign, let alone magnitude, as regression coefficients. Regression is therefore necessary to understand the true relationships between variables. Mathematically, GMR and unweighted regression can be converted to one another using linear algebra. It is not clear, however, whether the equivalence also holds for weighted regression.
Dataset  WCLR  FWCLR  GMR  DoGR 

Synthetic  6.46  0.68  0.76  1.4 
Metropolitan  37  9  2  4 
Wine Quality  180+  36  9  16 
NYC  600+  232  failed  10 
Stack Overflow    7200+  failed  170 
Discussion
In this paper, we introduce DoGR, which softly disaggregates data by latent confounders. Our method retains the advantages of linear models, namely their interpretability, by reporting regression coefficients that give meaning to trends. Our method also discovers the multidimensional latent structure of data by partitioning it into subgroups with similar characteristics and behaviors. While alternative methods exist for disaggregating data, our approach produces interpretable regressions that are computationally efficient.
We demonstrated the utility of our approach by applying it to realworld data, from wine attributes to answers in a questionanswering website. We show that our method identifies meaningful subgroups and trends, while also yielding new insights into the data. For example, in the wine data set DoGR correctly separates high quality red wines from white, and also discovers two distinct classes of high quality white wines. In Stack Overflow data, DoGR identifies important users like “veterans” and “power users.”
There are a few ways to further improve our method. Currently, it is applied to continuous variables, but it needs to be extended to categorical variables, often seen in social science data. Moreover real data is often nonlinear; therefore, our method needs to be extended to nonlinear models beyond linear regression. In addition, to improve prediction accuracy, our method could include regularization parameters, such as ridge regression or LASSO. Even in its current form, however,
DoGR can yield new insights into data.References
 [1] (2018) Can you trust the trend?: discovering simpson’s paradoxes in social data. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 19–27. Cited by: Introduction, Background.
 [2] (2018) Using simpson’s paradox to discover interesting patterns in behavioral data. In Twelfth International AAAI Conference on Web and Social Media, Cited by: Introduction.
 [3] (201607) Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences 113, pp. 7345–7352. External Links: Document Cited by: Background.
 [4] (1972) On simpson’s paradox and the surething principle. Journal of the American Statistical Association 67 (338), pp. 364–366. Cited by: Introduction.

[5]
(2018)
Conditional linear regression.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: Background.  [6] (2009) Simpson’s paradox in a synthetic microbial system. Science 323 (5911), pp. 272–275. External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/323/5911/272.full.pdf Cited by: Introduction.
 [7] (1988) Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association 83 (403), pp. 596–610. External Links: Document Cited by: Background.
 [8] (2009) Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47 (4), pp. 547–553. Cited by: Data.

[9]
(2017)
On combining clusterwise linear regression and kmeans with automatic weighting of the explanatory variables.
In
International Conference on Artificial Neural Networks
, pp. 402–410. Cited by: Background, Quantitative Results.  [10] (2018) On combining fuzzy cregression models and fuzzy cmeans with automated weighting of the explanatory variables. In 2018 IEEE International Conference on Fuzzy Systems (FUZZIEEE), pp. 1–8. Cited by: Background, Prediction Performance, Quantitative Results.
 [11] (1988) A maximum likelihood methodology for clusterwise linear regression. Journal of classification 5 (2), pp. 249–282. Cited by: Model Specification.
 [12] (2000) Discovering surprising patterns by detecting occurrences of simpson’s paradox. In Research and Development in Intelligent Systems XVI, pp. 148–160. Cited by: Introduction.
 [13] (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Inc, Thousand Oaks, CA. Cited by: Background.
 [14] (1994) Supervised learning from incomplete data via an em approach. In Advances in neural information processing systems, pp. 120–127. Cited by: Background.
 [15] (2013) Simpson’s paradox in psychological science: a practical guide. Frontiers in psychology 4, pp. 513. Cited by: Introduction.
 [16] (1975) Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report U.S. Navy. Cited by: Data.
 [17] (1952) Use of ranks in onecriterion variance analysis. Journal of the American Statistical Association 47 (260), pp. 583–621. External Links: Document Cited by: Prediction Performance.
 [18] (2016) Emotions, demographics and sociability in twitter interactions. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 10. Cited by: Data.
 [19] (2017) Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6446–6456. External Links: Link Cited by: Background.
 [20] (2015) Kplane regression. Information Sciences 292, pp. 39–56. Cited by: Background.
 [21] (2014) Handbook of biological statistics. 3rd edition, Sparky House Publishing, Baltimore. Cited by: Prediction Performance.
 [22] (1991) A unified approach to mixed linear models. The American Statistician 45 (1), pp. 54–64. External Links: Document Cited by: Background.
 [23] (2019) Yulesimpson’s paradox in galactic archaeology. arXiv preprint arXiv:1902.01421. Cited by: Introduction.
 [24] (2012) Editors’ introduction to the special section on replicability in psychological science: a crisis of confidence?. Perspectives on Psychological Science 7 (6), pp. 528–530. External Links: Document Cited by: DoGR Method.
 [25] (1998) Using the correct statistical test for the equality of regression coefficients. Criminology 36 (4), pp. 859–866. Cited by: Qualitative Results.
 [26] (2018) Multiple causal inference with latent confounding. ArXiv preprint: 1805.08273. Cited by: Background.
 [27] (2016) The natural selection of bad science. Royal Society Open Science 3 (9), pp. 160384. External Links: Document, Link, https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.160384 Cited by: DoGR Method.
 [28] (1979) Algorithm 39 clusterwise linear regression. Computing 22 (4), pp. 367–373. Cited by: Background.
 [29] (2004) Gaussian mixture regression and classification. Ph.D. Thesis, Rice University. Cited by: Background, Quantitative Results.
 [30] (2018) The blessings of multiple causes. ArXiv preprint: 1805.06826. Cited by: Background.
 [31] (2013) Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior research methods 45 (4), pp. 1191–1207. Cited by: Data.
 [32] (2013) A very basic tutorial for performing linear mixed effects analyses. arXiv preprint arXiv:1308.5499. Cited by: Introduction, Background.
Comments
There are no comments yet.