DoGR: Disaggregated Gaussian Regression for Reproducible Analysis of Heterogeneous Data

08/31/2021
by   Nazanin Alipourfard, et al.
0

Quantitative analysis of large-scale data is often complicated by the presence of diverse subgroups, which reduce the accuracy of inferences they make on held-out data. To address the challenge of heterogeneous data analysis, we introduce DoGR, a method that discovers latent confounders by simultaneously partitioning the data into overlapping clusters (disaggregation) and modeling the behavior within them (regression). When applied to real-world data, our method discovers meaningful clusters and their characteristic behaviors, thus giving insight into group differences and their impact on the outcome of interest. By accounting for latent confounders, our framework facilitates exploratory analysis of noisy, heterogeneous data and can be used to learn predictive models that better generalize to new data. We provide the code to enable others to use DoGR within their data analytic workflows.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

06/12/2017

General Latent Feature Models for Heterogeneous Datasets

Latent feature modeling allows capturing the latent structure responsibl...
03/27/2019

The Landscape of R Packages for Automated Exploratory Data Analysis

The increasing availability of large but noisy data sets with a large nu...
01/26/2022

Multi-objective Semi-supervised Clustering for Finding Predictive Clusters

This study concentrates on clustering problems and aims to find compact ...
09/29/2019

I Know You'll Be Back: Interpretable New User Clustering and Churn Prediction on a Mobile Social Application

As online platforms are striving to get more users, a critical challenge...
10/28/2019

Same-Cluster Querying for Overlapping Clusters

Overlapping clusters are common in models of many practical data-segment...
05/03/2022

Scalable Regularised Joint Mixture Models

In many applications, data can be heterogeneous in the sense of spanning...
08/18/2017

Statistical Latent Space Approach for Mixed Data Modelling and Applications

The analysis of mixed data has been raising challenges in statistics and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Social data is often highly heterogeneous, coming from a population composed of diverse classes of individuals, each with their own characteristics and behaviors. As a result of heterogeneity, a model learned on population data may not make accurate predictions on held-out test data, nor offer analytic insights into the underlying behaviors that motivate interventions. To illustrate, consider Figure 1

, which shows data collected for a hypothetical nutrition study measuring how the outcome, body mass index (BMI), changes as a function of daily pasta calorie intake. Multivariate linear regression (MLR) analysis finds a negative relationship in the population (red dotted line) between these variables. The negative trend suggests that—paradoxically—increased pasta consumption is associated with lower BMI. However, unbeknownst to researchers, the hypothetical population is heterogeneous, composed of classes that varied in their fitness level. These classes (clusters in Fig. 

1

) represent, respectively, people who do not exercise, people with normal activity level, and athletes. When data is disaggregated by fitness level, the trends within each subgroup are positive (dashed lines), leading to the conclusion that increased pasta consumption is in fact associated with a higher BMI. Recommendations for pasta consumption arising from the naive analysis are opposite to those arising from a more careful analysis that accounts for the confounding effect of different classes of people. The trend reversal is an example Simpson’s paradox, which has been widely observed in many domains, including biology, psychology, astronomy, and computational social science  

[6, 15, 23, 4].

Figure 1: Heterogeneous data with three latent classes. The figure illustrates Simpson’s paradox, where a negative relationship between the outcome and the independent variable exists for population as a whole (dotted line) but reverses when the data is disaggregated by classes (dashed lines).

Social scientists analyze heterogeneous data with underlying structure (classes) using mixed effects models [32]

. This variant of linear regression allows for intercepts to be random variables in order to account for shifts between classes, and slopes to be random variables in order to model differences in regression coefficients within classes. Mixed effects models are used to describe non-independent observations of data from the same class, and they can even handle trend reversal associated with Simpson’s paradox. Mixed effects models assume that classes are specified by a categorical variable, which can be constructed by disaggregating or binning data on an existing variable 

[1]. In practice, however, these classes may not be known a priori, or may be related to multiple variables. Instead, they must be discovered from data, along with the trends they represent.

Many methods already exist for finding latent classes within data; however, they have various shortcomings. Unsupervised clustering methods disaggregate data regardless of the outcome variable being explained although distinct outcomes may be best described by different clusterings of the same data. Recent methods have tackled Simpson’s paradox by performing supervised disaggregation of data  [2, 12], but these methods disaggregate data into subgroups using existing features, and thus are not able to capture effects of latent classes (i.e., unobserved confounders). Additionally, they perform “hard” clustering, but assigning each data point to a unique group. Instead, a “soft” clustering is more realistic as it captures the degree of uncertainty about which group the data belong to. To address these challenges, we describe Disaggregation via Gaussian Regression (DoGR

), a method that jointly partitions data into overlapping clusters and estimates linear trends within them. This allows for learning accurate and generalizable models while keeping the interpretability advantage of linear models. Proposed method assumes that the data can be described as a superposition of clusters, or components, with every data point having some probability to belong to any of the components. Each component represents a latent class, and the soft membership represents uncertainty about which class a data point belongs to. We use an expectation maximization (EM) algorithm to estimate component parameters and component membership parameters for the individual data points.

DoGR jointly updates component parameters and its regression coefficients, weighing the contribution of each data point to the component regression coefficients by its membership parameter. The joint learning of clusters and their regressions allows for discovering the latent classes that explain the structure of data.

Our framework learns predictive and interpretable models of data. DoGR is validated on synthetic and real-world data and is able to discover hidden subgroups in analytically challenging scenarios, where subgroups overlap along some of the dimensions, and identify interesting trends that may be obscured by Simpson’s paradox. This includes scenarios where the trends in data show Simpson’s reversal. Our method achieves performance on prediction tasks comparable to or better than state-of-the-art algorithms, but in a fraction of the time, making it well suited for big data analysis.

Background

Regression is one of the most widely used methods in data analysis due to its flexibility, interpretability, and computational scalability [13]

. Linear models are perhaps one of the most popular regression methods to learn the relationship between predictor variables (also known as fixed effects or features) and the outcome variable. However, linear models make assumptions about the data that are often violated by real-world data. One of these assumptions is independence of observations. In reality, however, observations may come from groups of similar individuals, violating the independence assumption. To account for individual differences within such groups, mixed-effect or multi-level models were developed

[22]. These models assume that similar individuals, or data samples, come from the same class, and that classes are specified by some categorical variable [32]. In some cases, this variable can be created by binning a continuous variable and disaggregating data by these bins [1]. However, existing methods do not work when differences between groups or individuals cannot directly be described by observed variables or features. In this chapter, we describe a method that captured differences between groups as latent confounders (i.e., latent clusters) that are learned from data. Our work improves the utility of linear models in the analysis of social data by controlling for these clusters. A number of previous works have attempted to tackle the issue of latent confounders. Clusterwise linear regression (CLR) [28]

starts with initial clusters and updates them by reassigning one data point to another partition that minimizes a loss function. The method is slow, since it moves one data point at each iteration. Two other methods, WCLR 

[9] and FWCLR [10]

, improve on CLR by using k-means as their clustering method. These methods were shown to outperform CLR and other methods, such as K-plane

[20]. In Conditional Linear Regression [5], the goal is to find a linear rule capable of achieving more accurate predictions for a segment

of the population by ignoring outliers. One of the larger differences between their method and ours is that Conditional Linear Regression focuses on a small subset of data, while we model data as a whole, alike to CLR, WCLR, FWCLR and K-plane. The shared parameter

across all clusters in WCLR and FWCLR makes these methods perform poorly if clusters have different variance, as we will show in our results section. Other methods have combined Gaussian Mixture Models and regression to create algorithms similar to our own

[29, 14]. We call these methods Guassian Mixture Regression (GMR). In contrast to these methods, however, we can capture relationships between independent and outcome variables through regression coefficients, which previous methods were unable to do. We also use Weighted Least Squares to fit our model, which makes it less sensitive to outliers [7].

Causal inference has been used to address the problem of confounders, including latent confounders [19], and to infer the causal relationship between features and outcomes [3, 26, 30]. One difficulty with causal inference however, is that the focus is traditionally on one intervention [26]. Taking into account synergistic effects of multiple features is not well understood, but has been attempted recently [26, 30]. With adequate datasets, these can help infer causal relationships between multiple causes, but certain causal assumptions are needed, which might not correspond to reality. In contrast, regression offers us the opportunity to understand relationships between each feature and an outcome, regardless of the dataset, even if we cannot make causal claims.

DoGR Method

Let be a set of observations, or records, each containing a real-valued outcome of an observation

and a vector of

independent variables, or features,

. Regression analysis is often used to capture the relationship between each independent variable and the outcome. Specifically, MLR estimates regression coefficients

by minimizing the residuals of over all observations. However, parameters learned by the MLR model may not generalize to out-of-sample populations, as they can be confounded by Simpson’s paradox and sampling bias. We can call this the “robustness” problem of regression.

As discussed in the introduction, Figure 1(a) illustrates this problem with synthetic data. The MLR model trained on the aggregate data gives . This suggests a negative relationship (solid red line) between the independent variable Daily pasta intake and the outcome . However, there are three components each with a positive trend. Indeed, applied separately to each component, MLR learns the proper positive relationship (dashed green line) between Daily pasta intake and : , where is cluster’s intercept, and is coefficient,

. Associations learned by the

MLR model trained on the population-level data are not robust or generalizable. This could help explain the reproducibility problem seen in many fields [24, 27], where trends seen in some experiments cannot be reproduced in other experiments with different populations. This can be clearly observed in Figure 1(b), which represents a different sampling of the population. The underlying clusters are the same, but the middle cluster is over-sampled during data collection. As a result, MLR trained at the population-level finds only a weak association between independent variable and the outcome (solid red line). In contrast, regressions learned separately for each cluster (green dashed line) remain the same even in the new data: thus, the cluster regressions represent robust and generalizable relationships in data.

Model Specification

The goal of this work is to learn robust and reproducible trends through a regression model that accounts for the latent structure of data, for example, the presence of three clusters in data shown in Fig. 1. Our model jointly disaggregates the data into overlapping subgroups, or components, and performs weighted linear regression within each component. We allow components to overlap in order to represent the uncertainty about which component or subgroup an observation belongs to.

In what follows, we used capital letters to denote random variables and lowercase letters their values. We model the independent variable of each component

as a multivariate normal distribution with mean

and covariance matrix :

(1)

In addition, each component is characterized by a set of regression coefficients . The regression values of the component are

(2)

with giving the residuals for component . Under the assumption of normality of residuals [11], has a normal distribution with mean

and standard deviation

.

(3)

where is defined by Eq. 2. Under the assumption of homoscedasticity, in which the error is the same for all , the joint density is the product of the conditional (Eq. 3) and marginal (Eq. 1) densities:

(4)

Multiplication of can be converted to a normal distribution: comes from , where and is a block matrix.

Model Learning

The final goal of the model is to predict the outcome, given independent variables. This prediction combines the predicted values of outcome from all components by taking the average of the predicted values, weighed by the size of the component. We define as the weight of the component , where

. We can define the joint distribution over all components as

.

Then the log-likelihood of the model over all data points is:

(5)

The formula here is same as the Gaussian Mixture Model, except that the target is a function of . To find the best values for parameters we can leverage the Expectation Maximization (EM) algorithm. The algorithm iteratively refines parameters based on the expectation (E) and maximization (M) steps.

E-step

Let’s define (membership parameter) as probability that data point belongs to component . Given the parameters of last iteration, membership parameter is:

(6)

Thus, the E-step disaggregates the data into clusters, but it does so in a “soft” way, with each data point having some probability to belong to each cluster.

M-step

Given the updated membership parameters, the M-step updates the parameters of the model for the next iteration as :

In addition, this step updates regression parameters based on the estimated parameters in . Our method uses Weighted Least Squares for updating the regression coefficients for each component, with as weights. In the other words, we find that minimizes the Weighted Sum of Squares (WSS) of the residuals:

Using the value of , the updated would be:

Intuitively, shows us where in the center of each subgroup resides. The further a data point is from the center, the lower its probability to belong to the subgroup. The covariance matrix captures the spread of the subgroup in the space relative to the center . Regression coefficient gives low weight to outliers (i.e., is a weighted regression) and captures the relationship between and near the center of the subgroup. Parameter tells us about the variance of the residuals over fitted regression line, and tells us about the importance of each subgroup.

Prediction

For test data , the predicted outcome is the weighted average of the predicted outcomes for all components. The weights capture the uncertainty about which component the test data belongs to. Using equation 3, the best prediction of outcome for component would be , which is the mean of conditional outcome value of the component :

(7)

The green solid line in figure 1 represents the predicted outcome as function of . The solid green line, is weighted average over dashed-green lines.

Data

We apply our method to a synthetic and real-world data sets, including large social data sets described below.

The Synthetic data consists of two subgroups, with the same mean on , but different variances, as shown in Figure 2. The variance of for the top component is and for the bottom component is . The number of data points in bottom component is and in top component . The value for the bottom component is and the top component it is , with as variance of residual.

Figure 2: Synthetic data with two components centered on , but with different variances. Gray points are data, and red points are predicted outcomes made by our method.

The Metropolitan data comes from a study of emotions expressed through Twitter messages posted from locations around Los Angeles County [18]. a large metropolitan area. The data contains more than 6 million geo-referenced tweets from 33,000 people linked to US census tracts through their locations. The demographic and socioeconomic characteristics of tracts came from the 2012 American Fact Finder.222http://factfinder.census.gov/ A tract is a small region that contains about 4,000 residents on average. The tracts are designed to be relatively homogeneous with respect to socioeconomic characteristics of the population, such as income, residents with bachelors degree or above, and other demographics. Of more than 2000 tracts within this metropolitan area, 1688 were linked to tweets.

Emotional valence

was estimated from the text of the tweets using Warriner, Kuperman, and Brysbaert (WKB) lexicon 

[31]. The lexicon gives valence scores—between 1 and 9—which quantify the level of pleasure or happiness expressed by a word, for 14,000 English words. Since valence is a proxy for the expressed happiness, the data allows us to investigate the social and economic correlates of happiness. We use valence as the outcome in data analysis.

The Wine Quality data combines two benchmark data sets from UCI repository333https://archive.ics.uci.edu/ml/datasets/wine+quality related to red and white wines [8]. The data contains records about white wine and red wine samples of wine quality and concentrations of various chemicals. Quality is a score between and , and is typically around . Our goal is to model wine quality based on physicochemical tests. Note that the type of wine (red or white) is only used to evaluate the learned components.

The New York City Property Sale data444https://www.kaggle.com/new-york-city/nyc-property-sales contains records of every building or building unit (such as an apartment) sold in the New York City property market over a 12-month period, from September 2016 to September 2017. We removed all categorical variables like neighborhood and tax class. We use each property’s borough to study relevance of the components to different neighborhoods in NYC; however, we do not use it in analysis. The outcome variable is sale price, which is in million dollars, and it ranges from to .

The Stack Overflow dataset contains data from a question-answering forum on the topic of computer programming. Any user can ask a question, which others may answer. We used anonymized data representing all answers to questions with two or more answers posted on Stack Overflow from August 2008 until September 2014.555https://archive.org/details/stackexchange We created a user-focused data set by selecting at random one answer written by each user and discarding the rest. To understand factors affecting the length of the answer (the outcome, which we measure as the number of words in the answer), we use features of answers and features of users. Answer features include the number of hyperlinks, and lines of code the answer contains, and its Flesch readability score [16]. Features describing answerers are their reputation, tenure on Stack Overflow (in terms of percentile rank) and experience, i.e., the total number of answers written during their tenure. We also use activity-related features, including time since previous answer written by the user, session length, giving the number of answers user writes during the session, and answer position within that session. We define a session as a period of activity without a break of 100 minutes or longer. Features acceptance probability, reputation and number of answers are only used to evaluate the learned components.

Preprocessing

When variables are linearly correlated, regression coefficients are not unique, which presents a challenge for analysis. To address this challenge, we used Variance Inflation Factor (VIF) to remove multicollinear variables. We iteratively remove variables when their VIF is larger than five. For example, in the Metropolitan data, this approach reduced the number of features from more than 40 to six, representing the number of residents within a tract who are , , and , as well as percent of adult residents with degree or with incomes line. Table 1, represents information about all data sets.

Dataset records features after VIF outcome
Synthetic Y
Metropolitan Valence
Wine Quality Quality
NYC Sale Price
Stack Overflow Answer Length
Table 1: Data sets and their characteristics.

Finding Components

As first step in applying DoGR to data for qualitative analysis, we need to decide on the number of components. In general, finding the appropriate number of components in clustering algorithms is difficult and generally requires ad hoc parameters. For Gaussian Mixture Models, the Bayesian information criterion (BIC) is typically used. We also use BIC with parameters, where is number of components and is number of independent variables. Based on the BIC scores, we choose five components for this data. In our explorations, using three or six components gave qualitatively similar results, but with some of the components merging or splitting. With the same procedure, the optimal number of components for Wine Quality and NYC and Stack Overflow data is four.

Results

Due to the unique interpretability of our method, we first use it to describe meaningful relationships between variables in the real-world data (“Qualitative Results”). We then show how it compares favorably to competing methods (“Quantitative Results”).

Qualitative Results

We use a radar chart to visualize the components discovered in the data. Each colored polygon represents mean of the component, , in the feature space. Each vertex of the polygon represents a feature, or dimension, with the length giving the coordinate of along that dimension. For the purpose of visualization, each coordinate value of was divided by the largest coordinate value of

across all components. The maximum possible value of each coordinate after normalization is one. The mean value of the outcome variable within each component is shown in the legend, along with 95% confidence intervals.

Our method also estimates the regression coefficients, which we can compare to MLR. We computed the p-value assuming coefficients in DoGRare equal to MLR. If and are two coefficients to compare, and and are the standard errors

of each coefficient, then the z-score is

, from which we can easily infer the p-value assuming a normal distribution of errors [25].

% Below %
Poverty Graduate
mean std
All
Orange
Purple
Red
Green
Blue
* p-value; *** p-value
Figure 3: Disaggregation of the Metropolitan data into five subgroups. The radar plot shows the relative importance of a feature within the subgroup. The table report regression coefficients for two independent variables and for Multivariate Regression (MLR) of aggregate data and separately for each subgroup found by our method.
Figure 4: Map of Los Angeles county. The color represent the component for the Tract. Colors match the colors of Figure 3. Pasadena Neighbourhood is red, while downtown is blue, Beverly Hills is orange, Santa Monica is purple and orange, while we can see more orange for Manhattan beach. East Los Angeles is green.

Metropolitan

Figure 3 visualizes the components of the Metropolitan data and reports regression coefficients for two variables. Interestingly, the data is disaggregated along ethnic lines, perhaps reflecting segregation within the metropolitan area. Figure 4 represents the regions in each component in Los Angeles County map. The component consists of census tracts with many highly educated white residents. It also has highest valence (), meaning that people post the happiest tweets from those tracts. The Purple component, is ethnically diverse, with large numbers of white, Asian and Hispanic residents. It’s valence () is only slightly lower than of the Orange component. The Red component has largely Asian and some Hispanic neighborhoods that are less well-off, but with slightly lower valence (). The Blue and Green components represents tracts with the least educated and poorest residents. They are also places with the lowest valence ( and , respectively). Looking at the regression coefficients, education is positively related to happiness across the entire population (All), and individually in all components, with the coefficients significantly different from MLR in four of the five components, suggesting this trend was a Simpson’s paradox. However, the effect is weakest in the most and least educated components. Poverty has no significant effect across the entire population, but has a negative effect on happiness in poorest neighborhoods (Blue). Counter-intuitively, regression coefficients are positive for two components. It appears that within these demographic classes, the poorer the neighborhood, the happier the tweets that originate from them.

Wine Quality

Wine Quality Citric Acid SO2 Sugar
mean std err
All
Blue 6.02
Orange 5.91 0.016
Green 5.6 0.020
Red 5.36 0.037
* p-value; *** p-value
Figure 5: The value of the center of each component () for four components of the Wine Quality data. The Blue and Orange components are almost entirely white wines, and the Green component is composed mostly () of red wines. The lowest quality and smallest—Red—component is a mixture of red () and white () wines.

Figure 5 visualizes the disaggreagation of the Wine Quality data into four components. Although we did not use the type of wine (red or white) as a feature, wines were naturally disaggregated by type. The Blue component is almost entirely () composed of high quality () white wines. All data in the Orange component () are white wine, while the Green component is composed mostly () of red wines with average quality . The lowest quality (Red) component, with quality equal to , contains a mixture of red () and white () wine. In other words, we discover that in higher-quality wines, wine color can be determined with high accuracy simply based on its chemical components, which was not known before, to the best of our knowledge. Low quality wines appear less distinguishable based on their chemicals. We find Chlorides have a negative impact on quality of wine in all components (not shown), Sugar has positive impact on high quality white wines and red wines, and negative impact on low quality wines. Surprisingly, Free Sulfur Dioxide has a positive impact on high quality white wines, but a negative impact in other components. These are findings that may be important to wine growers, and capture subtleties in data that commonly-used MLR does not.

NYC Sale Price

Sale Resid. Commer. Gross Land
Price Units Units Sq.Feet Sq.Feet
mean std err
All
Purple
Green
Red
Blue
all significant: p-value
Figure 6: The value of center () for four components of NYC data. The numbers in legends are average price of the component’s real estate and is in millions of dollars.

The mean sale price of all properties in NYC property sales data is $1.31 Million. The mean price, however, hides a large heterogeneity in the properties on the market. Our method tamed some of this heterogeneity by identifying four components within the sales data. Figure 6 shows that these components represent large commercial properties (Purple), large residential properties (Green), mixed commercial/residential sales (Red), and single-unit residential properties (Blue).

Component Manhattan Bronx Brooklyn Queens Staten
Purple 17% 20% 41% 17% 5%
Green 41% 22% 26% 10% 1%
Red 8% 9% 65% 15% 3%
Blue 0% 13% 37% 33% 16%
All 3% 13% 40% 30% 14%
Table 2: DoGR components, and the boroughs that make up each component (rows might not add up to 100% due to rounding).

Table 2 show what percentage of each component is made up of New York City’s five boroughs. Large commercial properties (Purple component), such as office buildings, are located in Brooklyn and Manhattan, for example. These are the most expensive among all properties, with average price of more than $12 Million. The next most expensive type of property are large residential buildings (Green component)—multi-unit apartment buildings. These are also most likely to be located in Manhattan and Brooklyn. Small residential properties (Blue component)—most likely to be single family homes—are the least expensive, on average half a million dollars, and most likely to be located in Brooklyn and Queens, with some in Staten Island.

Regressions in this data set show several instances of Simpson’s paradoxes. Property price in population-level data increases as a function of the number of residential units. In disaggregated data, however, this is true only for the Green component, representing apartment buildings. Having more residential units in smaller residential properties (Red and Blue components) lowers their price. This could be explained by smaller multi-unit buildings, such as duplexes and row houses, being located in poorer residential areas, which lowers their price compared to single family homes. Another notable trend reversal occurs when regressing on lot size (land square feet). As expected, there is a positive relationship in the Blue component with respect to lot size, as single family homes build on bigger lots are expected to fetch higher prices in New York City area. However, the trends in the other components are negative. This could be explained the following way. As land becomes more expensive, builders are incentivized to build up, creating multi-story apartment and office buildings. The more expensive the land, the taller the building they build. This is confirmed by the positive relationship with gross square feet, which are strongest for the Purple and Green components. In plain words, these components represent the tall buildings with small footprint that one often sees in New York City.

Stack Overflow

As our last illustration, we apply DoGR to Stack Overflow data to answer the question how well we can predict the length of the answer a user writes, given the features of the user and the answer.

DoGR splits the data into four clusters, as shown in 7. Green and Red components contain most of the data, with 47% and 39% of records respectively. The radar plot shows the relative strength of the features in each component, while the table above the plot shows features characterizing each discovered group. Except for Percentile tenure, these features were not used by DoGR and are shown to validate the discovered groups.

Size Acceptance Answerer Num. of Percentile
Probability Reputation Answers Tenure
Red 39% 0.26 338.17 16.02 0.43
Green 47% 0.24 314.35 14.56 0.42
Blue 8% 0.33 492.19 20.65 0.47
Orange 5% 0.32 784.62 40.21 0.40
Words Codes Readab. Percnt. URLs
mean std
All
Red
Green
Blue -0.22
Orange
* Italicized effects are not significant
Figure 7: Disaggregation of Stack Overflow data into subgroups. The outcome variable is length of the answer (number of words). The radar plot shows the importance of each features used in the disaggregation. The top table shows average values of validation features, while the bottom table shows regression coefficients for the groups.

The Orange component ( of data) contains very active (longer Session Length) users, who meticulously document their answers with many lines of code and URLs, so we can label them “power users”. These are among the longest (high Words) and more complex (low Readability) answers in the data, and they also tend to be high quality (high Acceptance Probability). Surprisingly this group has newer users (lower Percentile Tenure), but they have high reputation (Answerer Reputation) and wrote more answers previously (higher Number of Answers). These users, while a minority, give life to Stack Overflow and make it useful for others. Orange component users have more code lines within shorter answers. This is in contrast to other groups, which tend to include more lines of code within longer answers. The brevity of Orange users when documenting code is an example of a trend reversal.

Another interesting subgroup is the Blue group ( of data), which is composed of “veterans” (high Percentile Tenure), who write easy-to-read answers (high Readability) that are documented with many URLs. These users have a relatively high reputation, but they are selective in the questions they answer (lower Number of Answers than for the Orange users). Interestingly, tenure (Percentile) does not have an effect on length of the answer for these Blue users, while it has positive effect in other groups (i.e., more veteran users write longer answers). Negative effect of URLs on the length of the answer suggests that these users use URLs to refer to existing answers.

The Red (39%) and Green (47%) components contain the vast majority of data. They are similar in terms of user reputation, tenure (Percentile) and experience (Number of Answers), as well as the quality of the answers they write (Acceptance Probability), with the Red component users scoring slightly higher on all the measures. The main difference is in the answers they write: Red users do not include URLs in their answers, while Green users do not include code. Another difference between these groups is that Green users have longer answers than Red users, but as their answers become longer, they also become more difficult to read (lower Readability). In contrast, longer answers by Red users are easier to read.

Overall, we find intuitive and surprising results from data that are largely due to DoGR’s interpretability.

Quantitative Results

We compare the performance of DoGR to existing state-of-the-art methods for disaggregated regression: the three variants of CLR, WCLR [9], FWCLR [10], and GMR [29]. We use CART

as a method which uses decision tree for regression. We also use

MLR, which does not disaggregate data, as a baseline.

Prediction Performance

For the prediction task, we use

-fold nested cross validation to train the model on four folds and make predictions on the out-of-sample data in the fifth fold. As hyperparameters, we used

1–6 as potential number of components for all methods. For WCLR and FCWCLR, we set , and for FWCLR we set for CART we set the depth of the tree as . We use grid search to find best hyperparameters. Table 3 presents results on our datasets and synthetic data. To evaluate prediction quality, we use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE)

. When comparing the quality of cross-validated predictions across methods, we use the Kruskal-Wallis test, a non-parametric method to compare whether multiple distributions are different

[17]. When distributions were significantly different, a pairwise comparisons using the Tukey-Kramer test was also done as post-hoc test, to determine if mean cross-validated quality metrics were statistically different [21]. Prediction results are shown in table 3. We use to indicate statistically significant () differences between our method and other methods. Bolded values differ significantly from others or have lower standard deviation. Following [10], we use lower deviation to denote more consistent predictions.

CART, WCLR and FWCLR do not perform well in Synthetic data, because variances of the two components are different, which these methods do not handle. This is a problem that exists in an arbitrary number dimensions; we observe in higher dimensions the gap between performance of FWCLR and our method increases.

For Metropolitan data, there is no statistically significant difference between methods: all cluster-based regression methods outperform MLR and CART on the prediction task. This shows that any method that accounts for the latent structure of data helps improve predictions. For Wine Quality

data, while the null hypothesis of equal performance of

GMR and DoGR cannot be rejected, our method has a smaller standard deviation in both datasets. We were not able to successfully run GMR and WCLR on NYC data, due to exceptions (returning null as predicted outcome) and extremely long run time, respectively. It took three days for FWCLR, and four hours for our method to finish running on NYC data. Our method has significantly lower MAE compared to FWCLR, while CART has a better MAE and worse RMSE. This shows that CART has a worse performance for outliers. Figure 8 represents the reason behind the poor performance for outliers for hard clustering methods like CART.

We were also not able to successfully run GMR and WCLR on Stack Overflow data for the same reasons. It took six days for FWCLR to run one round of cross-validation, after which we stopped it. Therefore, the mean reported in the table for FWCLR is the average of five runs, while for MLR and DoGR it is the average of runs. The best performing method is CART. The main reason is that Stack Overflow has discrete variables and CART is a better method is compare with FWCLR and DoGR for handling that types of variables.

Method RMSE () MAE ()
Synthetic
MLR 294.88 ( 1.236)* 288.35 ( 0.903)*
CART 264.70 ( 7.635)* 224.57 ( 6.138)*
WCLR 261.14 ( 3.370)* 232.76 ( 2.682)*
FWCLR 261.27 ( 4.729)* 233.05 ( 3.772)*
GMR 257.36 ( 4.334) 219.15 ( 3.567)
DoGR 257.32 ( 3.871) 219.11 ( 3.106)
Metropolitan
MLR 0.083 ( 0.0061) 0.062 ( 0.0033)
CART 0.086 ( 0.0056) 0.064 ( 0.0036)*
WCLR 0.083 ( 0.0029) 0.062 ( 0.0024)
FWCLR 0.082 ( 0.0044) 0.061 ( 0.0021)
GMR 0.083 ( 0.0043) 0.061 ( 0.0023)
DoGR 0.083 ( 0.0052) 0.061 ( 0.0031)
Wine Quality
MLR 0.83 ( 0.018)* 0.64 ( 0.015)*
CART 0.79 ( 0.015) 0.62 ( 0.013)
WCLR 0.83 ( 0.013)* 0.64 ( 0.011)*
FWCLR 0.80 ( 0.013)* 0.63 ( 0.009)*
GMR 0.79 ( 0.017) 0.62 ( 0.014)
DoGR 0.79 ( 0.014) 0.62 ( 0.011)
NYC
MLR 13.36 ( 7.850) 2.20 ( 0.064)*
CART 15.33 ( 9.128) 1.34 ( 0.190)
FWCLR 13.14 ( 7.643) 1.76 ( 0.321)*
DoGR 11.88 ( 9.109) 1.40 ( 0.222)
Stack Overflow
MLR 60.69 ( 1.118) 37.74 ( 0.152)
CART 58.19 ( 0.781)* 34.05 ( 0.208)*
FWCLR 60.47 ( 0.960) 37.25 ( 0.794)
DoGR 60.68 ( 1.298) 37.62 ( 0.314)
Table 3: Results of prediction on five data-sets. Asterisk indicates results that are significantly different from our method (p-value ). The bolded results have smaller standard deviation among the best methods with same mean of error.
(a) Soft Clustering
(b) Hard Clustering
Figure 8: The difference between hard and soft clustering for data analysis. Assuming we have two clusters (yellow and red); in hard clustering, the uncertainty of cluster assignment is not tractable in data analysis phase (e.g. studying the coefficients of the independent variables), since all the data-points have one of the main cluster colors (b). However, having the soft-clustering, we can get the approximated coefficient for each individual data-point (the whole range from yellow to red in (a)), separately.

Run Time

To compare the run time of all algorithms, we performed one round of cross validation (not nested) for each method. The same machine (-GB RAM, -GHz Intel CPU, Windows OS) was used for time calculation. The available code for WCLR and FWCLR methods are in R while the other methods are written in Python. Table 4 presents the run time in minutes. The slowest method is WCLR, while the fastest one is MLR. WCLR and FWCLR are sensitive to size of data, perhaps due to the many hyperparameters they need to tune. To find the best hyperparameters for GMR and DoGR, we ran the methods times, while WCLR requires runs and FWCLR runs. Beside NYC and Stack Overflow datasets for which exceptions occur in GMR, the run time of DoGR is twice that of GMR. The reason GMR throws exception could be because of a singular covariance matrix.

While our method is similar in spirit to GMR, out method is more stable, as shown on the NYC and Stack Overflow data. In addition, our method is interpretable, as it directly computes regression coefficients, while GMR represents relationships between variables via the covariance matrix. Covariance values are not guaranteed to have the same sign, let alone magnitude, as regression coefficients. Regression is therefore necessary to understand the true relationships between variables. Mathematically, GMR and unweighted regression can be converted to one another using linear algebra. It is not clear, however, whether the equivalence also holds for weighted regression.

Dataset WCLR FWCLR GMR DoGR
Synthetic 6.46 0.68 0.76 1.4
Metropolitan 37 9 2 4
Wine Quality 180+ 36 9 16
NYC 600+ 232 failed 10
Stack Overflow - 7200+ failed 170
Table 4: Run time (in minutes) of one round of cross-validation (non-nested). MLR took less than a minute for all datasets. ’failed’ indicates that the method could not run on the data due to exceptions. We time out WCLR exceeding sufficient amount of time to show the order of performance. Bold numbers indicate the fastest algorithm.

Discussion

In this paper, we introduce DoGR, which softly disaggregates data by latent confounders. Our method retains the advantages of linear models, namely their interpretability, by reporting regression coefficients that give meaning to trends. Our method also discovers the multidimensional latent structure of data by partitioning it into subgroups with similar characteristics and behaviors. While alternative methods exist for disaggregating data, our approach produces interpretable regressions that are computationally efficient.

We demonstrated the utility of our approach by applying it to real-world data, from wine attributes to answers in a question-answering website. We show that our method identifies meaningful subgroups and trends, while also yielding new insights into the data. For example, in the wine data set DoGR correctly separates high quality red wines from white, and also discovers two distinct classes of high quality white wines. In Stack Overflow data, DoGR identifies important users like “veterans” and “power users.”

There are a few ways to further improve our method. Currently, it is applied to continuous variables, but it needs to be extended to categorical variables, often seen in social science data. Moreover real data is often non-linear; therefore, our method needs to be extended to non-linear models beyond linear regression. In addition, to improve prediction accuracy, our method could include regularization parameters, such as ridge regression or LASSO. Even in its current form, however,

DoGR can yield new insights into data.

References

  • [1] N. Alipourfard, P. G. Fennell, and K. Lerman (2018) Can you trust the trend?: discovering simpson’s paradoxes in social data. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 19–27. Cited by: Introduction, Background.
  • [2] N. Alipourfard, P. G. Fennell, and K. Lerman (2018) Using simpson’s paradox to discover interesting patterns in behavioral data. In Twelfth International AAAI Conference on Web and Social Media, Cited by: Introduction.
  • [3] E. Bareinboim and J. Pearl (2016-07) Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113, pp. 7345–7352. External Links: Document Cited by: Background.
  • [4] C. R. Blyth (1972) On simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association 67 (338), pp. 364–366. Cited by: Introduction.
  • [5] D. Calderon, B. Juba, Z. Li, and L. Ruan (2018) Conditional linear regression. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: Background.
  • [6] J. S. Chuang, O. Rivoire, and S. Leibler (2009) Simpson’s paradox in a synthetic microbial system. Science 323 (5911), pp. 272–275. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/323/5911/272.full.pdf Cited by: Introduction.
  • [7] W. S. Cleveland and S. J. Devlin (1988) Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association 83 (403), pp. 596–610. External Links: Document Cited by: Background.
  • [8] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis (2009) Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47 (4), pp. 547–553. Cited by: Data.
  • [9] R. A. da Silva and F. d. A. de Carvalho (2017) On combining clusterwise linear regression and k-means with automatic weighting of the explanatory variables. In

    International Conference on Artificial Neural Networks

    ,
    pp. 402–410. Cited by: Background, Quantitative Results.
  • [10] R. A. da Silva and F. d. A. de Carvalho (2018) On combining fuzzy c-regression models and fuzzy c-means with automated weighting of the explanatory variables. In 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8. Cited by: Background, Prediction Performance, Quantitative Results.
  • [11] W. S. DeSarbo and W. L. Cron (1988) A maximum likelihood methodology for clusterwise linear regression. Journal of classification 5 (2), pp. 249–282. Cited by: Model Specification.
  • [12] C. C. Fabris and A. A. Freitas (2000) Discovering surprising patterns by detecting occurrences of simpson’s paradox. In Research and Development in Intelligent Systems XVI, pp. 148–160. Cited by: Introduction.
  • [13] J. Fox (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Inc, Thousand Oaks, CA. Cited by: Background.
  • [14] Z. Ghahramani and M. I. Jordan (1994) Supervised learning from incomplete data via an em approach. In Advances in neural information processing systems, pp. 120–127. Cited by: Background.
  • [15] R. Kievit, W. E. Frankenhuis, L. Waldorp, and D. Borsboom (2013) Simpson’s paradox in psychological science: a practical guide. Frontiers in psychology 4, pp. 513. Cited by: Introduction.
  • [16] J. P. Kincaid, J. R. P. Fishburnea, R. L. Rogers, and B. S. Chissom (1975) Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report U.S. Navy. Cited by: Data.
  • [17] W. H. Kruskal and W. A. Wallis (1952) Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47 (260), pp. 583–621. External Links: Document Cited by: Prediction Performance.
  • [18] K. Lerman, M. Arora, L. Gallegos, P. Kumaraguru, and D. Garcia (2016) Emotions, demographics and sociability in twitter interactions. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 10. Cited by: Data.
  • [19] C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling (2017) Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6446–6456. External Links: Link Cited by: Background.
  • [20] N. Manwani and P. Sastry (2015) K-plane regression. Information Sciences 292, pp. 39–56. Cited by: Background.
  • [21] J.H. McDonald (2014) Handbook of biological statistics. 3rd edition, Sparky House Publishing, Baltimore. Cited by: Prediction Performance.
  • [22] R. A. McLean, W. L. Sanders, and W. W. Stroup (1991) A unified approach to mixed linear models. The American Statistician 45 (1), pp. 54–64. External Links: Document Cited by: Background.
  • [23] I. Minchev, G. Matijevic, D. Hogg, G. Guiglion, M. Steinmetz, F. Anders, C. Chiappini, M. Martig, A. Queiroz, and C. Scannapieco (2019) Yule-simpson’s paradox in galactic archaeology. arXiv preprint arXiv:1902.01421. Cited by: Introduction.
  • [24] H. Pashler and E. Wagenmakers (2012) Editors’ introduction to the special section on replicability in psychological science: a crisis of confidence?. Perspectives on Psychological Science 7 (6), pp. 528–530. External Links: Document Cited by: DoGR Method.
  • [25] R. Paternoster, R. Brame, P. Mazerolle, and A. Piquero (1998) Using the correct statistical test for the equality of regression coefficients. Criminology 36 (4), pp. 859–866. Cited by: Qualitative Results.
  • [26] R. Ranganath and A. J. Perotte (2018) Multiple causal inference with latent confounding. ArXiv preprint: 1805.08273. Cited by: Background.
  • [27] P. E. Smaldino and R. McElreath (2016) The natural selection of bad science. Royal Society Open Science 3 (9), pp. 160384. External Links: Document, Link, https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.160384 Cited by: DoGR Method.
  • [28] H. Späth (1979) Algorithm 39 clusterwise linear regression. Computing 22 (4), pp. 367–373. Cited by: Background.
  • [29] H. G. Sung (2004) Gaussian mixture regression and classification. Ph.D. Thesis, Rice University. Cited by: Background, Quantitative Results.
  • [30] Y. Wang and D. M. Blei (2018) The blessings of multiple causes. ArXiv preprint: 1805.06826. Cited by: Background.
  • [31] A. B. Warriner, V. Kuperman, and M. Brysbaert (2013) Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior research methods 45 (4), pp. 1191–1207. Cited by: Data.
  • [32] B. Winter (2013) A very basic tutorial for performing linear mixed effects analyses. arXiv preprint arXiv:1308.5499. Cited by: Introduction, Background.