## 1 Introduction

Consider a collection of citizens in a voting precinct. Each voter cast their votes for candidates and issues, and the votes are aggregated together into the precinct-level vote. These data may be analyzed for many purposes: forecasting, campaign targeting, developing community programs, and understanding the composition of the electorate. For all of these applications, it is useful to identify voting cohorts—groups of people that vote similarly and often share population demographics such as gender, socioeconomic status, or race. Then one may study these voter cohorts in terms of voter cohort prevalence within each precinct and how the cohort voted across precincts.

Several model families exist to decompose observations such as these into patterns that can be interpreted as cohorts. However, no existing model family captures the notion that the distribution of a voting cohort may vary locally within precincts. For example, middle-class voters in a precinct with a recent incidence of gun violence might systematically vote more in favor of gun control than the middle-class cohort counterparts in other precincts. In other words, it is important to know how voters of a given cohort voted *within a specific precinct*, and, in turn, how the precinct-specific cohort votes differ from the global cohort votes.

In this paper, we introduce *deconvolution models*, a new class of mixed membership models with observation-specific mixture distributions. This model family is designed for data with *convolved* observations such as ballot outcomes—each observation (e.g., outcomes for a voting precinct) is composed of many particles (e.g., voters casting their votes); these particles are observed in aggregate, or convovled together into an observation with some number of features (e.g., measures on a ballot). This same structure exists in data from many disciplines, including politics, sports, finance, and biology, as shown in Table 1. The goal of deconvolution models is to explicitly model observation-specific deviations from the global factor distributions.

The term *deconvolution*

is used in many settings to denote similar notions of decomposing information, and often carries specific technical connotations in different contexts. In signal processing, it is used to refer to two conceptually distinct processes; its use in density deconvolution and convolutional neural networks further adds to the confusion. In all of these settings, the core meaning behind the term is the same: some value is a convolution, or blending, of component parts; deconvolution involves estimating these unknown components. Moving forward, we use the term

*deconvolution*to refer to estimates of both the local (observation-specific) and global factor distributions and the proportions of those factors represented in each observation; we build models to allow us to estimate those three components essential to deconvolution.

This paper is organized as follows. We first place this work in the context of related models in Section 2.
In Section 3, we formally define the nonparametric deconvolution model (NDM) family and describe several instances of this family in Section 3.4.
Then, we describe an inference algorithm^{1}^{1}1Open source software for our inference algorithm is available at https://github.com/ajbc/ndms. for estimating the posterior of these models in Section 4. We explore the resulting approximations and compare results with related methods on simulated data and California voting data in Section 5. We conclude with a discussion of the advantages, limitations, and potential extensions of this model family in Section 6.

General | Voting | Bulk RNA-seq | fMRI | Baseball |
---|---|---|---|---|

observation | precinct votes | sample expression levels | image | pitcher |

feature | issue or candidate | gene | voxel | pitch type and outcome |

particle | individual voter | one cell | one neuron |
one pitch |

factor | voting cohort | cell type | response pattern | pitching strategy |

## 2 Related Work on Statistical Deconvolution

Many disciplines rely on the analysis of high-dimensional heterogeneous data; latent variable models are well-suited to expose hidden patterns in these data. The simplest form of the latent variable model is a *mixture model* (Figure 1, top), which assign each observation to one of clusters. For each cluster

, there is an associated probability distribution on each feature—when an observation is assigned to cluster

, it is assumed to be drawn from the corresponding cluster distribution. Whether the observations have hard (i.e., single cluster) assignments or soft (i.e., probabilistic) assignments, the generative model assumes that each observation comes from only a single cluster distribution. Global membership then captures the proportion of observations assigned to each cluster.More complex latent factor models build on this structure, relying on similar notions of global membership and factor-specific feature distributions. For example, *admixture models* are the simplest version of a mixed-membership model. In admixture models, each observation is generated from a convex combination (i.e., a weighted sum where the weights are non-negative and sum to one) of the latent factor distributions (Pritchard et al., 2000); this is in contrast to the observations being generated from a single factor’s distribution, as in the mixture model framework. These latent factor distributions, as in the mixture model, are shared across all observations. Latent Dirichlet allocation (LDA; Blei et al., 2003) is a well-known instance of this model family where the global membership variables have Dirichlet priors. The hierarchical Dirichlet process (HDP; Teh et al., 2006) extends LDA with a Bayesian nonparametric prior that enables support over an infinite number of latent factors and the ability to share feature distributions across multiple, nested collections of data.

Admixture models are a subset of the broader class of *decomposition models* (Figure 1, middle), which represent local factor membership as a mixture of global factors.
Matrix factorization is a popular model family and comes in many varieties, including non-negative matrix factorization (NMF; Lee and Seung, 2001);^{2}^{2}2

Note that the original NMF construction is not a probabilistic generative model and no likelihood or associated posterior distribution is available; this posterior distribution is important when interested in variance.

Gamma-Poisson matrix factorization (GaP; Canny, 2004), and Gaussian probabilistic matrix factorization (PMF; Salakhutdinov and Mnih, 2007). Other examples of decomposition models include principal component analysis

(PCA; Hotelling, 1933; Jolliffe, 1986; Tipping and Bishop, 1999), factor analysis (FA; Harman, 1960; Jolliffe, 1986), and their sparse variants (Zou et al., 2006; Engelhardt and Stephens, 2010, respectively).We use the term *deconvolution models* to distinguish a new class of mixed membership models with observation-specific factor distributions across features. While *deconvolution* is used to refer to a variety of concepts, we use it here to refer to a family of mixed membership models that include local (i.e., observation-specific) factor feature distributions (Figure 1, bottom). Specifically, deconvolution models draw on decomposition models for the notion of group-specific distributions of membership and global factor features shared among all observations. But unlike these models, deconvolution introduces observation-specific (or local) factor feature distributions to capture real-world variation in the factor feature distributions.
This model structure is most advantageous in the context of *convolved admixed data*, where observations are collections of heterogeneous particles that have been averaged or otherwise convolved together to be observed as a single unit; real-world examples of convolved data are shown in Table 1. The objective of a deconvolution model is to reverse this convolution process in order to both estimate the factor proportions of the underlying particles in each observation, as well as to estimate the feature values of all particles assigned to a specific factor within an observation; Section 3 will describe this model family in greater detail.

## 3 Nonparametric Deconvolution Models

In this section, we formally specify the family of nonparametric deconvolution models (NDMs). We begin by describing the parametric variant of this model family, which includes a fixed number of latent factors (Section 3.1). The nonparametric version of this model family (which estimates ) is based on the hierarchical Dirichlet process (HDP, Teh et al., 2006), a Bayesian nonparametric admixture model. We will review the normalized gamma process construction of the HDP (Paisley et al., 2012) in Section 3.2. Then, we will introduce the NDM family (Section 3.3) and describe several instances of this family (Section 3.4).

### 3.1 Parametric Deconvolution Models

The parametric variant of the proposed deconvolution model family requires a fixed number of latent factors . Each factor is present in the global population with proportion ; a randomly chosen particle (e.g., an individual voter) has probability of being associated with factor (e.g., one voting cohort). We assume that these global factor proportions are drawn from a Dirichlet distribution parameterized by ,

(1) |

Similarly, we assume each of the convolved observations has observation-specific (or local) proportions , where represents the probability that a random particle from observation (e.g., a voter from Alameda County) will be associated with factor . As with the global proportions, we assume these local proportions are drawn from a Dirichlet distribution; the distribution is parameterized using the global proportions

scaled by hyperparameter

,(2) |

To describe the feature distribution of each latent factor , we use a combination of two parameters at the global level: mean and covariance matrix . Each global mean represents the average value of each of features over all particles from factor ; the covariance matrix represents the covariance of these features among particles from factor . The latent factor feature parameters are the set of these two parameters, . We assume that the global mean for factor and feature

is drawn from a normal distribution,

(3) |

and that covariance matrix is drawn from an inverse Wishart distribution,

(4) |

In an ideal world, we would know the number of particles associated with observation . If we were given this information, we could model the assignments of each particle to the factors,

(5) |

and then draw the local particle-specific features from the global features associated with its assigned factor ,

(6) |

In practice, however, we do not need to infer the values for the particle assignments and particle-specific features . Instead of modeling the features of each particle with , we model the average of these particles for a given factor, or

(7) |

Thus, just as we model latent factor proportions at the local, or observation-specific, level with , we use variables to describe the latent feature values for all the particles in observation associated with factor . Using the fact that the sum of normally-distributed variables is also normally-distributed, we assume these averaged local features are drawn from -dimensional multivariate normal distributions,

(8) |

This construction allows us to study local features without needing to infer particle assignments or particle-specific feature values . These variables capture observation-specific deviations from the global factor distributions; they allow us to answer questions such as “how do voters from a specific cohort vote in *this particular precinct*?”

Practically, we still need the number of particles for observation , which is often not available, but an approximation. Thus, we explicitly model the number of particles when needed, or

(9) |

Parameter can be set according to a rough estimate of .

To complete the parametric model specification, we provide a framework for generating observations

for observation (e.g., a voting precinct) and feature (e.g., an issue or candidate). We combine the local features of observation over all latent factors, then we pass this weighted average through a link function and use this result to parameterize some distribution . Broken down for a single feature , we generate an observation(10) |

Though the generative processes for a nonparametric deconvolution model (Section 3.3) is different from this parametric generative process, they both correspond to the same graphical model (Figure 3), except the parametric version uses latent factors instead of an infinite number.

While we will discuss instances of the full model family in detail within Section 3.4, we will briefly preview some example model instances here. As with generalized linear models (GLMs), if we choose the distribution to be Gaussian, the link function is most naturally the identity function. Similarly, if we think the observations

are best modeled using a Poisson distribution for

, the canonical link function would be an exponential in order to transform the combination of local Gaussian features to be positive real values to parameterize the Poisson distribution.Some model families may require additional feature-specific hyperparameters—for example, when is Gaussian, we may want feature-specific variances. We refer to feature-specific hyperparameters as for feature ; in practice we set them to be relatively small values so that the majority of variance is captured by the estimated parameters. If additional hyperparameters are needed, we may adjust Equation 10 to include them:

(11) |

### 3.2 Background: construction of the HDP

We will now extend this preliminary parametric decomposition model family to its full nonparametric form. We first review the hierarchical Dirichlet process (HDP), upon which we base the nonparametric version of the deconvolution model family.

The HDP (Teh et al., 2006) is constructed using two layers of Dirichlet processes (DPs). This hierarchical process defines a global random probability measure and a set of random probability measures , one for each group . In our case, these measures specify both factor proportions and feature distributions. The global measure is distributed as a Dirichlet process with concentration parameter and base probability measure :

(12) |

The random measures are also distributed as Dirichlet processes with base measure :

(13) |

A hierarchical Dirichlet process can be used as the prior distribution over the factors for grouped data, allowing us to define hierarchical Dirichlet process mixture models. A single Dirichlet process can be constructed in several equivalent ways, the choice of construction has implications for inference.

Following Paisley et al. (2012), we use two different representations of the DP—one for each layer. The top-level DP uses the Sethuraman (1994) stick-breaking representation of the DP. As its name suggests, we imagine that some population may be partitioned into its component parts much the way one would break a stick into pieces. Formally, we represent global proportions as ; this could describe, for example, how many middle class people there are as a percentage of all voters. The Sethuraman generative process draws the unnormalized variant of these proportions

from a beta distribution:

(14) |

The hyperparameter is called the *concentration parameter* and controls the distribution over the proportions, with higher values leading to smaller numbers of larger clusters and values closer to leading to larger numbers of smaller clusters in expectation.
These proportions are normalized relative to all previous proportions,

(15) |

which gives us the stick breaking analogy: starting with a stick of length , after partitioning some portion of the population into factors (breaking the stick times), for the remainder of the population (or stick), what proportion should I assign to factor (or how much of the stick should I break off)?

With as our global proportions for factor (e.g., what proportion of voters belong to voting cohort ?), we represent the distribution of features associated with factor as (e.g., how do people in cohort vote across all districts?). We generate these parameters from the base distribution :

(16) |

When the HDP is used for modeling multinomial features, as topic models capture bag-of-words representations of text, the base distribution is usually a symmetric Dirichlet distribution over the feature simplex. For NDMs, will take an alternative form to align with the parametric deconvolution model generative process (Equations 4 and 3).

The second layer of the HDP captures the relationship between the local level and the global level. The local proportions are the analog of the global proportions , with one set of proportions for each observation ; these are the observation-specific factor proportions. In the context of voting precincts, tells us what proportion of precinct is made up of voting cohort . To generate these local proportions, we use a normalized gamma process prepresentation of the DP (Ferguson, 1973; Paisley et al., 2012)

, which begins by generating unnormalized proportions from a gamma distribution,

(17) |

and then normalizes them:

(18) |

These normalized proportions are Dirichlet-distributed because of the relationship between the gamma and Dirichlet distributions Ferguson (1973). Like , hyperparameter is a concentration parameter encoding the distance of the local proportions from the global proportions.

The NDM family uses this same construction but, at this point, the models diverge. HDP mixture models (e.g., Figure 2) continue by using the local factor proportions and the global factor features to construct discrete probability distributions , one for each group of observations :

(19) |

Each is a distribution over the global factors —a draw from produces with probability .^{3}^{3}3Other constructions simply draw an index to factor with probability . Either way, depends on both and —this construction just requires less bookkeeping.
Thus we can use to draw local factor assignments for particle in group ,

(20) |

In the voting example, is the voting cohort assigned to voter of precinct . Notably, the votes of voter are assumed to be observed in the HDP mixture model setting; the observed data (e.g., the the votes cast by voter in precinct ) are then drawn from a distribution parameterized by :

(21) |

as the prior distribution must be conjugate to for the inference algorithm updates to be closed form, is usually multinomial.

The individual factor assignments and observations are marginalized out in the NDM family, as convolved admixed data does not involve observations of individual particles—for instance, we only record votes aggregated at the precinct level for privacy reasons.

### 3.3 NDM Generative Process

As with the HDP construction, the NDM family draws global factor proportions (Equation 14) and normalizes them to (Equation 15, “how much of the population is in voting cohort ?”). We likewise represent global factor feature distributions—“how do people in cohort usually vote?”—but instead of using the general form of (Equation 16), we describe the features for each global factor in terms of its mean (Equation 3) and covariance matrix (Equation 4). This still follows the general HDP framework (Equation 16), and can be viewed as .

At the local level, we also have factor proportions (Equation 17) that are similarly normalized to (Equation 18, “How much of this precinct is in the middle class cohort?”). Deviating from the HDP construction, we draw *local factor features* (Equation 8), which enable us to identify how local feature distributions of each factor deviate from global ones. With the voting example, instead of assuming that the middle class votes the same in every precinct, we can characterize how the middle class votes in each precinct separately—one precinct may vote more socially liberal relative to global patterns.

At the local level, we also draw the number of particles for each observation (Equation 9). Given the local parameters, we then generate our observations just as in the parametric model variant (Equation 10); this completes the NDM generative process (Figure 3).

### 3.4 NDM instances

To apply an NDM, we need to specify the distribution that is used to generate (Equation 10), and the link function to map the combination of the local parameters to the appropriate support of the parameters for . Table 2 outlines example link functions and the corresponding distributions which they support.

Link function | Distributions | |
---|---|---|

identity | x | Normal, log-Normal |

soft-plus | Poisson, Gamma | |

exponential | Poisson, Gamma | |

sigmoid | Beta | |

inverse exponential | Exponential |

The appropriate choice of depends on the nature of the observations. For example, it makes sense to use a Poisson to model voting counts (Section 5.2) or discrete sports data. If one turns those data into percentages, however, it might make more sense to specify

as a beta distribution. The gamma, log-normal, and exponential distributions are natural choices for positive real-valued data. In all cases, the exact choice should be made based on domain knowledge of the underlying processes.

## 4 Inference

In this section, we parallel the structure of Section 3 by beginning with an inference algorithm for the parametric variant of the deconvolution model family (Section 4.1). We then design split and merge operations to construct the inference algorithm for the full nonparametric deconvolution model family (Section 4.2). We have released open source software for our model and inference methods at https://github.com/ajbc/ndm.

### 4.1 Inference for Parametric Deconvolution Models

Our central computational problem is inference: given the observed data , how do we determine the best values for the latent parameters in our model? In particular, inference involves estimating the latent variables and parameters—the global proportions , local proportions , global feature means and covariances , local features , and local counts . As the true posterior for our model is intractable to compute, we approach this problem with variational inference (Blei et al., 2017; Wainwright and Jordan, 2008).

Variational inference finds a candidate approximation from a family of densities that is close the true posterior distribution by finding the that minimizes the KL divergence from the approximation to the posterior . Using standard convexity arguments, this is equivalent to maximizing the evidence lower bound (ELBO), which is also called the variational objective:

(22) |

Here we define the family of approximating distributions using the mean field assumption:

(23) |

where each variable-specific approximation is parameterized by free variational parameters . For the factor proportions, the global proportion approximation is a Dirichlet distribution with a -dimensional free variational parameter vector ; local proportion approximations are also Dirichlet distributions, each with a -dimensional variational parameter vector . For the factor descriptions, the global factor feature mean approximations

are Gaussian distributions with variational parameters

; global factor feature covariance approximations are inverse Wishart distributions with variational parameters . Local factor feature approximations are Gaussian distributions with variational parameters , and local count approximations are Poisson distributions, each with variational parameter .To maximize the ELBO (Equation 22), we need to be able to compute the expectations of the hidden parameters under . The expectations for global factor feature means and covariances have analytic forms, but the remaining parameters () do not. In lieu of analytic estimates for these second set of parameters, we use “black box” variational inference techniques (Ranganath et al., 2015). We construct our inference algorithm (Algorithm 1) by iterating over each of the parameters and latent variables () and updating the corresponding variational parameters according to either analytic or black box estimates; we continue iterating over each parameter until convergence, giving us a coordinate ascent variational inference approach. The remainder of this section will detail the estimation techniques for each parameter and outline the resulting algorithm.

#### Updates for global factor feature means and covariances .

To compute the expectations of the global factor feature means for each factor , we will derive the complete conditional distribution of these parameters, or . These updates are straightforward given the conjugate relationships between the relevant distributions, and we obtain the complete conditional distribution

(24) |

Similarly, we obtain the complete conditional distributions for global covariances :

(25) |

To update the estimates of and , we set the variational parameters and to be the values of their corresponding terms in these complete condition distributions, using the current expectations of all of the parameters in the conditioning set.^{4}^{4}4While we can infer the scale parameter for global factor features , we opt to fix this at a low value and only infer the means; thus our estimates for are nearly point estimates. This speeds up convergence and gives us better estimates of local factor feature variances, which are more important in using the fitted result to answer questions about real-world data.

For example, the degrees of freedom parameter

for covariance matrix is estimated to be(26) |

#### Black box variational inference overview.

To estimate the remaining parameters, we turn to black box variational inference techniques (Ranganath et al., 2015). We will describe this approach using generic latent variable with variational parameter .

Recall that our objective is to maximize the ELBO (Equation 22); black box variational inferences relies on stochastic optimization to do this. We want to approximate the true gradient of the ELBO, which we express as an expectation with respect to the variational distribution. The true gradient with respect to a generic variational parameter is written as

(27) |

This gradient expressions contain a term for the log probability of all terms containing the hidden parameter of interest, or . For example, the log probability for local features is defined as follows:

(28) |

The objective is now to approximate the gradient using samples from the variational distribution:

. Using these samples, we construct the following noisy unbiased estimate of the gradient with respect to variational parameter

:(29) |

Once we have a noisy estimate of the gradient, we can update the corresponding variational parameter in the standard stochastic gradient ascent manner; at iteration this update is

(30) |

where the learning rate meets the Robbins-Monro conditions,

(31) |

As noted by Ranganath et al. (2015), the variance of the black box estimator of the gradient can be large; this poses a challenge to achieving convergence in a reasonable time frame. To control the variance of the gradient, we use approaches suggested by Ranganath et al. (2015)

, including control variates and RMSProp

(Tieleman and Hinton, 2012) (in lieu of AdaGrad (Duchi et al., 2011)).#### Black box updates for the remaining parameters (, , , ).

Following the black box variational inference framework, updates for the remaining parameters are straightforward. Section A.1 lists the log probabilities containing only the parameters of interest (e.g., Equation 28), Section A.2 contains the gradients of all log distributions used in inference, and details on how we set learning rates can be found in Section A.3. Readers who are interested in seeing additional details, beyond what is supplied in the appendix, are invited to explore our open-source implementation of the algorithm (https://github.com/ajbc/ndm).

To generalize inference for a wide range of variants in the deconvolution model family, we only need to update the log probability terms (Section A.1; e.g., Equation 28) that contain . or and . Here, we simply update the likelihood term with the distribution and link function for the given model instance, and no other changes are needed for inference.

Both these black box updates and the analytic updates for global factor means and covariances are combined to give us the full parametric inference algorithm (Algorithm 1).

### 4.2 Inference for NDMs

Now that we have established an efficient inference algorithm for the parametric version of deconvolution models, we turn to split and merge procedures to assist us with estimating the latent variables in the nonparametric context. While many split-merge procedures exist (Ueda et al., 1999; Jain and Neal, 2004; Dahl, 2005; Wang and Blei, 2012), Bryant and Sudderth (2012) introduced split and merge procedures for inference in the Hierarchical Dirichlet Process with an online variational inference algorithm; we adapt these procedures for our model.

The core idea of this approach is to treat the parametric algorithm (Algorithm 1) as a batch with a fixed number of factors , with one major exceptions: instead of being a -dimensional vector, the global proportions are instead a -dimensional vector; the last element accounts for the mass of all remaining factors . Once inference has converged at the batch level, factors can be split (creating new ones) or merged (merging redundant pairs). Then, further batches can be run until the number of factors and the associated parameters for those factors converge.

#### Split operation (creating new factors).

The split operation allows splitting a factor into two factors, and . To summarize: given the current variational approximation and corresponding variational parameters , we first initialize the variational parameters for the candidate approximation , taking care to introduce small amounts of random noise so that the new factors can distinguish themselves from each other. Then, we run a single iteration of the batch algorithm (Algorithm 1) to update the new candidate variational parameters . Last, we accept the split candidate approximation if it increases the ELBO (Equation 22), and reject it otherwise.

We initialize the candidate variational parameters for new factors and as follows. Both global and local proportions ( and , respectively) are split between the two new factors, using a rate to determine how the proportions are divided between the two new factors. We set for iteration . For the global proportions , the variational parameters are initialized to

(32) |

For local proportions , for all observations , the variational parameters are initialized to

(33) |

To break the symmetry between the two new factors, we must introduce a small amount of noise for the global factor features ; thus we can initialize the variational parameters for the factor features to

(34) |

where is an -dimensional vector where each element is drawn from a Gaussian distribution, or with small scale

. Alternatively, we can run a simple clustering algorithm (e.g., K-means) with two clusters and treat the expectations of the local factor feature means for factor

, or , as input “data;” this approach performs well in practice and does not require defining a scale hyper-parameter, to which the split operation would be sensitive; thus we use a K-means approach.Given that the symmetry between the factors is broken with global factor features , we can simply carry over the variational parameters for the global factor covariances , giving us straightforward initializations,

(35) |

Local factor features can similarly be copied; for all observations , the variational parameters are initialized as

(36) |

No other variational parameters are impacted by the split operation during initialization; all remaining for the candidate are initialized by copying over their values from the current variational parameters .
Once all the variational parameters have been initialized, we run a single iteration, or “trial iteration,” of the batch algorithm (Algorithm 1, lines 4–23^{5}^{5}5 Global proportions need to be modified to be -dimensional, which impact lines 13, and 21–23. Line 13 need only use the first elements of in computing and is otherwise the same. Lines 21–23 are impacted by updating to be -dimensional; then, sampling on line 22 and using the samples on line 23 are both straightforward. , returning ), which updates each variational parameter exactly once. Now, we can compute the ELBO (, Equation 22) of the candidate approximation and compare it to ; if the ELBO of the split candidate is larger than that of the current approximation , or , then we accept the split candidate approximation and continue the inference algorithm with an additional factor.

When the splitting stage is triggered, all factors that exist at the start of the stage are considered for splitting (ordered randomly). Each of the factors is considered individually during this stage: each factor goes through the split operation as just described. When the splitting stage has completed, the current approximation can have at most factors.

#### Merge operation (removing redundant factors).

The merge operation considers two factors and to combine into a single factor . This procedure is similar to the split operation: first we initialize the variational parameters for the candidate approximation , then we update the variational parameters with a single iteration of the batch algorithm, and accept or reject the merge candidate approximation based on the ELBO.

We initialize the candidate variational parameters for the new factor as follows. Both global and local proportions ( and , respectively) are summed,

(37) |

for all observations . The other global variational parameters are initialized based on weighted averages of the two original factors and (the proportions or being the weights). For global factor feature distribution parameters , we have

(38) |

For global factor feature distribution covariances , the variational parameters are initialized as

(39) |

and

(40) |

Local factor feature values are initialized based on the weighted average of the two original factors; for all observations , we set the variational parameters to

(41) |

and

(42) |

No other variational parameters are impacted by the merge operation during initialization; all remaining for the candidate are initialized by copying over their values from the current variational parameters .
After initializing the variational parameters, we run a single iteration of the batch algorithm (Algorithm 1, lines 4–23^{5}, returning ) and compute the ELBO (Equation 22) of the candidate approximation , or , for comparison with the ELBO of the current , or . If , we accept the merge candidate approximation, setting , and continue inference with factors ( for ).

When the merging operation is triggered, only a subset of factor pairs are considered as merge candidates. For every possible pair of factors and , we compute the covariance of the local proportions over all observations. When this covariance is greater than zero, the pair is added to the merge candidate list. Candidate pairs are ordered based on covariance, with the highest covariance pairs being considered for merging first. We considered identifying candidate pairs based on Euclidean distance of the factor means, but proportion covariance worked better in practice; it additionally has the advantages of being faster to compute and having a natural threshold (greater than zero).

#### Full inference algorithm.

We combine the split and merge operations with the variational inference updates to get the full inference algorithm for nonparametric deconvolution models, shown in Algorithm 2.

## 5 Empirical Results

We evaluated the performance of NDMs trained on simulated data (Section 5.1) and on voting data (Section 5.2). We show that modeling local features leads to improved estimates of parameters and latent variables, and that a fitted NDM captures between-group variability better than existing models. We begin by showing improved latent variable estimates with simulated data (Section 5.1). Then, we turn to addressing variation in demographic voting patterns across voting precincts with data from the 2016 election in California (Section 5.2).

### 5.1 Simulations

A main purposes of applying a deconvolution model to data is to recover information that has been lost during the convolutional (or aggregation) process. We often do not have ground truth observations for each of the components in the convolutional process for applications of interest. Thus, we rely on simulations to provide data where the particles that we wish to recover from the aggregated data are known in order to validate our model. We also compare results from our deconvolution model with results from related models on these simulated data.

#### Simulated Data Description.

We simulated data in four different ways in order to quantify performance for a variety of underlying data-generating processes; Appendix B provides more details on these simulation procedures. Briefly, we generated data similar to the NDM generative process (Section 3), where individual particles are generated from observation-specific means (simulation procedure 1, Section B.1), or individual particles are generated directly from global means (simulation procedure 2, Section B.2). We also modified the NDM generative process to add additional hierarchical complexity to the data by including some number of “modes” from which the particles are drawn; these modes are either associated with local factors (simulation procedure 3, Section B.3) or associated with each global factor (simulation procedure 4, Section B.4).

We simulated data from multiple distributions (and link functions ), including Gaussian (identity link, ), Poisson (soft-plus link, ), beta (sigmoid link, ; see Appendix B), and gamma (soft-plus link). Except where stated otherwise, we simulated data with ten random seeds for each setting and report average performance across the ten seeds; we set the number of factors , the number of observations , and the number of features .

#### Comparison methods.

We focus our comparisons on standard decomposition methods, as these are the most similar family of models to deconvolution models. While we have introduced a nonparametric model family, we restrict our simulated evaluations to the parametric variant because parametric decomposition models are more readily available as comparison methods. In particular, we compare parametric deconvolution models (DM) to
factor analysis (FA; Harman, 1960),
principal component analysis (PCA; Hotelling, 1933),
non-negative matrix factorization (NMF; Lee and Seung, 2001),
Gamma-Poisson matrix factorization (GaP; Canny, 2004), and
Gaussian probabilistic matrix factorization (PMF; Salakhutdinov and Mnih, 2007).
When available, we used the scikit-learn Python library decomposition module Pedregosa et al. (2011) with default parameter settings; PMF and GaP required additional implementations.^{6}^{6}6We relied on the ProbabilisticMatrixFactorization library for PMF (https://github.com/fuhailin/Probabilistic-Matrix-Factorization) and our own implementation of GaP. A framework to run the comparison methods is included in the released software.
Some of the simulated data sets are incompatible with certain comparison methods; for instance, GaP can only be applied to integer data. In these cases, irrelevant comparison methods are omitted.

#### Estimating global factor feature distributions and proportions.

We compared estimates of global factor feature distributions across methods on our simulated data. To do this, we fit DMs and our comparative models to the simulated data and compared point estimates of the global factor distributions to the generated values using normalized root mean square error (NRMSE; we normalize to allow for averaging across data sets) of the estimated global factor feature distributions from the true simulated features , or

(43) |

We find that our approach both recovers good estimates of the local factors and also improves upon the global factor estimates over related methods (Figure 5). This suggests that augmenting the distributions of a deconvolutional model to include local distributions improves the estimates of the global distributions.

We also validated the model estimates of both global and local proportions ( and ) to their known simulated values ( and

). To perform this comparison, we used the same fits of DM and comparison methods as described above and computed cosine similarity, or

(44) |

For global proportions , we averaged cosine similarity across all factors ; for local proportions , we averaged cosine similarity across both observations and factors . In estimating global proportions , DMs outperform all other methods with data simulated using a Gaussian distribution for (real domain); similarly, DMs perform close to the best comparison methods with all other simulated data (Figure 6). For estimating local proportions

Comments

There are no comments yet.