DeepAI

Optimal Representative Sample Weighting

We consider the problem of assigning weights to a set of samples or data records, with the goal of achieving a representative weighting, which happens when certain sample averages of the data are close to prescribed values. We frame the problem of finding representative sample weights as an optimization problem, which in many cases is convex and can be efficiently solved. Our formulation includes as a special case the selection of a fixed number of the samples, with equal weights, i.e., the problem of selecting a smaller representative subset of the samples. While this problem is combinatorial and not convex, heuristic methods based on convex optimization seem to perform very well. We describe rsw, an open-source implementation of the ideas described in this paper, and apply it to a skewed sample of the CDC BRFSS dataset.

• 17 publications
• 7 publications
• 36 publications
03/11/2020

On Degree Sequence Optimization

We consider the problem of finding a subgraph of a given graph which max...
07/25/2014

Dissimilarity-based Sparse Subset Selection

Finding an informative subset of a large collection of data points or mo...
05/22/2021

A note on efficient audit sample selection

Auditing is a widely used method for quality improvement, and many guide...
01/13/2022

Using Survey Data to Obtain More Representative Site Samples for Impact Studies

To improve the generalizability of impact evaluations, recent research h...
05/24/2016

Interaction Screening: Efficient and Sample-Optimal Learning of Ising Models

We consider the problem of learning the underlying graph of an unknown I...
09/29/2015

Tractable Fully Bayesian Inference via Convex Optimization and Optimal Transport Theory

We consider the problem of transforming samples from one continuous sour...
02/26/2015

Representative Selection in Non Metric Datasets

This paper considers the problem of representative selection: choosing a...

1 Introduction

We consider a setting where we have a set of data samples that were not uniformly sampled from a population, or where they were sampled from a different population than the one from which we wish to draw some conclusions. A common approach is to assign weights to the samples, so the resulting weighted distribution is representative

of the population we wish to study. Here representative means that with the weights, certain expected values or probabilities match or are close to known values for the population we wish to study.

A a very simple example, consider a data set where each sample is associated with a person. Our data set is 70% female, whereas we’d like to draw conclusions about a population that is 50% female. A simple solution is to down-weight the female samples, and up-weight the male samples in our data set, so the weighted fraction of females is 50%. As a more sophisticated example, suppose we have multiple groups, for example various combinations of sex, age group, income level, and education, and our goal is to find weights for our samples so the fractions of all these groups matches or approximates known fractions in the population we wish to study. In this case, there will be many possible assignments of weights that match the given fractions, and we need to choose a reasonable one. One approach is to maximize the entropy of the weights, subject to matching the given fractions. This problem generally does not have an analytical solution but it can be solved by various methods, for example, raking or iterative proportional fitting [1, §7], or the methods described in this paper.

We formulate a general method for choosing weights that balances two objectives: making the weighted sample representative, i.e., matching some known fractions or other statistics, and having weights with desirable properties, such as being close to uniform. We pose this as an optimization problem. In many practical cases, the problem is convex, and so can be solved efficiently [2]; in others, the problem can be expressed as a mixed-integer convex optimization problem, which in general is hard to solve exactly, but for which simple and effective heuristics can be developed.

Our method allows for a variety of measures of how representative a weighted data set is, and a variety of other desirable properties for the weights themselves. Simple examples of measures of representativeness include allowing the user to specify ranges for fractions or expected values, instead of insisting that the values are matched precisely. For example, we might specify that the weighted sample should have a fraction of females between 49% and 51%, instead of insisting that it should be exactly 50%. As a simple example of desirable properties for the weights, we can specify minimum and maximum allowable values for the weights, e.g., we can require that the weights range between one half and twice the uniform weighting, so no sample is down- or up-weighted by more than a factor of two.

One particularly interesting constraint we can impose on the weights is that they are all zero, except for a specified number of them, and those must have the same weight, . The weighted data set is then nothing more than a subset of of our original samples, with the usual uniform weights. This means that we are selecting a set of samples from our original data set that is representative. This can be useful when we need to carry out some expensive further testing of some of the samples, and we wish to choose a subset of them that is representative.

In this paper we present a general framework for choosing weights, and a general algorithm for solving the associated optimization problems. Our method, which is based on an operator splitting method called ADMM (alternating directions method of multipliers), solves the problem exactly when it is convex, and is a good heuristic when the problem is not convex. We have implemented the method as an open source software package in Python, called rsw (for representative sample weights). It handles a wide variety of weight selection problems, and scales to very large problems with hundreds of thousands or millions of samples, making it suitable for almost all practical surveys.

Related work.

Survey weighting is a well-studied problem in statistics (see, e.g.[3] for an overview). The simplest technique for survey weighting is post-stratification [4]. When the samples can be divided into a finite number of groups, and we want the fraction of samples in each group to be equal to some given fraction, post-stratification adjusts the weights of each group using a simple formula so that the fraction of samples in the weighted groups match the given fractions [5]. For example, if we had 4 female samples and 6 male samples, and we wanted half to be female, we would give each female sample a weight of 0.125 and each male sample a weight of 0.083. Post-stratification, when applied to multiple attributes, requires the groups to be constructed as each combination of the attributes, which might lead to many groups and be undesirable [1, §7.3].

An alternative to post-stratification, when there are multiple attributes, is to match the marginal distribution for each attribute instead of the full joint distribution, while maximizing the entropy of the sample weights. This problem is exactly the maximum entropy problem described in §

3.1. The most common iterative algorithm used in practice is raking (also known as iterative proportional fitting or rim weighting) [6, 7, 8], which works by cycling through each attribute and performing post-stratification in order to gradually update the sample weights. Raking is the standard weighting method used by many public pollsters, according to the Pew Research Center [9]. It was observed that a version of raking can be seen as coordinate ascent on the dual of the maximum entropy problem, which is guaranteed to converge [10, 11] (see appendix A). Other optimization-based formulations of survey weighting can be found in [12, 13].

There are a number of generalizations to the raking procedure, that are each for different regularizers [10, §3]. Some of these are linear weighting [14] (which uses the Pearson

-divergence between the weights and the prior on the weights as the regularizer), logit weighting, and truncated linear weighting. Each of these has a corresponding generalized raking procedure, yet all of these generalizations are instances of our formulation, and can be easily solved using our ADMM algorithm even when the loss or regularizer is nonsmooth. Another technique is logistic regression weighting, which regresses the probability of sampling each sample on the sample attributes, and uses the reciprocal of the predicted sampling probabilities as weights

[15, 16]. We note that various heuristic techniques have been proposed for constraining the weights in the raking algorithm [17, §3], but, in our formulation, these can be easily expressed as constraints included in the regularizer. To our best knowledge, the representative selection problem, described in §3.2, has not appeared in the literature. The closest problem is representative subset selection [18], which tries to select a set of samples that are as different as possible from one another.

Once sample weights have been found, they can be used in a Horvitz–Thompson estimator to estimate various quantities

[19]. They can also be used in down-stream statistical analysis, e.g.[20]. We discuss these and some other applications for weights in §4. For a discussion of whether weighting is appropriate or not for a particular dataset or goal, see, e.g.[21]. (Our focus is on finding weights, and not whether or not using them is appropriate.)

Another important aspect of survey weighting is non-response weighting (also known as propensity weighting or direct standardization). Non-response weighting is the first part of the weighting process, which adjusts the weight of each sample based on the probability that a sample with those attributes would not respond to the survey [1, §9]. The most common technique is to construct a logistic regression model, which regresses the response probability on the sample attributes [22]. We assume that non-response weighting has already been performed, or that there is no non-response bias in the dataset.

Outline.

In §2 we describe the setting of sample weighting and the optimization problem of finding representative sample weights. In §3 we provide a number of examples of the representative sample weighting problem. In §5 we give a solution method based on ADMM. In §4 we describe a number of applications once sample weights have been computed. In §6 we demonstrate our method on annual survey data from the U.S. CDC. We conclude in §7 with extensions and variations.

Appendix A gives a proof that raking is equivalent to performing block coordinate ascent on the dual variables for a particular instance of the representative sample weighting problem, and appendix B gives a table of desired expectations for the provided examples.

2 Representative sample weighting

In this section we describe the general problem of representative sample weighting.

Samples and weights.

We are given samples or data records . For our purposes the feature values and the feature set do not matter; they will enter our problem in a very specific way, described below. We will assign a weight to each sample , , where and

is the all-ones vector. The sample weights induce a distribution on

, with , . Our goal is to choose these sample weights.

Expected values of functions.

We are given real-valued functions , . We let denote the expected values of these functions under the induced distribution,

 fj=n∑i=1wiFj(xi),j=1,…,m.

We express this as , where is the matrix with entries . In the sequel we will only need this matrix ; we will not otherwise need the particular values of or the functions .

Probabilities.

We note that when the function is the indicator function of a set , is the probability of under the induced distribution. As an elementary example of this, could be the set of examples that are female (which is encoded in ), in which case is the probability, under the weights , that the sample is female.

As an extension of this, suppose are the indicator functions of the sets , and each example is in exactly one of these sets (i.e., they are a partition). In this case is a distribution on , i.e., (elementwise) and . As an elementary example, these could be a partition of samples into female and male, and a number of age groups; is then the probability under the weighted distribution that an example is in group , which is some combination of sex and age group.

Desired expected values and loss function.

Our task is to choose the weights so that this induced distribution is representative, which means that the expected values of the functions of are near some target or desired values, denoted

. To get a numerical measure of how representative the induced distribution is, we introduce a loss function

, where measures our dissatisfaction with a particular induced expected value and desired expected value . We use infinite values of to denote constraints on and . For example, to constrain , we let for , and for . Other common losses include the sum squares loss , or the loss .

When is a distribution on (i.e., for all with , , we have and ), a reasonable loss is the Kullback-Liebler (KL) divergence from , a desired target distribution,

 ℓ(f,fdes)=m∑i=1filog(fi/fdesi).

Here we assume all entries of together form a distribution; if there are multiple groups of entries in that form distributions we can use the sum of the KL divergences from their associated target distributions. As an example and special case suppose is the probability that , where are sets (not necessarily a partition), for which we have target values . Then is the probability that . A reasonable loss is the sum of the KL divergence from these two probabilities and their associated desired probabilities, i.e.,

 ℓ(f,fdes)=m∑i=1(filog(fi/fdesi)+(1−fi)log((1−fi)/(1−fdesi))).

Regularizer.

We also have preferences and requirements on the choice of weights, beyond and , unrelated to the data samples and functions. We express these using a regularization function , where we prefer smaller values of . Again, we use infinite values of to denote constraints on . For example, to constrain for , we let whenever .

An important example of a regularizer is the negative entropy, with

 r(w)=n∑i=1wilogwi,

which is the same as the KL divergence from the distribution given by

and the uniform distribution

on . This regularizer expresses our desire that the weights have high entropy, or small KL deviation from uniform. A simple extension of this regularizer is the KL deviation from a (non-uniform) target distribution .

Regularizers can be combined, as in

 r(w)={∑ni=1wilogwi(1/(κn))1≤w≤(κ/n)1∞otherwise, (1)

where is a given hyper-parameter. This regularizer combines negative entropy with the constraint that no sample is up- or down-weighted by a factor more than .

Representative sample weighting problem.

We propose to choose weights as a solution of the representative sample weighting problem

 minimizeℓ(f,fdes)+λr(w)subject tof=Fw,w≥0,1Tw=1, (2)

with variables and , and positive hyper-parameter . The objective is a weighted sum of the loss, which measures how unrepresentative the induced distribution is, and the regularization, which measures our displeasure with the set of weights. The hyper-parameter is used to control the trade off between these two objectives.

The representative sample weighting problem (2) is specified by the loss function and desired expected values , the regularizer , and the matrix which gives the function values .

Convexity.

When is convex in its first argument and is convex, problem (2) is a convex optimization problem, and so can be efficiently solved [2]. Many interesting and useful instances of the representative sample weighting problem are convex; a few are not. The cases when the problem is not convex can be solved using global optimization methods; we will also recommend some simple but effective heuristics for solving such problems approximately.

3 Examples

In this section we give two examples of the representative sample weighting problem.

3.1 Maximum entropy weighting

The first case we consider is where we want to find the maximum entropy sample weights that exactly match the desired expected values. This is a representative sample weighting problem, with the loss and regularizer

 ℓ(f,fdes)={0f=fdes,+∞otherwise,r(w)=n∑i=1wilogwi.

Since both the loss and regularizer are convex, problem (2) is a convex optimization problem. We call the (unique) solution to problem (2) with this loss and regularizer the maximum entropy weighting. The maximum entropy weighting is the most random or evenly spread out weighting that exactly matches the desired expected values.

The maximum entropy weighting problem is convex and readily solved by many methods. The typical complexity of such methods is order , i.e., linear in the number of samples and quadratic in the number of functions . (See [2, §11.5].) This complexity assumes that the matrix is treated as dense; by exploiting sparsity in , the complexity of solving the problem can be lower.

Simple example.

As a very simple example, suppose and is the indicator function of the sample being female, so is the probability of being female under the weighted distribution, and we take , which means we seek weights for which the probability of female is 0.5. (This problem has an elementary and obvious solution, where we weight all female samples the same way, and all non-female samples the same way. But our interest is in more complex problems that do not have simple analytical solutions.)

Variations.

We have already mentioned several variations, such as adding bounds on , or replacing the negative entropy with the KL divergence from a target weighting. As another extension, we can modify the loss to allow the expected function values to be close to, but not exactly, their desired values. As a simple version of this, we can require that , where are given lower and upper limits (presumably with ). This corresponds to the loss function

 ℓ(f,fdes)={0fmin≤f≤fmax,+∞otherwise,

the indicator function of the constraint .

In our simple example above, for example, we might take and . This means that we will accept any weights for which the probability of female is between and . (This problem also has an elementary analytical solution for this example.)

3.2 Representative selection

In the second example, we consider the problem of selecting samples, with equal weights, that are as representative as possible. Thus, our regularizer has the form

 r(w)={0w∈{0,1/k}n,+∞otherwise. (3)

This regularizer is infinite unless we have for , and for , where and . In this case the induced distribution is uniform on for .

With this constraint, it is unlikely that we can achieve exactly, so we use a loss function such as , , or the KL divergence to if is a distribution.

We will refer to this problem as representative selection. Roughly speaking, the goal is to choose a subset of size from the original data points, such that this subset is as representative as possible. The problem is interesting even when we take , in which case the problem is to choose a subset of the original samples, of size , that approximates the means of the functions on the original data set.

Representative selection is a combinatorial optimization problem of selecting

out of original data points, of which there are choices. This problem can be difficult to solve exactly, since you can solve the 0-1 integer problem by solving instances of the representative selection problem [23]. Representative selection can be reformulated as a mixed-integer convex program, for which there exist modern solvers (e.g., ECOS [24], GUROBI [25], and MOSEK [26, §9]) that can often find global solutions of small problems in a reasonable amount of time. For the representative selection problem, we describe an ADMM algorithm in §5 that can quickly find an approximate solution, which appears to have good practical performance and can be used when is very large. We also note that the representative selection problem is similar in nature to the sparse approximation problem, which can be approximately solved by heuristics such as basis or matching pursuit [27].

4 Applications

In this section we describe what to do once sample weights have been computed.

Direct use of weights.

The most common use of sample weights is to use them as weights in each subsequent step of a traditional machine learning or statistical analysis pipeline. Most machine learning and statistical methods are able to incorporate sample weights, by in some way modulating the importance of each sample by its weight in the fitting process. As a result, samples with higher weights affect the ultimate model parameters more than those with lower weights. As a simple example, suppose we have another function

for which we want to compute its mean. Our naïve estimate of the mean would be the sample average, ; using the sample weights, our estimate of the mean is . (This is the aforementioned Horwitz-Thompson estimator.)

As another example, suppose we have additional features of each sample denoted and outcomes and we wish to fit a linear regression model, i.e., for some . With sample weights, the fitting problem becomes

 minimize∑ni=1wi(θTai−bi)2.

Re-sampling.

Another option is to use the sample weights to re-sample the data, and then use the new set of samples for further analysis. That is, we create a new set of samples , using the distribution

 Prob(~xi=xi)=wi,i=1,…,N.

For many machine learning and statistical analysis methods, when becomes very large, re-sampling the data converges to the approach above of using the weights directly. However, re-sampling can be useful when the machine learning or statistical analysis method cannot incorporate weights.

Selection.

The representative selection problem, described in §3.2, can be used to select a subset of the samples that are themselves representative. Representative selection can be useful when we want to select a few of the samples to gather more data on, but it is infeasible (or difficult, or expensive) to gather more data on all of the samples. This can be useful in a two-stage sampling or study, where one first gathers basic (low-cost) data on a large number of samples, and then selects a representative subset of those samples to include in the second (high-cost) study [1, §8].

For future reference we mention that a simple heuristic for representative selection comes directly from weighted sampling, described above. We start by solving the maximum entropy weight selection problem, and then draw samples from the associated weighted distribution. This simple scheme can be used as an initialization for a heuristic method for the problem.

5 Solution method

The problem (2) can be solved by many methods, depending on the properties of the loss and regularizer functions. For example, in the maximum entropy problem, with equality constraints on or with sum square loss, Newton’s method can be effectively used [2, §11]. More complex problems, with nondifferentiable loss and regularizer, can be solved using, for example, interior-point methods, after appropriate transformations [2, §11.6]. It is very easy to express general convex representative sample weighting problems using domain specific languages (DSLs) for convex optimization, such as CVX [28, 29], YALMIP [30], CVXPY [31, 32], Convex.jl [33], and CVXR [12]. Many of these packages are also able to handle mixed-integer convex programs, e.g., the representative selection problem can be expressed in them. These DSLs canonicalize the (human-readable) problem into a convex or mixed-integer convex cone program and call a numerical solver, such as OSQP [34], SCS [35], ECOS [24], MOSEK [24], or GUROBI [25]. However, as we will see in §6, these generic solvers can often be quite slow, taking two orders of magnitude more time to solve the problem compared to the algorithm we describe below. (The advantage of these DSLs is that representative sampling problems can be specified in just a few lines of code.)

In this section we describe a simple and efficient algorithm for solving problems of the form (2), including losses and regularizers that take on infinite values and are nondifferentiable or nonconvex. While it can at times be slower to achieve high accuracy than an interior-point method, high accuracy is not needed in practice in the representative sample weighting problem. Our algorithm uses the alternating direction method of multipliers (ADMM; see [36]). When the loss and regularizer are convex, and the problem is feasible, this algorithm is guaranteed to converge to a globally optimal solution. In the case when the loss or regularizer is not convex, our method is a heuristic that will find an approximate solution; it has been observed that these approximate solutions are often good enough for practical purposes, and that the approximate solutions can be refined or polished using greedy methods [37].

Problem rewriting.

In order to apply ADMM to problem (2), we introduce two new redundant variables, resulting in the problem

 minimizeℓ(f,fdes)+λr(~w)subject tof=Fw~w=w=¯w¯w≥0,1T¯w=1, (4)

with variables and . The problem data is still the data matrix and the desired values . Problem (4) is said to be in graph form, and many algorithms exist that can solve such problems, e.g., ADMM [38] and POGS [39].

In (4) we replicate the variable into three versions, which must all be equal. Each of these versions of handles one aspect of the problem:

is constrained to be a probability distribution;

appears in the regularizer, and appears in the loss. The technique of replicating a variable and insisting that the different versions be consistent is sometimes called consensus optimization [36, §7.1]. By itself, this consensus form is useless. It becomes useful when combined with an operator splitting method that handles the different versions of separately. We now describe one such operator splitting method, ADMM.

Using the splitting and , the corresponding ADMM steps (see, e.g.[40, §4.4]) are

 fk+1 =prox(1/ρ)ℓ(⋅,fdes)(Fwk−yk), (5) ~wk+1 =prox(λ/ρ)r(wk+zk), ¯wk+1 =Π(wk+uk), wk+1 yk+1 =yk+fk+1−Fwk+1, zk+1 =zk+wk+1−~wk+1, uk+1 =uk+wk+1−¯wk+1,

where and are the (scaled) dual variables for the constraints, and is a given penalty parameter. Here, is the proximal operator of the function (see [40, §1.1]) and often has closed-form solutions for many convex (and some nonconvex) functions that appear in practice. The function is the projection onto the probability simplex .

Convergence.

When the problem is convex and feasible, the ADMM steps are guaranteed to converge to a global solution for any penalty parameter . When the problem is not convex, the ADMM steps are not guaranteed to converge to a global solution, or even converge. However, it has been observed that ADMM is often effective at finding approximate solutions to nonconvex problems [37, 41].

Efficient computation of steps.

We note that the first three steps in (5), which update , , and , are independent and can be parallelized, i.e., carried out simultaneously. Moreover, when or is separable or block separable, the proximal operators can be evaluated on each block independently, leading to additional parallelization.

The update for can be expressed as the solution to the linear system

 [2IFTF−I][wk+1η]=[FT(fk+1+yk)+~wk+1+zk+¯wk+1+uk0].

The coefficient matrix in this system of linear equations is quasi-definite [42], so it always has an factorization with diagonal . We solve this linear system by performing a sparse factorization and caching the factorization between iterations. As a result, after the initial factorization, the cost of each subsequent iteration is that of a back-solve step, which can be much smaller than the initial factorization [2, App. C].

There is a large literature on, and many existing libraries for, efficiently computing proximal operators; see, e.g.[38, 43, 44]. We describe here two proximal operators that arise in solving the representative sample weighting problem, and are less well known.

Proximal operator for KL divergence.

The proximal operator for KL divergence can arise both in the loss and regularizer functions. The proximal operator of is given by

 proxλr(v)i=λW(uievi/λ−1/λ),

where is the Lambert- or product log function [45].

Proximal operator for representative selection.

In the case where the regularizer is nonconvex, its proximal operator is generally difficult to find in practice. However, one special case where it is in fact easy to evaluate is in the representative selection problem (3). In this case, the proximal operator of is just the projection of onto the (nonconvex) set

 {w∈{0,1/k}n∣1Tw=1}.

This is easily computed by setting the largest entries of to (with ties broken arbitrarily) and the rest to zero [37, §4.2].

5.1 Implementation

We have implemented the ADMM algorithm described above in an easy-to-use Python package rsw, which is freely available online at

 www.github.com/cvxgrp/rsw.

The current implementation exploits very little of the parallelization possible; a future version will exploit much more.

The package exposes a single method, rsw, which has the signature

 defrsw(df,funs,losses,reg):

It performs representative sample weighting on the pandas DataFrame df, using the functions in the list fun. (To use df directly as , one would pass in fun=None.) We provide a number of common losses and regularizers, and their associated proximal operators. The argument losses specifies the losses, applied in block-separable form; e.g., to express the loss , one would pass in

 losses=[rsw.LeastSquaresLoss(1),rsw.EqualityLoss(2)].

The argument 1 in rsw.LeastSquaresLoss(1) specifies the desired value for . One can also specify the scale of each loss; e.g., for , one would pass in the loss rsw.LeastSquaresLoss(1, scale=.5). The argument reg specifies the regularizer; e.g., to use entropy regularizer, one would pass in reg=rsw.EntropyRegularizer(); for a Boolean regularizer with 5 nonzeros, one would pass in reg=rsw.BooleanRegularizer(5). You can also specify limits on the entries of as in (1), e.g.,

 reg=rsw.EntropyRegularizer(limit=2)

constrains to not be up- or down-weighted from uniform by a factor of , or . At this point, the software package does not support making the loss or regularizer a sum of multiple losses or regularizers. We note however that this can easily be accomodated by adding additional consensus variables, and we plan to implement this in a future version of the software.

Packages.

We use the Python packages numpy [46] for dense linear algebra, scipy [47] for sparse linear algebra, pandas [48] for data manipulation, and qdldl [34, 49] for the sparse quasi-definite factorization.

6 Numerical experiments

We present here an application of the methods described above to a large dataset from the US Center for Disease Control (CDC). We chose examples that are a bit more complex that might be needed, to demonstrate the method’s flexibility and scalability. We note that, in practice, the functions and losses used would likely be much simpler, although they need not be. All experiments were conducted on a single core of an Intel i7-8700K CPU with 64GB of memory, which, at the time of writing, was a standard consumer desktop CPU.

CDC BRFSS dataset.

Each year, the US Center for Disease Control (CDC) conducts health-related telephone surveys of hundreds of thousands of US citizens. (This is the world’s largest telephone survey.) The data is collected as part of the Behavioral Risk Factor Surveillance System (BRFSS), and is freely available online [50]. We consider the 2018 version of this dataset, which contains 437436 responses to 275 survey questions. The data includes a variety of demographic and health-related attributes [51]. We consider the following columns in the dataset:

• State or territory. There are 53 possible values; e.g., NM (New Mexico) or PR (Puerto Rico).

• Age group. The 6 possible values are 18-24, 25-34, 35-44, 45-54, 55-65, and 65+.

• Sex. The 2 possible values are male and female.

• Education level. The 4 possible values are elementary, high school, some college, and college.

• Income. The 8 possible values are 0-10k, 10-15k, 15-20k, 20-25k, 25-35k, 35-50k, 50-75k, and 75k+.

• Reported general health. The 5 possible values are excellent, very good, good, fair, and poor.

Functions.

As our first group of functions, we consider the indicator functions of each possibility of state and age group, of which there are 318. (The expected values of these functions are the probabilities of each state/age group). Our next two functions are the indicator function of the sex being male or female. Our next group of functions are the indicator functions of each possibility of education level and income, of which there are 32. The final group of functions are the indicator functions of each possibility of reported general health, of which there are 5. In total, we have functions. The resulting matrix has over 150 million entries and around 2% of its values are missing. It is sparse, with a density of 3.5%.

Our choice of these particular 357 functions is relatively arbitrary (although they are not unreasonable).

Desired expected values.

In all experiments, we take the averages across the entire dataset as the desired expected values, or

 fdes=(1/n)F1.

We ignore missing entries in the computation. The desired expected values are given in tables that can be found in appendix B.

Constructing a skewed subsample.

We generate a smaller dataset composed of 10000 samples by generating a skewed sample from the dataset. Denote the th column of by . As described above,

contains a one-hot encoding of various things (state and age, sex, education and income, and health). We generate a skewed sampling distribution from the dataset by first sampling

, and then letting

 πi=exp(cTxi)/n∑j=1exp(cTxj),i=1,…,n.

(By construction, and .) Suppose entry of the samples is the indicator function for female samples. If, e.g., , then we are less likely to sample females, since whenever , this subtracts 0.1 from the log probability of sample , giving the female samples a smaller chance of being sampled. On the other hand, if , then samples that are female get 0.1 added to their log probabilities. Next we take 10000 samples without replacement from the columns of using the distribution . This sample is highly skewed; for example, 41.2% of samples are female, whereas in the original dataset 54.8% samples are female. Therefore, we need to do sample weighting to make the sample representative.

6.1 Maximum entropy weighting

We first find the maximum entropy weighting of these 10000 samples such that the induced expected values match the desired expected values, i.e., we solve the problem described in §3.1. We solved the problem using our ADMM implementation, which took around 21 seconds. (Using a generic cone program solver like SCS, the default solver for CVXPY, took 19 minutes.) The resulting entropy of was 8.59, whereas the entropy of the uniform distribution on is 9.21. The histogram of is shown in figure 1, and appears to have a long right tail, meaning that some samples are considerably up-weighted.

To illustrate a use for these weights, we compare the distributions of other columns not included in the procedure to the true distribution across the entire dataset. We did this for the height and weight columns, and in figures 2 and 3

we plot the cumulative distribution function (CDF) of height and weight in the true distribution, and the unweighted and weighted distribution over the subsample. The weighted CDF appears to match the true CDF much better than the unweighted CDF. We computed the Kolmogorov–Smirnov (K-S) test statistic

[52], i.e., the maximum absolute difference between the CDFs, between the unweighted and true CDF and between the weighted and true CDF. For the height attribute, the K-S statistic was 0.078 for the unweighted distribution, and 0.008 for the weighted distribution, which is almost a factor of 10 better. For the weight attribute, the K-S statistic was 0.064 for the unweighted distribution, and 0.009 for the weighted distribution, a factor of more than 7 better.

We remind the reader that our focus here is not on using the weights, but only on how to compute them. We give these results only to show that weighting has indeed improved our estimate of the CDF of unrelated functions of the samples. We also note that a similar improvement in estimating the distribution of height and weight can be obtained with a far simpler representative sample weighting problem, for example one that takes into account only sex and age group.

6.2 Representative selection

Next we consider the representative selection problem, described in §3.2, selecting samples with uniform weights from the samples. Since we can no longer match exactly, as the loss we use the sum of the KL divergence for each of state-age, sex, education-income, and general health. We approximately solved the problem using our ADMM implementation, initialized with a uniform distribution, which took under 12 seconds. The final loss of our weights was 0.21. We compared the loss to the loss of 200 random subsamples, which were each generated by sampling without replacement with the maximum entropy weights; the histogram of these objective values are displayed in figure 4. Our solution has substantially lower loss than even the best random subsample. We also computed the K-S statistic for the CDF of the height and weight columns between the 200 random samples and the true distribution, and between our selection and the true distribution. The histogram of the test statistic, as well as our test statistic, for the height and weight CDFs are displayed in figure 5 and 6; we had a lower test statistic 36.5% of the time for height and 68.5% of the time for weight. So our selection is less representative in terms of height than if we had sampled using the maximum entropy weights, but more representative in terms of weight. (Both of these methods are far superior to selecting samples uniformly at random from the 10000.)

7 Extensions and variations

In this section we discuss extensions and variations.

Missing entries in some data.

Here we describe what to do when some entries of are missing, i.e., , where denotes a missing value. The simplest and often most effective thing to do when there are missing values is to replace them with the desired expected value for that function. That is, suppose one of our functions is the indicator function of a sample being female. If, for one sample, we did not know their sex, we would replace that missing value in with the desired proportion of females. Assuming the loss is separable and homogeneous, which all aforementioned losses are, filling in missing entries this way has the same effect as ignoring the missing entries in the computation of , and scaling the loss by some multiple of one minus the sum of the weights for samples with missing entries of that feature. In our implementation, we fill in missing entries this way.

Conditional probabilities.

Suppose is the indicator function of some set , where , so is the probability that , and is the indicator function of , so is the probability that . Then the quantity is equal to the (conditional) probability that given that . For example, if contains the samples that are over the age of 65, and contains the samples that are female, then is the probability that the sample is over the age of 65 given it is female. If we want the induced probability to be close to some value , we can use the loss

 l(f,fdes)=l0(f1−f2fdes),

where is some convex function. The function is convex because is affine in .

CDFs.

Suppose . We can make the CDF evaluated at by letting be the {0,1} indicator function of the set , . Then would be the desired CDF. A common loss function is the K-S test statistic, or

 l(f,fdes)=∥f−fdes∥∞,

which is convex.

Acknowledgements

The authors would like to thank Trevor Hastie, Timothy Preston, Jeffrey Barratt, and Giana Teresi for discussions about the ideas described in this paper. Shane Barratt is supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518.

References

• [1] Thomas Lumley. Complex surveys: a guide to analysis using R, volume 565. John Wiley & Sons, 2011.
• [2] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
• [3] Richard Valliant, Jill Dever, and Frauke Kreuter. Practical tools for designing and weighting survey samples. Springer, 2013.
• [4] Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 96(4):558–625, 1934.
• [5] David Holt and Fred Smith. Post stratification. Journal of the Royal Statistical Society: Series A (General), 142(1):33–46, 1979.
• [6] Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579–652, 1912.
• [7] J Kruithof. Telefoonverkeersrekening. De Ingenieur, 52:15–25, 1937.
• [8] William Deming and Frederick Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4):427–444, 1940.
• [9] Andrew Mercer, Arnold Lau, and Courtney Kennedy. How different weighting methods work.
• [10] Jean-Claude Deville, Carl-Erik Särndal, and Olivier Sautory. Generalized raking procedures in survey sampling. Journal of the American Statistical Association, 88(423):1013–1020, 1993.
• [11] Yee Teh and Max Welling. On improving the efficiency of the iterative proportional fitting procedure. In AIStats, 2003.
• [12] Anqi Fu, Balasubramanian Narasimhan, and Stephen Boyd. CVXR: An R package for disciplined convex optimization. In Journal of Statistical Software, 2019.
• [13] Yiyuan She and Shao Tang. Iterative proportional scaling revisited: a modern optimization perspective. Journal of Computational and Graphical Statistics, 28(1):48–60, 2019.
• [14] Jelke Bethlehem and Wouter Keller. Linear weighting of sample survey data. Journal of Official Statistics, 3(2):141–153, 1987.
• [15] James Lepkowski, Graham Kalton, and Daniel Kasprzyk. Weighting adjustments for partial nonresponse in the 1984 SIPP panel. In Proceedings of the Section on Survey Research Methods, pages 296–301. American Statistical Association Washington, DC, 1989.
• [16] Vincent Iannacchione, Jennifer Milne, and Ralph Folsom. Response probability weight adjustments using logistic regression. In Proceedings of the Survey Research Methods Section, American Statistical Association, volume 20, pages 637–642, 1991.
• [17] Graham Kalton and Ismael Flores-Cervantes. Weighting methods. Journal of Official Statistics, 19(2):81, 2003.
• [18] Michal Daszykowski, Beata Walczak, and Desire Massart. Representative subset selection. Analytica Chimica Acta, 468(1):91–103, 2002.
• [19] Daniel Horvitz and Donovan Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
• [20] Pierre Lavallée and Jean-François Beaumont. Why we should put some weight on weights. Survey Methods: Insights from the Field (SMIF), 2015.
• [21] Leslie Kish. Weighting for unequal pi. Journal of Official Statistics, 8(2):183–200, 1992.
• [22] James Heckman. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In Annals of Economic and Social Measurement, volume 5, pages 475–492. NBER, 1976.
• [23] Richard Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85–103. Springer US, Boston, MA, 1972.
• [24] Alexander Domahidi, Eric Chu, and Stephen Boyd. ECOS: An SOCP solver for embedded systems. In 2013 European Control Conference (ECC), pages 3071–3076, Zurich, July 2013. IEEE.
• [25] Gurobi Optimization. GUROBI optimizer reference manual.
• [26] MOSEK ApS. MOSEK modeling cookbook.
• [27] Scott Chen, David Donoho, and Michael Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129–159, 2001.
• [28] Michael Grant and Stephen Boyd. Graph implementations for nonsmooth convex programs. In Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95–110. Springer-Verlag Limited, 2008.
• [29] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1, 2014.
• [30] Johan Lofberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In IEEE International Conference on Robotics and Automation, pages 284–289. IEEE, 2004.
• [31] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
• [32] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
• [33] Madeleine Udell, Karanveer Mohan, David Zeng, Jenny Hong, Steven Diamond, and Stephen Boyd. Convex optimization in Julia. Workshop on High Performance Technical Computing in Dynamic Languages, 2014.
• [34] Bartolomeo Stellato, Goran Banjac, Paul Goulart, Alberto Bemporad, and Stephen Boyd. OSQP: An operator splitting solver for quadratic programs. Mathematical Programming Computation, pages 1–36, 2020.
• [35] Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):1042–1068, 2016.
• [36] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2010.
• [37] Steven Diamond, Reza Takapoui, and Stephen Boyd. A general system for heuristic minimization of convex functions over non-convex sets. Optimization Methods and Software, 33(1):165–193, 2018.
• [38] Neal Parikh and Stephen Boyd. Block splitting for distributed optimization. Mathematical Programming Computation, 6(1):77–102, 2014.
• [39] Christopher Fougner and Stephen Boyd. Parameter selection and preconditioning for a graph form solver. In Emerging Applications of Control and Systems Theory, pages 41–61. Springer, 2018.
• [40] Neal Parikh and Stephen Boyd. Proximal Algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
• [41] Guillermo Angeris, Jelena Vučković, and Stephen Boyd. Computational bounds for photonic design. ACS Photonics, 6(5):1232–1239, May 2019.
• [42] Robert Vanderbei. Symmetric quasidefinite matrices. SIAM Journal on Optimization, 5(1):100–113, 1995.
• [43] Neal Parikh and Stephen Boyd. proximal Github repository.
• [44] Lorenzo Stella, Niccolo Antonello, and Mattias Falt. ProximalOperators.jl.
• [45] Johann Lambert. Observations variae in mathesin puram. Acta Helvitica, physico-mathematico-anatomico-botanico-medica, 3:128–168, 1758.
• [46] Stéfan van der Walt, Chris Colbert, and Gael Varoquaux. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.
• [47] Eric Jones, Travis Oliphant, and Pearu Peterson. SciPy: Open source scientific tools for Python, 2001.
• [48] Wes McKinney. Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference, pages 56 – 61, 2010.
• [49] Bartolomeo Stellato, Goran Banjac, Paul Goulart, Alberto Bemporad, and Stephen Boyd. qdldl: A free LDL factorization routine.
• [50] Center for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Data, 2018.
• [51] Center for Disease Control and Prevention (CDC). LLCP 2018 codebook report.
• [52] Andrey Kolmogorov. Sulla determinazione empírica di uma legge di distribuzione. 1933.
• [53] Martin Wittenberg. An introduction to maximum entropy and minimum cross-entropy estimation using stata. The Stata Journal: Promoting Communications on Statistics and Stata, 10(3):315–330, September 2010.
• [54] Yvonn Bishop, Stephen Fienberg, and Paul Holland.

Discrete Multivariate Analysis

.
Springer, New York, 2007.
• [55] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475–494, 2001.

Appendix A Iterative proportional fitting

The connection between iterative proportional fitting, initially proposed by [8] and the maximum entropy weighting problem has long been known and has been explored by many authors [11, 12, 13, 53, 54]. We provide a similar presentation to [13, §2.1], though we show that the iterative proportional fitting algorithm that is commonly implemented is actually a block coordinate descent algorithm on the dual variables, rather than a direct coordinate descent algorithm. Writing this update in terms of the primal variables gives exactly the usual iterative proportional fitting update over the marginal distribution of each property.

Maximum entropy problem.

In particular, we will analyze the application of block coordinate descent on the dual of the problem

 minimize∑ni=1wilogwisubject toFw=fdes1Tw=1,  w≥0, (6)

with variable , where the problem data matrix is Boolean, i.e., . This is just the maximum entropy weighting problem given in §3.1, but in the specific case where is a matrix with Boolean entries.

Selector matrices.

We will assume that we have several possible categories which the user has stratified over, and we will define selector matrices which ‘pick out’ the rows of containing the properties for property . For example, if the first three rows of specify the data entries corresponding to the first property, then would be a matrix such that

 S1F=F1:3,1:n,

and each column of is a unit vector; i.e., a vector whose entries are all zeros except at a single entry, where it is one. This is the same as saying that, for some property , each data point is allowed to be in exactly one of the possible classes. Additionally, since this should be a proper probability distribution, we will also require that , i.e., the desired marginal distribution for property should itself sum to 1.

Dual problem.

To show that iterative proportional fitting is equivalent to block coordinate ascent, we first formulate the dual problem [2, Ch. 5]. The Lagrangian of (6) is

where is the dual variable for the first constraint and is the dual variable for the normalization constraint. Note that we do not need to include the nonnegativity constraint on , since the domain of is .

The dual function [2, §5.1.2] is given by

 g(ν,λ)=infwL(w,ν,λ), (7)

which is easily computed using the Fenchel conjugate of the negative entropy [2, §3.3.1]:

 g(ν,λ)=−1Texp(−(1+λ)1−FTν)−νTfdes−λ, (8)

where of a vector is interpreted componentwise. Note that the optimal weights are exactly those given by

 w⋆=exp(−(1+λ)1−FTν). (9)

Strong duality.

Because of strong duality, the maximum of the dual function (7) has the same value as the optimal value of the original problem (6[2, §5.2.3]. Because of this, it suffices to find an optimal pair of dual variables, and , which can then be used to find an optimal , via (9).

To do this, first partially maximize with respect to , i.e.,

 gp(ν)=supλg(ν,λ).

We can find the minimum by differentiating (8) with respect to and setting the result to zero. This gives

 1+λ⋆=log(1Texp(−FTν)),

while

 gp(ν)=−log(1Texp(−FTν))−νTfdes.

This also implies that, after using the optimal in (9),

 w⋆=exp(−FTν)1Texp(−FTν). (10)

Block coordinate ascent.

In order to maximize the dual function , we will use the simple method of block coordinate ascent with respect to the dual variables corresponding to the constraints of each of the possible categories. Equivalently, we will consider updates of the form

 νt+1=νt+STtξt,t=1,…,T,

where is the dual variable at iteration , while is the optimization variable we consider at iteration . To simplify notation, we have used to refer to the selector matrix at iteration , if , and otherwise set ; i.e., we choose the selector matrices in a round robin fashion. The updates result in an ascent algorithm, which is guaranteed to converge to the global optimum since is a smooth concave function [55].

Block coordinate update.

In order to apply the update rule to , we first work out the optimal steps defined as

 ξt=argmaxξ gp(νt+STtξ).

To do this, set the gradient of to zero,

 ∇ξ gp(νt+STtξ)=0,

which implies that

 ∑ni=1(Stfi)exp(−fTiνt−fTiSTtξ)n∑i=1exp(−fTiνt−fTiSTtξ)=Sfdes, (11)

where is the th column of and the division is understood to be elementwise.

To simplify this expression, note that, for any unit basis vector (i.e., for some and 0, otherwise), we have the simple equality,

 xexp(xTξ)=x∘exp(ξ),

where indicates the elementwise product of two vectors. Using this result with on each term of the numerator from the left hand side of (11) gives

 n∑i=1(Stfi)exp(−fTiνt−fTiSTtξ)=exp(−ξ)∘(n∑i=1exp(−fTiνt)Stfi)=exp(−ξ)∘y,

where we have defined for notational convenience. We can then rewrite (11) in terms of by multiplying the denominator on both sides of the expression:

 exp(−ξ)∘y=(exp(−ξ)Ty)Stfdes,

which implies that

 y∘exp(−ξ)1T(y∘exp(−ξ))=Stfdes.

Since , then

 y∘exp(−ξ)=Stfdes,

or, after solving for ,

 ξ=−log(diag(y)−1Stfdes),

where the logarithm is taken elementwise. The resulting block coordinate ascent update can be written as

 νt+1=νt−STtlog(Stfdes∑ni=1exp(−fTiνt)Stfi), (12)

where the logarithm and division are performed elementwise. This update can be interpreted as changing in the entries corresponding to the constraints given by property by adding the log difference between the desired distribution and the (unnormalized) marginal distribution for this property suggested by the previous update. This follows from (10), which implies for each , where is the distribution suggested by at iteration .

Resulting update over w.

We can rewrite the update for the dual variables as a multiplicative update for the primal variable , which is exactly the update given by the iterative proportional fitting algorithm. More specifically, from (10),

 wt+1i=exp(−fTiνt+1)∑ni=1exp(−fTiνt+1).

For notational convenience, we will write , which is a unit vector denoting the category to which data point belongs to, for property . Plugging update (12) gives, after some algebra,

 exp(−fTiνt+1)=exp(−fTiνt)exp(xTtilog(Stfdes))exp(xTtilog(∑nj=1exp(−fTjνt)xtj)).

Since is a unit vector, then for any vector , so

 exp(−fTiνt+1)=exp(−fTiνt)xTtiStfdes∑nj=1exp(−fTjνt)xTtixtj.

Finally, using (10) with gives

 wt+1i=wtixTtiStfdes∑nj=1wtj(xTtixtj),

which is exactly the multiplicative update of the iterative proportional fitting algorithm, performed for property .