Network Flow Methods for the Minimum Covariates Imbalance Problem

07/14/2020 ∙ by Dorit S. Hochbaum, et al. ∙ berkeley college 0

The problem of balancing covariates arises in observational studies where one is given a group of control samples and another group, disjoint from the control group, of treatment samples. Each sample, in either group, has several observed nominal covariates. The values, or categories, of each covariate partition the treatment and control samples to a number of subsets referred to as levels where the samples at every level share the same covariate value. We address here a problem of selecting a subset of the control group so as to balance, to the best extent possible, the sizes of the levels between the treatment group and the selected subset of control group, the min-imbalance problem. It is proved here that the min-imbalance problem, on two covariates, is solved efficiently with network flow techniques. We present an integer programming formulation of the problem where the constraint matrix is totally unimodular, implying that the linear programming relaxation to the problem has all basic solutions, and in particular the optimal solution, integral. This integer programming formulation is linked to a minimum cost network flow problem which is solvable in O(n· (n' + nlog n)) steps, for n the size of the treatment group and n' the size of the control group. A more efficient algorithm is further devised based on an alternative, maximum flow, formulation of the two-covariate min-imbalance problem, that runs in O(n'^3/2log^2n) steps.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of balancing covariates arises in observational studies where one is given a group of control samples and another group, disjoint from the control group, of treatment samples. Each sample, in either group, has several observed nominal covariates. The values, or categories, of each covariate partition the treatment and control samples to a number of subsets referred to as levels where the samples at every level share the same covariate value.

When estimating causal effects using observational data, it is desirable to replicate a randomized experiment as closely as possible by obtaining treatment and control groups with similar covariate distributions. This goal can often be achieved by choosing well-matched samples of the original treatment and control groups, thereby reducing bias due to the covariates. The matching is typically one-to-one.

The bias due to the covariates can be completely removed if for each matched pair, the two samples belong to the same levels over all covariates. Satisfying this requirement, however, typically results in a very small selection from the treatment and control group, which is not desirable. To address this Rosenbaum et al. [1] introduced a relaxation requiring instead to match all treatment samples to a subset of the control samples so that the number of control and treatment samples in each level of each covariate are the same. This requirement is known in the literature as fine balance.

It is not always feasible to find a selection of the control samples that satisfies the fine balance requirement. To that end several papers considered the goal of minimizing the violation of this requirement, which we refer to as imbalance. Yang et al. [2] considered finding an optimal matching between the treatment samples and a selection of the control samples where that selection minimizes an imbalance objective, to be explained below. Bennett et al. [3] considered directly the minimum imbalance problem that finds a selection which minimizes the violation of the fine balance requirement.

Our focus here is the minimum imbalance problem, also referred to as the min-imbalance problem. The min-imbalance problem is trivial for a single covariate and NP-hard for three or more covariates, as discussed next. For the case of two covariates, we introduce here efficient network flow algorithms that solve the problem in polynomial time.

Let be the number of covariates to be balanced. To introduce the min-imbalance problem, consider first the case of a single covariate, , that partitions the control and treatment groups into, say, levels each. Let the sizes of the levels in the treatment group be and the sizes of the levels in the control group be . The min-imbalance problem is to select a subset of the control group, called selection, such that the sizes of the levels in the selection, approximate as closely as possible the sizes . The solution to the single covariate min-imbalance problem is trivial: the optimal selection is any samples out of the level control samples if ; otherwise, select all level control samples. The value of the objective function corresponding to this selection is then , which is the optimal value for the single covariate min-imbalance problem.

For the case of multiple covariates, covariate partitions both treatment and control groups into levels each. Denote the sizes of the levels in the treatment group under covariate by , and let the partition of the control group under covariate be of sizes . The min-imbalance problem is to find a selection , subset of the control group, that minimizes the objective function: .

Let the number of treatment samples be and the number of control samples be . In applications of the min-imbalance problem it is often the case that the size of selection is required to be , the size of the treatment group. This is assuming that . Here we focus the presentation on the case where the size of is required to be . Yet, all our methods generalize to the min-imbalance problem for any specified size , , of the selection, as shown in Appendix B.

The min-imbalance problem was recently studied by Bennett et al. [3] where a mixed integer programming (MIP) formulation of the problem was presented, with the size of the selection equal to , the size of the treatment group. They showed that the corresponding linear programming (LP) relaxation yields an integer solution when the number of covariates . Consequently, Bennett et al. [3] established that the min-imbalance problem on two covariates is solved in polynomial time via linear programming. For they presented an example where the LP relaxation’s solution is not integral. It is further noted in [3], that the min-imbalance problem for is NP-hard. However, no proof or reference is provided. Among other contributions, we fill this gap with a formal proof that the min-imbalance problem for three or more covariates is NP-hard.

A problem related to the min-imbalance that selects a subset of the control samples so as to achieve a certain type of balancing of the levels is called Optimal Matching with Minimal Deviation from Fine Balance, see Yang et al. [2]. We refer to this problem here for short as -Matching-Balance (MB). The MB problem is to optimize the assignment of each treatment sample to a number, , of control samples, based on a distance measure between each treatment and each control sample. This is subject to a constraint that the selection of the control samples is an optimal solution to a problem related to min-imbalance. In other words, MB is defined in two stages: In the first stage the goal is to find a selection of control samples of size , where is an integer between and , and is the size of the treatment group, so as to minimize the -imbalance, defined as . In the second stage, among all selections of control samples that attain the minimum -imbalance, one chooses the selection that minimizes the distance matching of each treatment sample to exactly selected control samples. Yang et al. [2] studied this MB problem for a single covariate and proposed two network flow algorithms. We observe here that, for any number of covariates, if the first stage problem of minimizing the -imbalance has a unique solution, in terms of the number of selected samples from each level-intersection (to be discussed in the next section, Section 2) of the control group, then an optimal solution to the second stage, and therefore to the MB problem, is easily attained by solving a network flow problem. As shown in Appendix C the minimum -imbalance is equivalent to the min-imbalance problem, and therefore an optimal solution to a corresponding min-imbalance problem provides an optimal solution to the minimum -imbalance problem. It follows in particular that for a single covariate, the first stage problem of minimizing the -imbalance is trivial and has a unique solution in terms of the number of selected samples from each level of the control group. For two or more covariates, no polynomial running time algorithm is known for this problem. However, for two covariates, any of the algorithms shown here, also solve optimally the minimum -imbalance problem. For any number of covariates, when the minimum -imbalance solution is unique (in terms of level-intersection sizes) and known, MB problem is solved in polynomial time in the second stage (for details see Appendix C). In particular for two covariates, if the solution to the minimum -imbalance problem is unique, then the problem is solved efficiently with network flow techniques.

Our main results here are efficient algorithms for the min-imbalance problem with two covariates. We present an integer programming formulation of the problem, related to that of Bennett et al [3], and show that the constraint matrix is totally unimodular. That implies the linear programming relaxation to the problem has all basic solutions, and in particular the optimal solution, integral. We then show that the two-covariate min-imbalance problem can be solved much more efficiently than as a linear program with network flow techniques. We show how to solve the problem as a minimum cost network flow problem with complexity of . We then devise a more efficient algorithm based on a maximum flow formulation of the two-covariate min-imbalance problem that runs in steps. Another contribution here is a proof that for three or more covariates, the min-imbalance problem is NP-hard.

In summary our contributions here are:

  • An integer programming formulation of the min-imbalance problem with two covariates that has a totally unimodular constraint matrix.

  • A formulation of the two-covariate min-imbalance problem as a minimum cost network flow problem, solved in steps.

  • A maximum flow formulation of the two-covariate min-imbalance problem that is solved in steps.

  • An NP-hardness proof of the min-imbalance problem with three or more covariates.

Paper overview: Section 2 provides the notation used here. Section 3 describes the integer programming formulation with a totally unimodular constraint matrix for the two-covariate min-imbalance problem. We describe the minimum cost network flow formulation of the two-covariate min-imbalance problem, and the algorithm used to solved it in Section 4. In Section 5 we provide a maximum flow problem, the output of which is converted, in linear time, to an optimal solution to the two-covariate min-imbalance problem. We provide several concluding remarks in Section 6. The NP-hardness proof of the min-imbalance problem with three of more covariates is provided in Appendix A. Appendix B explains our method also applies to the min-imbalance problem for other specified size of the selection.

2 Preliminaries and notations

The goal of the min-imbalance problem is to identify a selection of control samples that will match, at each level of each covariate partition, as closely as possible, the number of treatment samples in that level and the number of selection samples in the same level. We denote by the number of covariates and the number of levels in the partition induced by covariate by , for . Let the levels of the treatment group under covariate be of sizes , and let the levels of the control group under covariate be of sizes . For a selection of control samples we define the discrepancy at level under covariate as,

The discrepancy of a level can be positive or negative. If the discrepancy is positive we refer to it as excess which is defined as , and if negative, we refer to it as deficit . With this notation the imbalance of a selection , is,

Notice that this quantity is identical to the imbalance form presented in the introduction: .

We now present the integer programming formulation that was given by Bennett et al. in [3] for the min-imbalance problem. That integer program involves two sets of decision variables:

: a binary variable equal to

if control sample is in the selection , and otherwise, for ;
: the absolute value of discrepancy at level under covariate , for , and . We note that this variable can only assume integer values, even though Bennett et al. refer to this as a mixed integer programming formulation.
With these variables the formulation is:

(1a)
s.t. (1b)
(1c)
(1d)
(1e)

For each pair , and , constraint (1b) and (1c) ensure that assumes the absolute value of the difference between the number of selected level control samples and at an optimal solution. These constraints also ensure that any feasible is non-negative and therefore a non-negativity constraint is not required for variable . The constraint (1d) specifies that the size of selected subset equals the size of the treatment group.

Bennett et al. [3] proved that any basic solution of the linear programming relaxation of (1) is integral for . We provide in Section 3 a stronger result showing that a slightly modified form of this integer programming formulation has a constraint matrix which is totally unimodular for . That implies that result that every basic solution is integer, and furthermore the constraint matrix is that of a minimum cost network flow problem. This implies that the problem can be solved significantly more efficiently than as linear programming problem.

An optimal solution to formulation (1) specifies for each control sample whether or not it is in the selection. We observe however that the output to the min-imbalance problem, for any number of covariates, can be presented more compactly in terms of level-intersections. For covariates the intersection of the level sets , , , form a partition of the control group. The number of non-empty sets in this partition is at most . Instead of specifying which sample belongs to the selection, it is sufficient to determine the number of selected control samples in each level intersection, since the identity of the specific selected samples has no effect on the imbalance. With this discussion, we have a theorem on the presentation of the solution to the min-imbalance problem in terms of the level-intersection sizes.

Theorem 1.

The level-intersection sizes are an optimal solution to the min-imbalance problem if there exists an optimal selection such that .

We will say that the solution to the problem is unique if for any optimal selection , the numbers are unique. In order to derive an optimal selection given the optimal level-intersection sizes, one selects any control samples from the intersection for , . Obviously, when for at least one combination of , the optimal selection is never unique.

Standard formulations of the minimum cost network flow (MCNF) and the maximum flow problems as well as a discussion of why MCNF constraint matrix is totally unimodular, are given in Appendix D.

3 A modified formulation with a totally unimodular constraint matrix for

In our formulation, instead of using variables , we use variables for excess and deficit. As discussed in Section 2, for each and . We let the variable for excess for and be and the variable for deficit be . Note that where both and are non-negative variables.

For each and , .

In the modified formulation shown below, the constraints (2b) and (2c) for the two covariates are separated to facilitate the identification of the total unimodularity property. Since is a partition of the control group, . Also, because are the sizes of the levels partition of the treatment group for the first covariate, it follows that . Therefore, . So specifying is equivalent to constraint (2d) in formulation (2) given below:

(2a)
s.t. (2b)
(2c)
(2d)
(2e)
(2f)
Lemma 1.

The constraint matrix of LP relaxation of formulation (2) is totally unimodular.

Proof.

First, it is noted that in the constraint matrix of (2) each entry is , or . Multiplying some rows of a totally unimodular matrix by preserves total unimodularity. Consider the matrix resulting by multiplying the rows of constraint (2c) by . As shown next, each column in this new matrix has at most one and at most one . Hence, by a well-known theorem (see Appendix D Theorem 6) the matrix is totally unimodular.

  • Both and are partitions of the control group, so are mutually disjoint, and are mutually disjoint. The column of each has exactly one in rows corresponding to (2b), and one (after multiplication) in rows corresponding to (2c).

  • For each , the column of has exactly one in rows corresponding to (2b) and exactly one in rows corresponding to (2d); the column of has exactly one in rows corresponding to (2b)) and exactly one in rows corresponding to (2d).

  • For each , the column of has exactly one non-zero, or , entry in rows corresponding to (2c); the column of has exactly one non-zero, or , entry in rows corresponding to (2c).

Since the new matrix has at most one and at most one in each column, it is totally unimodular. ∎

Formulation (2) is in fact also a minimum cost network flow (MCNF) formulation. See Appendix D for a generic formulation of MCNF. A generic MCNF formulation has exactly one and one in each column of the constraint matrix. To make formulation (2) have this structure, we multiple all coefficients in constraint (2c) by , and add a redundant constraint , which, like constraint (2d), is also equivalent to . In the next section, we streamline this formulation to derive a more efficient network flow formulation for this problem.

4 Network flow formulation for

Here we use the level-intersection sizes as variables, for . These variables can also be written as . With these decision variables we get the following network flow formulation:

(3a)
s.t. (3b)
(3c)
(3d)
(3e)
(3f)
(3g)
.

The formulation (3) is a minimum cost network flow problem. The corresponding network is shown in Figure 1, where all capacity lower bounds are , and each arc has a cost per unit flow and upper bound associated with it. Nodes of type each has supply of and nodes of type each has demand of In Figure 1, for each and , the flow on the arc between node and node represents variable ; arc from node to node represents the excess ; arc to node from any node represents the deficit ; arc from node to any node represents the deficit ; arc to node from any node represents the excess . So the per unit arc cost should be 1 for arcs between node 1 or 2 and any node in . All other arcs have cost . It is easy to verify that constraints (3b) are corresponding to the flow balance at nodes for all , constraints (3c) are corresponding to the flow balance at nodes for all . Constraint (3d) is corresponding to the flow balance at node 1, and constraint (3e) is corresponding to the flow balance at node 2.

1,1

1,2

1,

2,1

2,2

2,

arc legend: (cost, upperbound)

Figure 1: MCNF graph corresponding to formulation (3).
Theorem 2.

The 2-covariate min-imbalance problem is solved as a minimum cost network flow problem in time.

Proof.

We choose the algorithm of successive shortest paths that is particularly efficient for a MCNF with “small” total supply to solve the network flow problem of the 2-covariate min-imbalance problem.

The successive shortest path algorithm iteratively selects a node with excess supply (supply not yet sent to some demand node) and a node with unfulfilled demand and sends flow from to along a shortest path in the residual network [4], [5], [6]. The algorithm terminates when the flow satisfies all the flow balance constraints. Since at each iteration, the number of remaining units of supply to be sent is reduced by at least one unit, the number of iterations is bounded by the total amount of supply. For our problem the total supply is .

At each iteration, the shortest path can be solved with Dijkstra’s algorithm of complexity , where is number nodes and is number of arcs [7], [8]. In our formulation, is , which is at most . Since the number of nonempty sets is at most , the number of arcs is .

Hence, the total running time of applying the successive shortest path algorithm with node potentials on our formulation is . ∎

5 Maximum flow formulation for

Here we show a maximum flow (max-flow) formulation (see Appendix D for a generic formulation of max-flow problem) for the min-imbalance problem with 2 covariates. Unlike the previous formulations, the maximum flow solution requires further manipulation in order to derive an optimal solution to the min-imbalance problem with 2 covariate. That max-flow graph is illustrated in Figure (2). The source node can send at most units of flow to node for each , the sink node can get at most units of flow from node for each , and there can be a flow from node to node with amount bounded by , for .

1,1

1,2

1,

2,1

2,2

2,

arc legend: upperbound

Figure 2: Maximum flow graph

Let the maximum flow value for the max-flow problem presented in Figure (2) be denoted by , and let

be the optimal flow vector, with

denoting the flow amount between node and node . It is obvious that . That means an initial selection generated by selecting the prescribed number of control samples as in the optimal max-flow solution, i.e., selecting control samples from is of size . In order to get a feasible solution for the min-imbalance problem it is required to select additional control samples. The selection has no positive excess, only non-negative deficits with respect to the levels of both covariates. This is because due to the upper bound of arc from to for each , and due to the upper bound of arc from to for each . To recover an optimal solution for the min-imbalance problem from the initial set , we add up to unselected control samples, one at a time, each corresponding to a level with positive deficit under either covariate or . This process is repeated until either such control samples are found, or until no such control sample exists. In the latter case, to complete the size of the selection, any randomly selected control samples are added. Algorithm 1 is a formal statement of this process of recovering an optimal solution of the min-imbalance problem from the initial selection .

Initialization step: Select number of control samples from in set .
while  do
     if there exists an unselected control sample whose covariate 1 level is and covariate 2 level is , such that or  then,
          .
     else
          Let and let be any control samples . Set .
     end if
end while
Output .
Algorithm 1

To show that Algorithm 1 provides an optimal solution to the min-imbalance problem, we distinguish two cases of Algorithm 1: (1) and (2) . In the first case, there is, at each iteration, at least one control sample that belongs to some level with positive deficit. In Theorem 3 we prove that the output of Algorithm 1 is an optimal solution in this case.

Theorem 3.

If then the output selection of Algorithm 1 is optimal for the min-imbalance problem, with an optimal objective value of .

Proof.

First, we show that the total imbalance of the selection is . At the initialization step the selection has only deficits for all levels, with total deficit for covariate 1, , and total deficit for covariate 2, . At each iteration, there is an added control sample, say in , such that either or has a positive deficit with respect to . It is however impossible for both and to have a positive deficit with respect to since otherwise, there is an -augmenting path, from to node , to node , to , along which the flow can be increased by at least one unit. This is in contradiction to the optimality of the max-flow solution . As a result, at each iteration where a control sample is added, the total deficit is reduced by one unit, and the total excess is increased by one unit. Thus, at each iteration of the if step, the sum of total deficit and excess remains the same, namely .

Suppose, by contradiction, that there exists a selection for which the total imbalance is lower, . We repeat the following iterative procedure of removing samples from until there is no positive excess remaining: while there is a level of either covariate with positive excess with respect to , we remove one sample of that belongs to this level. Each such iteration results in the total excess reducing by at least 1 unit and the total deficit increasing by at most 1 unit, and therefore the sum of total deficit and excess does not increase. So when this iterative procedure ends, the total excess is zero and the total deficit is at most . Let be the number of samples remaining in after this excess removing procedure. Since there is no positive excess, is a feasible solution for the max-flow problem with the flow between node and node equal to . The sum of deficits associated with this remaining set is for covariate 1 and for covariate 2, for a total of , which is at most . Therefore, the total flow value, , satisfies that it is at least . Since , it follows that the value of the feasible flow induced by the set is greater than the maximum flow value , which contradicts the optimality of . ∎

We now address the second case where and . In this case, the total imbalance of is, from the arguments in the proof of Theorem 3, . Each one of the samples selected adds 1 unit of excess to each covariate, resulting in the addition of 2 units of excess to the imbalance. Therefore, the total imbalance of the output solution is . We next show what the value of is, and then demonstrate that any feasible selection to the min-imbalance problem has total imbalance of at least . This will prove that the output of Algorithm 1, , is an optimal solution to the min-imbalance problem.

It will be useful to consider an equivalent form of Algorithm 1. For each level of covariate that has , we add the largest number possible of available control samples in so long as the total does not exceed . This number is . Let then for each that has we add previously unselected control samples to . The outcome of this equivalent procedure is exactly the same as that of Algorithm 1. In the case that there is an insufficient number of control samples to add to after for all levels the largest number possible has been added. Therefore, at the end of this process, the if step returns that another unselected control sample does not exist, and the total number of control samples of , for each level of covariate , is .

Lemma 2.

If (and ) then where and .

Proof.

At the initialization step of Algorithm 1 and the total deficit is . Each time a control sample is added to in the if step, the total deficit is decreased exactly by 1 unit. So we can derive the value of when the algorithm terminates if we know the total deficit when the algorithm terminates. Note that the total excess may change, but we only consider here the deficit part of the imbalance.

From the discussion above, the total number of control samples of , for each level of covariate , is . We denote , and . Since the sum and , the sum of deficits of set under covariate 1 is , and the sum of deficits under covariate 2 equals . It follows that the sum of deficits of is .

Since the initial set that has total deficit of has its deficit reduced through Algorithm 1 to in the set , the additional number of control sample in that were added to the initial control samples is . Therefore, the size of is