We study the problem of multi-person pose estimation (MPPE)  which we model as the problem of selecting a subset of non-overlapping proposals of people supported by image evidence and a prior model. This formulation of MPPE corresponds to a minimum weight set packing (MWSP) 
problem where elements correspond to detections of body parts and sets (referred to as poses) correspond to subsets of those body parts detections. Poses are associated with real valued costs based on occurrence and co-occurrence probabilities of detections in a pose as defined by a deep neural network[20, 17, 25, 2] and (augmented) tree structured part/spring model [11, 10, 26] respectively. This fully specifies MPPE as an optimization problem.
Since the set of poses grows exponentially in the number of detections we employ an implicit column generation (ICG) strategy [4, 13, 14] for inference in MWSP. We exploit the augmented tree structure of the cost of a pose to frame pricing as a dynamic program  where variables correspond to body parts and the state of a given variable corresponds to a subset of the detections for that part. Since the state-space of each variable is enormous (power set of detections of a given part), we introduce a tool from operations research called the nested Benders decomposition (NBD) [7, 8]
which avoids considering the vector product of the state-spaces of adjacent variables in the tree. NBD has been used for a variety of applications including: agricultural planning, factory production planning, and stock portfolio optimization. Our NBD formulation is guaranteed to achieve exact inference in the pricing problem and in practice is orders of magnitude faster than naive dynamic programming. NBD exploits the fact that pricing problems are similar across iterations of ICG by hot starting optimization in a given iteration with Benders rows produced in previous iterations. The combination of ICG with NBD promises efficient and provably optimal solutions without having to enumerate the vector product of state-spaces.
1.1 Related Work
Our work is closely related to the sub-graph multi-cut integer linear programming (ILP) formulation of[19, 15, 18] which we refer to as MC for shorthand. MC models the problem of MPPE as partitioning detections into fourteen body parts (plus false positive) and clustering those detections into poses. Clustering is done according to the correlation clustering  criteria with costs parameterized by the part associated with the detection. This formulation is notable as it models non-max suppression by allowing poses to be associated with multiple detections of a given body part. However, the optimization problem of MC is often too hard to solve and is thus attacked with heuristic methods.
Motivated by the difficulty of inference in MC, the work of  introduces an alternative model called the two tier formulation (TTF). TTF truncates the model of MC in such a way to achieve fast inference using ICG. The TTF model with ICG inference outperforms MC with regards to finding difficult-to-localize parts such as ankle and wrist, and also provides a marginal improvement in overall accuracy. Furthermore TTF provides some additional capacities beyond that of MC such as a prior on the number of people in an image. In  inference in the TTF model is attacked using a non-nested Benders decomposition [12, 7] though inference is demonstrated to not be as fast as in the ICG strategy. While the TTF model works well in practice, it does not optimize the full cost but privileges a single exemplar detection for each body part to model inter-part co-association costs of a pose, ignoring all inter-part co-association costs between non-exemplar detections. Furthermore TTF requires detections to be associated with parts in advance of optimization.
Our paper is structured as follows. In Section 2 we introduce a novel MWSP formulation for MPPE and address inference with ICG. In Section 3 we introduce our NBD approach to solving the pricing problem of ICG. In Section 4 we present experiments on the MPII-Multiperson validation set. Finally we conclude in Section 5. We provide additional derivations in the appendix.
2 Our MWSP Formulation of MPPE
We now formulate MPPE as MWSP. We use , to denote the sets of human body parts and the sets of body part detections, which we index with , respectively. We describe a surjection of detections to parts using where indicates the part associated with detection . Each detection is associated with a single part prior to MWSP optimization. For short hand we use to denote the set of detections associated with part . We use to denote the set of all possible poses which we index by . We associate the members of with detections using which we index by where indicates that detection is in pose . This allows us to formulate MPPE as a search for low cost, non-overlapping sets.
We model human poses with an augmented tree structure over the set of parts described using matrix . We index with where if and only if part is a child of part in the augmented tree. We use fourteen body parts (head, neck, and the left/right of shoulder, elbow, wrist, hip, knee, ankle), as standardized by MPII dataset. We form an augmented tree over these parts defined by a typical tree structure over all parts excluding the neck, followed by connecting the neck to each of the other thirteen parts; this model design is based on the observation that in real images people’s necks are rarely occluded, thus having connections from neck to all other body parts can handle cases where other body parts are occluded while keeping the model relatively simple. For short hand we use to denote the root of the tree which is a part other than the neck selected arbitrarily.
The cost of a pose is defined using terms , and which we index with and respectively. We refer to the terms as unary and pairwise respectively. We use to indicate the cost of including detection in a pose. Similarly we use to indicate the cost of including detections in a common pose. The augmented tree structure is respected with regards to these costs; thus can only be non-zero if or if is a child of . We model a prior on the number of poses in the image using which is a constant offset added to the cost of each pose. We define the mapping of poses to costs using which we index with where is the cost associated with pose which is defined formally below.
We have thus fully defined the cost of a single pose. In the next section we formulate the MWSP problem using the costs of poses.
2.1 ILP/ LP Relaxation Formulation
We frame the search for the lowest cost set of non-overlapping poses as an integer linear program (ILP). We use to define a selection of poses where if and only if pose is selected. We write the constraint that selected poses do not overlap as . We express the ILP along with the corresponding primal/dual linear program (LP) relaxations as follows using Lagrange multipliers which we index by .
2.2 Inference via Implicit Column Generation
Given that grows exponentially in the number of detections we can not explicitly consider it during MWSP optimization. We thus construct a sufficient subset using ICG so as to solve exactly the dual problem in Eq 2. ICG consists of two alternating optimizations referred to as the restricted master problem (RMP) and the pricing problem respectively.
RMP: We solve the dual optimization over the set providing dual variables . We write dual optimization of the RMP below.
Pricing: Using we identify a subset of the most violated constraints corresponding to members of and add it to . This subset includes the most violated constraint over all . When no violated constraints exist ICG terminates. The slack in the dual constraint corresponding to a given primal variable is referred to as its reduced cost, thus pricing identifies the lowest reduced cost terms in the primal.
During the pricing step we iterate through the power set of neck detections and compute the lowest reduced cost pose containing exactly those neck detections. We index the power set of neck detections with and use to indicate that the neck detections in are exactly those in . We write pricing below for an arbitrary subset of the neck detections .
Note that when conditioned on a specific set of neck detections, the pairwise costs from these neck detections to all other detections can be added to unary costs of the other detections. Thus the augmented-tree structure becomes a typical tree structure, and exact inference can be done via dynamic programming.
We write ICG in Alg 1 and consider the pricing problem which is a dynamic program in Section 2.4. At termination of Alg 1 we solve MWSP over using an ILP solver. In practice we find that the LP relaxation provides an integral solution at termination for over 99% of cases. If needed we could employ branch and price 
or tighten the bound with odd set inequalities while preserving the structure of the pricing problem.
2.3 Anytime Lower Bounds
We compute an anytime lower bound on our objective by adding a term based on the columns produced to the objective of the RMP. Recall that each detection can be assigned to at most one body part. Thus we can rely on the proof that the LP for MWSP can be bounded by RMP objective plus the lowest reduced cost times the cardinality of the set of elements [9, 23] (if a negative reduced cost term exists). We compute this lower bound below given any non-negative provided by the RMP as follows.
2.4 Pricing Using a Naive Dynamic Program
Observe that Eq 4 corresponds to computing the maximum a posteriori probability (MAP) of a Markov random field (MRF) where there is bijection between parts (except the neck) and variables in the MRF. Similarly for a variable in the MRF there is a bijection between the state space of that variable and the power set of detections for the associated part. This MRF is tree structured and hence amenable to exact inference via dynamic programming which we now consider.
We use to denote the state space of variable which we describe using . We index with where and respectively. We use to indicate that detection is included in configuration . We use to refer to the value of the lowest cost solution to the sub-tree rooted at conditioned on its parent taking on state . The cost includes the pairwise interaction terms between members of and . We define formally using helper terms below.
Computing for all is expensive since we need to search over all possible combinations of state spaces of two adjacent variables, where the number of possible states of each variable can be up to 50k in our experiments.
3 A Nested Benders Decomposition Alternative to Naive Dynamic Programming
We now consider NBD as an alternative to naive dynamic programming that avoids the expensive computation of for all . We express using the sum of convex functions each constructed from the maximum of a unique set of affine functions called Benders rows. Specifically for any part (other than the root ) we denote the corresponding set of Benders rows as which we index by . We describe using where is the parent of and index with or . For a given we use to indicate value associated with and to be the offset. Using we provide an alternative description of using helper term defined as follows.
The existence of such a decomposition is a known result of the stochastic programming literature . Notice that the minimization in Eq 7 does not require either the configuration of the children of nor “messages” from those children. Hence the state of can be determined independently of the other variables. Similarly each variable can be determined independently of its children given the state of its parent. If the sets are of small cardinality then solving Eq 7 is easy thus we construct a sufficient subset of denoted for each (other than the root) using row generation (cutting planes). We refer to the collection of the nascent sets as . Given the nascent sets we construct a lower bound on denoted defined below using helper function .
The root variable is not associated with terms since it has no parent but it is associated with terms.
3.1 Overview of Constructing
We now consider the construction of small sufficient sets such that . We outline the remainder of this section as follows. In Section 3.2 we produce upper/lower bounds on the MAP of the MRF, which are identical at termination of NBD. The upper bound is accompanied by a configuration with cost equal to the upper bound. In Section 3.3 we compute the gap between the upper and lower bounds introduced at each variable in the tree and select the variable associated with the largest increase in the gap. Then in Section 3.4 we add a Benders row to . Next in Section 3.5 we increase terms hence tightening the relaxation without generating new rows. Then in Section 3.6 we combine the steps above to produce a complete NBD inference approach. Finally we provide implementation details in Section 3.7.
3.2 Producing a Configuration and Corresponding Bounds using Nested Benders Decomposition
In this section produce a configuration for the MRF by proceeding from root to leaves and selecting the state for a given variable given the state of its parent only. We describe the configuration produced using where . We use to refer to the state of the parent of . The process of producing is defined below.
The cost of the configuration is an upper bound on the MAP and is associated with a lower bound on the MAP .
3.3 Computing the Gap Introduced at each Variable in the Tree
In this section we identify the variable in the tree associated with the largest increase in the gap between the upper and lower bounds given a configuration produced in Section 3.2. The gap between upper and lower bounds introduced at is the difference between the upper and lower bounds at variable minus the corresponding gaps at its children. We use , which are defined below, to denote the cost of the sub-tree rooted at ; the corresponding lower bound; and the gap introduced at respectively.
3.4 Generating Benders Rows
In this section we identify the most violated Benders row denoted for a given , given that its parent takes on and the sets. We formulate this as a small scale linear program described below.
We use decision variable to indicate that is the state associated with variable . We use decision variable which we index by where . We introduce dual variables and each of which lie in .
After computing we produce a new Benders row denoted that is defined as follows.
When solving optimization in the dual of Eq 11 we add a tiny negative bias to objective corresponding to terms . This ensures that the corresponding terms do not increase beyond what is needed to produce an optimal dual solution, which stabilizes optimization. The additional small biases may be understood intuitively as ensuring that corresponds to the marginal cost for using in the solution. In Appendix B we solve the dual LP in Eq 11 efficiently by reducing it to an equivalent LP with variables and far fewer constraints.
3.5 Re-using Rows: Rapidly Updating while leaving fixed
In this sub-section we provide a complementary mechanism to generating new Benders rows. This mechanism is motivated by the observation that may increase (but never decrease) when Benders rows are added to the descendants of a given variable . This mechanism takes in existing Benders rows and sets the corresponding term to the minimum feasible value thus tightening the corresponding constraint while leaving fixed. This task is faster than generating Benders rows via the method of Section 3.4.
Observe that given satisfying there always exists a feasible setting of . Given fixed we select the smallest feasible value for as follows.
3.6 Our Complete Nested Benders Decomposition Algorithm
Our NBD approach iterates through the following steps.
Step 1: Proceeding from the leaves to the children of the root: Set terms to the minimum feasible value given fixed . This is done on the first iteration of NBD only if is not empty for each .
Step Two: Proceed from root to leaves: select the state for each variable conditioned on its parent (if it has one) and the Benders rows associated with its children. This produces upper and lower bounds associated with each variable in the tree.
Step Three: Select the variable corresponding to the largest increase in the gap between the upper and lower bounds in the tree.
Step Four: Add a new Benders row to
We repeat this procedure until no additional Benders rows need be added at which point the configuration produced in step two is guaranteed to be the global optima. We formalize this procedure in Alg 2.
3.7 Implementation Details
In this section we provide implementation details with regards Alg 2.
Accelerating Step One of Alg 2: Observe that terms associated with can only be decreased if a new Benders row is added to one of the descendants of . Thus when executing Alg 2 we only update the terms associated with the ancestors of the variable which had its Benders row set augmented. Observe that we need only update the terms associated with the leaves in the first iteration of NBD in a given call from ICG.
Accelerating Step Two of Alg 2. Recall that the choice of the state of a given variable is a function of the Benders rows associated with its children and the configuration of the parent (if it has a parent). At any iteration of NBD other than the first we consider the previous configuration of the MRF produced in step two of Alg 2 and only update the state of a variable if the state of its parent was changed or if it is an ancestor of the variable which had its Benders row set augmented.
Limiting the Number of Neck Detections in a Pose: We found that our best results with regards to timing and modeling occur when we require that each pose include exactly one neck detection.
Limiting State Space of a Variable: We limit the number of states of a given variable to a given user defined parameter value (=50,000). We construct this set as follows. We begin with the state corresponding to zero detections included, then add in the group of states corresponding to one detection included; then add in the group of states corresponding to two detections included etc. If adding a group would have the state space exceed states for the variable we don’t add the group and terminate.
Caching Integrals: We accelerate Alg 2 by storing the value of repeatedly used integrals that do not change in value over the course of optimization.
Thus each time a new is produced we store for each . Similarly we store for each .
Initializing : We do not initialize with any Benders rows for the first round of pricing in ICG. Thus the initial state for a variable in NBD ignores the existence of its children and the corresponding initial lower bound is .
Timing Observation: Experimentally we observe that the total time consumed by steps in NBD is ordered from greatest to least as [1,2,4,3]. Note that the step solving the LP is the second least time consuming step of NBD.
Selecting the Root: Observe that Alg 2 requires solving LPs in step four for variables except the root. The number of constraints in the LP for part is exponential in the size of . We avoid solving the largest LP by selecting as the root the part associated with the most detections. Alternatively we could select the root so as to have a more balanced tree with the goal of leveraging parallel processing.
We evaluate our approach against a naive dynamic programming based formulation on MPII-Multiperson validation set , which consists of 418 images. The unary and pairwise costs are trained using the code of , with a constant bias (set by hand) to regularize the number of people in the solution.
We compare solutions found by NBD and DP at each step of ICG; for all problem instances and all optimization steps, NBD obtains exactly the same solutions as DP (up to a tie in costs). Comparing total time spent doing NBD vs DP across problem instances we found that NBD is 44x faster than DP, and can be up to 500x faster on extreme problem instances. Comparison of accumulated running time used by NBD and DP over all 418 instances are shown in Fig. 2. We observe that the factor speed up provided by NBD increases as a function of the computation time of DP.
With regards to cost we observe that the integer solution produced over is identical to the LP value in over 99% of problem instances thus certifying that the optimal integer solution is produced. For those instances on which LP relaxation fails to produce integer results, the gaps between the LP objectives and the integer solutions are all within 1.5% of the LP objectives.
For the sake of completeness, we also report MPPE accuracy in terms of average precisions (APs) and compare it against a state-of-the-art solver  which uses primal heuristics. Note that the cost formulation of  differs from ours in that it allows a pose to be associated with multiple neck detections or none, while our model requires that each pose must have exactly one neck detection and maps detections to parts prior to ICG. Also our model includes a prior on the number of poses as modeled by . As shown in Table 1, we achieve equivalent results to . We note that although our algorithm does not run as fast as , our code is implemented in pure MATLAB and can benefit further from using commercial LP solvers and parallelizing pricing routines. More importantly, our formulation provides upper/lower bounds and in over 99% of cases certificates of optimality.
We have described multi-person pose estimation as a minimum-weight set packing (MWSP) problem which we address using implicit column generation. We solve the corresponding pricing problem using a novel nested Benders decomposition (NBD) approach, which reuses Bender’s rows between calls to NBD. For over 99% of cases we find provably optimal solutions, which is practically important in domains where knowledge of certainty matters, such as interventions in rehabilitation. Our procedure for solving the pricing problem vastly outperforms a baseline dynamic programming approach. We expect that NBD will find many applications in machine learning and computer vision, especially for solving dynamic programs with large state spaces for individual variables. For example we could formulate sub-graph multi-cut tracking as a MWSP with pricing using NBD.
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proc. of CVPR, 2014.
P. Baldi, P. Sadowski, and D. Whiteson.
Searching for exotic particles in high-energy physics with deep learning.Nature communications, 5, 2014.
-  N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In Journal of Machine Learning, pages 238–247, 2002.
-  C. Barnhart, E. L. Johnson, G. L. Nemhauser, M. W. P. Savelsbergh, and P. H. Vance. Branch-and-price: Column generation for solving huge integer programs. Operations Research, 46:316–329, 1996.
-  R. E. Bellman. Dynamic Programming. Dover Publications, Incorporated, 2003.
-  H. Ben Amor, J. Desrosiers, and J. M. Valério de Carvalho. Dual-optimal inequalities for stabilized column generation. Operations Research, 54(3):454–463, 2006.
-  J. F. Benders. Partitioning procedures for solving mixed-variables programming problems. Numerische mathematik, 4(1):238–252, 1962.
-  J. R. Birge. Decomposition and partitioning methods for multistage stochastic linear programs. Operations research, 33(5):989–1007, 1985.
-  J. Desrosiers and M. E. Lübbecke. A primer in column generation. In Column generation, pages 1–32. Springer, 2005.
-  P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proc. of CVPR, 2008.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
-  A. M. Geoffrion and G. W. Graves. Multicommodity distribution system design by benders decomposition. Management science, 20(5):822–844, 1974.
-  P. Gilmore and R. Gomory. A linear programming approach to the cutting-stock problem. Operations Research (volume 9), 1961.
-  P. Gilmore and R. E. Gomory. Multistage cutting stock problems of two and more dimensions. Operations research, 13(1):94–120, 1965.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. CoRR, abs/1605.03170, 2016.
-  R. M. Karp. Reducibility among combinatorial problems. The IBM Research Symposia Series, pages 85–103. Plenum Press, New York, 1972.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. of NIPS, 2012.
-  E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition and node labeling: Problem, algorithms, applications. In Proc. of CVPR, 2017.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proc. of CVPR, 2016.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, Cambridge, MA, USA, 1986.
-  S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multi-target tracking. In CVPR, 2015.
-  S. Wang, K. Kording, and J. Yarkony. Exploiting skeletal structure in computer vision annotation with benders decomposition. arXiv preprint arXiv:1709.04411, 2017.
-  S. Wang, S. Wolf, C. Fowlkes, and J. Yarkony. Tracking objects with higher order interactions via delayed column generation. In Artificial Intelligence and Statistics, pages 1132–1140, 2017.
-  S. Wang, C. Zhang, M. A. Gonzalez-Ballester, A. Ihler, and J. Yarkony. Multi-person pose estimation via column generation. arXiv preprint arXiv:1709.05982, 2017.
-  R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015.
-  Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. of CVPR, 2011.
-  J. Yarkony and C. Fowlkes. Planar ultrametrics for image segmentation. In Proc. of NIPS, 2015.
Appendix A Appendix: Dual Optimal Inequalities on
In this section we provide upper bounds on the Lagrange multipliers called dual optimal inequalities (DOI)  which are computed prior to ICG. The use of DOI decreases the search space that ICG needs to explore and thus decreases the number of iterations of pricing required.
Observe that at any given iteration of ICG the optimal solution to the primal LP relaxation need not lie in the span of . If limited to producing a primal solution over it is useful to allow some values of exceed one. We introduce a slack vector indexed by that tracks the presence of any detections included more than once and prevents them from contributing to the objective when the corresponding contribution is negative. To do this we offset the cost for “over-including” a detection with a cost that at least compensates and likely overcompensates.
Observe that the removal of a detection from a pose removes from the cost the associated , for any in the pose and if is the only detection the term. We define such that it is an upper bound on the increase in the cost of a pose given that is removed. To express compactly we introduce the following terms ,,. We define as follows.
The expanded MWSP objective and its dual LP relaxation are given below:
Observe that the dual relaxation bounds by from above. These bounds are called DOI.
For cases where a pose is required to include a neck detection we can not use this bound for neck detections as the removal of the neck makes the pose invalid. Therefore we ignore the term when computing and for neck we set .
To ensure that the DOI are not active at termination of ICG we offset with a small positive constant.
Appendix B Appendix: Deriving a Compressed LP for Benders Row Generation
In this Section we compress the dual form of the LP in Eq 11 so as to accelerate inference. In fact by compressing the LP we observe that optimization of this LP no longer dominates NBD computation. To achieve this we make the following observations about the primal LP form in Eq 11.
Given or : The constraint is inactive so .
Given or : The constraint is inactive so .
Given the constraint inactive so .
Observe that the following is true any pair such that .
Since is non-negative and associated with a non-negative term in the objective we observe the following.
Using these observations we rewrite optimization.
Now observe that is the only term that is associated with a non-zero objective and which is not bound to zero thus it has the same value as the primal LP and therefor .
Observe that for all terms not bound to zero for a given that they co-occur in the objective with value zero and in each constraint over with the common value. Using this we merge terms across as follows using which we index by and helper term .