Scalable Parallel Numerical Constraint Solver Using Global Load Balancing

05/18/2015 ∙ by Daisuke Ishii, et al. ∙ Association for Computing Machinery 0

We present a scalable parallel solver for numerical constraint satisfaction problems (NCSPs). Our parallelization scheme consists of homogeneous worker solvers, each of which runs on an available core and communicates with others via the global load balancing (GLB) method. The parallel solver is implemented with X10 that provides an implementation of GLB as a library. In experiments, several NCSPs from the literature were solved and attained up to 516-fold speedup using 600 cores of the TSUBAME2.5 supercomputer.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Numerical constraint satisfaction problems (NCSPs, Section 3) and their dedicated solvers have been successfully applied to problems in the domain of real numbers [22, 6, 5]. Given a NCSP with a search space represented by a box

(i.e., interval vectors), the

branch and prune algorithm bisects the box or filters an inconsistent portion of the box repeatedly, until finally obtaining a paving (i.e., a set of boxes that precisely enclose the solutions set). However, the exponential computational complexity of NCSPs limits the number of tractable instances; therefore, parallelization of NCSP solvers that can scale on a number of cores is a promising approach for the further development of numerical constraint programming [7].

Search-space splitting is a simple and efficient approach for parallelization of CSP solvers, yet state-of-the-art implementations are still limited to scaling up to a few hundred cores (Section 2). Recently, Saraswat et al. [20] have proposed a global load balancing framework: a scalable scheme for the global workload distribution and termination detection of irregular parallel computation, which typically applies to the CSP solving process (Section 4.1). That framework is implemented with X10 and is available in the official distribution of X10 as the GLB library [24].

In this work, we propose a parallel NCSP solver that uses the GLB library (Section 4.2). The solver is simply implemented with X10 by adopting the (sequential) constraint propagator of the Realpaver solver as a unit process of GLB. Section 5 reports the experimental results obtained when our method was deployed on 600 cores of the TSUBAME2.5 supercomputer. Optimal configurations of the GLB parameters are analyzed on the basis of those experimental results.

2 Related Work

A survey by Gent et al. [8]

describes existing parallel CSP solvers by classifying them into three categories: search-space splitting methods, cooperative methods for heterogeneous workers (e.g. portfolios and parallel local search)

[4], and parallelization of the constraint propagation process [9]. Our work focuses on the first approach.

The main difficulty of the search-space splitting approach lies in the balanced distribution of the sub-trees, which keeps worker solvers active. When the search tree becomes highly unbalanced, it becomes difficult to predict the appropriate splitting of the search tree, so a dynamic load balancing scheme becomes necessary. A work stealing scheme is typically used for this purpose: when a worker is starving, it sends a request to other workers, and workloads are communicated in response. Most existing works experiment with a limited number (e.g., 40[21], 64[14, 4], or 256 [7, 2]) of processors, and the load balancing tends to assume a central master process, which may limit scalability [23, 2].

A substantial amount of work exists regarding the parallelization of the branch and bound algorithm with both search-space splitting and work stealing [11, 15, 18]. Although the branch and bound algorithm resembles the solving process of NCSPs, existing works consider the discrete domain; our work explores an efficient parallel method that handles the continuous domain with interval computation.

There exist parallel CSP/SAT solvers implemented with X10 [3, 2, 7, 17]. None has yet utilized the GLB library in its implementation.

3 Numerical Constraint

Numerical constraint programming is an extension of discrete constraint programming [19] and uses techniques that are inherited from interval analysis [16]. Their variable domains are continuous subsets of : (machine-representable) intervals , where and are floating-point numbers, and boxes (or vectors of intervals) . In the following, boldface characters (e.g., ) denote intervals and boxes. denotes the set of intervals and denotes the set of dimensional boxes. Numerical constraint solving resorts to validated interval computation and the branch and prune scheme: the solvers evaluate an interval extension of a real function in a reliable manner, e.g., an interval extension of addition is computed as where and denote the downward and upward rounding mode control, and the branch and prune algorithm enumerates interval assignments based on the dichotomy principle.

A numerical constraint satisfaction problem (NCSP) is defined as a triple that consists of a vector of variables , an initial domain in the form of a box , and a constraint , where and , i.e., a conjunction of equations and inequalities. A solution of a NCSP is an assignment of its variables that satisfies the constraint . The solution set is the region . A NCSP is well-constrained when , under-constrained when , and over-constrained when . In general, a well-constrained NCSP has a discrete solution set and an under-constrained NCSP has a continuous solution set.

We can model the intersection of two disks in the plane as an under-constrained NCSP, where , , and

The solution set projected onto the plane is depicted in Figure 2.

3.1 Branch and Prune Algorithm

The branch and prune algorithm [22] is the standard solving method for NCSPs. It takes a NCSP and a precision as input and outputs a set of boxes (or paving) that approximates the solution set with precision .

Figure 1 presents the algorithm. In the main loop at Lines 2–11, the algorithm first takes the first element of the queue of boxes and applies the procedure that shaves boundary portions of the considered box (Line 3). In this work, we use a basic implementation  [1] for well-constrained problems; for under-constrained problems, we use an implementation proposed in [6] that provides a verification process based on an interval Newton method combined with . As a result of , a box becomes either empty, precise enough (its width is smaller than ), verified as an inner box of the solution set , or undecided. Precise and inner boxes are appended to (Line 6) and undecided boxes are passed to . Next, the procedure bisects the box at the midpoint along a component corresponding to a variable and the sub-boxes are put back into the queue (Line 8). In this work, we assume that selects variables in an order that makes the search behave in a breadth-first manner.

0:  NCSP , precision
0:  set of boxes
2:  while  do
4:     if  then
5:        if  is an inner box then
7:        else
9:        end if
10:     end if
11:  end while
12:  return  
Figure 1: Branch and prune algorithm

Realpaver[12] has been developed as a (sequential) implementation of a NCSP solver.

A set of boxes enclosing the solution set of the NCSP from Example 3, which is computed with , is shown in Figure 2.

Figure 2: Paving of a solution set

There are some characteristics that make the parallel solving of NCSPs different from that of other CSPs. First, computing is expensive and causes a bottleneck during the solving process; in our experiments, a call takes around 1ms in average. Second, applications result in unbalanced search trees. Under certain conditions, contracts a large portion of a considered box (cf. quadratic convergence of the interval Newton methods) and may even filter out the entire box if the box in question is verified as an inner or totally inconsistent region. Thus, it is crucial for efficient NCSP solving to execute at each step while traversing the search tree, which, on the other hand, makes it more difficult to distribute a search path among processors. Third, the number of solutions is large when a problem is under-constrained and a small is given; this causes the search tree to spread toward the bottom. Last, the search processes for different branches do not require communication; thus, it is safe to run processes on different cores in parallel, without any modification.

4 Parallelization Using
Global Load Balancing

We parallelize the branch and prune algorithm by splitting and distributing a search space (represented as queue content) among the workers that run the branch and prune processes homogeneously. A balanced distribution is not straightforward; a naive method is to create a frontier of sufficient number of nodes in the search tree, and distribute them evenly across the workers; however, a breadth-first search computation of such a frontier is not efficient because of the time-consuming process.

The proposed method is implemented simply with X10 and the global load balancing (GLB), which is an efficient scheme for load balancing of irregular tasks.

4.1 X10 GLB Library

GLB is a global load balancing library [24] in the X10 standard library that implements the lifeline graph work-stealing algorithm [20]. GLB is suitable for parallelizing irregular tasks, where the workload for each subtask is not predictable, such as search algorithms for AI applications.

GLB computation is performed by multiple cooperative workers. Each worker runs on an X10 place and homogeneously processes a divided workload. The load balancing between workers is done in two phases: first, work stealing via requests sent from one worker to randomly selected other workers; then, work stealing and termination detection via a hyper-cube network of workers called a lifeline. GLB is simply implemented with X10 with configurable parameters and scales up to 16K places when applied to several benchmarks. For each GLB application, a sequential computation that processes a workload is implemented as an X10 class and an instance is given to a worker as input. There are four parameters: specifies the lower bound on the time (in seconds) taken by a unit of sequential process111We modified GLB to use the parameter instead of , which specified the number of processed unit tasks.; specifies the number of attempted workload steals in the first phase; and and specify the diameter of the lifeline graph and the number of branches of each node, respectively. turns off the random stealing process. We assume is greater than or equals to the number of workers. A tight lifeline graph is built by setting ; broadcasting is done in two hops.

0:  environment , instance
0:  task result
2:  repeat
3:     while  do{active phase}
5:     end while{idle phase 1}
6:     for (++) do
8:     end for{idle phase 2}
9:     for (++) do
10:        if  then
13:        end if
14:     end for
15:  until 
16:  return  
Figure 3: Worker process

A pseudo algorithm in Figure 3 mimics the worker process. When the instance contains a workload, a worker becomes active and iterates the loop at Lines 3–5. A call to the method of invokes a unit sequential computation that should take at least seconds. sends portions of the available workload (i.e. ) to other idling workers; otherwise, it signals that the worker has no workload available. When all workload is processed and property becomes true, the worker enters the two-stage idle phase: at Lines 6–8, the worker randomly selects another worker, sends a request, and waits for the victim’s to respond; at Lines 9–14, the worker sends a request following the lifeline graph. When the steal succeeds, the loot is merged by executing . When is still empty, the worker process terminates and outputs the task result.

4.2 Implementation of NCSP Solver with GLB

We implement to encapsulate computation of the branch and prune algorithm. holds the queue of undecided boxes and the solution set . Initially, a worker possesses the initial domain in and the queues of other workers are left empty. Methods of are implemented as follows:

  • computes the main loop of the branch and prune algorithm until the time elapses. For , the C++ implementation of Realpaver is used.

  • divides equally into two and returns a portion.

  • is given a set of boxes and appends the boxes to .

  • returns . Our implementation does not gather in one place, thus avoiding unnecessary overhead. Indeed, might be better distributed in the post-process of many applications.

The implementation consists of about 1,000 lines of X10 and 2,400 lines of C++ code. In NCSP applications, the solving process must be tweaked by trying several combinations of implementations and search strategies. In this respect, our simple X10 framework that is interfaced with C++ solver implementations will serve as a practical tool.

5 Performance Evaluation

We evaluated our parallel NCSP solver with several problems from existing literature. The experiments were operated on the TSUBAME2.5 supercomputer, which is a supercomputer at Tokyo Institute of Technology.222 Each node of TSUBAME2.5 has two Intel Westmere EP 2.93GHz processors (12 cores in total) and 54GB of local memory. We used 50 nodes; thus, each experiment was run with up to 600 X10 places on 600 cores. We used native X10 version and its MPI backend (based on Open MPI 1.6.5).

5.1 Experimental Results

We solved four instances of two well-constrained (WC) problems taken from [13, 10] and six instances of two under-constrained (UC) problems taken from [6, 5]. For each of the first three problems, we prepared two instances by varying the number of variables and constraints. For the UC problems, we solved with two different precisions. Every instance was solved with the following seven GLB parameter configurations:

  1. s, , and .

  2. s, , and .

  3. s, , and .

  4. s, , and .

  5. s, , and .

  6. s, , and .

  7. s, , and .

was set as (such that ).

problem size # sol # br ar # sb
Economics 8 8 63 478 0 58s 0.40s 0.41s 64% 47 000
(eco8) 1 0.78s 0.77s 25% 27 500
Economics 10 16 3 614 945 0 5 970s 22.0s 11.8s 88% 2 550 000
(eco10) 1 21.4s 11.5s 93% 1 150 000
Periodic orbits 48 2 939 28 742 0 1 330s 8.5s 5.0s 58% 34 800
(henon24) 1 6.8s 4.4s 63% 19 000
Periodic orbits 56 16 105 174 446 0 12 530s 60.2s 31.2s 65% 201 000
(henon28) 1 45.7s 25.1s 87% 81 000
2D sphere 2+2 312 064 364 961 0 122s 0.6s 0.5s 75% 295 000
and plane 1 0.8s 0.9s 39% 153 000
(sp2-2) 2 490 988 2 936 705 0 780s 3.8s 2.2s 87% 2 300 000
1 3.9s 2.6s 78% 955 000
4D sphere 2+4 1 459 225 2 488 689 0 1 202s 7.0s 3.8s 85% 1 790 000
and plane 1 6.6s 4.1s 85% 662 000
(sp2-4) 11 759 158 20 082 197 0 12 800s 52s 27s 92% 12 850 000
1 50s 26s 97% 4 600 000
3-RPR robot 3+3 1 488 388 1 936 939 0 598s 2.8s 1.7s 80% 1 550 000
(3rpr) 1 2.9s 2.1s 71% 675 000
5 649 780 7 186 845 0 2 135s 9.0s 5.0s 86% 5 600 000
1 8.7s 5.2s 88% 2 300 000
Table 1: Considered problems and experimental results

Specification of the instances and experimental results using configurations (1) and (2), which were performed most efficiently, are shown in Table 1. The columns “problem”, “size”, “”, “# sol”, “# br”, and “” represent the name of the problem, size (i.e., the number of variables ; for UC problems, the number of equality constraints that are separated with ‘+’), precision, number of solutions, number of branches, and value of , respectively. The rest of the columns provide some experimental results. represents the running time using X10 places (best timings are underlined). represents the mean of the ratio of active time versus the total solving time at each place when computed with 600 places. represents the total number of sent boxes from 600 places for load balancing. Figure 4 shows the number of paths per depth in each search tree of the four instances. Figure 5 illustrates the speedups of the parallel solving processes for the seven instances. Figure 6 illustrates the fraction of the three worker states within the CPU timing for the two instances. Each layer, from bottom to top, corresponds to the time taken for , , and the idle phase (Lines 6–14 in Figure 3), respectively.333The timings for differed per number of places. As a reason, we predicted and confirmed that the CPU cache hit ratio differed in the parallel processes.

(a) eco8 (left); eco10 (right).
(b) sp2-2, (left); sp2-4, (right).
Figure 4: Number of search paths (vertical axis) per depth (horizontal axis)

5.2 Discussion

(a) eco8 (left), eco10 (right)
(b) sp2-2, (left); sp2-4, (middle); sp2-4, (right)
(c) (left), (right)
Figure 5: Speedup of the solving process
(a) eco8 (left), eco10 (right)
(b) 3rpr, (left), (right)
Figure 6: Breakdown of CPU time with timings for pruning, load sending, and idling

In the experiments, our parallel solver scaled up to 600 places/cores and achieved up to 516-fold speedup (an efficiency of 0.84).

As can be seen from Figure 4, each of the search trees for the instances considered has a number of paths whose lengths are close to the height of the tree; thus, a certain level of parallelism exists. Furthermore, comparing the graphs of the different instances of the same problem, we see that the shapes of graphs are similar, but the size of the tree of larger instances increases exponentially, indicating that parallelization should be easier for larger instances.

For most instances (excluding the large ones), the best speedups were accomplished with configuration (1): the parallel process with the most frequent load balancing that used the lifeline graph with the broadest distribution and did not perform random stealing. The efficiency of load balancing is evident in the active ratio of workers (see in Table 1 and the left-hand figures of Figure 6). Despite its large communication overhead (see the last column of Table 1), load balancing that used the lifeline resulted in quick workload distribution and termination.

For large instances, such as eco10 (Figure 5(a), right), henon28, and sp4 (; Figure 5(b), right), configurations (2), (3), and (5), which had frequent random stealing, outperformed configuration (1). Their performance also appeared in the active ratio (Figure 6(a)

, right). Among these configurations, configuration (2) performed the best probably because of its quick termination process.

Regarding time interval of the load balancing, the most frequent setting, s, performed well. With this setting, workload was distributed between almost every call to which takes around 1ms on average and, in total, occupies around 90% of the running time.

Configuration (4) always performed poorly. Its load balancing, which used a lifeline formed as a 1D hyper cube (i.e. ring), was slow and led to its poor performance. Its performance improved dramatically by enabling the random stealing process (configuration (5)).

In some experiments, such as sp4 (; Figure 5(b), middle), 3rpr (; Figure 5(c), right), and 3rrr (), configuration (1) scaled well and outperformed other configurations when using 600 cores. It would appear that the random stealing process, when using many cores, suffered from large communication overhead; such large overhead can be confirmed with configuration (2), shown in the right-hand side of Figure 6(b).

Finally, the running times of some experiments were quite short. For example, eco8 and sp2-2 () took 58s and 122s, respectively, with a single core, and certain speedups were achieved: 141- and 252-fold with 600 cores.

6 Conclusions

In this work, we show that the parallelization of the branch and prune search is a good application of the X10 GLB framework. In the experiments, we achieved nearly linear speedups up to 600 X10 places/cores and are expected to be able to scale further. In future work, we plan further experiments on realistic problems including optimization problems, using a greater number of cores.


This work was partially funded by JSPS (KAKENHI
25880008, 15K15968, 25700038, 26280024, and 23240005) and JST ERATO Project.


  • [1] F. Benhamou, L. Granvilliers, F. Goualard, and J.-F. Puget. Revising Hull and Box Consistency. In ICLP, pages 230–244, 1999.
  • [2] D. Bergman, A. A. Cire, A. Sabharwal, H. Samulowitz, V. Saraswat, and W.-J. V. Hoeve.

    Parallel Combinatorial Optimization with Decision Diagrams.

    In CPAIOR, 8451, pages 351–367, 2014.
  • [3] B. Bloom, D. Grove, B. Herta, A. Sabharwal, H. Samulowitz, and V. Saraswat. SatX10: A Scalable Plug & Play Parallel SAT Framework. In SAT, 7317, pages 463–468, 2012.
  • [4] L. Bordeaux, Y. Hamadi, and H. Samulowitz. Experiments with Massively Parallel Constraint Solving. In IJCAI, pages 443–448, 2006.
  • [5] S. Caro, D. Chablat, , , and . A branch and prune algorithm for the computation of generalized aspects of parallel robots. Artificial Intelligence, 211:34–50, 2014.
  • [6] , , and . Interval-based projection method for under-constrained numerical systems. Constraints Journal, 17(4):432–460, 2012.
  • [7] , K. Yoshizoe, and T. Suzumura. Scalable Parallel Numerical CSP Solver. In CP, 8656, pages 398–406, 2014.
  • [8] I. P. Gent, C. Jefferson, I. Miguel, N. C. A. Moore, P. Nightingale, P. Prosser, and C. Unsworth. A Preliminary Review of Literature on Parallel Constraint Solving. In Workshop on Parallel Methods for Constraint Solving, 2011.
  • [9] A. Goldsztejn and F. Goualard. Box consistency through adaptive shaving. SAC, pages 2049–2054, 2010.
  • [10] A. Goldsztejn, L. Granvilliers, C. Jermann, and L. Umr. Constraint Based Computation of Periodic Orbits of Chaotic Dynamical Systems. In CP, pages 774–789, 2013.
  • [11] A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing. Addison Wesley, 2003.
  • [12] L. Granvilliers and F. Benhamou. Algorithm 852: RealPaver: An Interval Solver using Constraint Satisfaction Techniques. ACM Transactions on Mathematical Software, 32(1):138–156, 2006.
  • [13] P. V. Hentenryck, L. Michel, and F. Benhamou. Newton: Constraint Programming over Nonlinear Constraints. Science of Computer Programming, 30(1-2):83–118, 1998.
  • [14] J. Jaffar, A. Santosa, R. Yap, and K. Zhu. Scalable distributed depth-first search with greedy work stealing. In ICTAI, pages 98–103. IEEE, 2004.
  • [15] R. Lüling, B. Monien, A. Reinefeld, and S. Tschöke. Mapping Tree-Structured Combinatorial Optimization Problems onto Parallel Computers. In Solving Combinatorial Optimization Problems in Parallel, volume 7141, pages 115–144, 1996.
  • [16] R. E. Moore. Interval Analysis. Prentice-Hall, 1966.
  • [17] D. Munera, D. Diaz, S. Abreu, and P. Codognet. A Parametric Framework for Cooperative Parallel Local Search. In

    14th European Conference on Evolutionary Computation in Combinatorial Optimisation (EvoCOP)

    , 8600, pages 13–24, 2014.
  • [18] L. Otten and R. Dechter. Towards Parallel Search for Optimization in Graphical Models. In ISAIM, 2010.
  • [19] F. Rossi, P. V. Beek, and T. Walsh. Handbook of Constraint Programming, volume 2 of Foundations of Artificial Intelligence. Elsevier, 2006.
  • [20] V. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. In PPoPP, pages 201–212, 2011.
  • [21] C. Schulte. Parallel search made simple. In TRICS (Techniques foR Implementing Constraint programming Systems), pages 41–57, 2000.
  • [22] P. Van Hentenryck, D. McAllester, and D. Kapur. Solving Polynomial Systems Using a Branch and Prune Approach. SIAM Journal on Numerical Analysis, 34(2):797–827, 1997.
  • [23] F. Xie and A. Davenport. Massively Parallel Constraint Programming for Supercomputers : Challenges and Initial Results. In CPAIOR, 6140, pages 334–338, 2010.
  • [24] W. Zhang, O. Tardieu, D. Grove, B. Herta, T. Kamada, and V. Saraswat. GLB : Lifeline-based Global Load Balancing Library in X10. In PPAA, pages 31–40, 2014.