A parallel interval-based numerical constraint solver implemented with X10
We present a parallel solver for numerical constraint satisfaction problems (NCSPs) that can scale on a number of cores. Our proposed method runs worker solvers on the available cores and simultaneously the workers cooperate for the search space distribution and balancing. In the experiments, we attained up to 119-fold speedup using 256 cores of a parallel computer.READ FULL TEXT VIEW PDF
We present a scalable parallel solver for numerical constraint satisfact...
Upcoming many core processors are expected to employ a distributed memor...
The focus of my PhD thesis is on exploring parallel approaches to effici...
We propose a probabilistic model for the parallel execution of Las Vegas...
In competitive parallel computing, the identical copies of a code in a p...
In this paper, we propose an efficient parallelization strategy for boun...
We present an efficient implementation of the highly robust and scalable...
A parallel interval-based numerical constraint solver implemented with X10
(i.e., interval vector), thebranch and prune algorithm efficiently computes a paving, a set of boxes that encloses the solution set, yet its exponential computational complexity limits the tractable instances. Although the solving process exhibits a parallelism, no parallel NCSP solver has been made available to date because of the difficulty in partitioning the search space equally.
In this research, we parallelize a NCSP solver to scale its solving process on both shared memory and distributed memory parallel computers (Section 3). Our parallel method consists of parallel worker solvers that solve a portion of search space on CPU cores and interact with neighbor workers via message passing for dynamic load balancing. We also propose a preprocess that accelerates the initial search space distribution by sending sets of boxes via static routing between the workers. We have implemented the method by extending the Realpaver solver using the X10 language to realize a process-level parallelization over a number of cores. Section 4 reports experimental results when our method was deployed on two hardware environments.
There have been several work regarding parallel solving of CSPs with either discrete or continuous domains. Parallel solving of generic CSPs on massive computer clusters and supercomputers has been explored in [12, 2, 18]. This work focuses on a massive parallel solver for NCSPs that has a different characteristics compared to generic CSPs. In the survey 
, existing work is classified into (i) search-space splitting methods[16, 12, 2, 5, 15, 18], (ii) cooperative methods for heterogeneous solver agents (cf. portfolios), and (iii) parallelization of constraint propagation. Our method belongs to the first category. A few works have used approaches (ii) and (iii) for parallelization of NCSP solving. However, to the best of our knowledge, a massive parallelization method that uses the typical approach (i) has not yet been proposed.
Substantial work regarding the parallelization of the branch and bound algorithm with search-space splitting exists [9, 13, 14]. This approach has also been applied to CSP solvers [16, 5] and SAT solvers . This work explores an efficient parallel method for solving NCSPs with similar approach to [9, 13].
A numerical constraint satisfaction problem (NCSP) is defined as a triple that consists of a vector of variables , an initial domain in the form of a box ( denotes the set of closed real intervals), and a constraint , where and , i.e., a conjunction of equations and inequalities. A solution of a NCSP is an assignment of its variables that satisfies its constraints. The solution set of a NCSP is the region within its initial domain that satisfies its constraints, i.e., . The target of this paper is under-constrained NCSPs such that . In general, an under-constrained problem may have a continuous set of infinitely many solutions.
Branch and Prune Algorithm.
The branch and prune algorithm  is the standard solving method for NCSPs. It takes a NCSP and a precision as an input and outputs a set of boxes (or paving) that approximates the solution set with precision . Examples of are illustrated in Figure 3.
An intermediate state of the algorithm is represented as a pair of sets of boxes . The solver receives an initial state and iteratively applies the step computation (illustrated in Figure 3) until it reaches a final state . In the step computation, first, it takes the first element of the queue of boxes and applies the procedure, which is a filtering procedure that shaves boundary portions of the considered box. In this work, we use an implementation proposed in  which provides a verification process based on an interval Newton method combined with a relatively simple filtering process based on the Hull consistency. As a result, a box becomes either empty, precise enough (its width is smaller than ), verified as an inner box of the solution set , or undecided. Precise and inner boxes are appended to and undecided boxes are passed to . Second, the procedure bisects the box at the midpoint along a component corresponding to one of the variables and the sub-boxes will be put back in the queue. In this work, we assume selects variables in an order which makes the search to behave in a breadth-first manner and thus the solving process gradually improves the precision of the overall pavings (Figure 3).
The computation of is expensive and is the bottleneck of the solving process. Under certain conditions, application of contracts a large portion of the search space into a tight box (cf. quadratic convergence of the interval Newton methods). can also filter out the whole box if the considered box is verified as an inner or totally inconsistent region. These characteristics of result in the unbalanced nature of search trees. Therefore, a straightforward parallel method does not work efficiently. It is crucial for efficient NCSP solving to execute on each step of traversing the search tree which makes it more difficult to distribute a search path among processors. These properties will be discussed in Section 4.1.
We propose a parallel method that runs several workers on the available CPU cores (we assume a single worker runs on a core and identify both a worker and a core with ). Workers homogeneously interleave the following three procedures and cooperate in a decentralized fashion: (i) breadth-first branch and prune search, (ii) distribution and workload balancing of search space in a sender-initiated manner, and (iii) termination detection. Sub-trees in the search space of the branch and prune becomes unbalanced but can be searched independently: there are no confluence of multiple branches, and we have no shared information between branches. Distribution of the search space among workers is done in preprocess and postprocess as described in the following subsections. The preprocess distributes the portions of the search space to the other workers, and the postprocess balances the load of each worker during search. Termination is detected by circulating a termination token via idling workers based on Dijkstra’s method (see Section 11.4.4 of ).
Preprocess: Search-Space Splitting and Distribution.
A solving process is started by a worker that possesses the initial domain (i.e., a box which contains the whole search space) in the queue . To distribute the subsequent search space (i.e., a queue of boxes) equally to each core, the preprocess invokes a partition of the queue in two (or more) and then sends a portion to another worker. Figure 3 illustrates in a downward direction some initial transitions of the box queue distributed among three workers. In our implementation, the distribution routing is formed as a binary tree whose height is . In each node of the tree, the branch and prune process runs until the number of boxes in the queue reaches . Next, the queue is sorted by the volume of the boxes and the half of the content (i.e., boxes) is sent to the other core via the right branch.
Postprocess: Dynamic Load Balancing.
During search, each worker normalizes the loads within a predefined neighborhood which consists of a small number of neighbor workers. Because there are sufficiently large number of boxes, we simply regard the number of boxes in the queue as the amount of load. Assume workers are running and each worker possesses boxes in its queue. We also assume for each worker that is a set of neighbor workers, is a set of workers where , is a set of loads of the neighbor workers, and is a predefined load margin. The load balancing procedure of a worker performs the following steps once every branch and prune steps.
For each worker in , inform the load and put in the list .
Calculate the mean of the loads in .
If , for each worker in , send at most boxes to (to be efficient, a certain number of boxes should be kept locally).
Neighbor workers can be identified e.g. as adjacent nodes in the -dimensional mesh of workers. The routing between neighbor workers is fixed during a solving process and thus it may happen that a worker possesses an excessive load than others. However, this load imbalance will be resolved by the subsequent load balancing processes.
We have implemented the proposed method and measured the speedup of the solving process of under-constrained NCSPs. The experiments were performed with an exhaustive set of parameter combinations to explore the optimal settings.
We have implemented the proposed method with C++ (gcc ver. 4.4.7 and 4.3.4) and X10 (ver. 2.3.1), a high productivity language for parallel computing. In the following, we use the term place, which is a notion of X10 that in our setting represents a CPU core. Libraries Realpaver (ver. 1.1) for sequential NCSP solving and Gaol (ver. 4.0.1)111http://sourceforge.net/projects/gaol/ for basic interval computation were used to facilitate the implementation. The procedure was realized by calling the sequential implementation in Realpaver. Each run of took around 0.2–1ms and the overall execution became the bottleneck of the branch and prune algorithm (occupied greater than of running time in sequential solving). The procedures for search space distribution and load balancing were implemented with X10. Communication of boxes and loads between places were implemented as async tasks and performed in parallel to the search process so that the overhead will be hidden. In the experiments, timings for sequential runs on single core were measured using the C++ implementation described in , which worked identically and faster than our X10 version.
Two sets of experiments were operated using (1) a shared-memory machine equipped with 40 cores (four of 10-core Intel Xeon E7-4860 2.26GHz) and 256GB of local memory and (2) up to 256 cores of SGI UV1000, a pseudo-shared memory machine equipped with 2,048 cores (8-core Xeon E7-8837 2.67GHz) and 16TB of memory. UV1000 works as a single shared memory machine by emulating memory accesses using communication based on a high speed NUMAlink5 network which has a bandwidth of 120Gbps. We used the MPI backend of X10 with options and .
Experiments on a Shared Memory Machine.
We solved the problems shown in [6, 3] using 40 cores of the machine (1). We report the results for two representative problems. Parameters in the load balancing method were set as either combination of the following values: , and . We also computed with and without the preprocess (when the preprocess is not used, the postprocess is executed from the beginning). For each problem, we solved two instances with two multiplicative precisions. The specification of each instance and the computational results are presented in Table 1. In the table, the columns “problem”, “size”, “”, “pp”, and “” represent the name of the problem, size (i.e., the number of projection/parameter variables), the precision, usage of the preprocess, and the number of neighbor workers, respectively. The rest of the columns represents the results. and represent the running time and the number of branches on single core. and represent the running time and the largest number of branches performed by a worker when computed with the interval and X10 places (best timings are underlined). Figure LABEL:f:su:40 illustrates the speedup of the solving process.
|4D sphere||2+4||0.02||yes||2||254||9.52||7.94||10.1||238 319||37.0|
|3-RPR robot||3+3||0.2||yes||2||1 100||199||73.9||34.1||1 936 939||33.7|
|0.1||yes||2||4 080||1 010||714||282||7 186 845||30.2|
|4||2 820||1 070||257||36.0|
Experiments on a Cluster with High-speed Interconnection.
We solved the problem “” using up to 256 cores of the machine (2), UV1000. Parameters in the load balancing method were set as either combination of the following values: , , and . The results are presented in Table 2. Each column of the table represents the same information as presented in Table 1 except that the column “” represents the number of loads sent by the load balancer in the solving process with and 256 X10 places. Figure 4 illustrates the speedup of the solving process.
|3-RPR robot||3+3||0.2||yes||2||850||37.4||14.0||131||18 648|
|0.1||yes||2||3 040||341||25.6||192||39 608|
In the experiments, our method scaled up to 256 cores with the optimal configurations. We achieved speedups up to 32.3 fold using 40 cores of the shared memory machine and up to 119 fold using 256 cores of the cluster machine.
The best speedup of 119 fold was obtained with the preprocess. The preprocess facilitates and accelerates the workload distribution in the early stage of the search process. In some of the experiments without using the preprocess, the speedup ratio became saturated when using many cores (e.g., sp2-4 with and the experiments on the cluster). This was because the load balancing process was too infrequent for the given number of workers and the work load diffusion became too slow. When comparing the right-hand graphs for the instance , in Figure LABEL:f:su:40, we can notice that the point of saturation shifts according to the search space size. On the other hand, in some other experiments, the results got worse with the preprocess (e.g., results with on the machine (1)). It occasionally happens that the preprocess mostly solves the problem. However, the preprocess can result in highly unbalanced search trees because of the process, and in such cases the postprocess will not have enough time for load balancing.
Regarding the neighborhood sizes , there was a trade off between the workloads balance and the amount of communications required. For the shared memory machine, it was unclear which size had the advantage. However, for UV1000, the solver was notably slower for than . It is understandable because larger number of neighbors significantly increased the number of communications (see “” in Table 2) and communications between places were much more costly compared to normal shared memory machines despite the high speed network of UV1000.
Three intervals were used for load balancing which determined the speed of workload distribution. When the distribution was too slow, the speedup ratio did not scale well (e.g., 3rpr, , with on the machine (1)). Conversely, small intervals required greater amount of communications and therefore we used to draw better performance on the cluster where communications were more costly.
There was a large overhead caused by the workers sending a large number of boxes for load balancing when the number of workers was not sufficient against the problem size. Speedups for 3rpr, , using 40 workers or less shows an example of such overheads (Figure LABEL:f:su:40:3rpr). Resolving this overhead by suppressing redundant box sends is a part of the future work.
In this paper, we proposed a parallel branch and prune algorithm, based on, non-portfolio, search-space splitting approach. In the experiments, using 256 X10 places (i.e., cores), we achieved speedup factors of as much as 119. We expect that our parallelized solver will be applied to large practical problems, e.g., the robotics problems in .
This work was partially funded by JSPS (KAKENHI 25880008 and 25700038). Computing resources for the experiments in this paper were provided by Prof. Kazunori Ueda (Waseda University, Tokyo) and a Compute Canada RAC award, for environments (1) and (2), respectively.
Caro, S., Chablat, D., , , : A branch and prune algorithm for the computation of generalized aspects of parallel robots. Artificial Intelligence 211, 34–50 (2014)
Lüling, R., Monien, B., Reinefeld, A., Tschöke, S.: Mapping Tree-Structured Combinatorial Optimization Problems onto Parallel Computers. In: Proc. of Solving Combinatorial Optimization Problems in Parallel. pp. 115–144. 7141 (1996)