Towards Faster Reasoners By Using Transparent Huge Pages

04/29/2020 ∙ by Johannes K. Fichte, et al. ∙ 0

Various state-of-the-art automated reasoning (AR) tools are widely used as backend tools in research of knowledge representation and reasoning as well as in industrial applications. In testing and verification, those tools often run continuously or nightly. In this work, we present an approach to reduce the runtime of AR tools by 10 improvement addresses the high memory usage that comes with the data structures used in AR tools, which are based on conflict driven no-good learning. We establish a general way to enable faster memory access by using the memory cache line of modern hardware more effectively. Therefore, we extend the standard C library (glibc) by dynamically allowing to use a memory management feature called huge pages. Huge pages allow to reduce the overhead that is required to translate memory addresses between the virtual memory of the operating system and the physical memory of the hardware. In that way, we can reduce runtime, costs, and energy consumption of AR tools and applications with similar memory access patterns simply by linking the tool against this new glibc library when compiling it. In every day industrial applications this easily allows to be more eco-friendly in computation. To back up the claimed speed-up, we present experimental results for tools that are commonly used in the AR community, including the domains ASP, BMC, MaxSAT, SAT, and SMT.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Very recently, Vardi [41] directed attention to the fact that traveling to numerous conferences unsurprisingly results in a bad carbon-dioxide footprint. While in this case impact to the environment is immediate and obvious, there are many hidden factors that impact the environment. We often neglect the factors relating to the computation for combinatorial problems, as they seem to yield only a small improvement. This applies in particular to state-of-the-art solvers in combinatorial problem solving and automated reasoning (AR), such as ASP, #SAT, MaxSAT, MUS, SAT, and SMT solvers. The underlying algorithms for those reasoners are often based on a technique called conflict driven no-good learning (CDNL) for which it is well-known that their efficiency highly depends on their memory consumption [12]. A common explanation is that advanced data structures benefit from improved memory access. Examples are data structures for learnt clauses, two watched literals, and linear lookup tables. The impact is particularly large when the implementations respect hardware memory caches.

On modern systems, accessible memory is virtual and handled by a physical memory management unit (MMU). The mapping between virtual and physical memory is stored in page tables by the MMU. In order to reduce access time, recently used mappings are stored in a translation lookaside buffer (TLB). Since combinatorial reasoners often consume a significant amount of memory, avoiding cache misses and page translation failures can considerably improve the performance of a SAT solver [19]. Because CDNL-based solvers often form the core of other reasoning tools, improving the memory behavior of modern solvers can considerably speed up their execution. Even if the improvement involves only small factors the practical impact can be huge. Consider an industrial test case that runs each day 5 hours during the night. If we are able to reduce that runtime just by 10%, it results in 30 minutes less computation time per night. Overall this sums up to 10 hours of saved energy consumption per month. In practice, this efficiency improvement saves money and allows more eco-friendly computation.

New Contribution

In this paper, we introduce a simple and transparent approach to effectively reduce the number of TLB misses in order to speed up the execution of modern memory dependent solvers, in particular, unit propagation or sometimes also called Boolean constraint propagation. We employ a Linux memory management feature called transparent huge pages (THP), which reduces the overhead of virtual memory translations by using larger virtual memory page sizes [40]. Our approach is based on modifying the standard C library (glibc), which is the default standard library in Linux systems [3]. Whenever a solver allocates memory, we make sure that we additionally give the operating system kernel advice about the use of the memory (madvise). This feature can then be used for a solver simply by recompiling it and statically linking it against our modified glibc. In that way, we obtain a significant speed-up on benchmarks in model checking of up to 15% and for most other solvers up to 10%. The approach is based on a hardware feature and thus generalizable to other operating systems and CPU architectures supporting large page sizes.

Our advances summarize as follows:

  1. We propose an easily accessible way to reduce the number of TLB misses in combinatorial memory-dependent solvers by patching the glibc such that our modifications can be activated or deactivated at runtime.

  2. We provide a build system to easily patch glibc and statically link a solver against the patched version. Our system is based on a setup that uses OS-level virtualization (docker) [21] and is available to all modern Linux systems. We already provide various pre-compiled state-of-the-art reasoning tools.

  3. We carry-out extensive benchmarks and present detailed results for various reasoning tools.

Related Work

Chu, Hardwood, and Stuckey [14] as well as Holldobler, Manthey, Saptawijaya [19] considered cache utilization in SAT solvers and illustrated how a resource-unaware SAT solver can be improved by utilizing the cache sensibly, resulting in reasonable speed-ups. The effect of huge pages has already been widely investigated in the field of operating systems, e.g., [37], mostly with a strong focus on database systems. However, to the best of our knowledge the is no overall study on combinatorial problem solving and automated reasoning. Recent research considered benchmarking system tools [28], selecting benchmarks to tune solvers [20], and treating input benchmarks for benchmarking [11]. These topics are orthogonal to our work. In contrast, we consider computational resources and memory management of solvers, in particular, its effect on the runtime. Bornebusch, Wille, and Drechsler [13] analyzed the memory footprint of SAT solvers and tried to improve them. However, they did not consider propagation and its data structures, which is reasonable from a complexity point of view due to large formula sizes.

2 Modern CDNL-based Solvers and Memory

Before we present our advances, we give a brief explanation on how modern SAT solvers are implemented and introduce components and mechanisms that are relevant for memory access. As many reasoners are based on SAT technology, e.g., [12, 25, 42], core concepts are very similar for various reasoners. First, we define (propositional) formulas and their evaluation in the usual way and assume familiarity with standard notations, including satisfiability. For basic literature, we refer to introductory work [26]. We consider a universe  of propositional variables. A literal is a variable or its negation and a clause is a finite set of literals. A (CNF) formula is a finite set of clauses. A (partial) truth assignment is a mapping defined for a set  of variables. For , we put . For a formula , we abbreviate by the variables that occur in . We say that a truth assignment  satisfies a clause , if for at least one literal  we have . We say that a truth assignment  falsifies a clause , if it assigns all its literals to . We call a clause  unit if  assigns all but one literal to . A truth assignment  is satisfying if for each clause , the truth assignment  satisfies .

2.1 Why SAT solvers are fast?

So far, there are two main contributing factors to advances in the efficiency of modern SAT solvers: (i) theoretical improvements in terms of more advanced algorithms and heuristics and (ii) algorithm engineering in terms of data structures. The core algorithm that drives search in modern SAT solvers is based on

conflict driven no-good learning (CDNL) or also known as conflict driven clause learning (CDCL) [32, 17], which was widely extended by search heuristics [12, 24] and simplification techniques during search [23]. A key technique is unit propagation, which aims at finding clauses where all literals but one are already assigned and then setting the remaining literal to a value that satisfies the clause. Unit propagation is responsible for the vast majority of the overall runtime even in modern solvers [24]. Hence, algorithm engineering and efficient data structures are essential for practical solving, i.e., the two-watched-literal scheme for unit propagation [35] and fast lookup tables, which are also important for the used heuristics and learning techniques. The watched literal scheme reduces the number of steps in the algorithm and memory accesses, but decreases the efficiency of the memory access [19]. Still, this results in a considerable overall runtime improvement [24]. While lookup tables provide fast accesses of relevant clauses, they result in a much higher memory footprint and may yield unpredictable memory access [19].

2.2 CPUs, Virtual Memory, and Paging

Modern operating systems (OSes) provide the concept of virtual memory to applications. Thereby, the OS releases software developers from worrying about the actual physical memory layout and also allows for overcommitting resources. Virtual memory is managed at the granularity of pages. A page is a contiguous block of memory in the virtual address space. The OS can map a page to a page frame, which is a corresponding location in physical memory. On the Intel Architecture [22], page tables describe the mapping from pages to page frames and thus virtual to physical addresses. On 64-bit Intel systems, these page tables are trees with a depth of commonly up to four levels for -byte address spaces.

Walking these data structures to provide a translation for each memory access is infeasible, because it would add one page table read per level for each intended memory access. Instead, processors take advantage of spatial and temporal locality of memory accesses and cache translations in Translation Lookaside Buffers (TLBs). An Intel Skylake system has two-levels of TLBs and the unified L2 TLB can hold 1536 entries [43]. Other recent CPUs have similar specifications. With 4 KiB pages, this translates to holding translations for 6 MiB of virtual memory in the TLB. A straight-forward way to increase the capacity of the TLB is for the processor architecture to allow for larger page sizes. On Intel 64-bit systems, in addition to 4 KiB pages, the system also supports 2 MiB and 1 GiB pages. While 1 GiB pages have (few) dedicated TLB entries, 2 MiB pages share the same entries as 4 KiB entries in the TLB on Skylake.

Figure 1: This figure illustrates how pages cover a sequence of memory accesses for two different sizes of pages for a given amount of memory.

To benefit from large pages, the OS needs to make them accessible to applications [2]. The main problem is constantly defragmenting memory to have contiguous free memory from which large pages can be allocated [36]. Figure 1 illustrates the usage of memory with pages of different sizes. When using larger pages, less pages are required to cover the same area of memory, and hence, less TLB entries are occupied. In more detail, the black boxes in Figure 1 illustrate a sequence of accesses, which start at the top and flow to the bottom. While for larger pages (right) it is sufficient to memorize the translation for three pages, smaller pages require seven pages (left). In case the TLB can only hold four entries, the entry of Page 0 would be evicted before it can be re-used to access the same clause again. When using larger pages, less initial translations have to be done, and only three pages are required to perform all accesses.

2.3 Large Pages in Linux

Large page sizes are supported in Linux by a feature called transparent huge pages (THP), which offers both implicit and explicit use of large page sizes and was introduced in Linux 2.6.38 [40]. If THP is enabled, memory does not have to be statically provisioned for applications to use large pages, which is a clear advantage over previous attempts involving large pages [38]. Instead, the system is continuously compacting memory to free up contiguous space to allocate large pages. The Linux kernel can then, depending on the system configuration, transparently allocate large pages for applications. If intended, a system administrator can still additionally provision large pages manually.

THP can be globally enabled or be configured as an opt-in feature. Both mechanisms degrade gracefully when no large pages are available and will instead back memory using the standard page size. When THP is configured for opt-in, an application can use the system call “madvise” with the “MADV_HUGEPAGE” flag to mark memory regions as eligible. If this is done for virtual memory regions that have not been backed by physical memory yet, e.g., directly after a “mmap” call, the kernel will try to allocate a large page on first access to this memory. Otherwise, the kernel will occasionally scan virtual memory that is eligible for THP to create large pages. One downside of THP is that the kernel has to run scan and compact operations. Linux allows to configure this behavior to mitigate the impact by paying the cost for scanning and compacting at allocation time instead of doing it as a background job.

2.4 The Effect of THP

System workloads are known to speed up with huge pages. However, it may also result in reproducibility decreases, as huge pages have to be enabled in the kernel and globally for all applications on the system [38]. Hence, it is often recommended to use small pages for benchmarking. Unfortunately, the StarExec cluster [39] has enabled THP by default for all executed programs, which comes with the mentioned downsides. In case of virtualized machines, using small pages can result in almost 50% of the runtime being spent in address translation [29]. Using 2M pages in both the guest and host reduces this value to about 4%. We expect similar savings for tools that are run in a virtualized environment, as virtualization typically uses huge pages internally. In the following of the paper, we investigate the actual effect on a bare metal setup for our experimental work.

3 TLB Misses in SAT Solvers

Typically, SAT solvers do not exhibit the memory access locality that caches or TLB are optimized for. While previous works considered caches [14, 19], memory translation and the TLB have not been taken into account. Hence, we focus on memory accesses in the most time consuming part of SAT solvers: Boolean constraint propagation or also called unit propagation.

Assume that a formula  and a partial truth assignment  is given and a two watched literal data structure is used [35]. Briefly, unit propagation works as outlined in Listing 1. Initially, for each clause , one selects two literals from , which are not non-falsified by truth assignment . Then, the truth assignment  is extended by setting additional variables. Since assigning a literal  such that results in clause  that is satisfied, which in turn allows to remove clause  from the considered clauses right away, the only interesting case is if truth assignment  sets literal  such that . In that case, the clause  might be falsified and be involved in a conflict or have unassigned literals, which can be used to imply the truth value of other literals. Then, UnitPropagate checks every clause  that contains a literal which might be falsified during propagation. Therefore, the list  of literals to propagate is traversed (Line B1), and each watch list  for literal  is processed (Line B3). Then, each clause  in list  contains , so that the new state of clause  has to be evaluated by processing the other literals in clause . Hence there are two cases: either (i) clause is satisfied by another literal, or (ii) clause  contains another literal  that is not yet falsified by the truth assignment  (Line B5). Then, we watch literal  for being set to false instead of , and consequently have to update list  (Line B6) and list  (Line B7). Otherwise, clause  might be a unit clause (Line B8) or might be falsified by truth assignment  (Line B9). In both cases, clause  can remain in list .

UnitPropagate (formula , truth assignment , literals , watch lists )
B1     while the list  of literals to propagate is not empty       //compute closure
B2         pick , and remove from       //typically DFS
B3         access watch list of clauses       //propagate
B4         for all clauses in :
B5             if , , and not falsified in assignment        //watchable literal
B6                 remove from       //maintain lists
B7                 add to watch list for       //maintain lists
B8             else if unit, extend and with       //unit rule
B9             else if is falsified, trigger conflict analysis(, )       //conflict
Listing 1: Pseudo code for an implementation of unit propagation with the watched literal scheme. The state of the solver holds the formula  as watch lists , one list for each literal , as well as a truth assignment , and the list of literals to propagate. The result of the algorithm with either be an extension of the truth assignment, or a tuple truth assignment  and a conflict clause that is falsified by truth assignment .

When considering the memory access pattern, the unit propagation algorithm has the following properties: the literals in list  are not easily determined in advance. Hence, accesses in Line B3 to load the list are hard to predict. One could reduce the memory accesses in Lines B3 and B4 by pre-fetching data from memory in advance. This has been proposed in previous works [19]. Accessing the clauses in Line B4 are hard to predict, as the order of the clauses in list  changes. In more detail, in Line B6 some clauses are removed and in Line B8 or B9 others are kept. To improve the access behavior in Line B5, Een and Sorensson [16] proposed for MiniSAT 2.1 an optimization to avoid the access of clause  for the satisfied case. There, another literal of clause  is stored in list . Since its introduction, blocking literals are commonly used in most modern solvers. Accessing literal  in Line B5 is also unpredictable, as literal order in clauses also changes. Typically, the two watched literals are the first two stored literals and they change whenever the clause is moved to another list.

We suspect that unit propagation is the major source of memory accesses, as most run time is spend in unit propagation, and many different memory locations, i.e., clauses are accessed non-linearly during unit propagation. This drives us to the following hypothesis:

Hypothesis 1

Accessing clauses during unit propagation as well as updating and accessing watch lists has the highest impact on TLB misses.

We support our hypothesis by the following observation. The two watched literals data structure allows to keep the number of overall accesses low, but trades it for higher memory footprint with additional data structures and lists. Further, the memory accesses for (i) clause to access next, (ii) literals of a clause to watch next, or (iii) list to place it are difficult to predict. Hence, Lines B3, B4, and B7 are prime candidates to access memory locations that have not been accessed recently, and hence, are not cached, nor served with current TLB entries.

3.1 Analyzing Unit Propagation

To back up Hypothesis 1 with data, we analyze the distribution of TLB misses in the SAT solvers MiniSat and Glucose111We use a sampling approach of CPU performance counters for TLB misses with the system tool perf.. When running MiniSat [16], we observe that 90% of TLB misses occur in unit propagation; thereof, about 10% when moving clause to another (unpredictable) watch list and about 80% when accessing the first literal of the next watched clause. This data matches the assumption that Line B3 and B4 are responsible for most of the TLB misses. In addition, moving clauses to new watch lists contributes another 10%. When running Glucose version 4.2.1 [4], we can see similar results. 90% of the TLB misses happen in unit propagation. In the modern solver Glucose, unit propagation is split into (i) propagating binary clauses that contributed 5 % of all TLB misses, (ii) propagating during learned clause minimization [30] that contributes about 20 %, and (iii) propagating larger clauses and pushing them to watch lists that consume the majority of the TLB misses. Empirical observations for these two solvers confirm our hypothesis, unit propagation is the major source for TLB misses. The random memory accesses to check the next clause in the list for being unit, which can have an arbitrary memory location, as well as putting clauses into another watch list, are the major contributors. Unit propagation is responsible for a large fraction of the run time of SAT solvers, which is actually spend in address translation. Biere [7] places in his solver watch lists and its clauses closer to each other, to avoid TLB misses related to Line B4. Here, we present an orthogonal approach to avoid TLB misses, which allows to improve the implementation [7] further and can be applied to many other solvers.

3.2 Counter the Unit Propagation Implementation

We believe that additional data structure improvements along [14] are hardly feasible. Clauses would have to be even more compact and the changes require a huge effort for a single solver [6]. Changes to the underlying algorithms likely result in reduced performance and require tuning parameters again. On that account, we propose a general approach to THP in the following, which can also be easily used by other tools.

4 Improving Unit Propagation with THP via Madvise

Modern Linux distributions provide native support for transparent huge pages. Usually, the systems allow the superuser to define the behavior via the configuration file “/sys/kernel/mm/transparent_hugepage/enabled” whose values “always” or “never” apply to all running processes. Because there might be applications running on the host that would suffer from larger pages, THP is usually disabled on physical systems and it is not advised to set the value to always. Fortunately as described above, Linux also allows to set the value using the madvise system call. While this sounds fairly trivial, it requires (i) lots of manual adaptions of the source code to mark memory regions as eligible and in turn makes the implementations of solvers (ii) fairly incomparable on an algorithmic level. In the following, we suggest an easily accessible way to reduce the number of TLB misses in combinatorial memory-dependent solvers.

4.1 Using More Huge Pages

In the previous section, we explained that using more huge pages seems a reasonable approach to speed up the memory access of a modern solvers. This can be obtained by running a madvise system call to instruct the kernel to use transparent huge pages of 2M whenever the solver allocates memory. Then, we align all requested memory to 2M addresses and increase the size of the reservation accordingly, so that huge pages can actually be used. If we would not do so, two memory requests of the application can be in the middle of an 2M page, which results in not using a huge page. Compared to the system setting, this change results in using one more huge page per misaligned memory request.

4.2 Patching the Standard C Library (glibc)

In order to provide a transparent way to various solver developers, and offering a way for algorithms engineering to consider the effect of transparent huge pages on many AR tools, we want to avoid manual source code adaptions as much as possible. To this end, we put our focus on the standard C system library glibc, which already provides standard functions to access the system memory. The library is used in Linux to compile most of the solvers. Instead of modifying the source code of various solvers, we implement the above mentioned ideas into glibc222Our latest implementation is publicly available at https://github.com/conp-solutions/thp. Note that we will upstream the patches of glibc along this publication to merge it into glibc if accepted by the glibc maintainers. . Whenever a certain runtime flag is activated, our modified glibc takes care of the above mentioned changes. The current approach uses a system environment variable (GLIBC_THP_ALWAYS=1) that can be specified before calling a program. This way we allow setting the flag for a specific program instead of all running programs. Globally enabling transparent huge pages for all running applications on the system is usually forbidden both in industry and academia by administrators due to a variety of potential side effects, which might slow down a variety of programs. If the flag is not set, we disable the use of huge pages. The additional cost, compared to glibc, is a single if-statement. We implemented our patches into glibc 2.23 [3] and enable the feature without any source code modification of the various solvers themselves. In that way it is entirely sufficient for the user of a solver to recompile the solver and link it against our modified glibc.

4.3 Huge Pages in a Solver

In order to use the feature, there are two ways to proceed: (i) link the solver statically against our modified glibc or (ii) patch the system glibc and then dynamically link the solver against the new glibc. We provide an easy and accessible way for the former, since patching the system glibc is usually considered problematic due to side effects and as it requires superuser permissions, which makes it very unlikely that actual users of the reasoning tools will use this feature. We introduce a virtual environment that allows for easy compiling of the solver, in order to avoid problematic setups of a new secondary glibc.

Our system is based on the OS-level virtualization docker [21], which isolates running programs entirely from each other. Docker itself is available on all modern OSes and allows to deliver software in packages, which are called containers. A running container is entirely isolated from one another and can bundle its own software. We use this to not interfere with the system glibc. But we do not publish only a docker container, instead we provide the scripts to build containers in which the compilation then runs. The user just needs to install docker and we provide the tooling to link a solver with THP support. Along, we give many exemplary scripts to highlight how to run the tools and various pre-compiled state-of-the-art solvers.

5 Experimental Evaluation

We conducted a series of experiments using standard benchmark sets for various reasoning tools. All benchmark sets and our results are publicly available. To represent many fields, we selected various benchmarks and tools.

5.1 Benchmarked Solvers and Instances

In our experimental work, we present results for recent versions of publicly available SAT solvers: Glucose 4.2.1 [4], lingeling [8], MapleLCMDistChronoBTDL (winner2019) [27], MergeSat [31], MiniSAT [16], plingeling [8]. For SAT, we selected the recommended benchmark for tool tuning, and compare MiniSat and MergeSat from the above set of SAT tools again. For answer set programming (ASP), we used clasp [25] and a benchmark that has been shown to be recommended for benchmark selection [20]. From software model checking (SWMC), we use CBMC [15], which uses a single call to a SAT solver. As SWMC benchmark, we use the benchmark provided when introducing the LLBMC tool [34]. As another group, we collected tools, which use incremental SAT solvers as a backend. For hardware model checking (HWMC), we use the bounded model checker aigbmc [10], with an unrolling limit of 100, and use the benchmark of the deep bound track of the HWMC Competition of 2017 [9]. For optimization, we use the MaxSAT solver Open-WBO [33], which uses the SAT solver Glucose as a backend. As MaxSAT benchmark, we picked the weighted partial maxsat formulas from 2014, to make sure the incremental interface is actually used. Finally, we consider muser-2 [5], which computes a minimal unsatisfiable subformula (MUS) from a CNF formula, and use the group MUS benchmark from the MUS competition 2011333MUS benchmarks are available at cril.univ-artois.fr/SAT11.

Figure 2: Runtime for the SAT solvers on all considered instances. The x-axis refers to the number of instances and the y-axis depicts the runtime sorted in ascending order for each solver individually.
solver
glucose 189 4.58 3.72 18.70 2.60E+11 6.71E+09 2.59
lingeling 177 6.18 5.91 4.76 4.93E+10 5.13E+08 1.04
winner19 194 7.46 6.24 16.67 3.29E+11 1.35E+10 4.10
mergesat 176 7.31 6.22 14.53 3.02E+11 1.36E+10 4.50
minisat 170 7.26 6.09 15.97 2.75E+11 2.97E+09 1.08
Table 2: Overview on the speed-up between solving when using THP for SAT solvers. counts the number of solved instances. contains the runtime in hours, the saved runtime in %, i.e., factor, and TLB the TLB load misses. We distinguish by non-THP and THP by and , respectively. summarizes the % of TLB misses over the original TLB misses, i.e., .

5.2 Boolean Satisfiability (SAT)

For SAT solvers we carried out an extensive study on a cluster.

5.2.1 Measure, Setup, and Resource Enforcements.

Our results were gathered on a cluster of RHEL 7.4 Linux machines with kernel 3.10.0-693 on GCC 4.8.5-16. We evaluated the solvers on machines with two Intel Xeon E5-2680v3 CPUs of 12 physical cores each at 2.50GHz base frequency. We forced the performance governors to 2.5Ghz [18] and disabled multi-threading. The machines are equipped with 64GB main memory of which 60.5GB are freely available to programs. We compare wall clock time and number of timeouts. However, we avoid IO access on the CPU solvers whenever possible, i.e., we load instances into the RAM before we start solving. We run at most 5 solvers on one node, set a timeout of 900 seconds, and limited available RAM to 8 GB per instance and solver. We follow standard guidelines for benchmarking [28].

5.2.2 SAT Results.

Figure 2 illustrates the runtime results of the solvers in different configurations in a cactus-like plot. Table 2 gives an overview on the number of solved instances for each solver with and without THP. Note that we report in this table only on instances that have been solved by non-THP and THP. The results show that a solver with activated THP solves overall more instances that without THP. When considering runtime, the configurations that employ THP solve the considered instances faster. Runtime improvements range from a small improvement for lingeling, which was about 0.27h faster, up to more than one hour faster for mergesat, minisat, and winner2019. In terms of factor of saved runtime hours, we can see that the solvers that employ THP are up to almost faster. The number of TLB misses that we observed reduce up to orders, namely, goes down to for lingeling and similar for glucose and minisat. When we consider the number of solved instances glucose solves 5 instances less than winner2019. However, glucose solves the instances in 4.58 hours while winner2019 solves them in 7.46 hours.

5.2.3 Discussion and Summary.

The observed results above confirm our hypotheses. Throughout our experiments, the number of TLB misses goes significantly down for all considered solvers. For lingeling and minisat it even reduces to 1% of the original number of misses. Since unit propagation is a major source of memory accesses and major cause of a high number of TLB misses and responsible of a large part of the solving time, the reduced TLB misses also yield a speed up in the overall runtime. Our approach makes all solvers faster and even allows glucose to be almost as fast as winner19 without spending a significant amount of time in optimizing the solver itself. Only very few benchmarks can be solved in less than 1MB or 10MB memory. Hence, using larger pages results in less TLB misses and in faster execution

5.3 Other Reasoners.

We believe that the THP approach does not only boost SAT solvers, but other reasoners as well. On that account, we run additional experiments on tools that are either based on SAT solvers or implemented closely to the CDNL algorithm. To broaden the applicability, we also consider tools that use SAT solvers via their incremental interface [16]. As the SAT calls in these tools are shorter, we expect that the benefit of using THP is smaller.

Hypothesis 2

When using incremental SAT solvers inside a reasoner, the benefit of THP is smaller.

Measure, Setup, and Resource Enforcements.

To support Hypothesis 2, we run a second analysis. To make sure the above results are not CPU and OS dependent, we used a second environment of same architecture and repeat the results for MiniSat and MergeSat. The computer has an Intel Core i5-2520M CPU running at 2.50GHz, with an Ubuntu 16.04 and Linux 4.15, using 5GB as memory limit and the 900 seconds as timeout per instance.

      Category Tool
      SAT MiniSat 8.17 7.03 13.99
      SAT MergeSat 7.94 6.90 13.13
      ASP clasp 3.66 3.29 10.18
      MaxSAT open-wbo 1.19 1.09 8.49
      MUS muser2 4.18 3.97 5.16
      HWMC aigbmc 0.89 0.86 4.11
      SWMC cbmc 0.23 0.22 2.76
Table 3: Overview on the runtime of various reasoners with(out) THP evaluated on their respective competition benchmarks. and represent the runtime of the reasoner in hours and represents the saved runtime in %.

5.3.1 Results.

Table 3

states the results for the considered tools and benchmarks. To measure the speed-up, we show only instances that could also be solved by the variant without THP. When using THP, we can solve the same instances within the timeout, usually a few more. First, we can see a similar improvement as in the previous setting where we used the same architecture, but different hardware. The improvement when using THP for tools with a single calls is similarly high as presented above, i.e., SAT as well as ASP show improvements above 10%. Only cbmc from SWMC is an outlier, which might be related to it memory usage. For tools that use incremental SAT solver as a backend, the improvement range from 4% from HWMC to 8% in MaxSAT. The low speedup can be explained with their memory usage: over each benchmark, cbmc’s memory usage is rather low, i.e., the median memory footprint is 8.8MB. For all the other tools and categories, the memory footprint is higher, e.g., the median for ASP is 28.8MB and for SAT 123.3MB. The tools with incremental SAT backends also consume more memory than cbmc: MaxSAT 21.3MB, HWMC 164MB, and MUS 298.1MB. As expected, tools with a higher memory footprint result in a higher speed-up due to transparent huge pages.

6 Conclusion and Future Work

Although reasoners solve NP-hard problems, they are used across the research community to solve many tasks in artificial intelligence. Reasoners are also employed in industry to verify properties, generate tests, or run similar tasks. In this paper, we introduced a simple and transparent approach to effectively speed up memory access of reasoners by reducing the number of TLB misses, which in turn allows to significantly improve their runtime. Our approach is based on a modification of glibc, which is the C standard library of the GNU Project and widely used in Linux for C and C++ programs. A user of a reasoner can benefit from our improvement simply by recompiling his favorite reasoner and enabling the feature by an environment flag when invoking it. Our experiments confirmed that an application save 25% runtime on certain instances and on average more than 10%. Since the tools are often also used for long running jobs, for example in systems biology or verification, we can save a significant amount of runtime and hence save energy and money. In that way, the number of solved instances might not always be the right measure to evaluate a reasoner.

We believe that our approach can also be very beneficial other tools in automated reasoning, simply because there are many memory incentive applications that have simply not been tuned to reduce the number of TLB misses by using THP. One such domain might be graph algorithms, which can have random memory access patterns if the underlying data structure is updated often. In the future, we are interested in the influence of THP to various other domains and to increase the amount of tested benchmarks and reasoning tools. We hope that this opens up both theoretical and practical research on more general algorithm engineering techniques solvers.

References

  • [1] Proceedings of SAT Race 2019 : Solver and Benchmark Descriptions, Department of Computer Science Report Series, vol. B-2019-1. University of Helsinki, Helsinki, Finland (2019)
  • [2] Arcangeli, A.: Transparent hugepage support. In: KVM forum. vol. 9 (2010)
  • [3] Arnold, R.S., Brown, M., Eggert, P., Jelinek, J., Kuvyrkov, M., Myers, J., O’Donell, C., Oliva, A., Schwab, A.: The gnu c library (glibc). https://www.gnu.org/software/libc/ (2019)
  • [4] Audemard, G., Simon, L.: Glucose in the SAT Race 2019. [1], pp. 19–20
  • [5] Belov, A., Marques-Silva, J.: Muser2: An efficient MUS extractor. J. on Satisfiability, Boolean Modeling and Computation 8(3/4), 123–128 (2012)
  • [6] Biere, A.: Lingeling essentials, a tutorial on design and implementation aspects of the the sat solver lingeling. In: POS@SAT (2014)
  • [7] Biere, A.: Splatz, Lingeling, Plingeling, Treengeling, YalSAT Entering the SAT Competition 2016. In: Balyo, T., Heule, M., Järvisalo, M. (eds.) Proc. of SAT Competition 2016 – Solver and Benchmark Descriptions. Department of Computer Science Series of Publications B, vol. B-2016-1, pp. 44–45. University of Helsinki (2016)
  • [8] Biere, A.: CaDiCaL, Lingeling, Plingeling, Treengeling, YalSAT Entering the SAT Competition 2017. In: Balyo, T., Heule, M., Järvisalo, M. (eds.) Proc. of SAT Competition 2017 – Solver and Benchmark Descriptions. Department of Computer Science Series of Publications B, vol. B-2017-1, pp. 14–15. University of Helsinki (2017)
  • [9] Biere, A., van Dijk, T., Heljanko, K.: Hardware model checking competition 2017. In: Stewart, D., Weissenbacher, G. (eds.) Formal Methods in Computer-Aided Design, FMCAD 2017, Vienna, Austria, October 02-06, 2017. p. 9. IEEE (2017)
  • [10] Biere, A., Heljanko, K., Wieringa, S.: AIGER 1.9 and beyond. Tech. Rep. 11/2, Institute for Formal Models and Verification, Johannes Kepler University, Altenbergerstr. 69, 4040 Linz, Austria (2011)
  • [11] Biere, A., Heule, M.: The effect of scrambling CNFs. In: Berre, D.L., Järvisalo, M. (eds.) Proceedings of Pragmatics of SAT 2015 and 2018. EPiC Series in Computing, vol. 59, pp. 111–126. EasyChair (2019)
  • [12] Biere, A., Heule, M., van Maaren, H., Walsh, T. (eds.): Handbook of Satisfiability, Frontiers in Artificial Intelligence and Applications, vol. 185. IOS Press, Amsterdam, Netherlands (Feb 2009)
  • [13] Bornebusch, F., Wille, R., Drechsler, R.: Towards lightweight satisfiability solvers for self-verification. In: Proceedings of the 7th International Symposium on Embedded Computing and System Design (ISED’17). pp. 1–5 (Dec 2017), E=Digital Object Identifier10.1109/ISED.2017.8303924
  • [14] Chu, G., Harwood, A., Stuckey, P.: Cache conscious data structures for Boolean satisfiability solvers. J. on Satisfiability, Boolean Modeling and Computation 6, 99–120 (02 2009), E=Digital Object Identifier10.3233/SAT190064
  • [15] Clarke, E., Kroening, D., Lerda, F.: A tool for checking ANSI-C programs. In: Jensen, K., Podelski, A. (eds.) Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004). Lecture Notes in Computer Science, vol. 2988, pp. 168–176. Springer (2004)
  • [16] Eén, N., Sörensson, N.: An extensible SAT-solver. In: Giunchiglia, E., Tacchella, A. (eds.) Proceedings of the 6th International Conference on Theory and Applications of Satisfiability Testing (SAT’03). pp. 502–518. Springer Verlag (2003)
  • [17] Gomes, C., Selman, B., Crato, N.: Heavy-tailed distributions in combinatorial search. In: Smolka, G. (ed.) Proceedings of the 3rd International Conference on Principles and Practice of Constraint Programming (CP’97). Lecture Notes in Computer Science, vol. 1330, pp. 121–135. Springer Verlag, Linz, Austria (1997), E=Digital Object Identifier10.1007/BFb0017434
  • [18] Hackenberg, D., Schöne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An energy efficiency feature survey of the intel haswell processor. In: Lalande, J.F., Moh, T. (eds.) Proceedings of the 17th International Conference on High Performance Computing & Simulation (HPCS’19) (2019)
  • [19] Hölldobler, S., Manthey, N., Saptawijaya, A.: Improving resource-unaware SAT solvers. In: Fermüller, C.G., Voronkov, A. (eds.) Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR’16). Lecture Notes in Computer Science, vol. 6397, pp. 519–534. Springer Verlag, Dakar, Senegal (2010), E=Digital Object Identifier10.1007/978-3-642-16242-8_26
  • [20] Hoos, H.H., Kaufmann, B., Schaub, T., Schneider, M.: Robust benchmark set selection for boolean constraint solvers. In: Proceedings of the 7th International Conference on Learning and Intelligent Optimization (LION’13). Lecture Notes in Computer Science, vol. 7997, pp. 138–152. Springer Verlag, Catania, Italy (Jan 2013), revised Selected Papers
  • [21] Hykes, S., et al.: Docker ce. https://github.com/docker/docker-ce (2019)
  • [22] Intel: Intel® 64 and IA-32 Architectures Software Developer’s Manual (2019), order Number: 325462-069US
  • [23] Järvisalo, M., Heule, M., Biere, A.: Inprocessing rules. In: Gramlich, B., Miller, D., Sattler, U. (eds.) Automated Reasoning, Lecture Notes in Computer Science, vol. 7364, pp. 355–370. Springer Verlag (2012), E=Digital Object Identifier10.1007/978-3-642-31365-3_28
  • [24] Katebi, H., Sakallah, K.A., Marques-Silva, J.P.: Empirical study of the anatomy of modern SAT solvers. In: Sakallah, K.A., Simon, L. (eds.) Proceedings of the 14th International Conference on Theory and Applications of Satisfiability Testing (SAT’11), Lecture Notes in Computer Science, vol. 6695, pp. 343–356. Springer Verlag, Ann Arbor, MI, USA (June 2011), E=Digital Object Identifier10.1007/978-3-642-21581-0_27
  • [25] Kaufmann, B., Gebser, M., Kaminski, R., Schaub, T.: clasp – a conflict-driven nogood learning answer set solver. http://www.cs.uni-potsdam.de/clasp/ (2015)
  • [26] Kleine Büning, H., Lettman, T.: Propositional logic: deduction and algorithms. Cambridge University Press, Cambridge, New York, NY, USA (1999)
  • [27] Kochemazov, S., Zaikin, O., Kondratiev, V., Semenov, A.: MapleLCMDistChronoBT-DL, duplicate learnts heuristic-aided solvers at the SAT Race 2019. [1], pp. 24–24
  • [28] van der Kouwe, E., Andriesse, D., Bos, H., Giuffrida, C., Heiser, G.: Benchmarking crimes: An emerging threat in systems security. CoRR abs/1801.02381 (2018), http://arxiv.org/abs/1801.02381
  • [29] Kwon, Y., Yu, H., Peter, S., Rossbach, C.J., Witchel, E.: Coordinated and efficient huge page management with ingens. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). pp. 705—721. USENIX Association, Savannah, GA, USA (2016), E=Digital Object Identifier10.5555/3026877.3026931
  • [30] Luo, M., Li, C.M., Xiao, F., Manyà, F., Lü, Z.: An effective learnt clause minimization approach for CDCL SAT solvers. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. pp. 703–711 (2017), E=Digital Object Identifier10.24963/ijcai.2017/98
  • [31] Manthey, N.: MergeSat. [1], pp. 29–30
  • [32] Marques-Silva, J., Sakallah, K.: GRASP: a search algorithm for propositional satisfiability. IEEE Transactions on Computers 48(5), 506–521 (May 1999), E=Digital Object Identifier10.1109/12.769433
  • [33] Martins, R., Manquinho, V., Lynce, I.: Open-wbo: A modular maxsat solver,. In: Sinz, C., Egly, U. (eds.) Theory and Applications of Satisfiability Testing – SAT 2014. pp. 438–445. Springer International Publishing, Cham (2014)
  • [34] Merz, F., Falke, S., Sinz, C.: Llbmc: Bounded model checking of c and c++ programs using a compiler ir. In: Joshi, R., Müller, P., Podelski, A. (eds.) Verified Software: Theories, Tools, Experiments. pp. 146–161. Springer Berlin Heidelberg, Berlin, Heidelberg (2012)
  • [35] Moskewicz, M.W., Madigan, C.F., Zhao, Y., Zhang, L., Malik, S.: Chaff: Engineering an efficient SAT solver. In: Rabaey, J. (ed.) Proceedings of the 38th Annual Design Automation Conference (DAC’01). pp. 530–535. Assoc. Comput. Mach., New York, Las Vegas, Nevada, USA (2001), E=Digital Object Identifier10.1145/378239.379017
  • [36] Navarro, J., Iyer, S., Druschel, P., Cox, A.: Practical, transparent operating system support for superpages. SIGOPS Oper. Syst. Rev. 36(SI), 89–104 (Dec 2003), E=Digital Object Identifier10.1145/844128.844138, this paper describes Super Page implementation in FreeBSD. It also has performance nubmers, but really ancient ones. They roughly match the SAT solver performance improvements, though.
  • [37] Panwar, A., Prasad, A., Gopinath, K.: Making huge pages actually useful. In: Bianchini, R., Sarkar, V. (eds.) Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). pp. 679–692. Assoc. Comput. Mach., New York, Williamsburg, VA, USA (Mar 2018), E=Digital Object Identifier10.1145/3173162.3173203
  • [38] Park, S., Kim, M., Yeom, H.Y.: GCMA: Guaranteed contiguous memory allocator. IEEE Transactions on Computers 68(3), 390–401 (Mar 2019), E=Digital Object Identifier10.1109/TC.2018.2869169
  • [39] Stump, A., Sutcliffe, G., Tinelli, C.: Starexec: A cross-community infrastructure for logic solving. In: Demri, S., Kapur, D., Weidenbach, C. (eds.) Proceedings of the 7th International Joint Conference on Automated Reasoning (IJCAR’14). Lecture Notes in Computer Science, vol. 8562, pp. 367–373. Springer Verlag, Vienna, Austria (Jul 2014), E=Digital Object Identifier10.1007/978-3-319-08587-6_28, held as Part of the Vienna Summer of Logic, VSL 2014.
  • [40] Torvalds, L.: kernel.org: Transparent hugepage support. https://www.kernel.org/doc/Documentation/vm/transhuge.txt (May 2017)
  • [41] Vardi, M.Y.: Publish and perish. Communications of the ACM 63(1),  7 (2020), E=Digital Object Identifier10.1145/3373386
  • [42] Voronkov, A.: Avatar: The architecture for first-order theorem provers. In: Biere, A., Bloem, R. (eds.) Proceedings of the 26th International Conference on Computer Aided Verification CAV’14. Lecture Notes in Computer Science, vol. 8559, pp. 696–710. Springer Verlag (2014), held as Part of the Vienna Summer of Logic (VSL).
  • [43] Wikichip, C.: Skylake (client) – Microarchitectures – Intel. https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client) (2020)