Fast concurrent transactional processing is one of the major design goals of basically every modern database management system. To fully utilize the large amount of hardware parallelization that is nowadays available even in commodity servers, the right concurrency control mechanism must be chosen.
Interestingly, a large number of database systems, including major players like PostgreSQL , Microsoft Hekaton , SAP HANA , and HyPer , currently implement a form of multi-version concurrency control (MVCC) [10, 2] to manage their transactions. It allows a high degree of parallelism as reads do not block writes. The core principle is straight-forward: if a value is updated, its old version is not simply replaced by the new version. Instead, the new version is stored alongside with the old one in a version chain, such that the old version is still available for reads that require it. Timestamps ensure that transactions access only the version they are allowed to see.
1.1 Limitations of Classical MVCC
In classical MVCC implementations, all transactions, no matter whether they are short running OLTP transactions or scan-heavy OLAP transactions, are treated equally and are executed on the same (versioned) database. While this form of homogeneous processing unifies the way of transaction management, it also has a few unpleasant downsides under mixed workloads:
First and foremost, scan-heavy OLAP transactions heavily suffer when they have to deal with a large number of lengthy version chains. During a scan, each of these version chains must be traversed to locate the most recent version of each item that is visible to the transaction. This involves expensive timestamp comparisons as well as random accesses when going through the version chains that are typically organized as linked lists. As scans typically take time, a large amount of OLTP transactions can perform updates in parallel and lengthy version chains built up during the execution.
Apart from this, these version chains must be garbage collected from time to time to remove versions that can not be seen by any transaction in the system. Under classical MVCC, this is typically done by a separate cleanup thread, which frequently traverses all present chains to locate and to delete outdated versions. This thread has to be managed and synchronized with the transaction processing, utilizing precious resources.
Obviously, the mentioned problems are directly connected to the processing of scan-heavy OLAP transactions in the presence of short-running modifying OLTP transactions. Such a heterogeneous workload, consisting of transactions of inherently different nature, simply does not fit to homogeneous processing, which treats all incoming transactions in the same way. Unfortunately, such a homogeneous processing model is used in the state-of-the-art MVCC systems.
1.2 Heterogeneous Processing
But why exactly do state-of-the-art systems rely on a homogeneous processing model, although it does not fit to the faced workload? Why don’t they implement heterogeneous processing
, which classifies transactions based on the type and executes them in separation?
To answer these questions, let us look at the development of the prominent HyPer[4, 5] system. Early versions of HyPer actually implemented heterogeneous processing : the transactions were classified into the categories OLTP and OLAP and consequently executed on separate representations of the database. The short running modifying OLTP transactions were executed on the most recent version of the data while long-running OLAP transactions were outsourced to run on snapshots. These snapshots were created from time to time on the up-to-date version of the database.
While this concept mapped the mixed workload to the processing system in a very natural way, the engineers faced a severe problem: the creation of snapshots turned out to be very expensive . To snapshot, HyPer utilized the fork system call. This system call creates a child process that shares its virtual memory with the one of the parent process. Both processes perform copy-on-write to keep changes locally, thus implementing a form of (virtual) snapshotting. While this principle is cheaper than physical snapshotting, forking processes is still costly. Thus, the engineers were forced to move away from heterogeneous processing to a homogeneous model, fully relying on MVCC in their current version.
Despite the challenges one has to face when implementing a heterogeneous model, we believe it is the right choice after all. Matching the processing system to the workload is crucial for performance. This is exactly the goal of our main-memory transaction processing system coined AnKerDB, which we will propose in the following. Still, to do so, we have to discuss two problems first:
Obviously, MVCC is the state-of-the-art concurrency control mechanism in main-memory systems. In AnkerDB, we intend to apply it as well. But how to combine state-of-the-art MVCC with a heterogeneous processing model?
Apparently, state-of-the-art snapshotting mechanisms are not capable of powering a heterogeneous processing model. How to realize a fast snapshotting mechanism, that allows the creation of snapshots at a high frequency and at fine granularity?
Let us discuss these questions one by one in the following.
1.3.1 MVCC in Heterogeneous Processing
Classical systems implement MVCC in a homogeneous processing model, where all transactions are treated equally and executed on the same versioned database. In contrast to that, in AnKerDB we want to extend the capabilities of MVCC by reintroducing the concept of heterogeneous processing, where incoming transactions are classified by their type and treated independently. By this, we are able to utilize the advantages of MVCC while avoiding its downsides.
The concept works as follows: based on the classification, we separate the short-running OLTP transactions from the long-running (read-only) OLAP transactions. Conceptually, the modifying OLTP transactions run concurrently on the most recent version of the database and build up version chains as in classical MVCC. In parallel, we outsource the read-only OLAP transactions to run on separate (read-only) snapshots of the versioned database.
These snapshots are created at a very high frequency to ensure freshness. Thus, instead of dealing with a single representation of the database that suffers from a large number of lengthy version chains, as it is the case in systems that rely on homogeneous processing, we maintain a most recent representation inside of an OLTP component alongside with a set of snapshots, which are present in the OLAP component. Naturally, each of the representations contains fewer and shorter version chains, which largely reduces the main problem described in Section 1.1.
Apart from that, using snapshots has the pleasant side-effect that the garbage collection of version chains becomes extremely simple: We remove the chains automatically with the deletion of the corresponding snapshot, if it can not be seen by any transaction anymore. Other systems like PostgreSQL have to rely on a fine-granular garbage collection mechanism for shortening the version-chains, requiring precious resources. In contrast to that, by using snapshotting, we are able to solve the problem of complex garbage collection techniques implicitly.
1.3.2 High-Frequency Snapshotting
With the high-level design of the heterogeneous processing model at hand, it remains the question how to realize efficient snapshotting. The approach stands and falls with the ability to generate snapshots at a very high frequency to ensure that transactions running on the snapshots have to deal only with few and short version chains. In this regard, previous approaches that relied on snapshotting suffered under the expensive snapshot creation phase and consequently moved away from snapshotting. As mentioned, early versions of HyPer , which also used a heterogeneous processing model, created virtual snapshots using the system call fork. This call is used to spawn child processes which share their entire virtual memory with the parent process. The copy-on-write mechanism, that is carried out by the operating system on the level of memory pages ensures that changes remain local in the related processes. While this mechanism obviously implements a form of snapshotting, process forking is very expensive. Thus, it is not an option for our case as we require a more lightweight snapshotting mechanism.
In our recent publication on the rewiring  of virtual memory, we already looked into the case of snapshot creation. With rewiring, we are able to manipulate the mapping from virtual to physical memory pages at runtime in user space. In , we used this technique to snapshot an existing virtual memory area , which maps to a physical memory area , by manually establishing a mapping of a new virtual memory area to . While this approach is already significantly faster than using fork as we stay inside a single process, it is still not optimal as the mapping must be reconstructed page-wise in the worst case — a costly process for large mappings as individual system calls must be carried out.
Unfortunately, all the existing solutions are not sufficient for our requirements on snapshot creation speed. Therefore, in AnKerDB, we implement a more sophisticated form of virtual snapshotting. We do not limit ourselves by using the given general purpose system calls. Instead, we introduce our own custom system call coined vm_snapshot and integrate the concept of rewiring  directly into the kernel. Using our call, we can essentially snapshot arbitrary virtual memory areas within a single process at any point in time. The virtual snapshots share their physical memory until a write to a virtual page happens, which creates a local physical page. This allows us to create snapshots with a small memory footprint at a very low cost, allowing us to build them at a high frequency. Consequently, the individual snapshots contain few and short version chains and enable efficient scans.
1.4 Structure & Contributions
Before we start with the detailed presentation of the system design and the individual components, let us outline the contributions we make in the following work:
(I) We present AnKerDB, a prototypical main-memory (column-oriented) transaction processing system, which supports the efficient concurrent execution of transactions under a heterogeneous processing model under full serializability guarantees. Short-running (modifying) transactions concurrently run on the most recent version of the data using MVCC. Meanwhile long-running read-only transactions run on (versioned) snapshots in parallel.
(II) We realize the snapshots in form of virtual snapshots and heavily accelerate the snapshotting process by introducing a custom system call coined vm_snapshot to the Linux kernel. This call directly manipulates the virtual memory subsystem of the OS and allows a significantly higher snapshotting frequency than state-of-the-art techniques. We demonstrate the capabilities of vm_snapshot in a set of micro-benchmarks and compare it against the existing physical and virtual snapshotting methods.
(III) We create snapshots on the granularity of a column, instead of snapshotting the entire table or database as a whole. This is possible due to the flexibility of our custom system call vm_snapshot. Therefore, we are able to limit the snapshotting effort to those columns, which are actually accessed by the incoming transactions.
(IV) We create snapshots of versioned columns to keep the snapshot creation phase as cheap as possible. To create a snapshot, the current column is virtually snapshotted using our custom system call vm_snapshot and the current version chains are handed over. Running transactions can still access all required versions from the fresh snapshot. As the snapshot is read-only, all further updates happen to the up-to-date column, creating new version chains. As a side-effect, we avoid any expensive garbage collection mechanism as dropping an old snapshot drops all old version chains with it.
(V) We perform an extensive experimental evaluation of AnKerDB. First, we compare our heterogeneous transactions processing model with classical homogeneous MVCC under both snapshot isolation and full serializability guarantees, executing mixed OLTP/OLAP workloads based on TPC-H queries and hand-tailored OLTP transactions. To enable this form of evaluation, AnKerDB can be configured to support both heterogeneous and homogeneous processing (by disabling snapshotting) as well as the required isolation levels. We show that our approach offers a drastically higher transaction throughput under mixed workloads.
The paper is structured in the following way: In Section 2, we describe the heterogeneous design of AnKerDB and motivate it with the problems of state-of-the-art MVCC approaches. As the heterogeneous design requires a fast snapshotting mechanism, we discuss the currently available snapshotting techniques to understand their strengths and weaknesses in Section 3. As a consequence, in Section 4, we propose our own snapshotting method based on our custom system call vm_snapshot. Finally, in Section 5, we evaluate AnKerDB in different configurations and show the superiority of heterogeneous processing.
2 AnKer DB
As already outlined, the central component of AnKerDB is a heterogeneous processing model, which separates OLTP from OLAP processing using virtual snapshotting. Both in the up-to-date representation of the data as well as in the snapshots, we want to use MVCC as the concurrency control mechanism. To understand our hybrid design, let us first see how MVCC is working within a single component.
2.1 Classical MVCC
To understand the mechanisms of classical MVCC, let us go through the individual components. Initially, the data is unversioned and present in the column. Thus, there exist no version chains. If a transaction updates an entry, we first store the new value locally inside the local memory space of the transaction. When the commit is carried out, the update has to be materialized in the column. To do so, the old value is stored in the (freshly created) version chain of that row and the old value is overwritten with the new one in the column in-place. Thus, we store the versions in a newest-to-oldest order. Other systems as e.g. HyPer  rely on this order as well as it favors younger transactions: they will find their version early on during the chain traversal. Obviously, the version chains can become arbitrarily long, if consecutive updates to the same entry happen. Along with the version, we store a unique timestamp of the update that created that version. This is necessary to ensure that transactions, that started before the (committed) update happened, do not see the new version of the entry but still the old one. Unfortunately, reading a versioned column can also become arbitrarily expensive: for every entry that a transactions intends to read and that has a version chain, the chain must be traversed under comparisons of the timestamps to locate the proper version. In summary, if a large amount of lengthy version chains is present and a transaction intends to read many entries, version chain traversal cost becomes significant.
Besides the way of versioning the data, the guaranteed isolation level is an important aspect in MVCC. As a consequence of its design, MVCC implements snapshot isolation guarantees by default. During its lifetime, a transaction sees the committed state of the database, that was present at ’s start time. The updates of newer transactions, which committed during ’s lifetime, are not seen by . Write-write conflicts are detected at commit time: if wants to write to an entry, to which a newer committed transactions already wrote,
aborts. Still, under snapshot isolation, so called write-skew anomalies are possible.
Fortunately, MVCC can be extended to support full serializability. To do so, we extend the commit phase of a transaction with additional checks. If a transactions wants to commit, it validates its read-set by inspecting if any other transaction, that committed during ’s lifetime changed an entry in a way that would have influenced ’s result. If this is the case, has to abort as its execution was based on stale reads. To perform the validation, we adopt the efficient approach applied in HyPer , which is again based on the technique of precision locking . Essentially, we track the predicate ranges on which the transaction filtered the query result. During validation, it is checked whether any write of any recently committed transaction intersects with the predicate ranges. If an intersection is identified, the transaction aborts.
2.2 Heterogeneous MVCC
To overcome the aforementioned limitations of classical MVCC implementations, we realize a heterogeneous transaction processing model in AnKerDB. Two components are present side by side: one component is responsible for the concurrent processing of short-running transactions (coined OLTP component in the following), while the other one can perform long-running read-only transactions in parallel (coined OLAP component from here on). Incoming transactions are classified into being either an OLTP or an OLAP transaction and send to the respective component for processing. The challenge is to combine the concept of heterogeneous processing with MVCC. Let us look at the components in detail at the case of an example we show in Figure 1.
Step : For the following discussion, we assume that our table consists of a single column of rows (identified by row to ), which all contain the value in the beginning. This column is located in the OLTP component and represents the up-to-date representation of the column. Since there are no snapshots present yet, the OLAP component essentially does not exist.
Step : Two OLTP transactions and arrive and intend to perform a set of writes. The first write of () intends to update at row the value with the new value . However, instead of replacing the old value in the column with the new value in-place, we store the new value locally inside the transaction and keep the column untouched as long as the transaction does not commit. In the same fashion, the remaining write of () as well the write of () are performed only locally inside their respective transactions. Note that all three written values are uncommitted so far and can only be seen by the transactions that performed the respective writes.
Step : Let us now assume that commits while intentionally aborts. The commit of now actually replaces in column at row the old value with the new value . Of course, the old value is not discarded, but stored in a newly created version chain for that row. The same procedure is performed at row where the old value is replaced with the new value , moving the old value into the version chain. Note that we implement a timestamp mechanism (logging both the start and end time of a transactions commit phase) to ensure that both writes of becomes visible atomically to other transactions. As no other transactions modified row and during the lifetime of , the commit succeeds and satisfies full serializability, that we guarantee for all transactions. In contrast to that, the abort of simply discards the local change of row . This strategy makes aborts very cheap, as no rollback must be performed.
Step : An OLAP transaction arrives, which intends to scan and sum up the values of all rows of the column, denoted by sum( to ). As no snapshot is present yet to run on, the first snapshot is taken. Using our custom system call (which will be described in Section 4 in detail), we snapshot the column , resulting in a (virtual) duplicate of the column in form of . It is important to understand that this duplicate will become the most recent version of the column in the OLTP component. The “old” column along with its build-up version chains is logically moved to the OLAP component and becomes read-only.
Step : Another OLTP transactions arrives, that intends to perform a read followed by two writes (, ). The read is simply performed by accessing the current value of row of the representation in the OLTP view, resulting in . The two successive writes are stored locally inside and are not visible for other transactions. In parallel to the depicted operations of , our OLAP transaction , which sums up the column values, starts executing in the OLAP component on . As the snapshot is older than , it can simply scan the column without inspecting the version chains.
Step : While the scan of is running, decides to commit. This commit does not conflict with the execution of in any way, as the transactions run in different components. The local writes and are materialized in and the old versions are again stored in version chains.
Step : Another snapshot is taken to have a more up-to-date representation ready for incoming OLAP transactions. Again we use our custom system call and snapshot the column that is located in the OLTP component, resulting in a (virtual) duplicate of the column in form of . As before, the roles are changed: The new duplicate becomes the most recent representation of the column in the OLTP component, while with its version chains is moved to the OLAP component. Note that both as well as are now present in the OLAP component side by side, with still running on .
Step : The OLAP transaction finishes its scan and commits. This makes obsolete, as a newer representation already exists. As no transaction is running, we can safely delete the oldest snapshot , as no incoming transaction can access any of its versions anymore.
2.2.2 Snapshot Synchronization
For simplicity, in the previous example all transactions worked solely on a single column. However, a database usually consists of several tables, each equipped with a large number of attributes and therefore, some form of snapshot synchronization is necessary. In this context, snapshot synchronization means that a transaction, which accesses multiple columns, has to see all columns consistent with respect to a single point in time. A trivial way of achieving this is to simply snapshot all columns of all tables when a snapshot is requested. However, this causes unnecessary overhead as we might access only a small subset of the attributes. Therefore, in AnKerDB, we implement a lazy approach: when a snapshot creation is triggered, only a timestamp for that snapshot is logged and no actual snapshotting is performed yet. If a transaction comes in, that accesses a set of columns, it is checked whether there are snapshots present for these columns. If not, they are materialized. This ensures that columns, which are never touched are never materialized as snapshots.
2.2.3 Snapshot Consistency
In the previous example, we simply created a snapshot when the individual OLAP transactions required it. In our actual implementation, we trigger a snapshot creation after commits happened to the database. When this happens and the previously described access triggers the actual materialization of the snapshot using our system call, we have to ensure that no other transactions modify the column while the snapshot is under creation. We ensure this using a shared lock on the column, which must be acquired by any transaction to update. When materializing a snapshot, an exclusive lock must be acquired which invalidates all shared locks and blocks further updates until the exclusive lock is released.
3 State-of-the-art Snapshotting
As stated before, our heterogeneous processing model stands and falls with an efficient snapshot creation mechanism. Only if we are able to create them at a high frequency without penalizing the system, we get up-to-date snapshots with short version chains. There exist different techniques to implement such a snapshotting mechanism, including physical and virtual techniques. While the former ones create costly physical copies of the entire memory, the latter ones lazily separate snapshots only for modified memory pages. Let us now look at the state-of-the-art techniques in detail to understand why they do not suffice our needs and why we have to introduce a completely new snapshotting mechanism in AnKerDB.
3.1 Physical Snapshotting
The most straightforward approach of snapshotting is physical snapshotting, where a deep physical copy of the database is created when a snapshot is taken. On this physical copy, the reading queries can then run in isolation, while the modifying transactions update the original version. The granularity at which the snapshot is taken is up to the implementation. It is possible to snapshot the entire database, a table, or a set of columns. This way of snapshotting obviously represents the eager way of doing it — at the time of snapshot creation, the snapshot and the source are fully separated from each other. As a consequence, any modification to the source is not carried through to the snapshot without further handling.
Obviously, physical snapshotting is a very straightforward approach, that is easy to use. However, its effectiveness is directly bound to the amount of data that is updated on the source. If only a portion of the data is updated, the full physical separation of the snapshot and the source is unnecessary and only adds overhead to the snapshotting cost.
3.2 Virtual Snapshotting
Virtual Snapshotting overcomes this problem by following the lazy approach. The idea of virtual snapshotting is that the snapshot and the source are not separated physically when the snapshot is taken. Instead, the separation happens lazily for those memory pages, that have actually been modified. As we will see, there a multiple ways to perform this separation using virtual memory. To understand them, let us first go through some of the high-level concepts of the virtual memory subsystem of Linux (kernel 4.8).
3.2.1 Virtual vs Physical Memory
By default, the user perspective on memory is very simple — he sees only virtual memory.
To allocate a consecutive virtual memory area of size the system call mmap is used.
For instance, the well known general purpose memory allocator malloc internally uses mmap to claim large chunks of virtual memory from the operating system. The layer of physical memory is completely hidden and transparently managed by the operating system. Figure 2 visualizes the relationship of the memory types. After allocating the virtual memory area, the user can start accessing the memory area, e.g. via . Obviously, the user perspective is fairly simple. He basically does not have to distinguish between memory types at all. In comparison, the kernel perspective is significantly more complex.
First of all, the previously described call to mmap, which allocates a consecutive virtual memory area, does not trigger the allocation of physical memory right away. Instead, the call only creates a so called vm_area_struct (called VMA in the following), that contains all relevant information to describe this virtual memory area. For instance, it stores that the size of the area is and that the start address is . Thus, the set of all VMAs of a process define which areas of the virtual address space are currently reserved. Note that a single VMA can describe a memory area spanning over multiple pages. As an example, in Figure 3 we visualize two VMAs. They describe the virtual memory areas starting at address (spanning over four pages) respectively starting at (spanning over three pages). In between the two memory areas is an unallocated memory area of size two pages.
Besides of the VMAs, there exists the concept of the page table within each process. An entry in the page table (called PTE in the following) contains the actual mapping from a single virtual to a single physical page and is only inserted after the first access to a virtual page, based on the information stored in the corresponding VMA. The example in Figure 3 shows the state of the page table after four accesses to four different pages happened. As we can see, we have one PTE per page in the page table.
3.2.2 Fork-based Snapshotting
With the distinction between the different memory types and the separation of VMAs and PTEs in mind, we are now able to understand the most fundamental form of virtual snapshotting: fork-based snapshotting . It exploits the system call fork, which creates a child process of the calling parent process. This child process gets a copy of all VMAs and PTEs of the parent. In particular, this means that after a fork, the allocated virtual memory of the child and the parent share the same physical memory. Only a write111Assuming the virtual memory area written to is private (MAP_PRIVATE). to a page of child or parent triggers the actual physical separation of that page in the two processes (called copy-on-write or COW).
Obviously, this concept can be exploited to implement a form of snapshotting. If the source resides in one process we simply fork it to create a snapshot. Any modification to the source in the parent process is not visible to queries running on the snapshot in the child process. As mentioned in Section 1.2, early versions of Hyper that implemented heterogeneous processing utilized that mechanism.
3.2.3 Rewired Snapshotting
While fork-based snapshotting has the convenient advantage, that the snapshotting mechanism is handled by the operating system in a transparent fashion, it has two major disadvantages as well. First, it requires the spawning and management of several processes at a time. Second, it always snapshots all allocated memory of the process, i.e. it can not be limited to a subset of the data. Both problems can be addressed using our technique of rewiring, which we already applied on the snapshotting problem in .
To understand rewiring, let us again look at the mapping from virtual to physical memory as described in Section 3.2.1. This mapping is by default both hidden from the user as well as static, as the user sees only virtual memory by default. This is why we dedicated our recent work of rewiring memory  to the reintroduction of physical memory to user space. We bring back this type of memory in the form of so called main-memory files. As it is possible to freely map virtual memory to main-memory files using the system call mmap and main-memory files are internally backed by physical memory, we have established a transitive mapping from virtual to physical memory. This mapping can be updated at any time. Figure 4 shows the concept. This means using rewiring memory, we established a mapping that is both visible and modifiable in user space.
In rewired snapshotting, we utilize this modifiable mapping. Let us assume we have a virtual memory area as shown in Figure 4, on which we want to create a snapshot. To snapshot, we simply allocate a new virtual memory area and rewire it to the file, which represents our physical memory, in the same way as . Consequently, and share the same physical pages. If now a write to a page of is happening, the separation of snapshot and original version must be performed manually on that page, before the write can be carried out. In the first place, we have to detect the write. After detection, we claim an unused page from the file (which serves as our pool for free pages), copy the content of the page over, perform the write, and rewire to map to the new page.
By this, we are able to mimic the behavior of fork while staying within a single process. Further, we can offer the flexibility of snapshotting only a fraction of the data. However, rebuilding the mapping can also be quite expensive as we will see in the following.
3.3 Reevaluating the State-of-the-Art
As we have discussed the different state-of-the-art methods of physical and virtual snapshotting that are present, let us now try to understand their individual strengths and limitations. This analysis will point us directly to the requirements we have on our custom system call, that we will use in AnKerDB to power snapshotting.
In the experiment we are going to conduct in Section 3.3.2, we evaluate the time to create a snapshot in the sense of a establishing a separate view on the data. While for physical snapshotting, this means creating a deep physical copy of the data, for virtual snapshotting, it does not trigger any physical copy of the data. Still, virtual snapshotting has to perform a certain amount of work as we will see. We will perform the experiment as a stand-alone micro-benchmark to focus entirely on the snapshotting costs and to avoid interference with other components, that are present in a complex transactional processing system like AnKerDB. We use a table with columns that is stored in a columnar fashion, where each column has a size of MB.
The question remains which page size to use. To make snapshotting as efficient as possible, we want to back our memory with pages as small as available. This ensures that the overhead of copy-on-write on the level of page granularity is minimal. Consider the case where our MB column is either backed by huge pages or small pages. In the former case,
uniformly distributed writes would cause a COW of the entire column (MB) in the worst case, resulting in a full physical separation of the snapshotted column and the base column. In the latter case, writes would trigger COW of only small pages (KB), physically separating only of the snapshotted column from the base column.
3.3.1 System Setup
We perform all of the following experimental evaluations on a server consisting of two quad-core Intel Xeon E5-2407 running at a clock frequency of GHz. The CPU does neither support hyper-threading nor turbo mode. The sizes of the L1 and L2 caches are KB and KB, respectively, whereas the shared L3 cache has a size of MB. The processor can cache 64 entries in the fast first-level data-TLB for virtual to physical 4KB page address translations. In a slower second-level TLB, 512 translations can be stored. For MB huge pages, the TLB can cache 32 translations in L1 dTLB. In total, the system is equipped with GB of main memory, divided into two NUMA regions. For all experiments, we deactivate one CPU and the attached NUMA region to stay local on one socket. The operating system is a 64-bit version of Debian 8.16 with Custom Linux kernel version 4.8.17. The codebase is written in C++ and compiled using g++ 6.3.0 with optimization level O3.
3.3.2 Creating a Snapshot
To simulate snapshotting on a subset of the data, we create a snapshot on the first columns of the table . Let us precisely define how the individual snapshotting techniques behave in this situation:
(a) Physical: to create a snapshot of columns of table , we allocate a fresh virtual memory area of size pages. Then, we copy the content of columns of into using memcpy. represents the snapshot.
(b) Fork-based: to create a snapshot of columns of table , we create a copy of the process containing table using the system call fork. Independent of , this snapshots the entire table. The first columns of table contained in the forked process represent the snapshot. The virtual memory areas representing and are declared as private, such that writes to one area are not propagated to the other area.
(c) Rewiring: to create a snapshot of columns of table , we first have to inspect by how many VMAs each column is actually described. As a VMA describes the common properties of a consecutive virtual memory region, it is possible that a column is described by only a single VMA (best case), by one VMA per page (worst case), or anything in between. The more writes happened to a column and the more copy-on-writes were performed, the more VMAs a column is backed by. Eventually, every page is described by its individual VMA.
To create the snapshot, we first allocate a fresh virtual memory area of size pages. For each VMA that is backing a portion of the columns in , we now rewire the corresponding portion of to the same file offset. Additionally, we use the system call mprotect to set the protection of to read-only. This is necessary to detect the first write to a page to perform a manual copy-on-write. represents the snapshot.
Table 1 shows the results. We vary , the number of columns to snapshot, from column ( of the table) over columns ( of the table) to columns ( of the table) and show the runtime in ms to create the snapshot. For rewiring we vary the pages that have been modified (by writing the first B of the page) before the snapshot is taken, as it influences the runtime. We test the case where no write has happened and each column is backed by a single VMA. Further, we measure the snapshotting cost after pages, pages, and pages have been modified. These number of writes lead to , , and number of VMAs backing a column respectively.
First of all, we can see that physical snapshotting is quite expensive, as it creates a deep copy of the columns already at snapshot creation time. As expected, we can observe a linearly increasing cost with the number of columns to snapshot. In contrast to that, fork-based snapshotting is independent from the number of requested columns, as it snapshots the entire process with the entire table in any case. When snapshotting of the table, fork-based snapshotting is over an order of magnitude faster than physical snapshotting, as it duplicates solely the virtual memory, consisting of the VMAs and the page table. The runtime of rewiring is highly influenced by the number written pages respectively the number of VMAs per column. The more VMAs we have to touch to create the snapshot, the higher the runtime. If we have as many VMAs as pages (which is essentially the case after writes), the performance of rewiring pretty much equals the one of physical snapshotting. However, we can also see rewiring is significantly faster than the remaining methods, if less VMAs need to be copied. For instance, after writes, rewiring is around two orders or magnitude faster for a single column and still almost factor two faster for snapshotting the entire table.
|Method||Pages Modified per Col||1 Col [ms]||25 Col [ms]||50 Col [ms]|
3.3.3 Summary of Limitations
Obviously, the performance of rewiring for snapshot creation is highly influenced by the number of VMAs per column. For every VMA, a separate mmap call must be carried out – a significant cost if the number of VMAs is large. Unfortunately, when using rewiring, an increase in the amount of VMAs over time is not avoidable.
Still, we believe in rewiring for efficient snapshotting. It simply can not show its full potential. If we carefully inspect the description of rewired snapshotting in Section 3.3.2 again, we can observe that we actually implemented a workaround of the limitations of the OS. We manually rewire the virtual memory areas described by the individual VMAs to create a snapshot — because there is no way to simply copy a virtual memory area. We perform another pass over the VMAs to set the protection using the system call mprotect to read-only — instead of setting it directly when copying the virtual memory area.
Obviously, we hit the limits of the vanilla kernel. Therefore, in the following Section, we will propose a custom system call that tackles these limitations — leading to a much more straight-forward and efficient implementation of virtual snapshotting, which we will finally use in AnKerDB.
4 System Call vm_snapshot
As we have seen the limitations of the state-of-the-art kernel in the previous section, let us discuss how we can overcome them.
4.1 Snapshotting Virtual Memory
In our implementation of rewired snapshotting, we have experienced the need to directly snapshot virtual memory areas. By default, this is not supported by the kernel. As a workaround, we had to rewire a fresh virtual memory area in the same way as the source area. This is a very costly process as it involves repetitive calls to mmap.
To solve this problem, we have to introduce a new system call, that will be the core component of our snapshotting mechanism. Before doing this, let us precisely define what snapshotting a virtual memory area means in this context. Assuming we have a mapping from virtual to physical pages as shown in Figure 2 of Section 3.2.1, starting at virtual address . As we can see, the first virtual page covering the virtual address space () is mapped to the physical page . The second virtual page covering virtual address space () is mapped to another physical page and so on. Now, we want to create a new virtual memory area starting at a new virtual address (let us call it ) that maps to the same physical pages. Thus, the virtual page covering should map to , the virtual page covering should map to and so on. We define the following system call to encapsulate the described semantics:
This system call takes the src_addr of the virtual memory area to snapshot and the length of the area to copy in bytes. Both src_addr and length must be page aligned. It returns the address of a new virtual memory area of size length, that is a snapshot of the virtual memory area starting at src_addr. The new memory area uses the same update semantics as the source memory area, i.e. if the virtual memory area at src_addr has been declared using MAP_PRIVATE | MAP_ANONYMOUS, the new memory area is declared in the same way.
Implementing a system call that modifies the virtual memory subsystem of Linux is a delicate challenge. In the following, we will provide a high-level description of the system call behavior. For the interested reader, we provide a more detailed discussion in Appendix A respectively the actual source code, that will be released along with this paper. On a high level, vm_snapshot internally performs the following steps: (1) Identify all VMAs that describe the virtual memory area . (2) Reserve a new virtual memory area of size length starting at virtual address dst_addr. (3) Copy all of the previously identified VMAs and update them to describe the corresponding portions of virtual memory in . (4) For each VMA which describes a private mapping (which is the standard case in AnKerDB), additionally copy all existing PTEs and update them to map the corresponding virtual pages in .
This system call vm_snapshot will form the core component of creating snapshots on columns in AnKerDB. It is the call that we use in Figure 1 in Step and Step .
4.1.3 Snapshotting to Existing Virtual Memory Area
So far, our system call vm_snapshot returns the snapshot in form of the start address to a new virtual memory area. However, there might be situations in which we would like to realize the snapshot in an existing virtual memory area. Therefore, we extend our system call by adding a third argument dst_addr:
If dst_addr is NULL, vm_snapshot provides the semantics described in Section 4.1.1, returning the address of a new virtual memory. If dst_addr is a valid address, the snapshot of is created in . If is not (entirely) allocated, the call fails.
This extension to vm_snapshot allows us to reuse previously allocated virtual memory areas. For instance, when replacing an outdated snapshot of a column with a fresh one, we can simply “recycle” its allocated virtual memory area.
Let us now see how our custom system call vm_snapshot performs in comparison with its direct competitor rewiring. We excluded the baselines of physical snapshotting and fork-based snapshotting, as they are already out of consideration for AnKerDB due to high cost and low flexibility. We first look again at the snapshot creation time for a single column of MB. The previous experiment presented in Table 1 showed that rewiring is highly influenced by the number of VMAs that are backing the column to snapshot. To analyze this behavior in comparison with vm_snapshot, we run the following experiment: for each of the pages of the column, we perform exactly one write to the first B of the page. In the case of rewiring, this write triggers the COW of the touched page and thus creates a separate VMA describing it. After each and every write, we create a new snapshot of the column and report the time of snapshot creation.
Let us look at the results in Figure (a)a. As predicted, the snapshot creation cost of rewiring is highly influenced by the number of VMAs that is increasing with every modified page. To visualize this correlation, we plot the number of VMAs per column for rewiring alongside with the snapshot creation time. In contrast to rewiring, our system call vm_snapshot shows both a very stable and low runtime over the entire sequence of writes. After only around writes have happened (see zoom-in of Figure (a)a), the snapshotting cost of vm_snapshot already becomes lower than than the one of rewiring. After all writes have been carried out, vm_snapshot is x faster than rewiring. This shows the tremendous effect of avoiding repetitive system calls to mmap.
However, snapshot creation time is not the sole cost to optimize for. We should also look at the actual cost of writing the virtual memory. In the case of rewiring, the triggered COW is handled manually by copying the page content to an unused page and rewiring that page into the column. In the case of vm_snapshot, which works on anonymous memory and relies on the COW mechanism of the operating system, no manual handling is necessary. This becomes visible in the runtime shown in Figure (b)b. Obviously, writing a page of the column snapshotted by vm_snapshot is up to x faster than writing to one created by rewiring. The reason for this is that the entire COW is handled by the operating system. No protection must be set manually, no signal handler is necessary to detect the write to a page.
5 Experimental Evaluation
After the description of AnKerDB’s system design and the introduction of our custom system call vm_snapshot to efficiently snapshot virtual memory areas, let us now start with the experimental evaluation of the actual system. As AnKerDB relies on a heterogeneous processing model, we want to test it against the homogeneous counterparts. AnKerDB is designed in a way to also support homogeneous processing via configuration by disabling snapshotting.
5.1 AnKerDB Configurations
Let us look at the different configurations we are going to evaluate:
Homogeneous processing, full serializability. We configure AnKerDB such that no snapshots are taken at all. Thus, there is only the OLTP component with the most recent representation of the database. Both OLTP and OLAP transactions run in the OLTP component under full serializability guarantees. As in this setup version chains build up that are not discarded automatically with snapshots, a garbage collection mechanism is necessary. We use a thread that makes a pass over the version chains every second and deletes all versions that are older than the oldest transaction in the system.
Homogeneous processing, snapshot isolation. As in (1.), no snapshots are taken. There is only the OLTP component with the most recent representation of the database. Both OLTP and OLAP transactions run in the OLTP component under snapshot isolation guarantees and thus, the read set validation is not performed. The same garbage collection mechanism as in (1.) is applied.
Heterogeneous processing, full serializability. The OLTP transactions run in the OLTP component and the OLAP transactions run in the OLAP component. The creation of snapshots works as described in Section 2.2.2: after a certain amount of commits to the database has been registered ( in the upcoming experiments), the system sets a snapshot timestamp, that will mark the time of the snapshot to create. The very next access which a column receives will now trigger the actual snapshot creation using our system call. By this, we are able to generate snapshots that are consistent with respect to a single point in time but that are also created in a lazy fashion based on the actual access pattern.
5.2 Experimental Setup
To evaluate the system under complex transactions, we define the following mixture of OLTP and OLAP transactions:
On the side of OLAP, we form transactions based on queries of the TPC-H  benchmark. Precisely, we pick the single table queries Q1 and Q6 (LINEITEM) and Q4 (ORDERS) as well as the two-table query Q17 (joining LINEITEM and PART) as good representatives. For each fired OLAP transaction, we pick the configuration parameters of the query randomly within the bounds given in the TPC-H specification. Additionally, for each table (LINEITEM, ORDERS, and PART) we add a simple scan transaction that runs over the respective table. Thus, in total, we have OLAP transactions.
On the OLTP side, we introduce artificial transactions. Instead of relying on queries given by a transactional benchmark (like TPC-C), we decided to introduce hand-tailored transactions. The reason for this is, that the transactions specified in benchmarks are typically quite large and thus very hard to control and to configure. As a consequence, results that are based on these transactions are even harder to interpret. However, since our system design is focused on improving on the OLAP throughput, we need controllable OLTP transactions to precisely adjust the OLTP load on the system. This allows us to carefully inspect the impact on the OLAP side. In this regard, we introduce the set of OLTP transactions as depicted in Figure 6. The question marks denote the transactions parameters, that we set when firing the transactions. For the VARCHAR attributes l_returnflag, l_linestatus, o_orderpriority, and p_brand, we pick an existing value from the column in a uniform and random fashion. For the DOUBLE attributes l_discount, l_extendedprice, o_totalprice, and p_retailprice, which we update in the transactions, we take the current value at the selected row and increment it by with . In the same fashion, the DATE attribute l_shipdate is updated by incrementing the current value by days, with .
5.3 OLAP Transaction Latency
Let us start the evaluation by looking at the transaction latency in Figure 7. Precisely, we want to identify the response time for an individual OLAP transaction if the system is under load.
To measure the latency of an OLAP transaction, we pressurize the system by executing OLTP transactions picked randomly from the set of transactions described in Figure 6. These OLTP transactions are worked by threads while the th thread answers the OLAP transaction, for which we want to measure the latency. As described before, every commits a snapshot creation is triggered. To get stable results, we fire the OLAP transaction five times in total, measure the latency for each and use the average. We perform this experiment for the two homogeneous baseline configurations as well as for our heterogeneous setup and report the latency of the baselines normalized with respect to our heterogeneous approach.
In Figure 7, we can see that for all OLAP transactions, heterogeneous processing achieves a significantly lower latency than the homogeneous baseline configurations. Our approach is around x to x faster depending on the tested OLAP transaction. The reason for this is that under heterogeneous processing, the OLAP transactions run entirely on snapshots in the OLAP component. While the OLAP transaction is running, the OLTP transactions perform the updates in complete isolation inside the OLTP component. In contrast to that, in the case of homogeneous processing, the updates push new versions, which are possibly relevant for the OLAP transactions, into the version chain. This results in expensive repetitive checks of timestamps at access time, heavily slowing down the scans performed by the OLAP transactions. In contrast to that, the OLAP transaction running on the snapshot can scan the column entirely in-place in a tight loop, without considering the version chains at all.
5.4 Transaction Throughput
Let us now look at a traditional property in estimating the quality of a transaction processing system: the throughput at which a batch of transactions can be answered from end to end. To find out, we perform the two experiments that are depicted in Figure8.
In the first experiment, presented by the violet bars, we fire OLTP transactions and process them with all threads of our system. As before in the latency experiment, every commits we create a fresh snapshot. As expected, the throughput under snapshot isolation is the highest of all configurations, as no commit phase validation must be performed. More interesting for us in the fact that the OLTP throughput under the heterogeneous processing model equals the one under homogeneous processing. This essentially means that our heterogeneous design including snapshotting, aiming at improving OLAP processing, does not negatively influence OLTP throughput.
In the second experiment, depicted in the orange bars, we want to evaluate the throughput under a mixed workload. Additionally to firing OLTP transactions, we also fire OLAP transactions picked from the set of TPC-H transactions and the full table scans that we specified before in Section 5.2. As we can see, the mixed workload is where the heterogeneous design shines. Heterogeneous processing achieves a throughput that is almost by a factor of higher than the baselines. This shows the importance of separating OLAP from OLTP transactions in different processing components.
5.5 MVCC Scan Performance
In the previous section, we have seen that for mixed workloads, the throughput is significantly higher under our heterogeneous design. This is largely caused by the fact that OLAP transactions can simply scan the snapshotted column(s) in a tight loop instead of inspecting timestamps and traversing version chains. To investigate the problems connected with running OLAP transactions over versioned columns, we perform the experiment shown in Figure 9, which resembles executing mixed workloads under homogeneous processing.
In this experiment, we vary the number of rows in the table that are versioned for LINEITEM, ORDERS, and PART and measure the time it takes to perform a full scan of the table. The versioned rows are uniformly distributed across the table. To improve scan performance in the presence of versioned rows, we apply an optimization technique introduced by HyPer : for every rows, we keep the position of the first and of the last versioned row. With this information, it is possible to scan in tight loops between versioned records without performing any checks.
Nevertheless, in Figure 9 we can see that this optimization can not defuse the problem entirely. With an increase is the number of versioned rows, we see a drastic increase in the runtime of the scan as well. Scanning a table that is completely versioned takes around times longer than scanning an unversioned table. This unversioned table essentially resembles the situation when scanning in a snapshot under heterogeneous processing.
5.6 Snapshotting Creation Cost
After inspecting the performance of transactions processing in form of latency in Section 5.3, throughput in Section 5.4, and MVCC scan performance in Section 5.5, let us inspect the cost of snapshot creation in AnKerDB. Due to our flexible system call vm_snapshot, we are able to snapshot virtually at the granularity of individual columns.
To demonstrate the benefit of this flexible approach, we present in Figure 10 the cost of snapshotting the individual columns of the LINEITEM, ORDERS, and PART table of the TPC-H benchmark inside of AnKerDB in form of stacked bars. Each layer in a bar resembles the cost of snapshotting a single column of the respective table. The bar All presents the cost of snapshotting all three tables. In comparison, we show the cost of forking the process in which AnKerDB is running using the system call fork. We make sure that when performing the fork, the process is in the same state as when performing the snapshotting using vm_snapshot. At this point in time, the AnKerDB process has a size of GB in terms of virtual memory.
As we can see in Figure 10, the cost of snapshotting individual columns of the TPC-H tables is negligibly cheap. Thus, if a transaction accesses only a portion of the attributes, the cost of preparing the snapshot stays as low as possible as well. Nevertheless, even when snapshotting all columns of all tables, our approach is considerably cheaper than using the fork system call. The problem of fork is that the virtual memory of the entire process containing GB of virtual memory is replicated. Besides the tables, which consume only around GB of memory, this includes the used indexes, the version chains, the timestamp arrays, and various meta-data structures.
Our system essentially implements parallelism on two layers: On the first layer, we parallelize OLTP and OLAP execution by maintaining the two processing components. On the second layer, we apply MVCC inside each component to ensure a high concurrency among transactions of a single type. In this regard, let us now investigate how well the design scales with the number of available threads.
In Figure 11, we repeat the experiment measuring the throughput from Section 5.4 for heterogeneous processing and vary the number of available threads from thread to threads. As shown in Figure 8, we evaluate a pure OLTP workload consisting of transactions as well as a mixed workload that additionally runs OLAP transactions.
As we can see, the system scales sub-linear with the number of available threads. In comparison to single threaded execution, using threads results in a higher throughput of around x for the OLTP workload and around x for the mixed workload. The reason for this is that the commit phase validation, that is required for OLTP transactions to ensure full serializability, has to be partially sequential. For instance, a list of recently committed transactions, that must be mutex protected, is maintained to organize validation. Therefore, irrespective of our heterogeneous design, concurrent OLTP transaction processing under full serializability is limited by the validation phase.
In this work, we introduced AnKerDB, a transactional processing system implementing heterogeneous processing in combination with MVCC, which works hand in hand with a customized Linux kernel to enable snapshotting at a very high frequency. We have shown that a heterogeneous design powered by a lightweight snapshotting mechanism fits naturally to mixed OLTP/OLAP workloads and enhances the throughput of analytical transactions by factors x to x, as it enables very fast scans in tight loops. Besides, due to the flexibility of our custom system call vm_snapshot, we are able to limit the snapshotting effort to those columns that are actually accessed by transactions, heavily reducing the snapshotting overhead in comparison to classical approaches.
-  TPC-H Benchmark: http://www.tpc.org/tpch/.
-  P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987.
-  F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner. SAP HANA database: data management for modern business applications. SIGMOD Record, 40(4):45–51, 2011.
-  A. Kemper and T. Neumann. Hyper: A hybrid oltp & olap main memory database system based on virtual memory snapshots. In ICDE 2011, pages 195–206.
-  T. Neumann, T. Mühlbauer, and A. Kemper. Fast serializable multi-version concurrency control for main-memory database systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 677–689. ACM, 2015.
-  D. R. K. Ports and K. Grittner. Serializable snapshot isolation in postgresql. PVLDB, 5(12):1850–1861, 2012.
-  K. A. Ross, D. Srivastava, and D. Papadias, editors. Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013. ACM, 2013.
-  F. M. Schuhknecht, J. Dittrich, and A. Sharma. RUMA has it: Rewired user-space memory access is possible! PVLDB, 9(10):768–779, 2016.
-  G. Weikum and G. Vossen. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, 2002.
-  Y. Wu, J. Arulraj, J. Lin, R. Xian, and A. Pavlo. An empirical evaluation of in-memory multi-version concurrency control. PVLDB, 10(7):781–792, 2017.
Appendix A VM_SNAPSHOT Implementation Details
For the interested reader, the follow section provides a detailed description of the implementation details of vm_snapshot.
Check if the virtual memory area to snapshot in the range is actually allocated. If no, the call fails with return value MAP_FAILED and sets errno accordingly.
Identify all VMAs that describe the virtual memory area . This might be one VMA or multiple ones. Let us call them in the following VMA to VMA, if VMAs describe the area.
It is possible that VMA and VMA, the VMAs describing the borders of the virtual memory area, span larger than the area to replicate. This can be the case if virtual memory before src_addr or after is currently allocated as well. In this case, we split VMA and VMA at src_addr respectively . If a split happens, we update VMA and VMA to the VMAs that now exactly match the borders of the region to replicate.
If dst_addr is NULL, reserve a new virtual memory area of size length in the kernel, starting at address dst_addr. If dst_addr is not NULL, check whether is already reserved and fail if not.
Iterate over VMA to VMA. Let us refer to the current item as VMA. Further, let us define size(VMA) as the size of the described virtual memory area and offset(VMA) as the address of the described virtual memory area relative to src_addr. Now, we create an exact copy of VMA and update the virtual memory area described by it to + offset(VMA), dst_addr + offset(VMA) + size(VMA).
Further, we check whether VMA describes a shared or a private virtual memory area. If VMA is shared, nothing more has to be done for this VMA. If VMA is private, we additionally have to modify the page table, if there exist PTEs for the virtual memory area that VMA is describing. In this case, we identify all PTEs, which relate to VMA, as PTE to PTE.
Iterate over PTE to PTE. Let us refer to the current item as PTE. If pageoffset(PTE) returns the address of the mapped virtual page relative to src_addr, we create a copy of PTE and update the start address of the mapped virtual page in the copy to dst_addr + pageoffset(PTE). This step is necessary for private VMAs, as any write that is happening to the described virtual memory area results in a copy-on-write, that is handled with an anonymous physical page. As the information about the physical page is not present in the VMA but only in the corresponding PTE, we have to modify the page table in this case.
After these steps, the virtual memory area contains the snapshot and can be accessed.