Transparent Concurrency Control: Decoupling Concurrency Control from DBMS

02/02/2019
by   Ningnan Zhou, et al.
Microsoft
0

For performance reasons, conventional DBMSes adopt monolithic architectures. A monolithic design cripples the adaptability of a DBMS, making it difficult to customize, to meet particular requirements of different applications. In this paper, we propose to completely separate the code of concurrency control (CC) from a monolithic DBMS. This allows us to add / remove functionalities or data structures to / from a DBMS easily, without concerning the issues of data consistency. As the separation deprives the concurrency controller of the knowledge about data organization and processing, it may incur severe performance issues. To minimize the performance loss, we devised a two-level CC mechanism. At the operational level, we propose a robust scheduler that guarantees to complete any data operation at a manageable cost. At the transactional level, the scheduler can utilize data semantics to achieve enhanced performance. Extensive experiments were conducted to demonstrate the feasibility and effectiveness of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/07/2021

Learning-enhanced robust controller synthesis with rigorous statistical and control-theoretic guarantees

The combination of machine learning with control offers many opportuniti...
10/09/2020

Robust walking based on MPC with viability-based feasibility guarantees

Model predictive control (MPC) has shown great success for controlling c...
06/21/2018

Proving Linearizability Using Reduction

Lipton's reduction theory provides an intuitive and simple way for deduc...
03/10/2020

Hierarchical Neural Architecture Search for Single Image Super-Resolution

Deep neural networks have exhibited promising performance in image super...
11/15/2020

Full Attitude Intelligent Controller Design of a Heliquad under Complete Failure of an Actuator

In this paper, we design a reliable Heliquad and develop an intelligent ...
03/25/2018

Finite-Data Performance Guarantees for the Output-Feedback Control of an Unknown System

As the systems we control become more complex, first-principle modeling ...
11/23/2020

Data-driven Holistic Framework for Automated Laparoscope Optimal View Control with Learning-based Depth Perception

Laparoscopic Field of View (FOV) control is one of the most fundamental ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Existing implementations of DBMSes are mostly monolithic. This goes against common practice of software engineering, where separation of concerns is an important principle. Such monolithic design can be attributed to both tradition and performance consideration [6, 18], which we believe are no longer valid in today’s computing environment. On the one hand, applications are diversifying. They impose increasingly diverse requirements on DBMS, in terms of both functionality and performance. To meet these requirements, application developers are increasingly incentivized to customize DBMSes, for instance, by adding new data types or indexing schemes. On the other hand, hardware and platforms are evolving rapidly. We are constantly being forced to modify a DBMS to make the best of new hardware. A monolithic design unavoidably makes a DBMS difficult to modify or customize. We believe it is time to consider a loosely coupled architecture of DBMS, which is adaptable to diverse applications and platforms.

Attempts at DBMS decomposition dated back to two decades ago [1, 5], with limited progress and success. It has been commonly accepted that a DBMS should be broken into several standard components, such as an interpreter, a query processor, a transaction manager, a storage manager, etc. However, existing DBMSes largely regard this decomposition as an explanatory breakdown instead of a guideline for modularization. Only in recent years, limited but concrete efforts for decomposing a DBMS have been visible. The Deuteronomy project of Microsoft [18, 19, 16, 17] is a typical example, which attempted to decouple the transaction manager from the storage manager of a distributed database. Another example is today’s “big data” platforms, such as Hadoop, which separates the data processor and the storage manager to achieve extensibility. Despite these efforts and their inspiring results, the answer to the problem of DBMS decomposition remains inconclusive.

Among all the coupling points in a DBMS, the one between the transaction manager and the data manager appears the most challenging to break [9]. In practice, it also causes the most pain to engineers who attempt to modify a DBMS. When adding a new data format or a new index to a DBMS, it is inevitable to also implement the transactional methods for the data format or index and ensure their compatibility with the entire system. When upgrading a transactional mechanism, such as adding a new concurrency control method, heavy modification has to be introduced to the code of data organization and processing. To decompose a DBMS, it is crucial to separate the logic of transaction management from that of the data organization and processing component so that modifications on either component do not interfere with the other.

In this paper, we focus on Concurrency Control (CC), a major function of transaction management. We propose to completely separate CC from a DBMS, such that it becomes transparent to the rest of the system. We call our approach Transparent Concurrency Control (TCC). While this separation is in theory possible, it does not come for free. Once separated from the data layer, the CC layer is deprived of the knowledge about data semantics. This may introduce severe performance penalty.

A traditional DBMS performs CC at two levels – the operational level and the transactional level. At the operational level, the CC mechanism ensures isolation among data operations, such as index lookup, index insertion, table scan, etc. To achieve efficiency, the CC methods are normally highly specialized for the particular data models and data processing programs [14]. After the separation, such specialization is no longer possible, as the CC layer loses the knowledge about the data models or data processing methods. If we adopt a generic but blind CC mechanism, it is unlikely to perform well in all possible circumstances. We conducted experimental study to evaluate three generic CC mechanisms, 2PL, SSI and OCC, at the operational level. We found that the three mechanisms perform poorly on certain workloads, e.g., intensive index insertions.

The CC mechanism at the transactional level ensures the isolation among transactions. At this level, data semantics plays an important role. For instance, locking is widely used for isolation. However, after the separation, we cannot even determine the objects of locking, be it either a tuple or a table or a predicate, as such semantic objects are no longer visible to the CC layer. Meanwhile, the semantic relationship between data operations is also missing. Traditional DBMSes often utilize these relationships to achieve improved performance. For instance, as two insertions to the same table are semantically commutative, we can reorder the table insertions of different transactions to achieve a more efficient schedule.

This paper aims to tackle the TCC problem at the operational and transactional levels separately. At the operational level, we employ a try-and-error mechanism that can provide a certain guarantee about the efficiency of CC. At the transactional level, we provide interfaces for developers to declare data semantics to TCC, so that it can be utilized by the CC mechanism. We evaluated the two-level mechanism of TCC on the indexes of a real DBMS. The results demonstrate the potential of TCC in real-world implementation. It makes us optimistic about the feasibility to decompose a DBMS.

To summarize, we mainly made the following contributions in this paper:

  1. [nosep]

  2. We introduced the concept and the architecture of TCC and proved its soundness (Sections 3 and 4).

  3. We showed that separation of CC from DBMS will incur performance degradation. We identified two types of knowledge gaps, known as predictability gap and semantic gap, which are main reasons for such degradation (Section 5).

  4. We devised a mechanism of TCC, which aims to bridge the two knowledge gaps at the operational and transactional levels respectively (Section 6). We conducted experiments to verify its effectiveness (Section 7).

2 Related Work

There have been several attempts aiming at decomposing a DBMS into loosely coupled modules, with various purposes in minds.

In [5], Chaudhuri and Weikum envisioned a RISC-style system architecture, aiming to make a DBMS easier to tune and optimize. They propose to decompose a system coarsely into a storage manager and a query processor. Then the query processor can be further decomposed into an index manager, a SPJ query processor, an aggregator, etc. Such a decomposition is expected to enhance our ability of configuring and tuning a database, so as to improve its adaptability to changing workloads and environments. However, there has been little concrete follow-up research, and RISC-style DBMS remains a vision rather than a practical solution.

StagedDB [7, 8] provides another approach to decompose a DBMS. It separates the workflow of query processing into a number of self-contained and connected stages, such as a parser, a query optimizer, a query executor, etc. Users are allowed to customize the stages, so that they can support user-defined data types, access methods or cost models [23, 1, 2]. StagedDB aims at good performance of query processing. It does not address the modularity issue directly.

To the best of our knowledge, the Deuteronomy project of Microsoft [18, 19, 16, 17] is the most direct and recent effort to realize a decomposition of DBMS. The architecture of Deuteronomy decomposes a database kernel into a Transaction Component (TC) responsible for concurrency control and recovery and several Data Components (DCs) responsible for data organization and manipulation. Such an architecture allows system engineers to develop DCs independently, without concerning the work of TC. As shown on the left of Figure 1, this in effect places the transaction tier above data organization tier, which provides operational interfaces for data manipulation, such as retrieval, update, deletion and insertion of data items. The downside of this architecture is two-fold. First, DC is responsible for ensuring atomicity of data operations. This requires a built-in CC mechanism in the data organization tier. It means that CC has not been completely decoupled from the Data Component. Second, DC must provide sufficient information for TC to detect conflicts among data operations. The current implementation of Deuteronomy assumes that conflicts can be inferred through identifers of data objects. However, in principle, conflicts are not necessarily inferrable from data identifers. As shown in Figure 2, two seemingly separate items may refer to the same piece of physical data. If such implicit connection is unknown to TC, isolation is hardly achievable. This assumption limits the flexibility of DC, as data sharing or co-referencing cannot be used freely.

Figure 1: Possible Placements of the Transaction Tier
Figure 2: When logical items share physical data, serializablity cannot be ensured at the logical level alone. (As the transaction manager does not know that A1 and A2 or B1 and B2 refer to the same piece of data, it regards the above schedule serializable.)

By contrast, TCC expects to separate CC completely from the rest of the system. As shown on the right of Figure 1, TCC places an extra transaction tier between the data organization tier and the physical storage. This allows it to delegate the work of CC completely to the transaction tier.

It is not new to perform transaction management directly on the physical storage. Transactional Memory (TM) is based on the same idea. TM provides transactional support on shared memory, in order to ease programmers’ work on data synchronization. In recent years, TM has been a focus of intensive research [10, 3], resulting in a number of hardware based and software based implementations (a.k.a. HTM and STM). Some recent work [15, 4] has explored how to utilize HTM in database systems. According to their study, due to the constraints imposed by hardware, HTM cannot be directly applied to database transactions. This limits its usage in a generic database system. STM is believed to incur high overheads [3], as it requires extra computation to perform concurrency control. In [22], a “transactional storage” was proposed to transactionalize block-addressable storage. However, the work is focused on the functionality of persistence and recovery.

The major issue faced by both HTM and STM is their lack of adaptability. TMs normally employ generic CC mechanisms, mostly OCC, which are not universally applicable to all programs of data manipulation. There are always corner cases [20], in which they fail to perform. This is unacceptable to TCC. As TCC is supposed to be transparent, developers of the rest of the system should be allowed to implement any data manipulation method, without concerning any performance corner case. TCC deals with the adaptability issue through two approaches. On the one hand, its operational scheduler is able to learn from errors. This makes it eventually adaptable to any program of data manipulation. On the other hand, it provides interfaces for developers to input knowledge about data semantics, which can by utilized by its transactional scheduler to improve performance.

3 The Architecture

It is a common practice to decompose a database system into three tiers – a query processing tier, a data organization tier and a physical storage tier [23]. The query processing tier transforms a SQL query into a query plan and evaluates the plan by invoking relational operators, such as table scan, hash join, etc. The data organization tier is responsible for storing and maintaining structured data. It exposes interfaces of high-level data access to upper tiers, such as index lookup, tuple insertion, tuple update, etc. We call them data operations or operations. The physical storage tier exposes interfaces of low-level data access, such as read and write of data blocks. We call them r/w actions or actions.

In a traditional DBMS, the module of concurrency control is tightly integrated within the data organization tier. Intuitively, the module functions at two levels. At the finer level, it schedules the actions enclosed in each data operation, to ensure atomicity of data operations. At the coarser level, it schedules the data operations, to enforce a certain level of isolation among transactions. For example, in MySQL, the implementation of B-tree involves both latches and locks [21]. Latches enforce isolation among B-tree operations, such as lookup, insertion and deletion. Locks enforce isolation among transactions, each of which may involve multiple b-tree operations.

To separate the module of transaction management from the rest of the system, we are faced with three options. As Figure 1 illustrates, the first choice is to place the transaction tier above the data organization tier. This is the architecture adopted by Deuteronomy [18, 19]. As mentioned earlier, in this architecture, the data organization tier itself will be responsible for performing CC among data operations.

The second choice is to place the transaction tier below the data organization tier. The transaction manager regards each transaction as a sequence of r/w actions on data blocks. If a DBMS relies on transactional memory / storage [15, 24] alone to implement its CC mechanism, it basically adopts this architecture. As this architecture enables a complete separation of the CC mechanism, we treat it as a baseline approach of TCC. However, in this architecture, as the transaction tier lacks the knowledge about data organization, it is faced with severe performance issues. (Details about these issues will be elaborated in Section 5.2.)

TCC adopts the third architecture (on the right of Figure 1). It splits the transaction module into two tiers, and places one above and one below the data organization tier. We call the upper one the transactional CC tier and the lower one operational CC tier. They enforce isolation among transactions and data operations respectively.

As a result, the architecture of TCC consists of five tiers:

Query Processing Tier: This tier interprets and executes SQL queries. During the execution, it will invoke data operations offered by the data organization tier.

Transactional CC Tier: This tier regards each transaction as a sequence of data operations, such as index lookup, tuple insertion, etc. With the full knowledge about conflicts among data operations, it is able to schedule transactions to meet a desired isolation level, such as serializability.

Data Organization Tier: This tier keeps the data organized in predefined structures, such as relational tables, B-tree indexes, etc. It implements basic data operations, such as index lookup, tuple insertion, tuple update, table scan, etc. In this tier, a data operation is further translated into a sequence of r/w actions on the physical storage.

Operational CC Tier: This tier regards each data operation as a sequence of r/w actions, and employs a CC mechanism to ensure the serializability of data operations.

Physical Storage Tier: This tier executes r/w actions on the physical storage. In this paper, we assume that the database system uses block addressable storage. Therefore, the granularity of each r/w action is at the level of data blocks. We also assume that each r/w action is atomic. Should a DBMS employ a buffer manager to speedup data access, the buffer must be located at this tier.

The interfaces exposed by the CC tiers are as follows:

  1. [nosep]

  2. beginTx(int tx_id) This interface is invoked to start a transaction. The transaction has a unique identifier tx_id. The interface is provided by the transactional CC tier. It is supposed to be invoked by applications.

  3. endTx(int tx_id) This interface is invoked to finish a transaction identified by tx_id. It is also provided by the transactional CC tier and invoked by applications. When a transaction ends, it either commits or aborts, depending on whether it violates the predefined isolation level.

  4. abortTx(int tx_id) This interface is invoked by applications to abort a transaction identified by tx_id. It is provided by the transactional CC tier too.

  5. beginOp(int tx_id, int op_id) This interface is provided by the operational CC tier. It is invoked by the transactional CC tier before a data operation is invoked, to indicate the beginning of a data operation. We use tx_id to denote the identifier of the host transaction, and op_id to denote the identifier of the data operation.

  6. endOp(int tx_id, int op_id) This interface is also provided by the operational CC tier. This interface is invoked after a data operation finishes, to end the data operation identified by op_id. An operation may succeed or fail, depending on correctness of its schedule.

  7. read(int tx_id, int op_id, long block_id, char buf) The data organization tier invokes this interface to read the data block identified by block_id. Upon the invocation, the physical storage tier will copy the data in the block into the buffer buf refers to.

  8. write(int tx_id, int op_id, long block_id, char data) This interface is invoked to copy the data into the block identified by block_id in the physical storage. As calls of read and write all go through the operational CC tier, they are subject to the scheduling of the CC tier.

Figure 3 illustrates the usage of the above interfaces. Suppose that the application submits a transaction to insert an entry into a table. Suppose that there is a B-tree index on the table. The application uses beginTx and endTx to specify the beginning and end of the transaction. The query processing tier transforms the SQL statement into two data operations in the data organization tier – one inserts an entry into the B-tree and the other inserts a tuple into the table. The transactional CC tier encloses each data operation within a pair of beginOp and endOp calls. Between the two calls, the data organization tier invokes read and write interfaces to manipulate the data in the physical storage.

Such a design decouples CC from data organization tier completely. On the one hand, the CC tiers need not to care about how data is organized and processed. On the other hand, the data organization tier only needs to encapsulate data manipulation into data operations and invoke the read and write interfaces to access data in physical storage. It does not need to know the logic of CC mechanisms.

Figure 3: How the TCC Architecture Processes a Transaction

A transaction module needs to deal with both concurrency control and recovery. In this paper, we focus on concurrency control. The function of recovery can be realized through a conventional page-level WAL mechanism. Due to space limitation, we do not further elaborate on it.

4 Correctness of TCC

In this paper, we consider only the isolation level of serializability. We show that TCC is able to enforce serializability.

4.1 Enforcement of Conflict Serializability

Conventional DBMSes treat serializability narrowly as conflict serializability. Enforcement of conflict serializability requires knowledge about conflicts among transactions. As transactions are composed of data operations, it actually requires that the CC layer should observe all conflicts among data operations.

Most textbooks on transaction management discuss only the conflicts among simple read and write operations. (By read and write operations, we refer to read and write of data objects rather than r/w actions on physical storage.) They create an illusion that conflict serializability can be enforced by simply locking data objects. In fact, data operations in real-world systems are of much higher complexity. Consider operations such as insertion/deletion of a data object, scan of an entire table, etc. To capture conflicts among complex data operations, traditional DBMSes employ a variety of advanced locking mechanisms, such as key range locks, intention locks, predicate locks, etc.

Due to the separation, TCC is deprived of the options of using advanced locking mechanisms, such as predicate locks. It has to infer conflicts among data operations based on their low-level actions on physical storage. That is, it regards two data operations conflict, if and only if their r/w actions on the physical storage conflict. This approach greatly simplifies the CC mechanism. Meanwhile, it mandates the following prerequisite.

Prerequisite 1

The information in the physical storage is complete and exclusive, such that the results of any sequence of data operations are exclusively determined by the state of the physical storage.

Prerequisite 1 insists that all data and metadata should be stored in the physical storage. If any data or metadata is stored elsewhere, TCC may fail to capture the conflicts on this part of data. While this prerequisite appears trivial, system engineers must bear this prerequisite in mind, to prevent TCC from malfunctioning. For example, buffers must be placed within the physical storage layer, so that data accesses to the buffers are observable to TCC; data or metadata cannot be transmitted among data operations through shared variables, which TCC is unaware of.

theorem 1

Under Prerequisite 1, two data operations conflict only if their r/w actions conflict.

The proof is by contradiction. We assume that two data operations and conflict while their r/w operations do not conflict. Let and be the sequences of r/w operations of and respectively. As and conflict, there must be a sequence of operations , such that the concatenated sequences and will yield different results. As and do not conflict, and must transfer the physical storage to the same state. Thus, we can conclude that the results of is not determined by the physical storage. This contradicts Prerequisite 1.

Theorem 1 states that TCC can capture all conflicts among data operations by observing the r/w actions. This is sufficient for TCC to enforce conflict serializability. In TCC, the operational CC tier is responsible for ensuring serializability among data operations, and the transactional CC tier is responsible for ensuring serializability among transactions. Generic CC mechanisms, such as 2PL, SSI and OCC, can be employed for the enforcement.

4.2 Beyond Conflict Serializability

Inferring operational conflicts at the physical level can be overkill. In fact, when two r/w actions conflict on the physical storage, it is not necessary that their host data operations semantically conflict. For instance, we can increment a counter twice, through two data operations. Physically the two operations conflict, as they modify the same piece of physical data. In effect, they do not, as they can be reordered without affecting the results. As elaborated subsequently, conflict serializability at the level of physical storage will limit TCC’s concurrency. This issue is less serious to traditional DBMSes, as they detect conflict at the semantic level (the level of data objects), which helps them circumvent the worst cases. To achieve good performance, TCC needs to go beyond conflict serializability.

In this paper, we consider View Serializability (VS), a less restrictive definition of serializability. As the traditional definition of VS considers only read and write data operations, we redefine it as follows, to make it applicable to general data operations.

definition 1 (View Equivalence)

Two schedules and of the same set of data operations are View Equivalent, if for all possible sequences of operations and , the return values of the data operations in the concatenated sequence are identical to those in the sequence .

View equivalence requires not only that two schedules return the same results, but also that their subsequent operations (those of ) return the same results. That is, the two schedule should transform a database to the same state. Two states of a database are semantically identical if they always return the same result to the same operation. They are not necessarily byte-to-byte identical in physical forms. For instance, in classical relational theory, two relational tables are equivalent, if they contain the same set of tuples, even though their tuples are stored in different orders.

definition 2 (View Serializability)

Given a set of transactions , a schedule is View Serializable, iff there exists a serial schedule of , such that and are View Equivalent.

It is not difficult to prove that a conflict-serializable schedule is also view-serializable. To harness the benefits of view serializability, TCC allows system developers to specify the conditions under which view serializability can be preserved, especially when conflict serializability is violated. For instance, the developer of B-tree can declare that two B-tree insertions are commutative, which means that the order of insertion has no impact on serializability. As a result, TCC no longer needs to consider the conflicts among B-tree insertions, even though they have modified the same data blocks.

5 Where does Performance Drop

Our goal is to optimize the performance of TCC, so that it can be an alternative to traditional CC mechanisms.

A performance issue one can easily think of is the granularity of CC. As TCC operates at the block level, when data accesses are concentrated on a small number of blocks, the throughput may drop quickly. In fact, this issue is not as serious as we expect. In our experimental evaluation, we found that the granularity issue only occurs in a limited number of cases. We leave the granularity issue to engineers of the data organization tier, who are supposed to keep hotspot data decentralized, and treat it as a principle of design. (This does not necessarily mean that we should sacrifice data locality. Hotspot data is a small amount of highly contended data. Even if we scatter the data on multiple blocks, they can still be accommodated by caches.)

A more serious challenge faced by TCC is information loss. Once the CC layer is separated from the rest of the system, the structures of data and the system’s behaviorial patterns are no longer explicit to the CC mechanism. This may lead to serious performance degradation, as specialized designs cannot be adopted. We classify the issues of information loss into two categories –

predictability gap and semantic gap, and elaborate on them separately.

5.1 Predictability Gap

There are limited types of data operations in a DBMS, which are repeatedly invoked to complete complex data processing. As a result, there is a strong regularity in data accesses on the low-level storage. Such regularity has been utilized by traditional CC mechanisms to enhance performance. For example, when performing B-tree insertion, if a leaf node is retrieved, it is guaranteed to be updated subsequently. In MySQL, when a normal operation attempts to read a leaf node of a B-tree, it will place a shared latch on the node to allow more concurrency. However, if the operation is a B-tree insertion, MySQL will place an exclusive latch on the leaf node upfront. This helps it avoid latch upgrade, which can easily lead to deadlocks (Figure 4).

It is difficult for TCC to utilize such regular patterns in data accesses. When a B-tree insertion is reading a leaf node, TCC knows neither that it is a B-tree insertion nor that the block being accessed is a leaf node. It is then impossible for TCC to predict that there will be a follow-up modification. If TCC adopts a conventional CC mechanism, such as 2PL, B-tree insertion has to perform latch upgrade. As Figure 4 illustrates, if multiple B-tree insertions attempt to access the same leaf node concurrently, deadlock will be highly likely. To make the matter worse, if we retry the B-tree insertions whenever encountering a deadlock, it will incur more deadlocks or even starvation. The entire system may stop performing because of it.

Figure 4: Data access sequences on B-tree that cause deadlock or abort.

Without the knowledge about how each data operation works, TCC loses the ability to predict data operations’ behaviors. Thus, it misses the opportunity to apply specialized mechanisms to improve the performance of CC. We call this type of information loss “predictability gap”.

To the best of our knowledge, all generic CC mechanisms that have existed suffer from predictability gap. Figure 5 illustrates a corner case no generic CC mechanisms can deal with, be it either 2PL or OCC. In this case, two concurrent data operations and update a sequence of data blocks in reverse orders. All generic CC mechanisms will allow and to update and concurrently. This will surely lead to deadlock or abort. If and are invoked frequently, there will be performance degradation. It is unacceptable that TCC be handicapped by such corner cases. However, we cannot resort to specialization, as we still need to hide the implementation details of data operations from TCC. The only option left to us is to design a generic CC mechanism that is immune to predictability gap.

Figure 5: Data access sequences that embarrass all general-purpose CC mechanisms

In section 6.1, we will introduce a new CC mechanism, which can learn access patterns in a try-and-error manner. When performing or retrying a data operation, it acquires knowledge about its data access patterns. Then, it can utilize the knowledge in the subsequent retries. It proves to be robust against any corner cases.

5.2 Semantic Gap

We have mentioned that conflict serializability at the level of physical storage is too restrict for TCC to achieve good performance. As a makeup, we introduced view serializability, which is based on the definition of view equivalence. View equivalence, in turn, is a semantic measure. Its measurement requires the semantics of data operations, which we intend to hide from TCC.

For example, commutative operations and inverse operations [25] are common semantics we can use to measure view equivalence. Suppose that transaction performs two B-tree insertions and , and transaction performs one B-tree insertion , all on the same leaf node . If we enforce conflict serializability restrictively, we can accept only two schedules of the operations, namely and . In fact, most real-world DBMSes accept the schedule too, simply because B-tree insertions are commutative. While it is possible that the two versions of resulted from and are not physically identical, they are view equivalent – they are semantically identical to future data operations.

View serializability allows us to exploit more concurrency. However, TCC lacks the knowledge to judge view seraializability. This is known as “semantic gap”.

To deal with semantic gaps, we place a transactional CC tier atop the data organization tier. It allows system engineers to explicitly declare semantic relationship between data operations (e.g., commutative operations, inverse operations). Section 6.2 describes how TCC leverages data semantics to generate view serializable schedules.

6 The TCC Mechanism

The two-level architecture of TCC allows us to deal with the two information gaps separately. The operational tier deals with the predictability gap by adopting a try-and-error CC mechanism. To bridge the semantic gap, the transactional tier allows developers to declare semantic relationship among data operations.

6.1 Operational Scheduler

Our scheduler at the operational level employs latching to enforce serializability of data operations. The basic approach is two-phase latching – an operation places latches when it is about to read or write a data block for the first time; it releases all the acquired latches after the operations. The scheduler fails an operation, if it suspects that it may violate serializability. When an operation fails, the scheduler retries it immediately. During a retry, it performs early latching to prevent the operation from failing again for the same reason. When an operation fails more, more early latches will be placed, so that the chance of a successful retry gradually increases.

This try-and-error approach allows the scheduler to learn the behaviorial pattern of a data operation on the fly. As more retries are performed, the behavior of an operation becomes increasingly predictable. At a certain point, we can guarantee that the scheduler is able to complete the operation without further retry. To make this intuition work, we introduce the concept of progressiveness.

definition 3 (Progressiveness)

Let a data operation be a sequence of r/w actions. A scheduler is progressive if it can guarantee: whenever a data operation fails on an r/w action (i.e., the data operation is aborted because of a conflict on the action), the subsequent retries of the operation will not fail on the same r/w action again.

Progressiveness ensures that each r/w action of a data operation will fail at most once. If a data operation comprises r/w actions, it will fail at most times. Therefore, a progressive scheduler guarantees to complete any data operation in a limited number of retries, no matter how complicated the situation is. Progressiveness means robustness.

To ensure progressiveness, the operational scheduler needs to think twice before deciding to fail an operation, as it cannot fail it on the same r/w action for more than once. Suppose that two data operations and conflict. Then, there must be two r/w actions, of and of , which attempt to access the same data block. Suppose that is ahead of . We can distinguish among three types of situations:

  • [noitemsep]

  • and have already failed on and , in the previous attempts.

  • has never failed on , while have failed on .

  • has never failed on .

To ensure progressiveness, in Situation I, we cannot abort either or . In Situation II, we cannot abort . In Situation III, it is always safer to abort rather than . Based on this observation, we come up with the following rules for our operational scheduler:

  • [nosep]

  • Basic Latching. Whenever an operation conducts an r/w action (where denotes the data block being accessed, and denotes the access mode, i.e., read or write), it is supposed to place a latch of mode on . The latches will be held until succeeds or fails. This is basically two-phase latching, which ensures serializability among data operations.

  • Early Latching. To deal with Situations I and II, we perform early latching. Whenever a data operation fails on an r/w action for the first time, will record in an immunity set . When retries, it latches the blocks in its immunity set in advance. That is, for each in , will first place a latch of mode on before the execution starts. To avoid deadlocks in the early-latching phase: (1) we place latches in the order of block ids; (2) if a data operation will both read and write a block, we only place the write latch. When early latching is in use, in Situations I and II, and will actually be executed in a serial order, as will be blocked by in the early-latching phase. Then, we can avoid aborting and on and .

  • Early Abortion. To deal with Situation III, we ensure that , instead of , is the one to abort. When a data operation performs an r/w action , if did not fail on the r/w action before, it will try to latch before the action. In this case, if another operation has already obtained the latch on , instead of blocking , we abort directly.

A scheduler following the above three rules will be deadlock free. Due to the use of early abortion, blocking can only occur in the early latching phase. As early latching is performed in a universal order, the aforementioned three rules alone cannot cause deadlock. The following theorems confirm that our scheduler achieves serializability and progressiveness simultaneously.

theorem 2

If we perform scheduling by following Rules 1, 2 and 3, all data operations will be serializable.

The proof is by contradiction. If we assume that serializability does not hold, there must be a dependency cycle , where all complete successfully. For each dependency in the cycle, we can conclude that it is not in Situation III. Otherwise, will abort. Then, can only be in Situation I or Situation II. In either case, will not access any data until completes. Then, there will be a deadlock among the operations , as each operation is waiting for the preceding one to complete. Then, no operation can complete.

theorem 3

If a scheduler follows Rules 1, 2 and 3 exactly (except the actions specified in Rules 1, 2 and 3, no other blocking or abortion is performed), it is a progressive scheduler.

First, if we apply Rules 1, 2 and 3, there will not be deadlock. To prove it, we assume that there is a deadlock in the form . We know that there is a universal order for early latching. Then, not all operations involved in the deadlock are in the early latching phase. Suppose that is not in the early latching phase and the r/w action blocking is . Then, we can conclude that must have not failed on . (Otherwise, should be blocked in the early latching phase.) According to Rule 3, if have not failed on , should be aborted instead of being blocked. Then, the deadlock is impossible. We are in contradiction.

If deadlock is impossible, abort can only occur when we apply Rule 3. That is to say, a data operation can only fail on an r/w action where it has never failed. This is exactly what progressiveness needs.

Algorithm LABEL:alg:op_exec describes our scheduler. The duration of a data operation is divided into three phases. In the early latching phase, the operation latches all the blocks in the immunity set. During the execution phase, an operation performs updates only in its private workspace. This facilitates abortion – to abort an operation, we simply discard its workspace. After the execution phase, the operation enters a clearing phase, in which it makes its modification visible to other operations.

In the scenario of intensive B-tree insertion (illustrated in Figure 4), our progressive scheduler is superior to strict two-phase latching. Two concurrent B-tree insertions may conflict when they attempt to upgrade their latches on the same leaf node . In this case, our scheduler aborts both insertions, and adds the r/w action to their immunity sets. When it retries the two B-tree insertions, it will place a latch on at the very beginning. This guarantees the success of the retries. If we employ strict two-phase latching, the two B-tree insertions may fail repeatedly.

Compared to traditional optimistic CC mechanisms, such as OCC and SSI, early latching may seem too pessimistic. In fact, our basic assumption is that data operations are all short. In real-world systems, this assumption is valid, since long and sophisticated data manipulations are always composed of short and generic operations. Under this assumption, it is unlikely that early latching will hurt performance severely. It is more important to ensure the progressiveness of operation execution, as it frees system developers from the concerns on performance corner cases. In contrast to operations, lengths of transactions are less controllable, as they are determined by applications. This is the reason why we decide not to apply the same progressive scheduler to the transactional level.

algocf[t]    

6.2 Transactional Scheduler

The operational scheduler ensures a serial order of data operations. The transactional scheduler is supposed to schedule the operations to enforce serializability among transactions. In theory, it can employ any CC mechanism to enforce serializability, including 2PL, SSI, OCC, etc. However, there is a distinction between TCC and traditional DBMSes in transactional scheduling. In traditional DBMSes, the scheduler can predict conflicts between data operations prior to their execution, by comparing object ids or query predicates. In TCC, the scheduler can only observe conflicts during or after the execution of operations, as confliction can only be inferred from r/w actions on the physical storage (Section 4). This makes the design of the transactional scheduler less straight forward.

We devised two transactional schedulers for TCC – a basic scheduler which applies two-phase locking to enforce conflict serializability, and an extended scheduler which can relax the schedules to view serializability.

6.2.1 The Basic Scheduler

To perform 2PL, we need to determine the objects of locking. The locking objects of traditional DBMSes, such as tuple, table and predicate, do not apply, as they are unknown to TCC. Therefore, TCC has to place locks directly on data blocks. As mentioned earlier, r/w actions on data blocks enable TCC to capture all conflicts among data operations. Locking blocks suffices to achieve 2PL.

Our design of the 2PL mechanism has to consider the particular situation of TCC. First, we decide to perform locking only after a data operation completes. If we perform locking during the execution of a data operation, it will interfere with the work of the operational scheduler, making progressiveness difficult to achieve. As shown in the endOp function of Algorithm LABEL:alg:op_exec, we perform locking after the Clearing Phase of each data operation. More precisely, locks are added after all the latches are released. Separating latching and locking phases enables us to avoid unresolvable deadlocks. If we perform locking before latches are released, latches and locks may together constitute a deadlock. Such deadlocks are expensive to detect and resolve. For example, in Figure 6, two transactions and are executed concurrently. At the beginning, executes an operation to update the block . It thus holds a lock on . Then, executes an operation to update the blocks and . When attempts to lock , it is blocked by , while holding latches on both and . If then executes an operation that updates , it has to wait for ’s latch on . As a result, a deadlock is formed.

Figure 6: An example where locks and latches form a deadlock.

Second, since the locking phase is separated from the latching phase, we must guarantee that transactions place locks in the same order as their data operations place latches. That is to say, if two data operations conflict, resulting in a dependency , then the transaction of must place the lock before the transaction of does. To ensure the consistency between latching and locking orders, whenever a transaction obtains a lock, we check if its locking order complies with the latching order. If it does not, we abort the transaction (Line LABEL:alg:op:10 of Algorithm LABEL:alg:op_exec). We maintain a latch counter and a lock counter for each data block, which will be incremented during the latching and locking phases respectively. If a transaction performs locking in the right order, it is supposed to observe identical latch and lock counters. If there is a gap between the two counters (Line LABEL:alg:op:9 of Algorithm LABEL:alg:op_exec), it means that the locking order and the latching order are inconsistent.

A possible concern is that the separation between the latching and locking phases may lead to high abort rate. According to our experiment study (Section 7.4), this is unlikely, as the interval between the two phases is sufficiently small.

Recoverability refers to the ability to abort transactions correctly. When a transaction aborts, it needs to perform extra writes on the data blocks it has modified, to recover them to the original versions. It has been proven that recoverability is achievable if we disallow access on uncommitted data [25]. In principle, 2PL guarantees that no uncommitted data is accessed by any transaction. As to TCC, since it performs locking after latches are released, it is possible that a data operation accesses uncommitted data. To ensure recoverability, we simply abort transactions that accessed uncommitted data (Line LABEL:alg:op:20-LABEL:alg:op:21 of Algorithm LABEL:alg:op_exec).

algocf[htbp]    

TCC provides two ways to rollback a transaction. First, it maintains undo logs and uses them to recover data blocks to older versions. As an aborted transaction has already locked the data it has modified, no other transaction can access the data before the rollback is finished. Second, system engineers may have created inverse operations for some data operations. Then, we can cancel a data operation by executing its inverse operation. The details will be discussed in the Section 6.2.2.

Algorithm LABEL:alg:tx_exec depicts how the basic transactional scheduler works. It is worth noting that our transactional scheduler is not deadlock free. It thus requires a deadlock detector. Moreover, our transactional scheduler does not ensure progressiveness. Since our progressive scheduler can be overly pessimistic, applying it to the transaction level may hurt the concurrency of long-duration transactions. We consider it as application developers’ responsibility to ensure the performance of transactions. This is how the state-of-the-art software development works.

6.2.2 The Extended Scheduler

The basic scheduler enforces conflict serializability. As discussed previously, conflict serializability can be overkill. To improve the concurrency of transaction processing, we have introduced the concept of view serializability, which allows us to take data semantics into consideration.

An important type of data semantics is commutativity.

definition 4 (Commutative Operation)

Two operations and are commutative, iff for any two sequences of data operations, say and , the two schedules and are view equivalent.

TCC provides an interface add_commutativity(int , void *, int , void *) for system developers to declare that data operations of type op and op are commutative operations. and are the argument lists of and respectively. They are used to specify the conditions where commutativity holds. For example, suppose that the type of B-tree insertions is identified by . A developer can invoke add_commutativity(1, null, 1, null) to notify TCC that B-tree insertions are always mutually commutative.

Conflicts among consecutive commutative operations can be ignored when we enforce view serializability. This can be confirmed by the following theorem.

theorem 4

A schedule preserves view serializability if the following conditions are satisfied:

  • [nosep]

  • Suppose is the complete set of dependencies among the transactions.

  • Suppose is the complete set of dependencies caused by consecutive commutable operations.

  • The dependency graph consisting of is acyclic.

For each pair of dependency , we can rearrange the order of and , i.e., turning it to , without violating view serializability.

If the schedule does not satisfy view serializability, there must be a dependency cycle. Then, the cycle must not contain a dependency in . Otherwise, we can rearrange the dependency to break the cycle.

To take advantage of commutativity in TCC, we extend the basic scheduler. We regard locks hold by commutative data operations compatible. For example, if transaction executed a B-tree insertion and modified the leaf node , will hold an exclusive lock on . Then, when another transaction executes a B-tree insertion and modifies the same leaf node , can be granted with an exclusive lock on too. (To the basic scheduler, is supposed to be blocked.) This preserves view serializability, as the execution order of commutative operations can be arbitrary.

However, when commutativity is considered, extra measures are required to ensure recoverability. As an exclusive lock is no longer exclusive to commutable operations, a transaction may read uncommitted data. If the uncommitted data is aborted, we have to perform cascading abort, which will be expensive. While we can forbid access on uncommitted data, it makes commutativity useless. To undo commutative data operations, the best strategy is to use inverse operations.

definition 5 (Inverse Operation)

is an inverse operation of the operation , iff for any two sequences of data operations, say and , the two schedules and are view equivalent.

TCC provides an interface addInverse(int , void *, int , void *) for system developers to declare inverse operations. This interface specifies that is an inverse operation of . and are the argument lists of and respectively. For example, suppose that B-tree deletion is an inverse operation of B-tree insertion. We can declare the inverse operations by invoking addInverse(btreeInsert, [, ], btreeDelete, []). It indicates that if we perform B-tree deletion on , it will undo the B-tree insertion with the same .

If a data operation’s uncommitted data has been accessed by its commutative operations, we can abort it by simply invoking its inverse operation, without also aborting its commutative operations. The following theorem justifies this.

theorem 5

Suppose the operations and are commutative, and is an inverse operation of . Given any two sequences of data operations, say and , and are view equivalent.

The proof is straightforward. By Definition 4, and are view equivalent. By Definition 5, and are view equivalent. Thus, and are view equivalent.

When we abort a transaction, we undo its operations serially in reverse order. For an operation that is not commutative with any other operations, we undo it through the undo log. For an operation that has commutative operations, we invoke its inverse operation to undo it. Different from executing an undo log, an inverse operation can possibly be blocked by other transactions. In this case, instead of letting it be blocked, we choose to fail the inverse operation and retry it. And we repeat retrying until it succeeds.

In this paper, we consider only commutative and inverse operations. It is possible to define and exploit other types of data semantics in TCC. However, this is not within the scope of our current work.

7 Experimental Study

To evaluate the practicality of TCC, the best way is to apply TCC to an existing database system, whose design is completely oblivious to how TCC works. The purpose of TCC is to make concurrency control transparent to database engineers. If we create a new database system based on TCC, we will be inclined to tailor its design to the particular mechanisms of TCC. This will make the evaluation less objective. However, a complete substitution of the existing CC mechanism in a DBMS is extremely costly, if not impossible. The code of CC is usually intertwined with a large number of components of a DBMS, including the metadata manager, the storage space manager, the table manager, the indexer, etc. A complete deployment of TCC requires us to re-engineer all the components. It is beyond the capability of our research team. As a compromise, we chose to apply TCC to only the indexes of a DBMS. Indexes are typical data structures in data management. Their concurrency controllers are usually highly specialized. In the TCC architecture, they are likely to be affected by the predictability and semantic gaps. Therefore, evaluation on indexes can show how well TCC deals with the two gaps in a generic DBMS.

7.1 The Implementation

Our codebase is Shore-MT [12], a well used research prototype of RDBMS. It adopts 2PL for transaction-level CC and applies specialized CC mechanisms to indexes and metadata.

B-tree is the only type of index used by Shore-MT. We disabled the original concurrency controller on the B-trees of Shore-MT, and supplemented it with the TCC mechanism. Shore-MT’s B-tree are disk-resident. Any access to a B-tree node needs to first fix the underlying block in the buffer to avoid invalid access. Therefore, we regarded the “fix” routines as the read/write interface of the physical storage, and deployed the TCC module around it. This allows TCC to capture every r/w action on B-tree.

We implemented four mechanisms of TCC. The first three adopt the architecture of transactional memory (i.e., the middle one in Figure 1) and apply standard 2PL, SSI and OCC respectively to enforce serializability. These three mechanisms ignore the existence of data operations, and simply treat each transaction as a sequence of r/w actions. They may thus suffer from the the predictability and semantic gaps introduced in Section 5. We denote them by , and respectively. The fourth one is the TCC mechanism we proposed in this paper (adopting the architecture on the right of Figure 1). We denote it by . Since uses two transactional schedulers, a basic one and an extended one (Section 6.2), we denote a variant of that uses only the basic transactional scheduler as .

To preserve the ACID of transactions, we need to integrate the CC of B-trees with that of the rest of Shore-MT. For , we let its transactional scheduler and the rest of Shore-MT share the same lock manager. We did the same to . For and , we implemented two variants of Shore-MT, and , which uses SSI and OCC for concurrency control. Then, we integrated the schedulers of and into those of and respectively.

Shore-MT does not support MVCC. To implement and , we carved out an additional storage space to store old versions of data. All versions of a data block are linked together, so that a transaction can easily retrieve the proper version to read. Regarding the implementation of and , we maintain a write set and a read set for each transaction. During the validation stage, a transction locks the write set and validates the read set.

7.2 Experiment Setup

We compared TCC against the original CC mechanisms of Shore-MT. We had three versions of Shore-MT, , and . is the original Shore-MT, which uses 2PL for concurrency control. To achieve its best performance, we applied two of its optimization patches, i.e., Speculative Lock Inheritance (SLI) [11] and Early Lock Release (ELR) [13]. and are variants of Shore-MT that uses SSI and OCC for concurrency control. They were implemented to cooperate with and .

The experiments were carried out on an HP workstation equipped with Intel Xeon E7-4830 CPUs (with 32 cores and 64 physical threads in total) and a SATA-2T disk. The operating system was 64-bit Ubuntu 12.04. In most of the experiments, we set the buffer size to MB. For the experiments on TPC-C, we set the buffer size to GB (default setting of ShoreKit). For the experiments on TATP, we set the buffer size to GB (default setting of ShoreKit). We intentionally kept the buffer size large, to minimize I/O wait time. This helps to maximize concurrency control’s influence on performance. For the same reason, we turned off the logging of Shore-MT.

7.3 Experiments on Operational Scheduler

Our operational scheduler was designed to bridge the predictability gap. It is supposed to handle any data operation efficiently, regardless of its data access patterns. To evaluate the robustness of our operational scheduler, we performed experiments on a variety of scenarios, including different cases of B-tree insertion and an artificial corner case (such as the one depicted in Figure 5).

In the experiments on B-tree insertion, we created a B-tree index of records and ran two types of workload on it. In the first type of workload, each transaction contains a single tuple insertion, which inserts a tuple into a randomly selected leaf node of the B-tree. It represents the case of low contention. In the second type of workload, the transaction performs sequential tuple insertion, such that all transactions contend for the last leaf node. It represents the case of high contention.

(a) Throughput
(b) Abort Rate
I. Random Insertion.
(c) Throughput
(d) Abort Rate
II. Sequential Insertion.
Figure 7: Performance on B-tree Insertion.

Figure 7 shows the results on B-tree insertion. We can see that all the CC mechanisms perform similarly well when the degree of contention is low. When the degree of contention increases, the performance of , and gradually becomes unbearable. In the case of high contention, suffers from deadlocks. A large number of deadlocks was incurred when it upgraded the latches on the last leaf node of the B-tree (from shared mode to exclusive mode). This leads to high deadlock-resolving cost and high abort rate (Figure 7(d)). While and do not need to deal with deadlocks, they suffer from high abort rates. When transactions are contending for the last leaf node, ’s validation phases will be highly likely to fail, and will encounter a large number of write-write conflicts, which can easily force transactions to abort.

In contrast, ’s performance is significantly better in the high-contention case. It performed as well as Shore-MT’s built-in B-tree scheduler. The operational scheduler of is progressive. When a B-tree insertion fails, it automatically retries it, without aborting the host transaction. More importantly, it can learn from errors, such that the retries are limited. As Table 1 shows, even when the degree of contention is maximized, can complete a B-tree insertion with retries on average.

# of Workers
1 2 4 8 16 32
B-tree Insert 0 0.04 0.93 1.27 1.55 1.70
Corner Case 0 1.32 1.39 1.41 1.43 1.47
Table 1: Retry Frequency per Operation.

In our experiments on the corner case, we created an artificial operation in Shore-MT. There are two execution routes. When invoked, the operation will randomly choose one of the routes to execute. In the first route, the operation is supposed to first read the block , and then perform a large number of random reads, and finally update the block . In the second route, the operation is supposed to first read , and then perform a large number of random reads, and finally update . The corner case is intentionally designed to handicap the generic CC mechanisms, including 2PL, SSI and OCC.

Figure 8 shows the results. As we can see, when the degree of concurrency reaches a certain level, , and all seem to be subject to starvation. To , a transaction can easily be involved in deadlocks. To , write-write conflicts and anti-dependencies will be common, making transactions difficult to succeed. To , validation is difficult to pass. In contrast, performs much better, as its operational scheduler is progressive. If an operation fails on a data block in the previous execution, it will latch the block upfront to avoid failing again. After one to two retries, the operation is guaranteed to succeed. According to Table 1, needs less than 1.47 retries to complete an operation.

(a) Throughput
(b) Abort Rate
Figure 8: Performance on a Corner Case.

The experiments justified our initiative to create a progressive operational scheduler. Generic CC mechanisms such as 2PL, SSI and OCC can perform fairly well in some cases. However, there are always cases they stop performing. It is unlikely to get rid of all such corner cases, if we are blind to data access patterns. This is known as the predictability gap. In contrast, a progressive scheduler seems way more robust. It learns by doing, and is able to exploit the learned access patterns to improve the efficiency. This is especially meaningful to TCC, which is supposed to make CC transparent to the rest of the system.

7.4 Experiments on Transactional Scheduler

Our second set of experiments was conducted on the transactional scheduler. It mainly aimed to understand whether data semantics (i.e., commutative and inverse operations) can be exploited to improve performance. We used two types of workload, a revised New-Order workload of TPC-C and an artificial workload. We made two modifications on the New-Order transactions. First, we rebuilt the index of the order-line table. The new index key is composed of four fields – OrderId, WarehouseId, DistrictId and OrderNumber. With this arrangement, insertions in the order-line table will contend for the same B-tree leaf node. Second, we made sure that there were insertions to the order-line table in each transaction. This modification can enlarge the performance gaps among the variants.

We made the following data semantics explicit to . First, tuple insertions are mutually commutative. Second, given the same tuple id, tuple deletion is the inverse operation of tuple insertion.

Figure 9 shows the experiment results of revised New-Order. We can see that and beat the other approaches. , and suffer from high abort rates, due to the same reason as that in the sequential B-tree insertion experiments. While does not consider data semantics, it still outperforms , and , due to the adoption of the progressive operational scheduler. However, it is inferior to . If an uncommitted transaction has inserted into a leaf node of a B-tree, will abort other transactions attempting to insert into the same leaf node, as they are accessing uncommitted data. Otherwise, it cannot ensure the recoverability of transactions. can avoid such abortion. allows its B-tree insertions to access uncommitted data, while still preserving recoverability. This is because B-tree insertions are reversible by invoking their inverse operations, i.e., B-tree deletion. We can also see that and cannot achieve the same performance as . Both and suffer from high abort rates, which are incurred by conflicts on the district table.

(a) Throughput
(b) Abort Rate
Figure 9: Performance on Revised New-Order Transactions.

Our artificial workload was designed to demonstrate the difference between and . It contains two types of transactions. A short transaction is composed of B-tree insertions. A long transaction is composed of B-tree insertions. All insertions attempt to insert into the last leaf node of a B-tree. We ran the two types of transactions separately. Figure 10 shows the results. As we can see, achieved comparable performance as the original Shore-MT on both types of transactions. performed significantly worse than , especially in the case of long transactions. As considers commutativity of B-tree insertions, it allows multiple transactions to insert into the same B-tree leaf node concurrently. In contrast, does not allow such concurrency. When one transaction is performing the insertion, the other concurrent transactions have to be aborted. The longer the transactions, the higher the abort rate.

Therefore, we can conclude that data semantics can be powerful for enhancing the performance of TCC. Especially for data operations that are prone to confliction, it seems crucial to make them commutative and reversible (through inverse operations).

(a) Throughput
(b) Abort Rate
I. Short Transaction case.
(c) Throughput
(d) Abort Rate
II. Long Transaction case.
Figure 10: Performance on Short/Long Transactions.
(a) Throughput
(b) Abort Rate
Figure 11: Chance of Extra Aborts.

As TCC performs locking only after the operational latches are released, it may lead to extra aborts (Algorithm LABEL:alg:op_exec Line LABEL:alg:op:10). We used three types of workload to measure the abort rate caused by the separation between the latching and locking phases. We used short and long data operations. Short operations update a single record. Long operations update a set of records. In the first type of workload, each transaction consists of a short operation. In the second type of workload, each transaction consists of a long operation. In the third type of workload, each transaction consists of a randomly selected short or long operation.

Figure 11 shows the results of the experiments on the three types of workload. We can see that the mixed workload is more likely to incur abort. A long operation provides a relatively large window between the latching phase and the locking phase. This gives short operations more chance to jump the order and incur abort. Nevertheless, such abort is not a serious concern to TCC. As shown in Figure 11, it does not occur frequently even in the worst case.

7.5 Experiments on OLTP Benchmarks

Our final set of experiments was conducted on the benchmarks of TATP and TPC-C.

For the experiments on TATP, we set the scale factor to . In each test, we ran the TATP workload for more than 10 minutes. We increased the number of worker threads to see how the system scales. Figure 12 shows the results of the experiments. As the degree of contention is low in TATP, all CC mechanisms scale quite well. We could not see significant difference among the different approaches.

(a) Throughput
(b) Abort Rate
Figure 12: Performance on TATP.

For the experiments on TPC-C, we set the scale factor to . In each test, we ran the standard TPC-C workload (without wait time) for 10 minutes. We also increased the number of worker threads to evaluate the scalability. Figure 13 shows the results of the experiments.

We can see that most of the CC mechanisms achieved relatively good performance on TPC-C, except . scales well when there are less than workers. When the degree of concurrency exceeds 8, its throughput drops quickly. This is mainly due to that cannot deal with “select-for-update” request. has no concept of operation. When encountering “select-for-update”, it cannot predict that the data blocks accessed by “select” will be subsequently “updated”. Thus, it had to frequently perform lock upgrades, which led to a large number of deadlocks. In contrast, is able to deal with the “select-for-update” semantics. When encountering “select-for-update”, the data organization tier can explicitly tell that the corresponding operation should place exclusive locks on the data blocks it has accessed. Then, can avoid lock upgrade. As and do not perform locking, they do not suffer from the lock upgrading problems.

Comparing the three Shore-MT mechanisms, we can find that performs slightly worse than and . mainly suffers from the implementation of its predicate locks. When a transaction accesses indexed records in the warehouse and district tables, it will place predicate locks. The predicate locks are first shared locks. When updates are performed, they are upgraded to exclusive locks. Lock upgrade can cause deadlocks, which affect ’s performance.

In fact, we found that 2PL in general do not perform as well as SSI in TPC-C. The Payment transactions of TPC-C always need to update the warehouse table, while the New-Order transactions always need to read the warehouse table. When 2PL is adopted, a large number of transactions will be blocked by the read-write conflicts. In contrast, the SSI approaches do not face this problem. adopts 2PL as its transactional scheduler. In TPC-C, it cannot perform as well as . Nevertheless, is superior to in robustness. As shown in our previous experiments, can exhibit very poor performance in a variety of cases. From this perspective, no one of , and can compare to .

(a) Throughput
(b) Abort Rate
Figure 13: Performance on TPCC.

The experiments show that when TCC is taking care of the concurrency control of index structures, a DBMS can process transactions efficiently. The good performance of TCC is attributable to both its robust operational scheduler and its ability to utilize data semantics.

8 Conclusion

In this paper, we attempted to separate the layer of concurrency control from a DBMS. Our results showed that the separation is feasible, at least on the indexes of a DBMS. On the one hand, transactional safety can be guaranteed. On the other hand, the performance issues caused by the separation is controllable. We believe that the separation will be enormously beneficial, as it can substantially improve the flexibility of a DBMS. With such flexibility, a DBMS will be easier to implement, modify and extend.

To make the separation work, it is important to have a progressive scheduler that is robust against unpredictable data accesses. It is also important to allow the DBMS to declare data semantics to the CC layer, especially on data operations that are prone to confliction. To achieve these, we created TCC, which can deal the the predictability and semantic gaps effectively.

However, further research is required to make TCC practical. First, TCC needs to tested in a broader scope of scenarios. In this paper, we evaluated it on the indexes of a real-world DBMS. Its applicability on an entire DBMS, especially its components on metadata management and space management, requires further investigation. Second, a transparent recovery mechanism should be integrated with TCC to support full-scale ACID. Third, some principles need to be identified to help system developers make good use of TCC, including the guidelines on how to determine the granularity of data operations, how to create commutative and inverse operations, etc.

References

  • [1] D. Batoory, J. Barnett, J. Garza, K. Smith, K. Tsukuda, B. Twichell, and T. Wise. Genesis: An extensible database management system. IEEE TSE, pages 1711–1730, 1988.
  • [2] M. J. Carey, D. J. DeWitt, D. Frank, G. Graefe, M. Muralikrishna, J. E. Richardson, and E. J. Shekita. The architecture of the EXODUS extensible DBMS. International Workshop on Object-Oriented Database Systems, pages 52–65, 1986.
  • [3] C. Cascaval, C. Blundell, M. Michael, H. W. Cain, P. Wu, S. Chiras, and S. Chatterjee. Software transactional memory: Why is it only a research toy? Queue, 2008.
  • [4] D. Cervini, D. Porobic, P. Tözün, and A. Ailamaki. Applying htm to an oltp system: No free lunch. International Workshop on Data Management on New Hardware, 2015.
  • [5] S. Chaudhuri and G. Weikum. Rethinking database system architecture: Towards a self-tuning risc-style database system. VLDB, pages 1–10, 2000.
  • [6] J. Gray, P. McJones, M. Blasgen, B. Lindsay, R. Lorie, T. Price, F. Putzolu, and I. Traiger. The recovery manager of the system r database manager. ACM Computing Surveys, pages 223–242, 1981.
  • [7] S. Harizopoulos and A. Ailamaki. A case for staged database systems. CIDR, 2003.
  • [8] S. Harizopoulos, A. Ailamaki, et al. Stageddb: Designing database servers for modern hardware. IEEE Data Eng. Bull., pages 11–16, 2005.
  • [9] J. M. Hellerstein, M. Stonebraker, and J. Hamilton. Architecture of a database system. Now Publishers Inc, 2007.
  • [10] M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. SIGARCH Comput. Archit. News, pages 289–300, 1993.
  • [11] R. Johnson, I. Pandis, and A. Ailamaki. Improving oltp scalability using speculative lock inheritance. VLDB, pages 479–489, 2009.
  • [12] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. Falsafi. Shore-mt: a scalable storage manager for the multicore era. EDBT, pages 24–35, 2009.
  • [13] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, and A. Ailamaki. Aether: a scalable approach to logging. VLDB, pages 681–692, 2010.
  • [14] M. Kornacker, C. Mohan, and J. M. Hellerstein. Concurrency and recovery in generalized search trees. In SIGMOD Record, pages 62–72, 1997.
  • [15] V. Leis, A. Kemper, and T. Neumann. Exploiting hardware transactional memory in main-memory databases. ICDE, pages 580–591, 2014.
  • [16] J. J. Levandoski, D. Lomet, M. F. Mokbel, and K. K. Zhao. Deuteronomy: Transaction support for cloud data. CIDR, 2011.
  • [17] J. J. Levandoski, D. B. Lomet, S. Sengupta, R. Stutsman, and R. Wang. High performance transactions in deuteronomy. CIDR, 2015.
  • [18] D. Lomet, A. Fekete, G. Weikum, and M. Zwilling. Unbundling transaction services in the cloud. CIDR, 2009.
  • [19] D. Lomet and M. F. Mokbel. Locking key ranges with unbundled transaction services. Proceedings of the VLDB Endowment, 2(1):265–276, 2009.
  • [20] D. Makreshanski, J. Levandoski, and R. Stutsman. To lock, swap, or elide: On the interplay of hardware transactional memory and lock-free indexing. Proceedings of the VLDB Endowment, 8(11):1298–1309, 2015.
  • [21] C. Mohan. Aries/kvl: A key-value locking method for concurrency control of multiaction transactions operating on b-tree indexes. VLDB ’90, pages 392–405.
  • [22] R. Sears and E. Brewer. Stasis: Flexible transactional storage. OSDI, pages 29–44, 2006.
  • [23] M. Stonebraker and L. A. Rowe. The design of postgres. SIGMOD, pages 340–355, 1986.
  • [24] Z. Wang, H. Qian, J. Li, and H. Chen. Using restricted transactional memory to build a scalable in-memory database. EuroSys ’14, pages 26:1–26:15.
  • [25] G. Weikum and G. Vossen. Transactional information systems: theory, algorithms, and the practice of concurrency control and recovery. Elsevier, 2001.
  • [26] N. Zhou, X. Zhou, K.-l. Tan, and S. Wang. Transparent concurrency control: Decoupling concurrency control from dbms. arXiv.