For decades, database management systems (DBMSs) have separated on-line transaction processing (OLTP) from on-line analytics processing (OLAP) in distinct systems. This has been more an engineering compromise than a desirable feature: OLTP systems typically modified small portions of the database and demanded low latency, whereas OLAP systems typically queried large swaths of the database in read-only mode and demanded high throughput. But the promise of classical DBMS transaction theory and its serializability concepts is that all transactions – even with arbitrarily complex queries embedded within them – can be handled correctly (i.e., serializably) by a single DBMS, so that applications can program against a single-system image, even while running in a distributed system with high concurrency.
Increasingly over the last few years, the Internet-of-Things (IoT) applications are gaining their popularity. They have begun to demand this ideal of doing analytics on the same system as transaction processing, and even doing analytics within the transactions themselves CODE(0x6faa9927688)
We model a database as a logical chain of state modifications, called deltas. The chain is in time order, that is, smaller prefixes of the chain correspond to earlier states of the database. These state modifications are made by programs called transactions that logically read the current database state (the chain prefix up to that modification), and accordingly apply transaction logic to do the modification*** Application programs further need that transactions must be serialized in an order that respects the user time at which they were executed. We defer this problem until LABEL:subsub:strictser.. Each transaction runs local to a node, but may query the database state across nodes. We call the deltas produced at each node as a subchain and the database chain is an interleaving of these subchains. WiSer performs background replication and hardening of each subchain, as described later.
Figure 1 shows an example with three nodes, , , , and transactions’ deltas produced at these nodes: . The database state is the composition of deltas that come in time order: .
Content of a Delta. WiSer allows only one kind of state modification: an upsert of a row to a table. An update-by-key is treated as a query to get the prior row content, followed by an upsert of the modified contents. Deletes are updates that set an isDeleted field. A transaction can run many queries, and do many (or zero) upserts. The delta encapsulates not just these upserts, but also the read-set of – the read queries the transaction ran. Logically, the queries in should read the database chain up to ; that is, shall see any updates that happen before it. But in practice, the queries in read a prefix of this chain, called ’s snapshot: as shown in Figure 1, can observe the delta made by , but does not see that made by , which happens after the snapshot. If these query results can change due to intervening deltas (between ’s snapshot and ), must roll back. Therefore we add the read-set and the snapshot, as a serialization sequence number, to the delta.
WiSer represents transaction deltas in log format, as a Parquet CODE(0x6faa9927688)
3 Conflict Resolution
Conflicts in WiSer are resolved after we have determined a serial order. Thus the resolution problem is simple: the reads in a transaction saw a snapshot that is earlier than where its delta fits in the serial order. ’s delta includes ’s read-set and ’s snapshot (seen SSN). So the ConflictResolver needs to traverse the database chain of deltas, starting from the prior resolved point, and verify for each delta whether its read-set could have changed in value due to intervening deltas (between ’s snapshot and ). If it could have changed, then is added to Rollbacks table. is stored as a transaction ID in Rollbacks table: SSNs are assigned later, at Publish, only for the successful transactions.
Note that this process can give false positives – excess rollbacks, for two reasons. First, read-sets and write-sets are tracked approximately as we describe below. Second, in our data structure it is expensive to remove the writes of a rolled-back transaction, so a transaction can roll back due to conflicts with another one, which rolled-back for other reasons. This behavior is similar to the way deadlocks are typically handled, and applications are used to retrying transactions upon such roll-backs.
3.1 Representation of Read Sets
WiSer represents read-sets as entries in a ReadWriteSet table. This “all state is in tables” principle means that a transaction’s delta naturally incorporates its read-set as well. Queries are allowed to selectively read data by specifying conjunctive equality predicates on one or more columns, or range predicates (for table scan queries, the read set is escalated to the full table). For the common case of equality predicates, we compute a single 64-bit hash value over the key, thus easily handling different data types and multi-column predicates. For range predicates we store the boundaries of the range.
Our current implementation of WiSer unifies the role of ConflictResolver and Serializer. Thus each node running transactions replicates ReadWriteSet to the Serializer node, along with its regular heartbeats (the one containing LSN ranges of recent additions to its log).
3.2 Representation of Writes
To resolve conflicts, the read-set of a transaction needs to be compared against all writes in the serialization order between ’s snapshot and . Recall that Publish timestamps every row with its SSN. Thus this comparison can be done by an index lookup when available, and if not then via a table scan over recent data (WiSer maintains a min-max synopsis on each block for each column, so this is not terribly inefficient).
Still, to get greater efficiency in conflict resolution, the Serializer caches recent writes in memory directly. When a transaction is during execution, the read-set of is placed in ReadWriteSet as discussed earlier. In addition, the primary keys of rows written by are also placed in ReadWriteSet. Notice that the reads are represented logically (e.g., as key ranges), whereas writes are enumerated row-by-row. As the Serializer receives writes, it maintains a list of hash tables. These hash tables map hash(key) onto the latest SSN when that key was modified (using the same function to map arbitrary composite key values onto a 64-bit hash as we did for the reads).
The Serializer streams all read-sets it receives, looking for conflicts in the recent writes. A read conflicts with a write only if (a) the key-range of the read spans the key of the write, and (b) the SSN of the write is greater than the snapshot of .
These writes are maintained as a linked list of hash tables to facilitate eviction. Each hash table holds writes within a range of SSNs. So periodically we drop hash tables for SSN ranges that are older than most transactions.
3.3 Catching Phantoms from Range Predicates
Queries with range predicates are challenging to run serializably because of phantoms. The read-set for such a query is a range of key values, but the DBMS must guard against concurrent inserts of any value in that range. Typically this is done by placing a “next-key” lock on the index tree structure – i.e., a guard on the values just beyond the two boundary values of that range CODE(0x6faa9927688)
4 Constraint Resolution
We have now built up enough machinery to discuss how aggregation constraints are checked. These constraints are checked after conflict resolution and after publish, who has stamped each row with the SSN (its position in the serial order).
Logically, constraint resolution is a walk through the published data in SSN order, from the prior resolved point all the way up to the latest published point. The ConstraintResolver maintains a running aggregate, and we only support incrementally computable, algebraic aggregation constraints. The constraint is specified as a predicate on the result of the aggregation. In the online shopping application, the aggregation is to compute the sum of quantities grouped by product IDs, and the predicate is to check whether the aggregated result is larger than or equal to 0.
The ConstraintResolver can run queries repeatedly to update the running aggregate, each time selecting data in the range from the prior resolved SSN to a value , where is gradually increased until it hits a constraint violation.
But every transaction creates a new SSN, so this approach is far too slow. We want constraints to be evaluated on batches of transactions, corresponding to the batches that were serialized from each node. The challenge is that aggregates are generally not monotonic: just because aggregation constraints are not violated after incorporating an entire batch of transactions, we cannot assert that the constraints were not violated at some intermediate point (for example, a bunch of customer orders might have depleted inventory to be negative, and then a restocking transaction might have been serialized).
A second design goal is to do parallel constraint resolution, in two different forms:
Source parallelism: We want the bulk of aggregation to be performed at the nodes running the OTLP transactions, that reduces data movement and also gives significant parallelism.
Resolver parallelism: The running aggregation state for constraint enforcement can be large: for example, it is the running inventory of each product in a retail application. So we want individual resolvers to focus on subsets of products.
4.1 Partial Constraint Evaluation
WiSer supports partial constraint evaluation to reduce computation complexity.
Let us consider a setting with three nodes, each with multiple batches of transactions. The ConstraintResolver repeatedly picks up one batch from each node to resolve (recall that SerializeFrontiers lists the batches). It first pushes down to each node a partial aggregation query, which is derived from the constraint (e.g., compute the sum of quantities grouped by product IDs, in the online shopping application).
Next, the ConstraintResolver uses the partial aggregation results from each node to compute prefix sums (per the serial order) for the value of the aggregate at the start of each batch. If the prefix sum after any batch violates the constraint, the ConstraintResolver reduces that batch’s contribution to be one that makes the constraint just satisfied. For example, if the prefix sum before and after a batch is 12 and -5, respectively, then the ConstraintResolver artificially inflates the prefix sum after that batch to 0. This is not to abort the whole batch; instead, it is only to avoid polluting the prefix sums for other batches. In effect, the ConstraintResolver is delegating to the nodes the responsibility of keeping the constraint satisfied by rolling back any transactions that cause violations.
After that, the ConstraintResolver communicates these prefix sums to each node, and asks it to re-evaluate the aggregation query, except this time with a modified aggregation function. This new query runs the same group-by, but initializes its aggregation hash table with the prefix sum from the ConstraintResolver. The aggregation scans the input in SSN order (this does not involve an extra sort, because the rows were published in that order). As the query maintains the running aggregate it continually applies the constraint (e.g., inventories should stay positive), and surgically identifies and excludes individual transactions that cause constraint violation.
As a side-effect of this query, each node now adds rows to the ConstraintFailures table – these rows contain just the SSN of the constraint-violating transactions.
Notice that excluding these transactions will cause the prefix sums to change, but this is not communicated back to the ConstraintResolver. After all, removing transactions that caused constraints to be violated only moves the running aggregate in a beneficial direction, so it is safe for other nodes (checking other batches) to ignore this effect.
Constraint resolution is the last stage of transaction processing. Queries do antijoins with ConstraintFailures to avoid seeing changes from transactions that fail constraint resolution. Transactions that reached Promise but failed conflict resolution or constraint resolution are aborted transactions.
5 Performance Evaluation
WiSer is the next version of the Wildfire research prototype CODE(0x6faa9927688)
6 Related Work
WiSer builds on and borrows hugely from the large volume of literature on transaction processing, where both shared-nothing and shared-storage models have been heavily studied CODE(0x6faa9927688)
7 Conclusions and Future Work
The distributed systems community has developed consensus protocols mostly independently from the database community’s struggles with distributed transactions. We have described one approach to exploiting consensus as the underlying mechanism for transaction serialization, and shown it provides two key benefits: higher availability during the application’s interaction with the running transaction (the promise step), and greater scaling and easier enforcement of aggregation constraints after serialization.
WiSer comes at a time when demand for HTAP and real-time analytics is exploding. Yet, OLTP systems are complex and well-entrenched “mission-critical” pieces of applications. So we have an uphill task to penetrate this market, requiring that we massage our ’promise-serialize’ step to fit with existing applications, equal or surpass the plethora of must-have OLTP features (meeting existing isolation semantics, handling referential integrity, etc.), and demonstrate value from pulling compensation logic out of applications.
WiSer’s design of putting all state into tables, and having a tiny consensus piece, may also help in Byzantine settings. Blockchain systems are to some extent bypassing and re-inventing many aspects of classical OLTP. We want to explore whether casting OLTP as a consensus problem will make it easier to provide higher security, timestamping, and non-perturbation guarantees within existing DBMSs.