Many applications require a scalable and highly available backend that provides durable and consistent storage. Developing and operating such backends presents many challenges. Data loss is unacceptable and availability must be provided even in the face of network unreliability, disk errors, and a myriad of other failures. The high volume of data collected by modern applications, coupled with the large number of users and high access rates dictate smart partitioning and placement solutions for storing the data at scale. This problem must be addressed by any company offering stateful services. Yet, despite decades of academic and industrial research, it is notoriously difficult to solve correctly and requires a high level of expertise and experience. Big companies use in-house solutions, developed and evolved over many years by large teams of experts. Smaller companies often have little choice but to pay larger cloud providers or sacrifice durability.
FoundationDB (FDB, ) democratizes industrial-grade highly-available and consistent storage, making it freely available to anyone as an open source solution (FDBGithub, ) and is currently used in production at companies such as Apple, Snowflake, and Wavefront. While the semantics, performance, and reliability of FoundationDB make it extremely useful, FoundationDB’s data model, a simple mapping from binary keys to binary values, is often insufficient for applications. Many of them need structured data storage, indexing capabilities, a query language, and more. Without these, application developers are forced to reimplement common functionality, slowing development and introducing bugs.
Furthermore, as stateful services grow, they must support the needs of many users and applications. This multi-tenancy brings many challenges, including isolation, resource sharing, and elasticity in the face of growing load. Many database systems intermingle data from different tenants at both the compute and storage levels. Retrofitting resource isolation to such systems is challenging. Stateful services are especially difficult to scale elastically because state cannot be partitioned arbitrarily. For example, data and indexes cannot be stored on entirely separate storage clusters without sacrificing transactional updates, performance, or both.
To address these challenges, we present the FoundationDB Record Layer: an open source record-oriented data store built on top of FoundationDB with semantics similar to a relational database (recordLayerAnnouncement, ; recordLayerGithub, ). The Record Layer provides schema management, a rich set of query and indexing facilities, and a variety of features that leverage FoundationDB’s advanced capabilities. It inherits FoundationDB’s strong ACID semantics, reliability, and performance in a distributed setting. The Record Layer is stateless and lightweight, with minimal overhead on top of the underlying key-value store. These lightweight abstractions allow for multi-tenancy at extremely large scale: the Record Layer allows creating isolated logical databases for each tenant—at Apple, it is used to manage billions of such databases—all while providing familiar features such as structured storage and transactional index maintenance.
The Record Layer (Figure 1) represents structured records as Protocol Buffer (protobufs, ) messages that include typed fields and even nested records. Since an application’s schema inevitably changes over time, the Record Layer includes tools for schema management and evolution. It also includes facilities for planning and efficiently executing declarative queries using a variety of index types. The Record Layer leverages advanced features of FoundationDB; for example many aggregate indexes are maintained using FoundationDB’s atomic mutations, allowing concurrent, conflict-free updates. Beyond its rich feature set, the Record Layer provides a large set of extension points, allowing its clients to extend its functionality even further. For example, client-defined index types can be seemlessly “plugged in” to the index maintainer and query planner. Similarly, record serialization supports client-defined encryption and compression algorithms.
The Record Layer supports multi-tenancy at scale through two key architectural choices. First, the layer is completely stateless, so scaling the compute service is as easy as launching more stateless instances. A stateless design means that load-balancers and routers need only consider where the data are located, rather than which compute servers can serve them. Furthermore, a stateless server has fewer resources that need to be apportioned among isolated clients. Second, the layer achieves resource sharing and elasticity with its record store abstraction, which encapsulates the state of an entire logical database, including serialized records, indexes, and even operational state. Each record store is assigned a contiguous range of keys, ensuring that data belonging to different tenants is logically isolated. If needed, moving a tenant is as simple as copying the appropriate range of data to another cluster, since everything needed to interpret and operate each record store is found in its key range.
The Record Layer is used by multiple systems at Apple. We demonstrate the power of the Record Layer at scale by describing how CloudKit, Apple’s cloud backend service, uses it to provide strongly-consistent data storage for a large and diverse set of applications (CloudKit, ). Using the Record Layer’s abstractions, CloudKit offers multi-tenancy at the extreme by maintaining independent record stores for each user of each application. As a result, we use the Record Layer on FoundationDB to host billions of independent databases sharing thousands of schemas. In the future, we envision that the Record Layer will be combined with other storage models, such as queues and graphs, leveraging FoundationDB as a general purpose storage engine and remaining transactionally consistent across all these models. In summary, this work makes the following contributions:
An open source layer on top of FoundationDB with semantics akin to those of a relational database.
The record store abstraction and a suite of techniques to manipulate it, enabling billions of logical tenants to operate independent databases in a FoundationDB cluster.
A highly extensible architecture, clients can customize core features including schema management and indexing.
A lightweight design that provides rich features on top of the underlying key-value store.
2. Background on FoundationDB
FoundationDB is a distributed, ordered key-value store that runs on clusters of commodity servers and provides ACID transactions over arbitrary sets of keys, using optimistic concurrency control. Its architecture draws from the virtual synchrony paradigm (gcs, ; Birman10virtuallysynchronous, ), whereby FoundationDB is composed of two logical clusters: one that stores data and processes transactions and another coordination cluster (running Active Disk Paxos (ActiveDiiskPaxos, )) that is responsible for membership and configuration of the first cluster. This allows FoundationDB to achieve high availability while requiring only storage replicas to tolerate failures (Birman10virtuallysynchronous, ). One of the distinguishing features of FoundationDB is its deterministic simulation testing framework, which can simulate entire clusters under a variety of failure conditions in a single thread with complete determinism in a short period of time. For example, in the past year, we have run more than 250 million simulations and have simulated more than 1870 years and 3.5 million CPU hours. This rigorous testing in simulation makes FoundationDB extremely stable and allows its developers to introduce new features and releases in a rapid cadence, unusual among similar strongly-consistent distributed—or even centralized—databases.
Layers. Unlike most databases, which bundle together a storage engine, a data model, and a query language, forcing users to choose all three or none, FoundationDB takes a modular approach: it provides a highly scalable, transactional storage engine with a minimal yet carefully chosen set of features. Layers can be constructed on top to provide various data models and other capabilities. Currently, the Record Layer is the most substantial layer built on FoundationDB.
Transactions and Semantics. FoundationDB provides ACID multi-key transactions with strictly-serializable isolation, implemented using multi-version concurrency control (MVCC) for reads and optimistic concurrency for writes. As a result, neither reads nor writes are blocked by other readers or writers. Instead, conflicting transactions fail at commit time and are usually retried by the client. Specifically, a client performing a transaction obtains a read version, chosen as the latest database commit version, by performing a getReadVersion (GRV) call and performs reads at that version, effectively observing an instantaneous snapshot of the database. Transactions that contain writes can be committed only if none of the values they read have been modified by another transaction since the transaction’s read version. Committed transactions are written to disk on multiple cluster nodes and then acknowledged to the client. FoundationDB executes operations within a transaction in parallel, while preserving the program order of accesses to each key and guaranteeing that a read following a write to the same key within a transaction returns the written value. FoundationDB imposes a 5 second transaction time limit, which the Record Layer compensates for using techniques described in Section 4.
Besides create, read, update and delete (CRUD) operations, FoundationDB provides atomic read-modify-write operations on single keys (e.g., addition, min/max, etc.). Atomic operations occur within a transaction like other operations, but do not create read conflicts and so a concurrent change to that value would not cause the transaction to abort (they can, however, cause other transaction to abort). Instead, the operation is applied atomically at commit time. For example, a counter incremented concurrently by many clients would be best implemented using atomic addition.
FoundationDB allows clients to customize the default concurrency control behavior, reducing conflicts by trading off isolation semantics. Snapshot reads do not cause an abort even if the read key was overwritten by a later transaction. One can also explicitly add or remove read or write conflict ranges. For example, a transaction may wish to read a value that is monotonically increasing over time, just to determine if the value has reached some threshold. In this case, reading the value at a snapshot isolation level (or clearing the read conflict range for the value’s key) would allow the transaction to see the state of the value as it was when it started, but would not result in conflicts with other transactions that may be modifying the value concurrently.
Keys, values, and order. Keys and values in FoundationDB are opaque binary values. FoundationDB imposes limits on key and value sizes (10kB for keys and 100kB for values) with much smaller recommended sizes (32B for keys and up to 10KB for values). Transactions are limited to 10 MB in size, including the key and value size of all keys written and the sizes of all keys in the read and write conflict ranges of the commit. Keys are part of a single global namespace, and it is up to applications to divide and manage that namespace with the help of several convenience APIs such as the Tuple and Directory layers described next. FoundationDB supports range reads based on the binary ordering of keys. Finally, range clear operations are supported and can be used to clear all keys in a certain range or starting with a certain prefix.
Tuple and directory layers. Key ordering in FoundationDB makes tuples a convenient and simple way to model data. The tuple layer, included with FoundationDB, encodes tuples into keys such that the binary ordering of those keys preserves the ordering of tuples and the natural ordering of typed tuple elements. In particular, a common prefix of the tuple is serialized as a common byte prefix and defines a key subspace. For example, a client may store the tuple (state,city) and later read using a prefix like (state,*).
The directory layer provides an API for defining a logical directory structure which maps potentially long-but-meaningful binary strings to short binary strings, reducing the amount of space used by keys. For example, if all of an application’s keys are prefixed with its name, the prefix might be added to the directory layer. Internally, the directory layer assigns values to its entries using a sliding window allocation algorithm that concurrently allocates unique mappings while keeping the allocated integers small.
3. Design Principles
The Record Layer is designed to provide lightweight, scalable operations on structured data, including transaction processing, schema management, and query execution. Achieving these goals required a variety of careful design decisions, some of which differ from traditional relational databases. This section highlights several of these principles.
Statelessness. In many distributed databases, individual database servers maintain ephemeral, server-specific state, such as memory and disk buffers, in addition to persistent information such as data and indexes. In contrast, the Record Layer stores all of its state in FoundationDB so that the layer itself is completely stateless. For example, the Record Layer does not maintain any state about the position of a cursor in memory; instead, the context needed to advance a cursor in a future request is serialized and returned to the client as a continuation.
The stateless nature of the layer has three primary benefits, which we illustrate with the same example of streaming data from a cursor. First, it simplifies request routing: a request can be routed to any of the stateless servers even if it requests more results from an existing cursor, since there is no buffer on a particular server that needs to be accessed. Second, it substantially simplifies the operation of the system at scale: if the server encounters a problem, it can be safely restarted without needing to transfer cursor state to another server. Lastly, storing state in FoundationDB ensures that all state has the same ACID semantics as the underlying key-value store: we do not need separate facilities for verifying the integrity of our cursor-specific metadata.
Streaming model for queries. The Record Layer controls its resource consumption by limiting its semantics to those that can be implemented on streams of records. For example, it supports ordered queries (as in SQL’s ORDER BY clause) only when there is an available index supporting the requested sort order. This approach enables supporting concurrent workloads without requiring stateful memory pools in the server. This also reflects the layer’s general design philosophy of preferring fast and predictable transaction processing over OLAP-style analytical queries.
Flexible schema. The atomic unit of data in the Record Layer is a Protocol Buffer (protobufs, ) message which is serialized and stored in the underlying key-value space. This provides fast and efficient transaction processing on individual records akin to a row-oriented relational database. Unlike record tuples in the traditional relational model, these messages can be highly structured; in addition to a variety of complex data types, Protocol Buffer messages support nesting of record types within a field and repeated instances of the same field; these advanced features allow implementations of list and map data structures within a single record, for example. Because the Record Layer is designed to support millions of independent databases with a common schema, it stores metadata separately from the underlying data. The common metadata can be updated atomically for all stores that use it.
Efficiency. Several parts of the Record Layer’s design enable it to run efficiently at scale. For example, the Record Layer is implemented as a library that can be embedded in its client rather than an independent client/server system; while the Record Layer can be (and in many cases is) embedded in a server environment, it makes no presumptions and imposes few requirements on how the server is implemented. Since FoundationDB is most performant at high levels of concurrency, nearly all of the Record Layer’s operations are implemented asynchronously and pipelined where possible. We also make extensive use of FoundationDB-specific features, such as controllable isolation semantics and versioning, both within the layer’s implementation and exposed to clients via its API.
Extensibility. In the spirit of the FoundationDB’s layered architecture, the Record Layer exposes a large number of extension points though its API so that clients can extend its functionality. For example, clients can easily define new index types, methods for maintaining those indexes, and rules that extend its query planner to use those indexes in planning. This extensibility makes it easy for clients to add features that are left out of the Record Layer’s core, such as memory pool management and arbitrary sorting, in a way that makes sense for their use case. Our implementation of CloudKit on top of the Record Layer (discussed in Section 8) makes substantial use of this extensibility with custom index types, planner behavior, and schema management.
4. Record Layer Overview
The Record Layer is primarily used as a library by stateless backend servers that need to store structured data in FoundationDB. It is used to store billions of logical databases, called record stores, with thousands of schemas. Records in a record store are Protocol Buffer messages and a record’s type is defined with a Protocol Buffer definition. The schema, also called the metadata, of a record store is a set of record types and index definitions on these types (see Section 6). Metadata is versioned and may be stored in FoundationDB or elsewhere (metadata management and evolution is discussed in Section 5). The record store is responsible for storing raw records, indexes defined on the record fields, and the highest version of the metadata it was accessed with.
Providing isolation between record stores is key for multi-tenancy. To facilitate resource isolation, the Record Layer tracks and enforces limits on resource consumption for each transaction, provides continuations to resume work, and can be coupled with external throttling. On the data level, the keys of each record store start with a unique binary prefix, defining a FoundationDB subspace. All the record store’s data is logically co-located within the subspace and the subspaces of different record stores do not overlap.
Unlike traditional relational databases, all record types within a record store are interleaved within the same extent, and hence both queries and index definitions may span all types of records within the store. From a relational database perspective, this is akin to having the ability to create a single index across all tables that include the same column.
Primary keys and indexes are defined within the Record Layer using key expressions, covered in detail in Section 6.1. A key expression defines a logical path through records; applying it to a record extracts record field values and produces a tuple that becomes the primary key for the record or key of the index for which the expression is defined. Key expressions may produce multiple tuples, allowing indexes to “fan out” and generate index entries for individual elements of nested and repeated fields.
To avoid exposing FoundationDB’s limits on key and value sizes to clients, the Record Layer splits large records across a set of contiguous keys and splices them back together when deserializing split records. A special type of split, immediately preceding each record, holds the commit version of the record’s last modification; it is returned with the record on every read. The Record Layer supports pluggable serialization libraries, including optional compression and encryption of stored records.
The Record Layer provides APIs for storing, reading and deleting records, creating and deleting record stores and indexes in stores, scanning and querying records using the secondary indexes, updating record store metadata, managing a client application’s directory structure, and iteratively rebuilding indexes (when they cannot be rebuilt as part of a single transaction).
All Record Layer operations that provide a cursor over a stream of data, such as record scans, index scans, and queries, support continuations. A continuation is an opaque binary value that represents the starting position of the next available value in a cursor stream. Results are parceled to clients along with the continuation, allowing them resume the operation by supplying the returned continuation when invoking the operation again. This gives clients a way to control the iteration without requiring the server to maintain state and allows scan or query operations that exceed the transaction time limit to be split across multiple transactions.
The Record Layer exposes many of FoundationDB’s transaction settings to allow faster data access. For example, record prefetching asynchronously preloads records into FoundationDB client’s read-your-write cache, but does not return it to the client application. When reading batches of many records, this can potentially save a context switch and record deserialization. The Record Layer also includes mechanisms to trade off consistency for performance, such as snapshot reads. Similarly, the layer exposes FoundationDB’s “causal-read-risky” flag, which causes getReadVersion to be faster at the risk of returning a slightly stale read version in the rare case that the FoundationDB master is disconnected from the rest of the system but is not yet aware of this fact. This is usually an acceptable risk; for example, ZooKeeper’s ”sync” operation behaves similarly (zookeeper, ). Furthermore, transactions that modify state never return stale data since they perform validations at commit stage. Read version caching optimizes getReadVersion further by completely avoiding communication with FoundationDB if a read version was “recently” fetched from FoundationDB. Often, the client application provides an acceptable staleness and the last seen commit version as transaction parameters; the Record Layer uses a cached version as long as it is sufficiently recent and no smaller than the version previously observed by the client. This may result in reading stale data and may increase the rate of failed transactions in transactions that modify state. Version caching is most useful for read-only transactions that do not need to return the latest data and for low-concurrency workloads where the abort rate is low.
To help clients organize their record stores in FoundationDB’s key space, the Record Layer provides a KeySpace API which exposes the key space in a fashion similar to a filesystem directory structure. When writing data to FoundationDB, or when defining the location of a record store, a path through this logical directory tree may be traced and compiled into a tuple value that becomes a row key. The KeySpace API ensures that all directories within the tree are logically isolated and non-overlapping. Where appropriate, it uses the directory layer (described in Section 2) to automatically convert directory names to small integers.
5. Metadata management
The Record Layer provides facilities for managing changes to a record store’s metadata. Since one of its goals is to support many databases that share a common schema, the Record Layer allows metadata to be stored in a separate keyspace from the data, or even a separate storage system entirely. In most deployments, this metadata is aggressively cached by clients so that records can be interpreted without additional reads from the key-value store. This architecture allows low-overhead, per request, connections to a particular database.
Schema evolution. Since records are serialized into the underlying key-value store as Protocol Buffer messages (possibly after pre-processing steps, such as compression and encryption), some basic data evolution properties are inherited from Protocol Buffers: new fields can be added to a record type and show up as uninitialized in old records and new record types can be added without interfering with old records. As a best practice, field numbers are never reused and should be deprecated rather than removed altogether.
The metadata is versioned in single-stream, non-branching, monotonically increasing fashion. Every record store keeps track of the highest version it has been accessed with by storing it in a small header within a single key-value pair. When a record store is opened, this header is read and the version compared with current metadata version.
Typically, the metadata will not have changed since the store was last opened, so these versions are the same. When the version in the database is newer, a client has usually used an out-of-date cache to obtain the current metadata. If the version in the database is older, changes need to be applied. New records types and fields can be added by updating the Protocol Buffer definition.
Adding indexes. An index on a new record type can be enabled immediately, since there are no records of that type yet. Adding an index to an existing record type, which might already have records in the record store, is more expensive, since it might require reindexing. Since records of different types may exist in the same key space, all the records need to be scanned when building a new index. If there are very few or no records, the index can be built right away within a single transaction. If there are many existing records, the index cannot be built immediately because that might exceed the 5 second transaction time limit. Instead, the index is disabled and the reindexing proceeds as a background job, as described in Section 6.
Metadata versioning. Occasionally, changes need to be made to the way that the Record Layer itself encodes data. For this, the same database header that records the application metadata version also records a storage format version, which is updated at the same time. Updating may entail reformatting small amounts of data, or, in some cases, enabling a compatibility mode for old formats. We also maintain an “application version” for use by the client that can be used to track data evolution that is not captured by the metadata alone. For example, a nested record type might be promoted to a top-level record type as part of data renormalization. The application version allows checking for these changes as part of opening the record store instead of implementing checks in the application code. If a series of such changes occur, the version can also be used as a counter tracking how far along we are in applying the changes to the record store.
6. Index Definition and Maintenance
Record Layer indexes are durable data structures that support efficient access to data, or possibly some function of the data, and can be maintained in a streaming fashion, i.e., updated incrementally when a record is inserted, updated or deleted using only the contents of that record. Index maintenance occurs in the same transaction as the record change itself, ensuring that indexes are always consistent with the data; our ability to do this efficiently relies heavily on FoundationDB’s fast multi-key transactions. Efficient index scans use FoundationDB’s range reads and rely on the lexicographic ordering of stored keys. Each index is stored in a dedicated subspace within the record store so that indexes can be removed cheaply using FoundationDB’s range clear operation.
Indexes may be configured with one or more index filters, which allow records to be conditionally excluded from index maintenance, effectively creating a “sparse” index and potentially reducing storage space and maintenance costs.
Unlike indexes in classic relational databases, indexes are first-class citizens in the Record Layer. That is, they can be scanned directly (with or without retrieving the record to which the index points) rather than only being used internally to optimize queries. As a result, the Record Layer supports the ability to define and utilize indexes for which a query syntax may be difficult to express.
Index maintenance. Defining an index type requires implementing an index maintainer tasked with updating the index when records change, according to the type of the index. The Record Layer provides built-in index maintainers for a variety of index types (Section 7). Furthermore, the index maintainer abstraction is directly exposed to clients, allowing them to define custom index types and have them maintained transactionally by the layer.
When a record is saved, we first check if a record already exists with the new record’s primary key. If so, registered index maintainers remove or update associated index entries for the old record and delete the old record from the record store. A range clear to delete the old record is necessary as records can be split across multiple keys. Next, we insert the new record into the record store. Finally, registered index maintainers insert or update any associated index entries for the new record. We use a variety of optimizations during index maintenance; for example, if an existing record and a new record are of the same type and some of the indexed fields are the same, the unchanged indexes are not updated.
Online index building. The Record Layer includes an online index builder used to build or rebuild indexes in the background. To ensure that the index is not used before it is fully built, indexes begin in a write-only state, where writes maintain the index but queries do not use it yet. The index builder then scans the record store and instructs the index maintainer for that index to update the index for each encountered record. When the index build completes, the index is marked as readable, the normal state where the index is maintained by writes and usable by queries. Online index building is broken into multiple transactions to reduce conflicts with concurrent mutations.
6.1. Key Expressions
Indexes are defined by an index type and a key expression, which defines a function from a record to one or more tuples consumed by the index maintainer and used to form the index key. The Record Layer includes a variety of key expressions and also allows clients to define custom ones.
Field key expressions. The simplest key expression is field. When evaluated against the sample record in Figure 2, field ("id") yields the tuple (1066). Unlike standard SQL columns, Protocol Buffer fields permit repeated fields and nested messages. We support nested messages through the nest key expression. For example, field("parent").nest("a") yields (1415). To support repeated fields, field expressions define an optional FanType parameter. For a repeated field, FanType of Concatenate produces a tuple with one entry containing a list of all values within the field, while Fanout produces a separate tuple for each value. For example, field("elem", Concatenate)) yields (["first", "second", "third"]), and field("elem", Fanout)) yields three tuples: ("first"), ("second"), and ("third").
FanType determines the type of queries supported by an index. By default, each tuple (along with the record’s primary key) becomes an index entry. With Concatenate, the index can efficiently find records that have values matching a certain sequence or beginning with a certain sequence, since Protocol Buffers preserve the order of repeated values (protodocs, ) (e.g., all records where elem begins with "first"). Such an index can also be used to lexicographically sort repeated values. Fanout can be used to find records containing a particular value anywhere in the list of repeated values.
To create compound indexes, multiple key expressions can be concatenated. For example, concat(field("id"), field("parent").nest("b")) evaluates to the single tuple (1066, "child"). If any of the sub-expressions produce multiple values, the compound expression will produce the tuples in the Cartesian product of the sub-expressions’ values.
Advanced key expressions. In addition to its core key expressions, the Record Layer includes a variety of key expressions to support particular features. For example, the record type key expression produces a value that is guaranteed to be unique for each record type. This key expression is useful for defining primary keys that divide the primary record subspace by record type, allowing users to treat the record types more like tables in a traditional relational database. It can also be used to satisfy queries about record types, such as a query for the number of records of each type that are stored within the database. Another special key expression, version, is described in Section 7.
In addition to allowing clients to define their own key expressions, the Record Layer has function key expressions which allow for the execution of arbitrary, user-defined functions against records and their constituent fields. For example, one might define a “negation” function that is defined on some integer field of a record and returns a single tuple with that field negated. Function key expressions are very powerful and can allow, for example, users to define custom sort orders based on arbitrary functions of the records.
Another special key expression is groupBy, which defines “split points” used to divide an index into multiple sub-indexes. For example, a SUM index using the key expression field("parent").nest("a").groupBy(field("parent") .nest("b")) enables efficiently finding the sum of the parent field’s a field over all records where the parent field’s b field is equal to "child".
The KeyWithValue key expression has two sub-expressions, where the first is included in the index’s key while the second is included in the value. This defines a covering index satisfying queries that need a subset of the fields included in the index value without an additional record lookup.
7. Index Types
The type of an index determines which predicates it can help evaluate. The Record Layer supports a variety of index types, many of which make use of specific FoundationDB features. Clients can define their own index types by implementing and registering custom key expressions and index maintainers (see Section 6). In this Section, we outline the VALUE, Atomic Mutation and VERSION indexes. Appendix A describes the RANK and TEXT index types, used for dynamic order statistics and full-text indexing, respectively.
Unlike in traditional relational systems, indexes can span multiple record types, in which case any fields referenced by the key expression must exist in all of the index’s record types. Such indexes allow for efficient searches across different record types with a common set of search criteria.
VALUE Indexes. The default VALUE index type provides a standard mapping from index entry (a single field or a combination of field values) to record primary key. Scanning the index can be used to satisfy many common predicates, e.g., to find all primary keys where an indexed field’s value is less than or equal to a given value.
Atomic mutation indexes. Atomic mutation indexes are implemented using FoundationDB’s atomic mutations, described in Section 2. These indexes are generally used to support aggregate statistics and queries. For example, the SUM index type stores the sum of a field’s value over all records in a record store, where the field is defined in the index’s key expression. In this case, the index contains a single entry, mapping the index subspace path to the sum value. The key expression could also include one or more grouping fields, in which case the index contains a sum for each value of the grouping field. While the maintenance of such an index could be implemented by reading the current index value, updating it with a new value, and writing it back to the index, such an implementation would not scale, as any two concurrent record updates would necessarily conflict. Instead, the index is updated using FoundationDB’s atomic mutations (e.g., the ADD mutation for the SUM index), which do not conflict with other mutations.
The Record Layer currently supports the following atomic mutation index types, tracking different aggregate metrics:
COUNT - number of records
COUNT UPDATES - num. times a field has been updated
COUNT NON NULL- num. records where a field isn’t null
SUM - summation of a field’s value across all records
MAX (MIN) EVER - max (min) value ever assigned to a field, over all records, since the index has been created
Note that these indexes have a relatively low foot-print compared to VALUE indexes as they only write a single key for each grouping key, or, in its absence, a single key for each record store. However, a small number of index keys that need to be updated on each write can lead to high write traffic on those keys, causing high CPU and I/O usage for the FoundationDB storage servers that hold them. This can also result in increased read latency for clients attempting to read from these servers.
VERSION. VERSION indexes are very similar to VALUE indexes in that they define an index entry and a mapping from each entry to the associated primary key. The main difference between them is that a VERSION index allows the index’s key expression to include a special “version” field: a 12 byte monotonically increasing value representing the commit version of the last update to the indexed record. Specifically, FoundationDB commits transactions in batches, where each transaction can include multiple operations on individual records. The first 8 bytes of the version are the FoundationDB commit version and identify the transaction batch. The following 2 bytes identify a particular transaction within the batch, and finally the last 2 bytes identify an operation within the transaction. The version is guaranteed to be unique and monotonically increasing with time, within a FoundationDB cluster. The first 10 bytes are assigned by the FoundationDB servers upon commit, and only the last 2 bytes are assigned by the Record Layer, using a counter maintained by the Record Layer per transaction. Since versions are assigned in this way for each record insert and update, each record stored in the cluster has a unique version.
Since the version is only known upon commit, it is not included within the record’s Protocol Buffer representation. However, the Record Layer must be able to determine the version of a record in order to perform index maintenance, e.g., to find the relevant index entries when deleting a record. To this end, the Record Layer writes a mapping from the primary key of each record to its associated version. This mapping is stored next to the key-value pair representing the Protocol Buffer value for this key, so that both can be retrieved efficiently with a single range-read.
Version indexes expose the total ordering of operations within a FoundationDB cluster. For example, a client can scan a prefix of a version index and be sure that it can continue scanning from the same point and observe all the newly written data. The following section describes how CloudKit uses this index type to implement change-tracking (sync).
8. Use Case: CloudKit
CloudKit (CloudKit, ) is Apple’s cloud backend service and application development framework, providing much of the backbone for storage, management, and synchronization of data across devices as well as sharing and collaboration between users. We describe how CloudKit uses FoundationDB and the Record Layer, allowing it to support applications requiring advanced features, such as the transactional indexing and query capabilities described in this paper.
Within CloudKit a given application is represented by a logical container, defined by a schema that specifies the record types, typed fields, and indexes that are needed to facilitate efficient record access and queries. The application clients store records within named zones. Zones organize records into logical groups which can be selectively synced across client devices.
CloudKit assigns a unique FoundationDB subspace for each user, and defines a record store within that subspace for each application accessed by the user. This means that CloudKit is effectively maintaining logical databases, each with their own records, indexes, and other metadata; CloudKit maintains billions of such databases. When requests are received from client devices, they are routed and load balanced across a pool of available CloudKit Service processes at which point the appropriate Record Layer record store is accessed and the request is serviced.
CloudKit translates the application schema into a Record Layer metadata definition and stores it in a metadata store (depicted in Figure 3). The metadata also includes attributes added by CloudKit such as system fields tracking record creation and modification time and the zone in which the record was written. Zone name is added as a prefix to primary keys, allowing efficient per-zone access to records. In addition to user-defined indexes, CloudKit maintains a number of “system” indexes, such as an index tracking the total record size by record type, used for quota management.
8.1. New CloudKit Capabilities
CloudKit was initially implemented using Cassandra (cassandra, ) as the underlying storage engine. To support atomic multi-record operation batches within a zone, CloudKit uses Cassandra’s light-weight transactions (cas, ): all updates to the zone are serialized using Cassandra’s compare-and-set (CAS) operations on a dedicated per-zone update-counter. This implementation suffices for many applications using CloudKit, but has two scalability limitations. First, there is no concurrency within a zone, even for operations making changes to different records. Second, multi-record atomic operations are scoped to a single Cassandra partition, which are limited in size; furthermore, Cassandra’s performance deteriorates as the size of a partition grows. The former is a concern for collaborative applications, where data is shared among many users or client devices. These limitations require application designers to carefully model their data and workload such that records updated together reside in the same zone while making sure that zones do not grow too large and that the rate of concurrent updates is minimized.
The implementation of CloudKit on FoundationDB and the Record Layer addresses both issues. Transactions are scoped to the entire database, allowing CloudKit zones to grow significantly larger than before and supporting concurrent updates to different records within a zone. Leveraging these new transactional capabilities, CloudKit now exposes interactive transactions to its clients, specifically to other backend services that access CloudKit through gRPC (gRPC, ). This simplifies the implementation of client applications and has enabled many new applications on top of CloudKit.
Previously, only very few “system” indexes were maintained transactionally by CloudKit in Cassandra, whereas all user-defined secondary indexes were maintained in Solr. Due to high access latencies, these indexes are updated asynchronously and queries that use them obtain an eventually consistent view of the data, requiring application designers to work around perceived inconsistencies. With the Record Layer, user-defined secondary indexes are maintained transactionally with updates so all queries return the latest data.
Personalized full-text search. Users expect instant access to data they create such as emails, text messages, and notes. Often, indexed text and other data are interleaved, so transactional semantics are important. We implemented a personalized text indexing system using the TEXT index primitives described in Appendix A that now serves millions of users. Unlike traditional search indexing systems, all updates are done tranactionally and no background jobs are needed to perform index updates and deletes. In addition to providing a consistent view of the data, this approach also reduces operational costs by storing all data in one system. Our system uses FoundationDB’s key order to support prefix matching with no additional overhead and -gram searches by creating only key entries instead of the usual keys needed to create all possible sub-strings for supporting -gram searches. The system also supports proximity and phrase search.
High-concurrency zones. With Cassandra, CloudKit maintains a secondary “sync” index from the values of the per-zone update-counter to changed records (CloudKit, ). Scanning this index allows CloudKit perform a sync operation that brings a mobile device up-to-date with the latest changes to a zone. The implementation of CloudKit using the Record Layer relies on FoundationDB’s concurrency control and no longer maintains an update-counter that creates conflicts between otherwise non-conflicting transactions. To implement a sync index, CloudKit leverages the total order on FoundationDB’s commit versions by using a VERSION index, mapping versions to record identifiers. To perform a sync, CloudKit simply scans the VERSION index.
However, commit versions assigned by different FoundationDB clusters are uncorrelated. This introduces a challenge when migrating data from one cluster to another; CloudKit periodically moves users to improve load balance and locality. The sync index must represent the order of updates across all clusters, so updates committed after the move must be sorted after updates committed before the move. CloudKit addresses this with an application-level per-user count of the number of moves, called the incarnation. Initially, the incarnation is 1, and CloudKit increments it each time the user’s data is moved to a different cluster. On every record update, we write the user’s current incarnation to the record’s header; these values are not modified during a move. The VERSION sync index maps (incarnation, version) pairs to changed records, sorting the changes first by incarnation, then by version.
When deploying this implementation, we needed to handle previously stored data with an associated update-counter value but no version. Instead of adding business logic to combine the old and new sync indexes, we used the function key expression (see Section 6.1) to make this migration operationally straightforward, transparent to the application, and free of legacy code. Specifically, the VERSION index maps a function of the incarnation, version, and update counter value to a changed record, where the function is (incarnation, version) if the record was last updated with the new method and (0, update counter value) otherwise. This maintains the order of records written using update counters, and sorts all of them before records written with the new method.
8.2. Client Resource Isolation
Today, the Record Layer does not provide the ability to perform in-memory query operations, such as hash joins, grouping, aggregation, or sorts. Operations such as sorting and joining must be assisted by appropriate index definitions. For example, efficient joins between records can be facilitated by defining an index across multiple record types on common field names. While this does impose some additional burden on the application developer, it ensures that the memory requirements to service a given request are strictly fixed to little more than a single record (“row”) accessed by the query. However, this approach may require a potentially unbounded amount of I/O to implement a given query. For this, we leverage the Record Layer’s ability to enforce limits, such as total records or bytes read while servicing a request. When one of these limits has been reached, the current state of the operation is captured and returned to the client in the form of a Record Layer continuation. The client may then re-submit the operation with the continuation to resume the operation. If even these operations become too frequent and burdensome on the system, other CloudKit throttling mechanisms kick in, slowing the rate at which the clients make requests to the server. With these limits, continuations, and throttling, we ensure that all clients make some progress even when the system comes under stress. CloudKit uses the same throttling mechanism when reacting to stress indicators coming from FoundationDB.
9. Related Work
Traditional relational databases offer many features including structured storage, schema management, ACID transactions, user-defined indexes, and SQL queries that make use of these indexes with the help of a query planner and execution engine. These systems typically scale for read workloads but were not designed to efficiently handle transactional workloads on distributed data (Gray1996, ). For example, in shared-nothing database architectures cross-shard transactions and indexes are prohibitively expensive and careful data partitioning, a difficult task for a complex application, is required. This led to research on automatic data partitioning, e.g., (Schism, ; autoPartitioning, ). Shared-disk architectures are much more difficult to scale, primarily due to expensive cache coherence and database page contention protocols (pdsFuture, ).
With the advent of Big Data, as well as to minimize costs, NoSQL datastores (dynamo, ; bigtable, ; pnuts, ; RiakKV, ; mongoDB, ; dynamoDB, ) offer the other end of the spectrum—excellent scalability but minimal semantics—typically providing a key-value API with no schema, indexing, transactions, or queries. As a result, applications needed to re-implement many of the features provided by a relational database. To fill the void, middle-ground “NewSQL datastores”, appeared offering scalability as well as a richer feature-set and semantics (Spanner, ; cockroach, ; voltdb, ; cosmosDB, ; MemSQL, ; FDB, ). FoundationDB (FDB, ) takes a unique approach in the NewSQL space: it is highly scalable and provides ACID transactions, but offers a simple key-value API with no built-in data model, indexing, or queries. This choice allowed FoundationDB to build a powerful, stable and performant storage engine, without attempting to implement a one-size-fits-all solution. It was designed to be the foundation while layers built on top, such as the Record Layer, provide higher-level abstractions.
Multiple systems implement transactions on top of underlying NoSQL stores (Percolator, ; Tephra, ; omid, ; omidReloaded, ; megastore, ; warp, ; CockroachDB, ). The Record Layer makes use of transactions exposed by FoundationDB to implement structured storage, complete with secondary indexes, queries, and other functionality.
The Record Layer has a unique first-class support for multi-tenancy. Without this support, most systems have to retrofit it, which is extremely challenging. Salesforce’s architecture (salesforce, ) is similarly motivated by the need to support multi-tenancy within the database. For example, all data and metadata is sharded by application, and query optimization considers statistics collected per application and user. The Record Layer takes multi-tenancy support further through built-in resource tracking and isolation, a completely stateless design facilitating scalability, and its key record store abstraction. For example, CloudKit faces a dual multi-tenancy challenge as it needs to service many applications, each with a very large user-base. Each record store encapsulates all of a user’s data for one application, including indexes and metadata. This choice makes it easy to scale the system to billions of users, by simply adding more database nodes and moving record stores to balance the load and improve locality.
While many storage systems include support for full-text indexing and search, most provide this support using a separate system (CloudKit, ; RiakSolr, ; salesforce, ), such as Solr (solr, ), with eventual-consistency guarantees. In our experience with CloudKit, maintaining a separate system for search is challenging; it has to be separately provisioned, maintained, and made highly-available in concert with the database (e.g., with regards to fail-over decisions). MongoDB includes built-in support for full-text search, but queries (of any kind of index) are not guaranteed to be consistent, i.e., they are not guaranteed to return all matching documents (mongoDBSucks, ).
There is a broad literature on query optimization, starting with the seminal work of Selinger et al. (Selinger1979, ) on System R. Since then, much of the focus has been on efficient search-space exploration. Most notably, Cascades (Graefe95thecascades, ) introduced a clean separation of logical and physical query plans, and proposed operators and transformation rules that are encapsulated as self-contained components. Cascades allows logically equivalent expressions to be grouped in the so called Memo structure to eliminate redundant work. Recently, Greenplum’s Orca query optimizer (Orca, ) was developed as a modern incarnation of Cascades’ principles. We are currently in the process of developing an optimizer that uses the proven principles of Cascades, paving the way for the development of a full cost-based optimizer (Appendix B).
10. Lessons Learned
The Record Layer’s success at Apple validates the usefulness of FoundationDB’s “layer” concept, where the core distributed storage system provides a scalable, robust, but semantically simple datastore upon which complex abstractions are built. This allows systems architects to pick and choose the parts of the database that they need without working around abstractions that they do not. Building layers, however, remains a complex engineering challenge. To our knowledge, the Record Layer is currently deployed at a larger scale than any other FoundationDB layer. We summarize some lessons learned building and operating the Record Layer, in the hope that they can be useful for both developers of new FoundationDB layers and Record Layer adopters.
10.1. Building FoundationDB layers
Asynchronous processing to hide latency. FoundationDB is optimized for throughput and not individual operation latencies, meaning that effective use requires keeping as much work outstanding as possible. Therefore, the Record Layer does much of its work asynchronously, pipelining it where possible. However, the FoundationDB client is single-threaded, with only a single network thread that talks to the cluster. Earlier versions of the FoundationDB Java binding completed Java Futures in this network thread and the Record Layer used these for its asynchronous work, creating a bottleneck in that thread. By minimizing the amount of work done in the network thread, we were able to get substantially better performance and minimimze apparent latency on complex operations by interacting with the key-value store in parallel.
Conflict ranges. In FoundationDB, a transaction conflict occurs when some keys read by one transaction were concurrently modified by another. The FoundationDB API gives full control over these potentially overlapping read- and write-conflict sets. In particular, it allows for manually adding read conflicts. One pattern is then to do a non-conflicting (snapshot) read of a range that potentially contains distinguished keys and adding individual conflicts for only these and not the unrelated keys found in the same range. This way, the transaction depends only on what would invalidate its results. This is done, for instance, in navigating the skip list used for rank / select indexing, described in Appendix A. Bugs due to incorrect manual conflict ranges are naturally hard to find and made even harder to find when mixed-in with business logic. For that reason, it is important to define Record Layer abstractions, such as indexes, for such patterns, rather than relying on indiviudal client applications to relax isolation requirements.
10.2. Using the Record Layer in practice
Metadata change safety. The Protocol Buffer compiler generates methods for manipulating, parsing and writing messages, as well as static descriptor objects containing information about the message type, such as its fields and their types. These descriptors could potentially be used to build Record Layer metadata in code. We do not recommend this approach over explicitly persisting the metadata in a metadata store, except for simple tests. One reason is that it is hard to atomically update the metadata code used by multiple Record Layer instances. For example, if one Record Layer instance runs a newer version of the code (with a newer descriptor), writes records to a record store, then an instance running the old version of the code attempts to read it, an authoritative metadata store (or communication between instances) is needed to interpret the data. This method also makes it harder to check that the schema evolution constraints (Section 5) are preserved. We currently use descriptor objects to generate new metadata to be stored in the metadata store.
Relational similarities. The Record Layer resembles a relational database but has sightly different semantics, which can surprise clients. For example, there is a single extent for all record types because CloudKit has untyped foreign-key references without a “table” association. By default, selecting all records of a particular type requires a full scan that skips over records of other types or maintaining secondary indexes. For clients who do not need this shared extent, we now support emulating separate extents for each record type by adding a type-specific prefix to the primary key.
10.3. Designing for multi-tenancy
Multi-tenancy is remarkably difficult to add to an existing system. Hence, the Record Layer was built from the ground up to support massively multi-tenant use cases. We have gained substantial advantages from a natively multi-tenant design, including easier shard rebalancing between clusters and the ability to scale elastically. Our experience has led us to conclude that multi-tenancy is more pervasive than one would initially think. Put another way, many applications that do not explicitly host many different applications—as CloudKit does—can reap the benefits of a multi-tenant architecture by partitioning data according to logical “tenants”, such as users, groups of users, different application functions, or some other entity.
11. Future Directions
The current state of the Record Layer stems greatly from the immediate needs of CloudKit: that is, the ability to support billions of small databases, each database having few users, and all within a carefully controlled set of resources. As databases grow, both in terms of data volume and in terms of number of concurrent users, the Record Layer may need to adapt and layers will be developed on top expanding its functionality to support these new workloads and more complex query capabilities. It is our goal, however, that the layer will always retain its ability to support lightweight and efficient deployments. We highlight several future directions:
Avoiding hotspots. As the number of clients simultaneously accessing a given record store increases, checking the store header to confirm that the metadata has not changed may create a hot key if all these requests go to the same storage node. A general way to address hotspots is to replicate data at different points of the keyspace, making it likely that the copies are located on different storage nodes. For the particular case of metadata, where changes are relatively infrequent, we could also alleviate hotspots with caching. However, when the metadata changes, such caches need to be invalidated, or, alternatively, an out-of-date cache needs to be detected or tolerated.
Query operations. Some query operations are possible with less-than-perfect indexing but within the layer’s streaming model, such as a priority queue-based sort-with-small-limit or a limited-size hash join. For certain workloads it may be necessary to fully support intensive in-memory operations with spill-over to persistent storage. Such functionality can be challenging at scale as it requires new forms of resource tracking and management, and must be stateful for the duration of the query.
Materialized views. Normal indexes are a projection of record fields in a different order. COUNT, SUM, MIN, and MAX indexes maintain aggregates compactly and efficiently, avoiding conflicts by using atomic mutations. Adding materialized views, which can synthesize data from multiple records at once, is a natural evolution that would benefit join queries, among others. Adding support for materialized views to the key expressions API might also help the query planner reason about whether an index can be used to satisfy a query.
Higher layers. The Record Layer is close enough to a relational database that it could support a subset or variant of SQL, particularly once the query planner supports joins. A higher level “SQL layer” could be implemented as a separate layer on top of the Record Layer without needing to work around choices made by lower-level layers. Similarly, a higher layer could support OLAP-style analytics workloads.
The FoundationDB Record Layer is a record-oriented data store with rich features similar to those of a relational database, including structured schema, indexing, and declarative queries. Because it is built on FounationDB, it inherits its ACID transactions, reliability, and performance. The core record store abstraction encapsulates a database and makes it easy to operate the Record Layer in a massively multi-tenant environment. The Record Layer offers deep extensibility, allowing clients to seamlessly add features outside of the core library, including custom index types, schema managers, and record serializers. At Apple, we leverage these capabilities to implement CloudKit, which hosts billions of databases. CloudKit uses the Record Layer to offer new features (e.g., transactional full-text indexing), speed up key operations (e.g., with high-concurrency zones), and simplify application development (e.g., with interactive transactions).
In building and operating the Record Layer at scale, we have made three key observations with broader applicability. First, the Record Layer’s success validates FoundationDB’s layered architecture in a large scale system. Second, the Record Layer’s extensible design provides common functionality while easily accommodating the customization needs of a complex system like CloudKit. It is easy to envision other layers that extend the Record Layer to provide richer and higher-level functionality, as CloudKit does. Lastly, we find that organizing applications into logical “tenants”, which might be users, features, or some other entity, is a powerful and practically useful way to structure a system and scale it to meet demand.
-  Apache Solr. http://lucene.apache.org/solr/.
-  gRPC: A high performance, open-source universal RPC framework. https://grpc.io/.
-  Lightweight transactions in Cassandra 2.0. https://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
-  Apache Cassandra, 2018.
-  Azure Cosmos DB. https://azure.microsoft.com/en-us/services/cosmos-db/, 2018.
-  Blog: MongoDB queries don’t always return all matching documents! https://blog.meteor.com/mongodb-queries-dont-always-return-all-matching-documents-654b6594a827, 2018.
-  Cockroach Labs. https://www.cockroachlabs.com/, 2018.
-  CockroachDB. https://www.cockroachlabs.com/, 2018.
-  FoundationDB. https://www.foundationdb.org, 2018.
-  FoundationDB on GitHub. https://github.com/apple/foundationdb, 2018.
-  MemSQL. https://www.memsql.com/, 2018.
-  MongoDB. https://www.mongodb.com, 2018.
-  Protocol Buffers. https://developers.google.com/protocol-buffers/, 2018.
-  Protocol Buffers: Specifying Field Rules. https://developers.google.com/protocol-buffers/docs/proto#specifying-field-rules, 2018.
-  Riak: Complex Query Support. http://basho.com/products/riak-kv/complex-query-support, 2018.
-  Riak KV. http://basho.com/products/riak-kv, 2018.
-  Tephra: Transactions for Apache HBase. https://tephra.io, 2018.
-  The Force.com Multitenant Architecture. http://www.developerforce.com/media/ForcedotcomBookLibrary/Force.com_Multitenancy_WP_101508.pdf, 2018.
-  VoltDB. https://www.voltdb.com/, 2018.
-  Announcing the FoundationDB Record Layer. https://www.foundationdb.org/blog/announcing-record-layer/, 2019.
-  FoundationDB Record Layer on GitHub. https://github.com/foundationdb/fdb-record-layer, 2019.
-  J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. In Proceedings of the Conference on Innovative Data system Research (CIDR), pages 223–234, 2011.
-  K. Birman, D. Malkhi, and R. van Renesse. Virtually Synchronous Methodology for Dynamic Service Replication. Technical report, Microsoft Research, 2010.
-  E. Bortnikov, E. Hillel, I. Keidar, I. Kelly, M. Morel, S. Paranjpye, F. Perez-Sorrosal, and O. Shacham. Omid, Reloaded: Scalable and Highly-Available Transaction Processing. In 15th USENIX Conference on File and Storage Technologies, FAST 2017, Santa Clara, CA, USA, February 27 - March 2, 2017, pages 167–180, 2017.
-  F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 205–218, 2006.
-  G. V. Chockler, I. Keidar, and R. Vitenberg. Group communication specifications: a comprehensive study. ACM Comput. Surv., 33(4):427–469, 2001.
-  G. V. Chockler and D. Malkhi. Active Disk Paxos with infinitely many processes. Distributed Computing, 18(1):73–84, 2005.
-  B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!’s hosted data serving platform. PVLDB, 1(2):1277–1288, 2008.
-  J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google’s Globally Distributed Database. ACM Trans. Comput. Syst., 31(3):8:1–8:22, Aug. 2013.
-  T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Third Edition, chapter 14.1. The MIT Press, 2009.
-  C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden. Schism: a Workload-Driven Approach to Database Replication and Partitioning. PVLDB, 3(1):48–57, 2010.
-  G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, October 14-17, 2007, pages 205–220, 2007.
-  D. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM, 35(6), June 1992.
-  R. Escriva, B. Wong, and E. G. Sirer. Warp: Lightweight Multi-Key Transactions for Key-Value Stores. CoRR, abs/1509.07815, 2015.
-  D. G. Ferro, F. Junqueira, I. Kelly, B. Reed, and M. Yabandeh. Omid: Lock-free transactional support for distributed data stores. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 676–687, 2014.
-  G. Graefe. The Cascades Framework for Query Optimization. Data Engineering Bulletin, 18, 1995.
-  J. Gray, P. Helland, P. O’Neil, and D. Shasha. The dangers of replication and a solution. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996.
-  P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In 2010 USENIX Annual Technical Conference, Boston, MA, USA, June 23-25, 2010, 2010.
-  H. Melville. Moby Dick; or The Whale. http://www.gutenberg.org/files/2701/2701-0.txt.
-  A. Pavlo, C. Curino, and S. Zdonik. Skew-aware Automatic Database Partitioning in Shared-nothing, Parallel OLTP Systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, 2012.
-  D. Peng and F. Dabek. Large-scale Incremental Processing Using Distributed Transactions and Notifications. In In the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
-  P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, SIGMOD ’79, 1979.
-  A. Shraer, A. Aybes, B. Davis, C. Chrysafis, D. Browning, E. Krugler, E. Stone, H. Chandler, J. Farkas, J. Quinn, J. Ruben, M. Ford, M. McMahon, N. Williams, N. Favre-Felix, N. Sharma, O. Herrnstadt, P. Seligman, R. Pisolkar, S. Dugas, S. Gray, S. Lu, S. Harkema, V. Kravtsov, V. Hong, Y. Tian, and W. L. Yih. Cloudkit: Structured Storage for Mobile Applications. Proc. VLDB Endow., 11(5):540–552, Jan. 2018.
-  S. Sivasubramanian. Amazon dynamoDB: a seamlessly scalable non-relational database service. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 729–730, 2012.
-  M. A. Soliman, L. Antova, V. Raghavan, A. El-Helw, Z. Gu, E. Shen, G. C. Caragea, C. Garcia-Alvarado, F. Rahman, M. Petropoulos, F. Waas, S. Narayanan, K. Krikellas, and R. Baldwin. Orca: A Modular Query Optimizer Architecture for Big Data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 337–348, New York, NY, USA, 2014. ACM.
Appendix A Rank and Text Index Types
RANK indexes. The RANK index type allows clients to efficiently find records by their ordinal rank (according to some key expression) and conversely to determine the rank of a field’s value. For example, in an application implementing a leaderboard, finding a player’s position in the leaderboard could be implemented by looking-up their score’s rank using a RANK index. Another example is an implementation of a scrollbar, where data (e.g., query results) is sorted according to some field and the user can request to skip to the middle of a long page of results, e.g., to the -th result. One way to implement this could be to linearly scan a VALUE index until we get to the -th result, and use Record Layer’s cursors to efficiently restart the scan if it is interrupted. An implementation using a RANK index can be much more efficient and query for the record that has rank , then begin scanning from that position in the index.
Our implementation of the RANK index stores each index entry in a durable data structure that we call a ranked set: a probabilistic augmented skip-list (Cormen et al.  describe a tree-based variant) persisted in FoundationDB such that each level has a distinct subspace prefix. Duplicate keys are avoided by attempting to read each key before inserting it, in the same transaction. The lowest level of the skip-list includes every index entry, and each higher level contains a sample of entries in the level below it. For each level, each index entry contains the number of entries in the set that are greater or equal to it and less than the next entry in that level (all entries in the lowest level have the value 1). This is the number of entries skipped by following the skip-list “finger” between one entry and the next. In practice, an explicit finger is not needed: the sort order maintained by FoundationDB achieves the same purpose much more efficiently. Figure 4(a) includes a sample index representing a skip-list with six elements and three levels.
To determine the ordinal rank of an entry, a standard skip-list search is performed, starting from the highest level. Whenever the search uses a finger connecting two nodes on the same level, we accumulate the value of the first node, i.e., the number of nodes being skipped. An example of this computation is shown in Figure 4(b). The final sum represents the rank. Given a rank, a similar algorithm is used to determine the corresponding index entry. A cumulative sum is maintained and a range scan is performed at each level until following a finger would cause the sum to exceeds the target rank, at which point the next level is scanned.
TEXT indexes. The TEXT index enables full-text queries on the contents of string fields. This includes simple token matching, token prefix matching, proximity search, and phrase search. The index stores tokens produced by a pluggable tokenizer using a text field as its input. Our inverted index implementation is logically equivalent to an ordered list of maps. Each map represents a postings list: it is associated with a token () and the keys in the map are the primary keys () of records containing that token in the indexed text. Each value is a list of offsets in the text field containing that token, expressed as the number of tokens from the beginning of the field. To determine which records contain a given token, a range scan can be performed on the index prefixed by that token to produce a list of primary keys, one for each record that contains that token. One can similarly find all records containing a token prefix. To filter by token proximity or by phrase, the scan can examine the relevant offset lists and filter out any record where the tokens are not within a given distance from each other or do not appear in the correct order.
To store the TEXT index, we use one key for each token/primary key pair, with the offset list in the value:
Note that the prefix is repeated in each key. While this is true for all indexes, the overhead is especially large for TEXT indexes, due to the large number of entries. To address this, we reduce the number of index keys by “bunching” neighboring keys together, so that for a given token, there might be multiple primary keys included in one index entry. Below is an example with a bunch size of 2, i.e., each index entry represents up to two primary keys.
To insert a token and primary key pk the index maintainer performs a range scan and finds the biggest key that is less or equal to (prefix,t,pk) and the smallest key bigger than (prefix,t,pk). It then places a new entry in the bunch corresponding to , but only if the insertion does not cause this entry to exceed the maximum bunch size. If it does, the biggest primary key in the bunch is removed (this might be pk and its list of offsets) and the index maintainer makes it a new index key. If the size of ’s bunch is smaller than the maximum bunch size, the bunch is merged with the newly created one. In order to delete the index entry for token t and primary key pk, the index maintainer performs a range scan in descending order from (prefix,t,pk). The first key returned is guaranteed to contain the data for t and pk. If pk is the only key in the bunch, the index entry is simply deleted. Otherwise, the entry is updated such that pk and its list of offsets are removed from the bunch and if pk appears in the index key, the key is updated to contain the next primary key in the bunch.
Inserting an entry requires reading two FoundationDB key-value pairs and writing at most two, though usually only one, key-value pair. Deleting an entry always requires reading and writing a single FoundationDB key-value pair. This access locality makes index updates use predictable resources and have predictable latencies. However, it is easy to see that there are certain write patterns that can result in many index entries, where bunches are only partially filled. Currently, deletes do not attempt to merge smaller bunches together, although the client can execute bunch compactions.
Numerical Example. To demonstrate the benefit of this optimization, we used Herman Melville’s Moby Dick , broken up by line into 233 roughly equal 5 kilobytes documents of raw text. With whitespace tokenization, each document contains 431.8 unique tokens (so the primary key can be represented using 2 bytes) with the average length of 7.8 characters (encoded with 1 byte per character), appearing an average of 2.1 times within the document. There are approximately 2 bytes of overhead within the key to encode each of the token and primary key. For the calculation, we use 10 bytes as the prefix size (much smaller than the typical size we use in production). The value is encoded with 1 byte per offset plus 1 byte for each list. In total, the index requires 21.8 key bytes and 3 value bytes per token, or 10.7 kilobytes per document. For each byte in the original text, we write 2.14 bytes in the index.
With a bunch size of 20, each key still includes 19.8 bytes for the prefix (10 bytes), token (7.8 bytes) and encoding overhead (2 bytes), as well as up to 20 primary keys and values (representing individual documents containing the token), each of size 2 and 3 bytes, respectively (100 bytes total), i.e., 6 amortized bytes per token per document (1 for key and 5 for the value). Multiplied by the number of tokens, the index size is 2.6 kilobytes per document – roughly half of the original document size. When actually measured, the index required 4.9 kilobytes per document, almost as much as the document text itself. The reason for the discrepancy is that not every bunch is actually filled. In fact, the average bunch size was 4.7, significantly lower than the maximum possible; in fact, some words appear only once in the entirety of Moby Dick and therefore will be necessarily be given their own bunch. To optimize further, we are considering bunching across tokens, and implementing prefix compression in FoundationDB. Note that even then, there is per-key overhead in the index as well as in FoundationDB’s internal B-tree data structure, so reducing the number of keys on the Record Layer level is still beneficial.
Appendix B Query planning and API
Often, clients of the Record Layer want to search, sort, and filter the records in a record store in order to retrieve specific information. Instead of forcing clients to manually inspect indexes, the Record Layer has extensive facilities for executing declarative queries on a record store. While the planning and execution of declarative queries has been studied for decades, the Record Layer makes certain unusual design decisions in order to meet its goals. In this section, we present the query interface exposed by the Record Layer, the architecture of its extensible query planner, and the reasoning behind key design decisions.
Extensible query API. The Record Layer has a fluent, declarative Java API for querying the database. This query API can be used to specify the types of records that should be retrieved, Boolean predicates that the retrieved records must match, and a sort order specified by a key expression (see Section 6.1). Both the filter and sort expressions can include “special functions” on the record set, including aggregates, cardinal rank, and several full-text search operations such as -gram and phrase search. This query language is akin to an abstract syntax tree for a SQL-like text-based query language exposed as a first class citizen, allowing consumers to directly interact with it in Java. Another layer on top of the Record Layer could provide translation from SQL or a related query language.
Query plans. While declarative queries are convenient for clients, they need to be transformed into concrete operations on the record store, such as index scans, union operations, and filters in order to execute the queries efficiently. For example, the query plan in Figure 5 implements the query as a union of two index scans that produce streams of records in the appropriate order. The Record Layer’s query planner is responsible for converting a declarative query—which specifies what records are to be returned but not how they are to be retrieved—into an efficient combination of operations that map directly to manipulations of the stream of records.
The Record Layer exposes these query plans through the planner’s API, allowing its clients to cache or otherwise manipulate query plans directly. This provides functionality similar to that of a SQL PREPARE statement, but with the additional benefit of allowing the client to modify the plan if necessary . In the same fashion as SQL PREPARE statements, Record Layer queries (and thus, query plans) may have bound static arguments (SARGS). In CloudKit we have leveraged this functionality to implement certain CloudKit-specific planning behavior by combining multiple plans produced by the Record Layer and binding the output of an “outer” plan as input values to an “inner” plan.
We are currently evolving the Record Layer’s planner from an ad-hoc architecture to a Cascades-style rule-based planner. Our new planner design supports deep extensibility, including custom planning logic defined completely outside the Record Layer by clients, and has an architecture that provides a path to cost-based optimization.
Cascades-style planner. The new planner architecture uses the top-down optimization framework presented in Cascades . Internally, we maintain a tree-structured intermediate representation called an expression of partially-planned queries that includes both logical operations (such as a sort order needed to perform an interaction of two indexes) and physical operations (such as scans of indexes, unions of streams, and filters). We implement the planner’s functionality through a rich set of planner rules, which match to particular structures in the expression tree, optionally inspect their properties, and then produce equivalent expressions. An example of a simple rule that converts a logical filter into a scan over the appropriate index range is shown in Figure 5.
Rules are automatically selected by the planner, but can be organized into “phases” based on their utility; for example, we prefer to scan a relevant index rather than scan all records and filter them after deserialization. They are also meant to be modular; several planner behaviors are implemented by multiple rules acting in concert. While this modular architecture make the code base easier to understand, its primary benefit is allowing more complicated planning behavior by mixing-and-matching the available rules.
The rule-based architecture of the planner allows clients, who may have defined custom query functions and indexes, to plug in rules for implementing that custom functionality. For example, a client could implement an index that supports geospatial queries and extend the query API with custom functions for issuing bounding box queries. By writing rules that can turn geospatial queries into operations on a geospatial index, the client could have the existing planner plan those queries while making use of all of the existing rules.
Future directions. In designing the intermediate representation and rule execution system in the Record Layer’s experimental planner, we have tried to anticipate future needs. For example, the data structure used by the “expression” intermediate representation currently stores only a single expression at any time: in effect, this planner currently operates by repeatedly rewriting the intermediate representation each time a rule is applied. However, it is designed to be seamlessly replaced by the compact Memo data structure 
which allows the planner to succinctly represent a huge space of possible expressions. At its core, the Memo structure replaces each node in the expression tree with a group of logically equivalent expressions. Each group can then be treated as an optimization target, with each expression in the group representing possible implementations of that group’s logical operation. With this data structure, optimization work for a small part of the query can be shared (or memoized, as the name suggests) across any possible expressions. Adding the Memo structure paves the way to a cost-based optimizer, which uses estimates of the cost of different possibilities to choose from several possible plans.