Megaphone: Latency-conscious state migration
We design and implement Megaphone, a data migration mechanism for stateful distributed dataflow engines with latency objectives. When compared to existing migration mechanisms, Megaphone has the following differentiating characteristics: (i) migrations can be subdivided to a configurable granularity to avoid latency spikes, and (ii) migrations can be prepared ahead of time to avoid runtime coordination. Megaphone is implemented as a library on an unmodified timely dataflow implementation, and provides an operator interface compatible with its existing APIs. We evaluate Megaphone on established benchmarks with varying amounts of state and observe that compared to naïve approaches Megaphone reduces service latencies during reconfiguration by orders of magnitude without significantly increasing steady-state overhead.READ FULL TEXT VIEW PDF
Megaphone: Latency-conscious state migration
To satisfy latency and availability requirements, modern stream processors support consistent online reconfiguration, in which they update parts of a dataflow computation without disrupting its execution or affecting its correctness. Such reconfiguration is required during rescaling to handle increased input rates or reduce operational costs [13, 14], to provide performance isolation across different dataflows by dynamically scheduling queries to available workers, to allow code updates to fix bugs or improve business logic [5, 7], and to enable runtime optimizations like execution plan switching 
and straggler and skew mitigation.
Streaming dataflow operators are often stateful, partitioned across workers by key, and their reconfiguration requires state migration: intermediate results and data structures must be moved from one set of workers to another, often across a network. Existing state migration mechanisms for stream processors either pause and resume parts of the dataflow (as in Flink , Dhalion , and SEEP ) or launch new dataflows alongside the old configuration (as for example in ChronoStream  and Gloss ). In both cases state moves “all-at-once”, with either high latency or resource usage during the migration.
State migration has been extensively studied for distributed databases [6, 9, 11, 10]. Notably, Squall  uses transactions to multiplex fine-grained state migration with data processing. These techniques are appealing in spirit, but use mechanisms (transactions, locking) not available in high-throughput stream processors and are not directly applicable without significant performance degradation.
In this paper we present Megaphone, a technique for fine-grained migration in a stream processor which delivers maximum latencies orders of magnitude lower than existing techniques, based on the observation that a stream processor’s structured computation and logical timestamps allow the system to plan fine-grained migrations. Megaphone can specify migrations on a key-by-key basis, and then optimizes this by batching at varying granularities; as Figure 1 shows, the improvement over all-at-once migration can be dramatic.
Our main contribution is fluid migration for stateful streaming dataflows: a state migration technique that enables consistent online reconfiguration of streaming dataflows and smoothens latency spikes without using additional resources (Section 3) by employing fine-grained planning and coordination through logical timestamps. Additionally, we design and implement an API for reconfigurable stateful timely dataflow operators that enables fluid migration to be controlled simply through additional dataflow streams rather than through changes to the dataflow runtime itself (Section 4). Finally, we show that Megaphone has negligible steady-state overhead and enables fast direct state movement using the NEXMark benchmarks suite and microbenchmarks (Section 5).
Megaphone is built on timely dataflow,111https://github.com/frankmcsherry/timely-dataflow and is implemented purely in library code, requiring no modifications to the underlying system. We first review existing state migration techniques in streaming systems, which either cause performance degradation or require resource overprovisioning. We also review live migration in DBMSs and identify the technical challenges to implement similar methods in distributed stream processors (Section 2).
This paper is an extended version of a preliminary workshop publication . In the present paper we describe a more general mechanism, further detail its implementation, and evaluate it more thoroughly on realistic workloads.
A distributed dataflow computation runs as a physical execution plan which maps operators to provisioned compute resources (or workers). The execution plan is a directed graph whose vertices are operator instances (each on a specific worker) and edges are data channels (within and across workers). Operators can be stateless (e.g., filter, map) or stateful (e.g., windows, rolling aggregates). State is commonly partitioned by key across operator instances so that computations can be executed in a data-parallel manner. At each point in time of a computation, each worker (with its associated operator instances) is responsible for a set of keys and their associated state.
State migration is the process of changing the assignment of keys to workers and redistributing respective state accordingly. A good state migration technique should be non-disruptive (minimal increase in response latency during migration), short-lived (migration completes within a short period of time), and resource-efficient (minimal additional resources required during the migration).
We present an overview of existing state migration strategies in distributed streaming systems and identify their limitations. We then review live state migration methods adopted by database systems and provide an intuition into Megaphone’s approach to bring such migration techniques to streaming dataflows.
Distributed stream processors, including research prototypes and production-ready systems, use one of the following three state migration strategies.
A straight-forward way to realize state migration is to temporarily stop program execution, safely transfer state when no computation is being performed, and restart the job once state redistribution is complete. This approach is most commonly enabled by leveraging existing fault-tolerance mechanisms in the system, such as global state checkpoints. It is adopted by Spark Streaming , Structured Streaming , and Apache Flink .
In many reconfiguration scenarios only one or a small number of operators need to migrate state, and halting the entire dataflow is usually unnecessary. An optimization first introduced in Flux  and later adopted in variations by Seep , IBM Streams , StreamCloud , Chi , and FUGU , pauses the computation only for the affected dataflow subgraph. Operators not participating in the migration continue without interruption. This approach can use fault-tolerance checkpoints for state migration as in [13, 19] or state can be directly migrated between operators as in [15, 16].
To minimize performance penalties, some systems replicate the whole dataflow or subgraphs of it and execute the old and new configurations in parallel until migration is complete. ChronoStream  concurrently executes two or more computation slices and can migrate an arbitrary set of keys between instances of a single dataflow operator. Gloss  follows a similar approach and gathers operator state during a migration in a centralized controller using an asynchronous protocol.
Current systems fall short of implementing state migration in a non-disruptive and cost-efficient manner. Existing stream processors migrate state all-at-once, but differ in whether they pause the existing computation or start a concurrent computation. As Figure 1 shows, strategies that pause the computation can cause high latency spikes, especially when the state to be moved is large. On the other hand, dataflow replication techniques reduce the interruption, but at the cost of high resource requirements and required support for input duplication and output de-duplication. For example, for ChronoStream to move from a configuration with instances to a new one with instances, instances are required during the migration.
Database systems have implemented optimizations that explicitly target limitations we have identified in the previous section, namely unavailability and resource requirements. Even though streaming dataflow systems differ significantly from databases in terms of data organization, workload characteristics, latency requirements, and runtime execution, the fundamental challenges of state migration are common in both setups.
Albatross  adopts VM live migration techniques and is further optimized in  with a dynamic throttling mechanism, which adapts the data transfer rate during migration so that tenants in the source node can always meet their SLOs. Prorea  combines push-based migration of hot pages with pull-based migration of cold pages techniques. Zephyr  proposes a technique for live migration in shared-nothing transactional databases which introduces no system downtime and does not disrupt service for non-migrating tenants.
The most sophisticated approach is Squall , which interleaves state migration with transaction processing by (in part) using transaction mechanisms to effect the migration. Squall introduces a number of interesting optimizations, such as pre-fetching and splitting reconfigurations to avoid contention on a single partition. In the course of a migration, if migrating records are needed for processing but not yet available, Squall introduces a transaction to acquire the records (completing their migration). This introduces latency along the critical path, and the transaction locking mechanisms can impede throughput, but the system is neither paused nor replicated. To the best of our knowledge, no stream processor implements such a fine-grained migration technique.
Applying existing fine-grained live migration techniques to a streaming engine is non-trivial. While systems like Squall target OLTP workloads with short-lived transactions, streaming jobs are long-running. In such a setting, Squall’s approach to acquire a global lock during initialization is not a viable solution. Further, many of Squall’s remedies are reactive rather than proactive (because it must support general transactions whose data needs are hard to anticipate), which can introduce significant latency on the critical path.
The core idea behind Megaphone’s migration mechanism is to multiplex fine-grained state migration with actual data processing, coordinated using logical timestamps common in stream processors. This is a proactive approach to migration that relies on the prescribed structure of streaming computations, and the ability of stream processors to coordinate with high frequency using logical timestamps. Such systems, including Megaphone, avoid the need for system-wide locks by pre-planning the rendezvous of data at specific workers.
Megaphone’s features rely on some interesting dataflow concepts first introduced in Naiad . We briefly review these concepts here for completeness, as they are necessary to understand the following sections.
A streaming computation in Naiad is expressed as a timely dataflow: a directed (possibly cyclic) graph where nodes represent stateful operators and edges represent data streams between operators. Each data record in timely dataflow bears a logical timestamp, and operators maintain or possibly advance the timestamps of each record. Example timestamps include integers representing milliseconds or transaction identifiers, but in general can be any set of opaque values for which a partial order is defined. The timely dataflow system tracks the existence of timestamps, and reports as processed timestamps no longer exist in the dataflow, which indicates the forward progress of a streaming computation.
A timely dataflow is executed by multiple workers (threads) belonging to one or more OS processes, which may reside in one or more machines of a networked cluster. Workers communicate with each other by exchanging messages over data channels (shared-nothing paradigm) as shown in Figure 2. Each worker has a local copy of the entire timely dataflow graph and executes all operators in this graph on (disjoint) partitions of the dataflow’s input data. Each worker repeatedly executes dataflow operators concurrent with other workers, sending and receiving data across data exchange channels. Due to this asynchronous execution model, the presence of concurrent “in-flight” timestamps is the rule rather than the exception.
As timely workers execute, they communicate the numbers of logical timestamps they produce and consume to all other workers. This information allows each worker to determine the possibility that any dataflow operator may yet see any given timestamp in its input. The timely workers present this information to operators in the form of a frontier:
A frontier is a set of logical timestamps such that
no element of is strictly greater than another element of ,
all timestamps on messages that may still be received are greater than or equal to some element of .
In many simple settings a frontier is analogous to a low watermark in streaming systems like Flink, which indicates the single smallest timestamp that may still be received. In timely dataflow a frontier must be set-valued rather than a single timestamp because timestamps may be only partially ordered.
Operators in timely dataflow may retain capabilities that allow the operator to produce output records with a given timestamp. All received messages come bearing such a capability for their timestamp. Each operator can choose to drop capabilities, or downgrade them to later timestamps. The timely dataflow system tracks capabilities held by operators, and only advances downstream frontiers as these capabilities advance.
Timely dataflow frontiers are the main mechanism for coordination between otherwise asynchronous workers. The frontiers indicate when we can be certain that all messages of a certain timestamp have been received, and it is now safe to take any action that needed to await their arrival. Importantly, frontier information is entirely passive and does not interrupt the system execution; it is up to operators to observe the frontier and determine if there is some work that cannot yet be performed. This enables very fine-grained coordination, without system-level intervention. Further technical details of progress tracking in timely dataflows can be found in [20, 4].
We will use timely dataflow frontiers to separate migrations into independent arbitrarily fine-grained timestamps and logically coordinate data movement without using coarse-grained pause-and-resume for parts of the dataflow.
To frame the mechanism we introduce for live migration in streaming dataflows, we first lay out some formal properties that define correct and live migration. In the interest of clarity we keep the descriptions casual, but each can be formalized.
We consider stateful dataflow operators that are data-parallel and functional. Specifically, an operator acts on input data that are structured as pairs, each bearing a logical timestamp. The input is partitioned by its key and the operator acts independently on each input partition by sequentially applying each val to its state in timestamp order. For each key, for each val in timestamp order, the operator may change its per-key state arbitrarily, produce arbitrary outputs as a result, and it may schedule further per-key changes at future timestamps (in effect sending itself a new, post-dated val for this key).
The output triples are the new state, the outputs to produce, and future changes that should be presented to the operator.
For a specific operator, we can describe the correctness of an implementation. We introduce the notation of in advance of as follows.
A timestamp is in advance of
a timestamp if is greater than or equal to ;
a frontier if is greater than or equal to an element of .
The correct outputs through time are the timestamped outputs that result from each key from the timestamp-ordered application of input and post-dated records bearing timestamp not in advance of time.
For each migrateable operator, we also consider a configuration function, which for each timestamp assigns each key to a specific worker.
With a specific configuration, we can describe the correctness of a migrating implementation.
A computation is migrated according to configuration if all updates to key with timestamp time are performed at worker configuration(time, key).
A configuration function can be represented in many ways, which we will discuss further. In our context we will communicate any changes using a timely dataflow stream, in which configuration changes bear the logical timestamp of their migration. This choice allows us to use timely dataflow’s frontier mechanisms to coordinate migrations, and to characterize liveness.
A migrating computation is completing if, once the frontiers of both the data input stream and configuration update stream reach F, then (with no further requirements of the input) the output frontier of the computation will eventually reach F.
Our goal is to produce a mechanism that satisfies each of these three properties: Correctness, Migration, and Completion.
State migration is driven by updates to the configuration function introduced in 3.2. In Megaphone these updates are supplied as data along a timely dataflow stream, each bearing the logical timestamp at which they should take effect. Informally, configuration updates have the form
indicating that as of time the state and values associated with key will be located at worker, and that this will hold until a new update to key is observed with a greater timestamp.
As configuration updates are simply data, the user has the ability to drive a migration process by introducing updates as they see fit. In particular, they have the flexibility to break down a large migration into a sequence of smaller migrations, each of which have lower duration and between which the system can process data records. For example, to migrate from one configuration to another , a user can use different migration strategies to reveal the changes from to :
To simultaneously migrate all keys from to , a user could supply all changed (time, key, worker) triples with one common time.
To smoothly migrate keys from to , a user could repeatedly choose one key changed from to , introduce the new (time, key, worker) triple with the current time, and await the migration’s completion before choosing the next key.
To trade off low latency against high throughput, a user can produce batches of changed (time, key, worker) triples with a common time, awaiting the completion of the batch before introducing the next batch of changes.
We believe that this approach to reconfiguration, as user-supplied data, opens a substantial design space. Not only can users perform fine-grained migration, they can prepare future migrations at specific times, and drive migrations based on timely dataflow computations applied to system measurements. Most users will certainly need assistance in performing effective migration, and we will evaluate several specific instances of the above strategies.
We now describe how to create a migrateable version of an operator implementing some deterministic, data-parallel operator as described in 3.2. A non-migrateable implementation would have a single dataflow operator with a single input dataflow stream of (key, val) pairs, exchanged by key before they arrive at the operator.
Instead, we create two operators and . takes the data stream as input and as an additional input the stream of configuration updates and produces data pairs and migrating state as outputs. takes as inputs exchanged data pairs and exchanged migrating state, and applies them to a hosted instance of , which implements operator and maintains both state and pending records for each key. Figure 3b presents a schematic overview of the construction.
This construction can be repeated for all the operators in the dataflow that need support for migration. Separate operators can be migrated independently (via separate configuration update streams), or in a coordinated manner by re-using the same configuration update stream. Operators with multiple data inputs can be treated like single-input operators where the migration mechanism acts on both data inputs at the same time.
Operator routes (key, val) pairs according to the configuration at their associated time, buffering pairs if time is in advance of the frontier of the configuration input. For times in advance of this frontier, the configuration is not yet certain as further configuration updates could still arrive. The configuration at times not in advance of this frontier can no longer be updated. As the data frontier advances, configurations can be retired.
Operator is also responsible for initiating state migrations. For a configuration update (time, key, worker), must not initiate a migration for key until its state has absorbed all updates at times strictly less than time. initiates a migration once time is present in the output frontier of , as this implies that there exist no records at timestamps less than time, as otherwise they would be present in the frontier in place of time.
Operator initiates a migration by uninstalling the current state for key from its current location in operator , and transmitting it bearing timestamp time to the instance of operator on worker. The state includes both the state for operator, as well as the list of pending (val, time) records produced by operator for future times.
Operator receives exchanged (key, val) pairs and exchanged state as the result of migrations initiated by . immediately installs any received state. applies received and pending (key, val) pairs in timestamp order using operator once their timestamp is not in advance of either the data or state inputs.
We provide details of Megaphone’s implementation of this mechanism in Section 4.
For each key, defines a timeline corresponding to a single-threaded execution, which assigns to each time a pair of state and pending records just before the application of input records at that time. Let denote the function from times to these pairs for key.
For each key, the configuration function partitions this timeline into disjoint intervals, , each of which is assigned to one operator instance .
Claim: migrates exactly to .
First, always routes input records at to , and so routes all input records in to . If also presents with , it has sufficient input to produce . More precisely,
Because maintains its output frontier at , in anticipation of the need to migrate , will apply no input records in advance of . And so, it applies exactly the records in .
Until transitions to , its output frontier will be strictly less than , and so will not migrate anything other than .
Because maintains its output frontier at , and is able to advance its output frontier to , the time will eventually be in the output frontier of S.
Figure 4 presents three snapshots of a migrating streaming word-count dataflow. The figure depicts operator instances and of the upstream routing operator, and operator instances and of the operator instances hosting the word-count state and update logic. The operators maintain input queues of received but not yet routable input data, and an input stream of logically timestamped configuration updates. Although each maintains its own routing table, which may temporarily differ from others, we present one for clarity. Input frontiers are represented by boxed numbers, and indicate timestamps that may still arrive on that input.
In Figure 3(a), has enqueued the record (44, a, 3) and has enqueued the record (43, c, 5), both because their control input frontier has only reached 42 and so the destination workers at their associated timestamps have not yet been determined. Generally, instances will only enqueue records with timestamps in advance of the control input frontier, and the output frontiers of the instances can reach the minimum of the data and control input frontiers.
In Figure 3(b), both control inputs have progressed to 45. The buffered records (44, a, 3) and (43, c, 5) have been forwarded to and , and the count operator instances apply the state updates accordingly, shown in bold. Additionally, both operators have received a configuration update for the key b at time 45. Should the configuration input frontier advance beyond 45, both and can integrate the configuration change, and then react. Operator would observe that the output frontier of reaches 45, and initiate a state migration. Operator would route its buffered input at time 45, to rather than .
In Figure 3(c) the migration has completed. Although the configuration frontier has advanced to 55, the output frontiers are held back by the data input frontier of at 53. If the configuration frontier advances past 55 then operator could route its queued record, but neither operator could apply it until they are certain that there are no other data records that could come before the record at 55.
Megaphone is an implementation of the migration mechanism described in Section 3. In this section, we detail specific choices made in Megaphone’s implementation, including the interfaces used by the application programmer, Megaphone’s specific choices for the grouping and organization of per-key state, and how we implemented Megaphone’s operators in timely dataflow. We conclude with some discussion of how one might implement Megaphone in other stream processing systems, as well as alternate implementation choices one could consider.
Megaphone presents users with an operator interface that closely resembles the operator interfaces timely dataflow presents. In several cases, users can use the same operator interface extended only with an additional input stream for configuration updates. More generally, we introduce a new structure to help users isolate and surface all information that must be migrated (state, but also pending future records). These additions are implemented strictly above timely dataflow, but their structure is helpful and they may have value in timely dataflow proper.
The simplest stateful operator interface Megaphone and timely provide is the state_machine operator, which takes one input structured as pairs (key, val) and a state update function which can produce arbitrary output as it changes per-key state in response to keys and values. In Megaphone, there is an additional input for configuration updates, but the operator signature is otherwise identical.
More generally, timely dataflow supports operators of arbitrary numbers and types of inputs, containing arbitrary user logic, and maintaining arbitrary state. In each case a user must specify a function from input records to integer keys, and the only guarantee timely dataflow provides is that records with the same key are routed to the same worker. Operator execution and state are partitioned by worker, but not necessarily by key.
For Megaphone to isolate and migrate state and pending work we must encourage users to yield some of the generality timely dataflow provides. However, timely dataflow has already required the user to program partitioned operators, each capable of hosting multiple keys, and we can lean on these idioms to instantiate more fine-grained operators, partitioned not only by worker but further into finer-grained bins of keys. Routing functions for each input are already required by timely dataflow, and Megaphone interposes to allow the function to change according to reconfiguration. Timely dataflow per-worker state is defined implicitly by the state captured by the operator closure, and Megaphone only makes it more explicit. The use of a helper to enqueue pending work is borrowed from an existing timely dataflow idiom (the Notificator). While Megaphone’s general API is not identical to that of timely dataflow, it is just a more crisp framing of the same idioms.
Listing LABEL:listing:interface shows how Megaphone’s operator interface is structured. The interface declares unary and binary stateful operators for single input or dual input operators as well as a state-machine operator. The logic has to be encoded in the fold-function. Megaphone presents data in timestamp order with a corresponding state and notificator object. Here, migration is transparent and performed without special handling by the operator implementation.
Listing LABEL:listing:code_word_count shows an example of a stateful word-count dataflow with a single data input and an additional control input. The stateful_unary operator receives the control input, the state type, and a key extraction function as parameters. The control input carries information about where data is to be routed as discussed in the previous section. The state object has to be serializable and also able to expose its elements for migration. Moreover, it has to be reconstructable from a collection of elements. State is managed in groups of keys, i.e. many keys of input data will be mapped to the same state object. The key extraction function defines how this key can be extracted from the input records.
State migration as defined in Section 3.2 is defined on a per-key granularity. In a typical streaming dataflow, the number of keys can be large in the order of million or billions of keys. Managing each key individually can be costly and thus we selected to group keys into bins and adapt the configuration function as follows:
Additionally, each key is statically assigned to one equivalence class that identifies the bin it belongs to.
In Megaphone, the number of bins is configurable in powers of two at startup but cannot be changed during run-time. A stateful operator gets to see a bin that holds data for the equivalence class of keys for the current input. Bins are simply identified by a number, wich corresponds to the most significant bits of the exchange function specified on the operator.222Otherwise, keys with similar least-significant bits are mapped to the same bin; Rust’s HashMap-implementation suffers from collisions for keys with similar least-significant bits.
Megaphone’s mechanism requires two distinct operators, and . The operator maintains the bins local to a worker and passes references to the user logic . Nevertheless, the -operator does not have a direct channel to its peers. For this reason, can obtain a reference to bins by means of a shared pointer. During a migration, serializes the state obtained via the shared pointer and sends it to the new owning -operator via a regular timely dataflow channel. Note that sharing a pointer between two operators requires the operators to be executed by the same process (or thread to avoid synchronization), which is the case for timely dataflow.
In timely dataflow, data is exchanged according to an exchange function, which takes some data and computes an integer representative value:
Timely dataflow uses this value to decide where to send tuples. In Megaphone, instead of assigning data to a worker based on the exchange function, we introduce an indirection layer where bins are assigned to workers. That way, the exchange function for the channels from to is by a specific worker identifier.
Megaphone ties migrations to logical time and a computation’s progress. A reconfiguration at a specific time is only to be applied to the system once all data up to that time has been processed. The operators access this information by monitoring the output frontier of the operators. Specifically, timely dataflow supports probes as a mechanism to observe progress on arbitrary dataflow edges. Each worker attaches a probe to the output stream of the operators, and provides the probe to its operator instance.
For Megaphone to migrate state, it requires clear isolation of per-key state and pending records. Although timely dataflow operators require users to write operators that can be partitioned across workers, they do not require the state and pending records to be explicitly identified. To simplify programming migrateable operators, we encapsulate several timely dataflow idioms in a helper structure that both manages state and pending records for the user, and surfaces them for migration.
Timely dataflow has a Notificator type that allows an operator to indicate future times at which the operator may produce output, but without encapsulating the keys, states, or records it might use. We implemented an extended notificator that buffers future triples (time, key, val) and can replay subsets for times not in advance of an input frontier. Internally the triples are managed in a priority queue, unlike in timely dataflow, which allows Megaphone to efficiently maintain large numbers of future triples. By associating data (keys, values) with the times, we relieve the user from maintaining this information on the side. As we will see, Megaphone’s notificator can result in a net reduction in implementation complexity, despite eliciting more information from the user.
Up to now, we explained how to map the abstract model of Megaphone to an implementation. The model leaves many details to the implementation, several of which have a large effect on an implementation’s run-time performance. Here, we want to point out how they interact with other features of the underlying system, what possible alternatives are and how to integrate Megaphone into a larger, controller-based system.
We implemented Megaphone in timely dataflow, but the mechanisms could be implemented in any sufficiently expressive stream processor. Specifically, we require the ability of the operators to 1. observe timestamp progress at other locations in the dataflow, and 2. to extract state from downstream operators for migration. These two requirements are met in timely dataflow, in which workers manage multiple operators, but even in systems where each thread of control manages a single operator external coordination and communication mechanisms could be used to effect the same behavior.
Megaphone is a library built on timely dataflow abstractions, and inherits fault-tolerance guarantees from the system. For example, the Naiad implementation of timely dataflow provides system-wide consistent snapshots, and a Megaphone implementation on Naiad would inherit fault tolerance. At the same time, Megaphone’s migration mechanisms effectively provide programmable snapshots on finer granularities, which could feed back into finer-grained fault-tolerance mechanisms.
Megaphone’s implementation uses binning to reduce the complexity of the configuration function. An alternative to a static mapping of keys to bins could be achieved by the means of a prefix tree (e.g., a longest-prefix match as in Internet routing tables). Extending the functionality of bins to split bins into smaller sets or merge smaller sets into larger bins would allow run-time reconfiguration of the actual binning strategy rather than setting it up during initialization without the option to change it later on.
We implemented Megaphone as a system that provides an input for configuration updates to be supplied by an external controller. A controller could observe the performance characteristics of a computation on a per-key level and correlate this with the input workload. For example, the recent DS2  system automatically measures and re-scales streaming systems to meet throughput targets.
Independently, we have observed and implemented several details for effective migration. Specifically, we can use bipartite matching to group migrations that do not interfere with each other, reducing the number of migration steps without much increasing the maximum latency. We can also insert a gap between migrations to allow the system to immediately drain enqueued records, rather than during the next migration, which reduces the maximum latency from two migration durations to just one.
Our evaluation of Megaphone is in three parts. We are interested in particular in the latency of streaming queries, and how they are affected by Megaphone both in a steady state (where no migration is occuring) and during a migration operation.
First, in Section 5.1 we use the NEXMark benchmarking suite [21, 25] to evaluate Megaphone under a realistic workload. NEXMark consists of queries covering a variety of operators and windowing behaviors. Next, in Section 5.2 we look at the overhead of Megaphone when no migration occurs: this is the cost of providing migration functionality in stateful dataflow operators, versus using optimized operators which cannot migrate state. Finally, in Section 5.3 we use a microbenchmark to investigate how parameters like the number of bins and size of the state affect migration performance.
We run all experiments on a cluster of four machines, each with four Intel Xeon E5-4650 v2 @ CPUs (each 10 cores with hyperthreading) and of RAM, running Ubuntu 18.04. For each experiment, we pin a timely process with four workers to a single CPU socket. Our open-loop testing harness supplies the input at a specified rate, even if the system itself becomes less responsive (e.g., during a migration). We record the observed latency every , in units of nanoseconds, which are recorded in a histogram of logarithmically-sized bins.
Unless otherwise specified, we migrate the state of the main operator of each dataflow. We initially migrate half of the keys on half of the workers to the other half of the workers (25% of the total state), which results in an imbalanced assignment. We then perform and report the details of a second migration back to the balanced configuration.
The NEXMark suite models an auction site in which a high-volume stream of users, auctions, and bids arrive, and eight standing queries are maintained reflecting a variety of relational queries including stateless streaming transformations (e.g., map and filter in Q1 and Q2 respectively), a stateful record-at-a-time two-input operator (incremental join in Q3), and various window operators (e.g., sliding window in Q5, tumbling window join in Q8), and complex multi-operator dataflows with shared components (Q4 and Q6).
We have implemented all eight of the NEXMark queries in both native timely dataflow and using Megaphone. Table 1 lists the lines of code for queries 1–8 . Native is a hand-tuned implementation, Megaphone is implemented using the stateful operator interface. Note that the implementation complexity for the native implementation is higher in most cases as we include optimizations from Section 4 which are not offered by the system but need to be implemented for each operator by hand.
To test our hypothesis that Megaphone supports efficient migration on realistic workloads, we run each NEXMark query under high load and migrate the state of each query without interrupting the query processing itself. Our test harness uses a reference input data generator and increases its rate. The data generator can be played at a higher rate but this does not change certain intrinsic properties. For example, the number of active auctions is static, and so increasing the event rate decreases auction duration. For this reason, we present time-dilated variants of queries Q5 and Q8 containing large time-based windows (up to 12 hours). We run all queries with updates per second. For stateful queries, we perform a first migration at and perform and report a second re-balancing migration at . We compare the all-at-once to Megaphone’s batched strategy, which strikes a balance between migration latency and duration. We use bins for Megaphone’s migration; in Section 5.2 we study Megaphone’s sensitivity to the bin count.
Figures 7 through 12 show timelines for the second migration of stateful queries Q3 through Q8. Generally, the all-at-once migrations experience maximum latencies proportional to the amount of state maintained, whereas the latencies of Megaphone’s batched migration are substantially lower when the amount of state is large.
maintain no state. Q1 transforms the stream of bids to use a different currency, while Q2 filters bids by their auction identifiers. Despite the fact that both queries do not accumulate state to migrate, we demonstrate their behavior to establish a baseline for Megaphone and our test harness. Figures 5 and 6 show query latency during two migrations where no state is thus transferred; any impact is dominated by system noise.
joins auctions and people to recommend local auctions to individuals. The join operator maintains the auctions and people relations, using the seller and person as the keys, respectively. This state grows without bound as the computation runs. Figure 7 shows the query latency for both Megaphone, and the native timely implementation. We note that while the native timely implementation has some spikes, they are more pronounced in Megaphone, whose tail latency we investigate further in Section 5.2.
reports the average closing prices of auctions in a category relying on a stream of closed auctions, derived from the streams of bids and auctions, which we compute and maintain, and contains one operator keyed by auction id which accumulates relevant bids until the auction closes, at which point the auction is reported and removed. The NEXMark generator is designed to have a fixed number of auctions at a time, and so the state remains bounded. Figure 8 shows the latency timeline during the second migration. The all-at-once migration strategy causes a latency spike of more than two seconds whereas the batched migration strategy only shows an increase in latency of up to .
reports, each minute, the auctions with the highest number of bids taken over the previous sixty minutes. It maintains up to sixty counts for each auction, so that it can both report and retract counts as time advances. To elicit more regular behavior, our implementation reports every second over the previous minute, effectively dilating time by a factor of 60. Figure 9 shows the latency timeline for the second migration; the all-at-once migration is an order of magnitude larger than the per-second events, whereas Megaphone’s batched migration is not distinguishable.
reports the average closing price for the last ten auctions of each seller. This operator is keyed by auction seller, and maintains a list of up to ten prices. As the computation proceeds, the set of sellers, and so the associated state, grows without bound. Figure 10 shows the timeline at the second migration. The result is similar to query 4 because both have a large fraction of the query plan in common.
reports the highest bid each minute, and the results are shown in Figure 11. This query has minimal state (one value) but does require a data exchange to collect worker-local aggregations to produce a computation-wide aggregate. Because the state is so small, there is no distinction between all-at-once and Megaphone’s batched migration.
reports a twelve-hour windowed join between new people and new auction sellers. This query has the potential to maintain a massive amount of state, as twelve hours of auction and people data is substantial. Once reached, the peak size of state is maintained. To show the effect of twelve-hour windows, we dilate the internal time by a factor of 79. The reconfiguration time of corresponds to approximately of event time.
These results show that for NEXMark queries maintaining large amounts of state, all-at-once migration can introduce significant disruption, which Megaphone’s batched migration can mitigate. In principle, the latency could be reduced still further with the fluid migration strategy, which we evaluate in Section 5.3.
We now use a counting microbenchmark to measure the overhead of Megaphone, from which one can determine an appropriate trade-off between migration granularity and this overhead. We compare Megaphone to native timely dataflow implementations, as we vary the number of bins that Megaphone uses for state. We anticipate that this overhead will increase with the number of bins, as Megaphone must consult a larger routing table.
The workload uses a stream of randomly selected 64-bit integer identifiers, drawn uniformly from a domain defined per experiment. The query reports the cumulative counts of the number of times each identifier has occurred. In these workloads, the state is the per-identifier count, intentionally small and simple so that we can see the effect of migration rather than associated computation. We consider two variants, an implementation that uses hashmaps for bins (“hash count”), and an optimized implementation that uses dense arrays to remove hashmap computation (“key count”).
Each experiment is parameterized by a domain size (the number of distinct keys) and an input rate (in records per second), for which we then vary the number of bins used by Megaphone. Each experiment pre-loads one instance of each key to avoid measuring latency due to state re-allocation at runtime.
Figure 13 shows the ccdf of per-record latency for the hash-count experiment with distinct keys and a rate of updates per second. Figure 14 shows the ccdf of per-record latency for the key-count experiment with distinct keys and a rate of updates per second. Figure 15 shows the ccdf of per-record latency for the key-count experiment with distinct keys and a rate of updates per second. Each figure reports measurements for a native timely dataflow implementation, and for Megaphone with geometrically increasing numbers of bins.
For small bin counts, the latencies remain a small constant factor larger than the native implementation, but this increases noticeably once we reach bins. We conclude that while a performance penalty exists, it can be an acceptable trade-off for flexible stateful dataflow reconfiguration. A bin-count parameter of up to leads to largely indistinguishable results, and we will use this number when we need to hold the bin count constant in the rest of the evaluation.
We now use the counting benchmark from the previous section to analyse how various parameters influence the maximum latency and duration of Megaphone during a migration. Specifically,
In Section 5.3.1 we evaluate the maximum latency and duration of migration strategies as the number of bins increases. We expect Megaphone’s maximum latencies to decrease with more bins, without affecting duration.
In Section 5.3.2 we evaluate the maximum latency and duration of migration strategies as the number of distinct keys increases. We expect all maximum latencies and durations to increase linearly with the amount of maintained state.
In Section 5.3.3 we evaluate the maximum latency and duration of migration strategies as the number of distinct keys and bins increase proportionally. We expect that with a constant per-bin state size Megaphone will maintain a fixed maximum latency while the duration increases.
In Section 5.3.4 we evaluate the memory consumption during migration. We expect a smaller memory footprint for Megaphone migrations.
Each of our migration experiments largely resembles the shapes seen in Figure 1, where each migration strategy has a well defined duration and maximum latency. For example, the all-at-once migration strategy has a relatively short duration with a large maximum latency, whereas the bin-at-a-time (fluid) migration strategy has a longer duration and lower maximum latency, and the batched migration strategy lies between the two. In these experiments we summarize each migration by the duration of the migration, and the maximum latency observed during the migration.
We now evaluate the behavior of different migration strategies for varying numbers of bins. As we increase the number of bins we expect to see fluid and batched migration achieve lower maximum latencies, though ideally with relatively unchanged durations. We do not expect to see all-at-once migration behave differently as a function of the number of bins, as it conducts all of its migrations simultaneously.
Holding the rates and bin counts fixed, we will vary the number of bins from up to by factors of four. For each configuration, we run for one minute to establish a steady state, and then initiate a migration and continue for one another minute. During this whole time the rate of input records continues uninterrupted.
Figure 16 reports the latency-vs-duration trade-off of the three migration strategies as we vary the number of bins used. The connected lines each describe one strategy, and the common shapes describe a common number of bins. We see that all all-at-once migration experiments are in a low duration high latency cluster. Both fluid and batched migration achieve lower maximum latency as we increase the number of bins, without negatively impacting the duration.
We now evaluate the behavior of different migration strategies for varying domain sizes. Holding the rates and bin counts fixed, we will vary the number of keys from up to by factors of two. For each configuration, we run for one minute to establish a steady state, and then initiate a migration and continue for one another minute. During this whole time the rate of input records continues uninterrupted.
Figure 17 reports the latency-vs-duration trade-off of the three migration strategies as we vary the number of distinct keys. The connected lines each describe one strategy, and the common shapes describe a common number of distinct keys. We see that for any experiment, all-at-once migration has the highest latency and lowest duration, fluid migration has a lower latency and higher duration, and batched migration often has the best qualities of both.
In the previous experiments, we either fixed the number of bins or the number of keys while varying the other parameter. In this experiment, we vary both bins and keys together such that the total amount of data per bin stays constant. This maintains a fixed migration granularity, which should have a fixed maximum latency even as the number of keys (and total state) increases. We run the key count experiment and fix the number of keys per bin to . We then increase the domain in steps of powers of two starting at and increase the number of bins such that the keys per bin stays constant. The maximum domain is keys.
Figure 18 reports the latency-versus-duration trade-off for the three migration strategies as we increase domain and number of bins while keeping the state per bin constant. The lines describe one migration strategy and the points describe a different configuration. We can observe that for fluid and batched migration the latency stays constant while only the duration increases as we increase the domain. For all-at-once migration, both latency and duration increase.
We conclude that fluid and batched migration offer a way to bound the latency impact on a computation during a migration while increasing the migration duration, whereas all-at-once migration does not.
In Section 5.3.3 we analyzed the behavior of different migration strategies when increasing the total amount of state in the system while leaving the state per bin constant. Our expectation was that the all-at-once migration strategy would always offer the lowest duration when compared to batched and fluid migrations. Nevertheless, we observe for large amounts of data being migrated the duration for a all-at-once migration is longer than for batched migration.
To analyze the cause for this behavior we compared the memory consumption for the three migration strategies over time. We run the key count dataflow with keys and 4096 bins. We record the rss as reported by Linux over time per process.
Figure 19 shows the rss reported by the first timely process for each migration strategy. Batched and fluid migration show a similar memory consumption of
in steady state and do not expose a large variance during migration at timesand . In contrast to that, all-at-once migration shows significant allocations of approximately additional during the migrations.
The experiment gives us evidence that a all-at-once migration causes significant memory spikes in addition to latency spikes. The reason for this is that during a all-at-once migration, each worker extracts and serializes the data to be migrated and enqueues it for the network threads to send. The network thread’s send capacity is limited by the network throughput, limiting the throughput at which data can be transferred to the remote host. Batched and fluid migration patterns only perform another migration once the previous is complete and thus provide a simple form of flow-control effectively limiting the amount of temporary state.
We presented the design and implementation of Megaphone, which provides efficient, minimally disruptive migration for stream processing systems. Megaphone plans fine-grained migrations using the logical timestamps of the stream processor, and interleaves the migrations with regular streaming dataflow processing. Our evaluation on realistic workloads show that migration disruption was significantly lower than with prior all-at-once migration strategies.
We implemented Megaphone in timely dataflow, without any changes to the host dataflow system. Megaphone demonstrates that dataflow coordination mechanisms (timestamp frontiers) and dataflow channels themselves are sufficient to implement minimally disruptive migration. Megaphone’s source code is available on https://github.com/strymon-system/megaphone.