The public cloud offers access to an easily-managed, pay-on-use model of renting compute and storage resources. Increasingly, many companies are moving their business workloads to the cloud [10, 11]. This requires designing software services that execute on the cloud, making effective use of the available resources. However, developing such services is challenging as the cloud programming environment is that of a traditional distributed system: service components are spread across multiple virtual machines and data centers, and communication must happen over the network. To build a reliable cloud service, developers must defend against all common pitfalls of distributed systems: the concurrency from multiple executing processes, unreliable networks (e.g., out-of-order delivery, or message loss/duplication), as well as hardware/software failures. In this paper, we refer to these combined challenges as sources of non-determinism. It is no surprise that the presence of such non-determinism leads to bugs in production, causing tangible loss of business and customer trust [2, 38, 35].
The research community has made several attempts at finding distributed-systems bugs, commonly through the use of systematic testing tools. Examples include Chess [32, 31], MoDist , dBug  and SAMC . These tools take over the non-determinism in a test environment and control it to explore many different program executions. Both exhaustive (up to a bound) and random explorations have proven to be effective. In fact, folklore suggests that any distributed (or concurrent) system when “shaken” carefully by a systematic testing tool would surely produce bugs. However, despite of this success, there has been no visible change in software development practices followed in the industry. Chances are that the next time around a new system is built, it will be built in the same manner as before, leading to the same kinds of bugs seen in previous systems. Without a change in the software development process, the likely impact of any bug-finding technique will be limited.
This paper presents evidence that the situation is not as grim as outlined above. Tools and techniques suggested by the research community can indeed have considerable impact in the industry for developing distributed systems. We share experience from the adoption of the open-source programming framework [6, 36] in Microsoft Azure for building production cloud services. imposes a principled design pattern, inspired from the actor-style of programming , allowing the implementation to closely resemble its high-level design. also provides mechanisms for programmatically expressing non-determinism and writing detailed safety and liveness specifications. Finally, comes with automated testing capabilities that encapsulate the state-of-the-art in systematic testing. This enables high-coverage testing of the production code against its specifications and provides deterministic reproducibility of bugs. Putting these pieces together, allows developers to effectively iterate through the design-implement-test cycle faster than otherwise, leading to accelerated development.
To illustrate the benefits of using , we provide a detailed description of PoolManager, one of three core components of the Azure Batch service (ABS)  that was written from scratch in . ABS is a popular job scheduling and compute management service, offered by Azure, managing over hundreds of thousands of VMs on the cloud. ABS allows a user to create a collection of compute nodes and schedule a parallel job across these nodes. PoolManager is responsible for creating and managing the collection of compute nodes (also called a pool). A previous version of PoolManager, developed over several years using standard engineering practices, had an outdated design that was unable to manage the increasing demands of Azure Batch. It was hard to maintain and test, making feature addition unacceptably slow. This prompted the ABS team to rewrite the PoolManager, this time adopting .
The ABS engineers (both junior and senior) were able to move faster and be more confident in their code changes because they could achieve high-coverage testing with . Writing detailed specifications alongside production code became an integral part of their daily development process. The team reported that the coverage obtained with systematic testing was much higher than even with days of stress testing. found hundreds of bugs that were fixed fast, and often without ever getting checked-in. For a few bugs that we were able to snapshot, it was unlikely that they would have been found using stress testing, or other conventional testing methods, because they required several failures and timeout events to interleave.
The ABS team gained considerable confidence in testing as the PoolManager development proceeded: once a feature was tested with , it would just work when put into production. The current state of practice in the team is that each code check-in to the master branch must clear all available tests. It was a unique experience for the team to get exhaustive (in reality, high-coverage) testing of their code changes readily available on their desktop as they were writing and integrating the code. The debugging process was also significantly improved: each time found a bug, engineers could deterministically replay the buggy trace, attach a debugger, set breakpoints and step-into the code. The majority of the new PoolManager development took only six months, considerably faster than the previous version. For some time, both versions of PoolManager existed simultaneously. During this time, the team had to add a new feature (for supporting low priority preemptible VMs in the pools) to both versions. The addition in the old PoolManager took six person months, whereas the addition in the new version took just one person month. The new PoolManager has now been operating for over a year with no reported bugs in production for tested features. There were occasional bugs, but they all pointed to features outside the scope of their tests.
The Azure Batch PoolManager is the first production-scale system, to the best of our knowledge, to have been developed along with continuous validation of safety and liveness specifications. The experience of the ABS team was not isolated. Given their success, several other teams in Azure adopted in their engineering process. Currently, has been used in Azure for building nine production services, with several more services in the planning stage (§6). Furthermore, has user retention so far: once a team started using it, they have continued to use it for writing new cloud services.
The main contributions of this paper are as follows:
We present PoolManager, the first production-scale system to have been developed simultaneously with continuous validation of the actual code against its safety and liveness specifications; both design and implementation was done by engineers, not researchers.
PoolManager is a stateful microservice that requires storing its state reliably so that it can be restored after a failure (known as a failover). Getting the failover logic correct is often hard. We give a novel methodology for failover testing (§4).
We describe several improvements to that were necessary to support industry-scale usage (§5).
We discuss the experience of several Azure engineering teams with using , for building highly-reliable cloud services (§6).
The rest of this paper is organized as follows: §2 provides background; §3 outlines the design and implementation of the PoolManager using ; §4 focuses on testing of the PoolManager; §5 lists the improvements made to to support the development of production systems; §6 summarizes the experience of several Azure engineering teams with using ; and finally §7 presents related work.
2.1 The Azure Batch Service (Abs)
ABS is a popular generic job scheduling service offered by Microsoft Azure . ABS allows a user to execute a parallel job in the cloud. The job can consist of multiple tasks with a given set of dependencies. ABS will execute the tasks in dependency order while attempting to exploit as much parallelism between independent tasks as possible. Unlike distributed schedulers such as Apache Yarn  and Mesos  that typically require to be installed on a pre-created set of VMs, ABS integrates scheduling with VM management. ABS can auto-scale (i.e., spin up and down) the number of created VMs based on the needs of each job as well as a variety of parameters such as CPU, memory and I/O metrics on VMs, and preemption rate.
The high-level architecture of ABS is shown on the left side of Figure 1. Each region (i.e., a geographical location that hosts one or more data centers) has a resource provider and several schedulers. A breakdown of the resource provider is shown on the right side of Figure 1. The resource provider has a front-end or gateway service that routes requests to back-end managers that support the CRUD operations for specific entities: user accounts, pools and jobs, with relevant data stored in Azure Storage  for persistence.
ABS is a multi-tenant service and a user account is the multi-tenant isolation boundary. Each account is associated with a quota that limits the amount of compute resources that can be allocated for a scheduled job. All resources used by ABS for executing a job are billed to the corresponding account. After creating and registering an account, a user can create a pool that refers to a collection of compute nodes (VMs). The sum total of cores of all VMs across all pools associated with an account must be less than the corresponding core quota limit of the account. The pool can be of fixed size, or set to auto-scale. Once a pool is created, the user can submit a job to schedule on the pool. Each of the account, pool and job managers are multi-instance partitioned services (partitioned by account). The partitions themselves are managed by the partition manager. The design of the partition manager is out of scope for this paper but the various services have to honor partition manager requests to start/stop/split/merge partitions.
At the heart of the ABS functionality is the microservice component called PoolManager. The PoolManager interacts with many other components in the system, such as the job manager, Azure Storage  for storage, Virtual Machine Scale Sets (VMSS)  for VM management and Azure Subscriptions  for billing accounts. ABS needs to respond to auto-scale requirements very rapidly. To achieve this, it must support functionality to cancel outstanding operations so that further changes to resources can be made. The functionality must be provided with low latency, high throughput, high availability and scale. Note that all services that ABS interacts with are publicly available and ABS uses the same APIs that are published externally. This implies that ABS must obey all the rules and limitations enforced by these services. This point was an important design consideration, especially for ABS quota and billing management.
The PoolManager component had to be redesigned for a variety of reasons. The old design split the work of creating pools between the PoolManager and the scheduler. The PoolManager managed pool entities and quotas while the scheduler did the actual allocation of the VMs for the pools. This design itself was an evolution of a previous design where the scheduler cached a pool of VMs. In that scenario, allocating a pool was not about creating new VMs but rather picking from a set of free VMs. Caching of VMs was no longer feasible for ABS: as its usage increased, customers wanted VMs of different sizes and OS images, etc., so it became too costly to hold the VMs in the scheduler. The old design made the scheduler very complicated. It was also harder to dynamically scale up and down the number of schedulers because they were involved in VM management. The goal of the redesign was to move the quota management and the actual VM allocation to a single component (PoolManager) where it can be partitioned and scaled independently of the scheduler component. This helped the scheduler become a very lightweight component that can be spun up and down quickly, as it now only focuses on job scheduling. The redesign also allowed the ABS team to easily incorporate different types of scheduling policies.
The old PoolManager did have some unit and integration tests but the ABS team felt that the tests did not provide much confidence in the overall reliability of the PoolManager. It was important to remedy this situation as well. The PoolManager is a stateful component that operates in a distributed-systems environment. Testing of such components is challenging. For instance, the VM hosting a PoolManager instance may fail or reboot without warning. Operations on pools are long-running asynchronous activities, thus, the developer must anticipate and account for failures that can happen in the middle of such an operation. We refer to the part of a program’s design that deals with recovery from failures as failover logic and testing for its correctness as failover testing. Failovers are not the only challenge: one must also correctly deal with issues such as message re-orderings, timeouts and error handling (when interacting with other services). Due to inadequate testing technology, most often, errors arising from these types of issues are discovered late in the development cycle, or even after deployment when they are very costly to debug and fix. Ideally, we should be able to discover and fix these kinds of issues well before the software is deployed in production.
A further requirement of the PoolManager design was that the entire code base should be asynchronous and non-blocking. This is to ensure that PoolManager remains responsive to cancellation requests: the user is allowed to cancel an outstanding operation at any time. In the old design, there were dedicated threads that blocked synchronously and when the process ran out of threads, the system could not process more requests.
2.2 An Introduction to the Framework
[6, 36] is an open-source actor-based  .NET programming framework. An actor, which is the unit of concurrency, in is called a machine. A program can dynamically create any number of machines that execute concurrently with each other and communicate via messages called events. Each machine is equipped with an inbox where incoming events get enqueued, and executes an event-handling loop that waits for events to arrive and processes them sequentially one after the other. A machine can internally define a state machine structure for programmatic convenience (which is where the term machine comes from). However, this feature is merely syntactic sugar and not central to the value provided by .
has a higher-level concurrency model compared to using threads and locks. A machine encapsulates its own state that is not shared with other machines, and synchronization is limited to sending events. This means that all communication points between machines are clearly marked in code (as opposed to using shared memory where communication happens implicitly each time shared data is accessed).
is designed in a manner that allows robust testing of non-deterministic systems. To this effect, requires developers to explicitly declare all non-determinism present in their code, after which they can use the tester to exercise (in the limit) all possible behaviors of a given test case. The tester understands the non-determinism that arises from concurrency between machines. It uses hooks into the runtime to control the scheduling of machines. There can be other forms of non-determinism in the code. exposes an API to generate unconstrained Boolean and integer values. We refer to this API as NonDet in the rest of this paper. It is the responsibility of the programmer to model the non-determinism in their code using this API. We illustrate this point using an example.
Consider building an application that requires running multiple machines distributed over a network. The developer is interested in testing the implementation against a lossy network, as the code must work correctly even if the network arbitrarily drops messages. The developer first writes a test that initializes all machines under a mocked distributed environment so that the code can execute in a single-process setting. This mocked environment can model the network to express its lossy behavior. Figure 2 shows an illustration for such a mock. The application (not shown) is designed against an interface of the network (INetworkingService), which is then mocked (MockNetworkingService) for testing purposes. The mocked method SendMessage calls NonDet to decide if it is going to deliver the event or not. When it must deliver, it directly addresses the target machine and delivers the event via a Send. Once external dependencies are substituted with mocks to make the test self-contained (as in standard unit-testing), the tester repeatedly executes the test, each time exploring a different interleaving of concurrent actions as well as resolving NonDet calls with different values.
The tester supports many state-of-the-art search strategies inspired from the systematic testing literature [4, 30, 9], and makes it easy to add new strategies as the research community comes up with new algorithms. By default, the tool is configured to execute a portfolio of search strategies in parallel to provide the best coverage to the user.
In addition, provides support for writing functional specifications of the code. These specifications are written in the form of monitors. A monitor can only observe the execution of a program but cannot influence it. Syntactically, this means that it can receive messages from any machine, but cannot send messages. A monitor makes it easy to assert conditions that span multiple machines.
A monitor can also encode a liveness property to check if a system is making progress. A monitor can indicate their temperature as either hot or cold. Given such monitor, the tester searches for an execution where the monitor
remains in a hot state for an “infinite” amount of time (in reality using a heuristic) without transitioning to a cold state. For instance, consider a replication protocol that is required to maintain, say, three replicas of some data. The developer can write a monitor that turns hot when one replica fails (and the count falls below three) and turns cold when three replicas have come up. This monitor encodes the specification that the protocol eventually creates the required number of replicas, even in the presence of failures. Such a specification was previously used for finding bugs in a storage system . It is common for distributed systems to have liveness requirements .
To summarize, a system developed using typically implies three activities. First, the system itself must be written using the concurrency model. The runtime provides APIs to create machines and send events. Second, external dependencies must be mocked and all sources of non-determinism must be expressed via NonDet calls. Third, the user writes tests (exercising the system under a workload) and specifications for asserting correctness.
3 Implementing the PoolManager in
This section outlines the design of the ABS PoolManager and how it is implemented using . The goal is only to provide enough details to impress the complexity of the service, justifying the need to use , and not to give an exhaustive account of the system. We believe the core reasons behind the complexity are common to many cloud systems.
PoolManager exposes APIs for creating a pool, resizing or deleting an existing pool, as well as canceling a previous resizing operation. We begin by explaining key external services that PoolManager relies on for implementing its functionality before getting into the PoolManager design.
3.1 External Services
PoolManager operations are naturally long-running because the creation or deletion of VMs takes time (in the order of seconds to a few minutes). If PoolManager fails while creating, say, a pool of size after it has already allocated VMs, then after restarting, it must resume the pending operation and allocate only the remaining VMs. It is important to not lose track of previously allocated VMs (i.e., they must be part of some pool), else ABS risks allocating resources that will never be subsequently used.
Anticipating failures in the middle of an operation, the PoolManager records its progress using Azure Storage , a highly available and reliable cloud-scale storage system. Azure Storage offers a key-value storage interface. PoolManager uses REST APIs to read and write information about pools, VMs, jobs, tasks, quota management, etc., as entities (rows) in storage. Azure Storage also provides opportunistic concurrency control using entity tags or ETags. These are metadata attached to each row. A client can do a conditional write to a row: the row is updated only if the user-provided ETag matches the current value in the row.
All Azure resources that a customer allocates belong to a subscription, which is a billing entity. Subscriptions, which are managed by the Azure Subscriptions service, can contain accounts for services such as Azure Batch and Azure Storage. Azure imposes limits on how many operations a subscription can perform on resources. Similarly, there are limits on the number of cores one can allocate via VMs in a single subscription. The provided limits are too restricting for running ABS workloads. As ABS is built on public Azure services and needs to allocate resources, it has to own a set of subscriptions (with fairly large limits) and use them to manage resources. By spreading resources across many internal subscriptions, ABS scales beyond what can be achieved with a single subscription.
ABS uses VMSS  in order to allocate VMs for creating a pool. VMSS offers a service for allocating a collection of VMs, which ABS further wraps into the concept of a pool, for the following reasons:
ABS ties scheduling with resource provisioning. If ABS has a pending request to shrink a pool, and a VM finishes running a task, then ABS proceeds to collect and free the VM. This tight coupling between scheduling and resource provisioning is an important value proposition of ABS.
VMSS imposes VM creation limits per subscription. ABS pools can be much larger than these limits. VMSS also limits the number of operations per subscription. ABS spreads out pool creations across many subscriptions to speed up deployment.
ABS supports pool operations such as stop-resize. This is not supported by VMSS. When a user issues a stop-resize operation, ABS moves the corresponding VMSS operations to the background (the customer is not charged) and deletes the extra VMs allocated. The stop operation offered by ABS allows customers to respond to compute demand more quickly.
The DeploymentManager (which is also written in ) is a microservice component of ABS that interfaces with VMSS. PoolManager uses the DeploymentManager service to create, grow, shrink and delete individual VMSS collections, also called deployments in the rest of the paper.
3.2 PoolManager Design
Central to the PoolManager is the need to manage quotas. Each user has a quota on the maximum number of VMs they can allocate for running their jobs. Further, ABS internally manages multiple Azure subscriptions, each one tied to one region, which limits the number of VMs that can be allocated in that region. Thus, a VM can be allocated for a user only when the user-quota has not been exceeded, and there is some subscription (in some region) whose quota has not been exceeded.
In addition to pool operations such as create and resize, PoolManager can recover VMs that are determined to be unhealthy. The unhealthy signal comes from the ABS scheduler and the PoolManager, in response, deletes the unhealthy VM (by signalling to VMSS) and allocates a new one.
The PoolManager implementation has over types of machines, totalling over K lines of code. Each machine is designed to manage the lifetime of a particular resource or to execute a workflow that implements a sub-operation. For example, the Pool machine manages a single pool, the PoolServer machine manages a collection of Pool machines, the PoolFlow machine allocates resources for a pool, the Deployment machine manages a single deployment, the Account machine manages a single account, the QuotaManager machine manages quota requirements and decides how to allocate across subscriptions.
3.3 PoolManager Operations
To create a pool, PoolManager goes through the following steps: persists the pool properties and puts the pool in resizing state, checks if enough quota is available and tentatively reserves it, allocates resources and creates VMs to match the required pool size, persists deployment and VM information, informs the scheduler about the created VMs so that it can start scheduling job tasks on the VMs, commits the revised leftover quota, updates the pool properties to the final count of resources and puts the pool in steady state. An operation is deemed completed once the pool reaches steady state.
Figure 3 shows the workflow that implements this operation to highlight its complexity. Each vertical line corresponds to a machine. Arrows represent exchange of events between machines. The AzureStorage machine wraps calls to the Azure Storage service. Arrows to AzureStorage represent a read or write of persistent data. The DeploymentManager machine wraps VMSS, and arrows to this machine represent VMSS operations.
It is important to note that the workflow is a simplified linear view of the PoolManager execution. The reality is even more complex because of two reasons. First, the workflow executes in parallel (all machines are running concurrently), so their responses can arrive in different orders. Second, error-handling code is pervasive. Each operation, especially when interacting with external services, can return an error code (or time out), which must be handled appropriately. Due to these reasons, the asynchronous programming model of fits naturally with the PoolManager requirements: the machines send out requests and field responses as they arrive asynchronously, instead of blocking each time for a response.
A resize-pool operation is similar to creating a pool, except that it may have to grow or shrink existing deployments, in addition to creating new ones. The resize-pool workflow goes through the following steps: persists the new target values and puts the pool in resizing state, checks quotas, if the resize involves growing the pool, for grow operations: allocates resources and creates or grows deployments, then persists the updated information, for shrink operations: works with scheduler instances to identify VMs that can be deleted, commits the revised quota, updates pool properties to reflect final counts and puts the pool in steady state.
The PoolManager allows only one resize or delete operation per pool at a time. To stop this operation, the user can issue a cancellation that goes through the following steps: stops operations that were creating or resizing deployments, updates the pool size to the previous size plus (minus) any deployments whose creation (deletion) has already committed, and puts the pool in steady state.
After a cancellation is carried out, there may be deployments with extra VMs that have not been freed. These are termed rogue VMs. After the pool goes to steady state, a stabilize deployment operation starts in the background that asynchronously removes any rogue VMs. This operation first identifies such deployments and puts them in a stabilizing state (subsequent resize operations skip deployments that are in this state). It then waits for pending operations on the deployment to complete before issuing fresh operations to remove the rogue VMs. The stabilize-deployment operation also needs to persist progress to storage; on failover, the stabilization is resumed.
An additional requirement is to limit the total number of rogue VMs across all deployments. The QuotaManager machine implements this logic: it aggregates the rogue-VM count across all its deployments. If this number exceeds a threshold, the PoolManager stops new cancellation requests until the rogue-VM count comes down.
4 Testing the PoolManager with
As illustrated in §3, PoolManager involves multiple different operations that can be running concurrently at any point in time. Furthermore, PoolManager has to deal with failures that can happen unexpectedly, and has to correctly resume all pending operations after a failover. Testing such a complex system is where using truly makes a difference.
4.1 PoolManager Specifications
Any testing effort with must start by writing a specification that defines all valid system behaviors. The PoolManager specification is around LoC and is written as a monitor (§2.2) that captures the following properties:
For a given pool, if the last client operation was a resize to size , then the pool eventually reaches steady state with size .
For a given pool, if the last client operation was a delete, then the pool is eventually removed and all its allocated VMs are returned back to VMSS.
For a given pool, all pending stabilization operations must eventually complete.
For a given pool, all pending delete operations must eventually complete.
For a given pool, all pending recovery operations must eventually complete.
In steady state, the state of PoolManager is in sync with VMSS. In particular, if the PoolManager believes that a pool has VMs , then these VMs have indeed been allocated by VMSS to the PoolManager.
For every successful create-pool request, a pool entry is created in Azure Storage.
For every successful resize-pool request, the pool target matches what is requested in the corresponding Azure Storage entry.
For every successful delete-pool request, the pool entry is deleted from Azure Storage.
For every pool, every deployment enumerated in the Azure Storage pool entry is present in the Azure Storage deployment entry.
Every deployment in the Azure Storage deployment entry belongs to a pool.
Every VM in the Azure Storage VM entry belongs to a pool.
Importantly, the above properties must hold even after a failover. Checking the PoolManager against this specification, especially to get coverage of corner-case behaviors, is a challenging task for several reasons. For instance:
The PoolManager is a concurrent program; its various operations may interleave in many different ways.
The PoolManager interacts with multiple external services and it must be able to handle any valid response from those services. Responses that return error codes (e.g., failure to write to storage) happen rarely, thus are hard to cover during testing. Interactions with external services may time out. The dependence on time further creates testing coverage issues.
Failures are non-deterministic events that may happen at any time. Failure-injection tools are often hard to setup and control.
The specification requires consistency between the PoolManager state (including its in-memory state as well as state stored in Azure Storage) and VMSS. Writing such an assertion can be very cumbersome with traditional means because it spans multiple services.
The specification is a liveness property that requires the pool to eventually reach steady state. Executions that get stuck in a loop without making progress are violations of this property and hard to capture using plain assertions.
The testing methodology helped to overcome these challenges. To the best of our knowledge, this is the first time a formal specification was continuously tested against production code during its development.
4.2 Mocking External Dependencies
As mentioned in §2, using requires writing mocks of external services as well as all external sources of non-determinism. The ABS engineering team wrote mocks for Azure Storage, VMSS, the ABS scheduler, and a basic system timer (used for encoding timeouts).
Mock Azure Storage
Writing a mock for Azure Storage was easy. It consists of roughly lines of code. The mock has no internal concurrency (i.e., it executes sequentially) but makes non-deterministic choices to expose various error modes. Figure 4 shows a simple illustration of the mock with a Write operation. The entire store is modeled as an in-memory dictionary (store). The Write operation can, for instance, either succeed and write to storage, or it may return one of several error codes. It can return an error code even after writing to the store successfully: a possibility that can indeed happen with the real Azure Storage service. The mock also implements the ETag matching logic (§3.1).
The mock for DeploymentManager, which wraps VMSS, is around LoC. This service has to handle requests for creating, growing and deleting deployments, as well as for deleting specific VMs. The mock uses an in-memory dictionary to track the deployments and the VM instance names. The mock non-deterministically returns HTTP errors including timeouts. Like the real VMSS service, the mock supports multiple levels of failure (e.g., at the operation or HTTP level). One key requirement was to ensure that the mock respects idempotency when the real service guaranteed it. For example, once the mock returns success for an operation, then it has to return success if the same operation request is issued again.
The mock scheduler is roughly LoC and mimics the ABS scheduler. The job of the scheduler is to schedule tasks onto VMs. From the perspective of the PoolManager, the scheduler has to handle requests for adding or removing VMs and getting VMs that need recovery. The mock scheduler has internal data structures that track pools and VMs (essentially, a struct with multiple fields). The mock scheduler can non-deterministically return failures or remove a subset of VMs for a remove-VM request. The mock is required to update the corresponding Azure Storage entries as it accepts or removes VMs. This helps expose possible race conditions in the PoolManager, when a VM that is picked for recovery is also removed during a shrink operation in the PoolManager. (The ETag logic of Azure Storage helps discover such races.)
All timers used in the PoolManager were also mocked. ABS engineers wrote an abstract Timer class. During testing, Timer is implemented using a non-deterministic choice that can fire the timeout at any point. This helps abstract away time, expressing the fact that correctness of the service does not rely on the particular timeout values chosen. During production runs, an actual system timer is used for implementing Timer. Therefore, timeout values can be freely manipulated in the production service in order to optimize performance.
As opposed to testing the PoolManager against production services (e.g., Azure Storage and VMSS), creating mocks is additional engineering cost that one must pay for using . However, it also has several advantages. First, it clearly lays down the assumptions that the PoolManager is making of these external services. Any deviation from reality can be captured and fixed in the mocks in order to avoid further regressions. Second, all failure modes are made explicit in the mocks using non-deterministic choices. This provides tester with the hooks necessary to explore rare or exceptional behaviors. Third, the state stored in the external services is captured in the (in-memory) mocks. Thus, asserting for consistency between multiple distributed services is much easier. Furthermore, the use of mocks allows testing to be fast: there is no need to wait on wall-clock time in order to fire timers; no need to go over the network to talk to external services, no need to write to disk to survive (hard) failures, etc.
ABS has various tests that each exercise different APIs of the PoolManager. All these tests work against the same monitor that captures the specification of the system (§4.1). It is then up to the tester to find a violation of the safety and liveness properties specified by the monitor. The tester also checks for any assertions embedded inside the code, as well as uncaught exceptions.
4.3 Failover Testing Technique
The PoolManager failover logic is checked programmatically by mocking failures themselves. The tests and specifications do not change. The ABS team modeled failures by creating an event called Terminate (from the perspective of , this is just a user-defined event with no special meaning). Figure 5 provides an illustration of the PoolManager execution when a Terminate event is injected by the test harness during a CreatePool operation. When each machine in PoolManager receives a Terminate event (at an arbitrary point), it forwards the event to its children machines (forwarding of Terminate is not shown in the figure for simplicity), waits for them to send a response and once all responses are received, it halts itself. This way, sending a Terminate event to the top-level machine (PoolServer) ends up terminating the entire PoolManager. The Terminate event is only forwarded to the PoolManager machines (in red), not the machines that mock external services (because the test is for the PoolManager failover logic). The failure injection, i.e., the action of sending a Terminate event, is non-deterministic, thus the tester will provide coverage by exploring many different possibilities.
After all of the PoolManager machines halt, the test harness restarts the service by re-creating the PoolServer machine. When the PoolServer starts, it will read its state from storage, where it will find the state from before the failure (because the mock Azure Storage machine survived the “failure”). Thus, the failover logic—the same logic written to handle real failures in production—kicks in and the PoolManager resumes the CreatePool operation.
For most part, the relationship of a machine to its children machines is obvious and follows the creation hierarchy: if machine created machine then is ’s child. In some cases, this is more involved, especially when machines (legally) halt. In this case, if a machine, say , wishes to halt, then it must first delegate the responsibility of terminating its children machines to some other machine. This was done using custom logic that the ABS team designed for the PoolManager. The same termination code is also used for legal teardown of a PoolManager instance, so the team did not see this effort as a test-only overhead.
A key advantage of this technique is that failovers are simply tested at the level of program semantics. It does not require an actual setup with hard failure injection that must crash and re-start the process. The complete engineering of the tests simply becomes a programming activity. It is then much easier for a developer to control and observe failover coverage. For instance, a few lines of code are enough to limit failover testing to one particular region of code. There is no need of resorting to fuzzing or failure-injections tools or a stress-test environment. Debugging is much easier as well because the programmer is given a fully-replayable trace by the tester, consisting of the actions taken by the system both before and after the failure injection.
5 Improvements in using
Supporting the development of production services required several engineering enhancements to improve the programming, testing and debugging experience of using .
The C# language (which extends) includes the async and await keywords that make it easy to write asynchronous code . Awaiting on an async call packages the current task as a continuation and releases it from the executing thread, so that other tasks can be scheduled. We enhanced to allow machine handlers to be async: this, in turn, allows handlers to call async APIs and await on their result without blocking the underlying thread, enabling other machines to be scheduled.
The programming model discourages sharing objects between machines. Message transfers are the only way for machines to synchronize, which can be cumbersome for some tasks, compared to other forms of concurrency. For example, consider the PoolManager task of maintaining the total rogue VM count (§3.3). To maintain this count, the QuotaManager machine must communicate with all Deployment machines in the system and aggregate their individual counts. Getting a count from each machine requires sending an event to it, waiting for a response, and defining a handler for the response. This not only increases the programming burden but is not efficient for a simple task such as aggregating counts.
To remedy this situation, we developed a library111https://github.com/p-org/PSharp/tree/master/Source/SharedObjects that allows a machine to create a shared object and freely pass its reference to other machines. These shared objects expose a linearizable  interface so that multiple machines can issue operations on the (same) shared object without concurrency issues. In production, shared objects are implemented using an efficient lock-free data structure. When running under the tester, they automatically resort to using machines with message transfers so that the tester does not have to understand any additional form of synchronization. We implemented shared objects for common types such as counters and dictionaries. The design of shared objects showcases the power of mocking: the programming model is not in conflict with low-level or efficient concurrent programming of any form, it simply requires that any concurrency outside of be mocked for testing.
Even the smallest of programs can have an astronomically large state space. There are several search strategies developed in the research community that target finding common bug patterns fast. Many of these strategies have complementary strengths [9, 37]. The tester includes multiple search strategies and makes it easy to include new ones. We enhanced the tester to run a portfolio of search strategies in parallel so that engineers, who are likely unaware of these search algorithms, do not have to worry about making the choice.
We further enhanced the tester to parallelize it on the cloud. We used ABS itself: one can create a pool of VMs and run the tester in parallel on each VM. testing parallelizes easily: each instance of the tester simply runs a different search strategy (with different parameters and seeds). The typical requirement for many teams was to run the tester for iterations, which could be easily achieved on a developer laptop in a manner of minutes. However, occasionally teams chose to run millions of iterations for which cloud-scale testing was important.
When the tester finds a bug, it generates a trace file consisting of all scheduling decisions as well as non-deterministic choices that it made. This trace can be fed back to the tester, in which case, it reproduces the same sequence of choices. To improve the debugging experience, we enhanced the tester by allowing it to attach a debugger when replaying a trace. Deterministically reproducing reported bugs in a concurrent and non-deterministic system while being able to set breakpoints and step-into the code, was a key value addition that has been appreciated by all developers who have used so far.
6 Experience with using in production
As discussed in §1, the positive experience of the ABS team using invited attention from other teams in Azure. Besides PoolManager, there are eight other services built with that are currently live in production, with several more in the planning stage. Many of these services share common characteristics with PoolManager: asynchronously arriving requests that must be processed concurrently in a non-blocking fashion, multiple distributed data sources that must be kept consistent with each other, and interaction with several other services.
Each team echoed two key advantages of using : the actor programming model (that machines are based on) allowed them to implement a service at a higher-level of abstraction, resulting in a cleaner design and code that is easier to maintain, extend and explain to new team members; and the high-coverage testing allowed them to exercise many corner cases and find several high-severity bugs before deployment. The rest of this section summarizes the experience of all these teams from using to design, implement and test their cloud services.
Benefits in design and implementation
Several teams reported that using the actor programming model helped them implement services that closely match their initial high-level (whiteboard) design. A senior Azure engineer that used gave us the following feedback: “the design maps very closely to the actual code, usually I see a much bigger delta between design and finished product”. Closing the gap between design and implementation, allowed teams to easily create diagrams such as in Figure 3 that provide not only a detailed understanding of the workflow implementing each operation, but also the expected communication between machines. These diagrams were useful in explaining the design to other team members. Other benefits of using the actor-based approach of include:
The events sent between machines have to be clearly defined in the implementation. Further, no data can be shared between machines unless explicitly sent via an event. Both of these helped improve code abstraction and readability.
code is naturally non-blocking (asynchronous), so there is no need for explicitly locking resources. Further, there is no need to manage a thread pool, as this is done automatically by the runtime. These benefits together were a welcomed relief over typical multi-threaded code with threads and locks.
Machines are lightweight and event-based which can lead to significant performance gains. One of the teams reported that their previous design relied on polling, which consumed too many CPU cycles. After rewriting their service to , the team was able to write fully reactive code that allowed them to scale to much larger workloads. For instance, they were able to hold up to machines before seeing the CPU reach 80% utilization.
It is worth noting that was not perceived to be an “arcane” technology only used and understood by the most senior engineers. A junior engineer that recently joined one of the Azure teams using said: “being a new developer to the team, one of the first few things I worked on was , it was really quick to onboard and writing actual code is simple and straightforward”.
The importance of mocking
Cloud services typically operate by communicating with their environment, which can consist of other services, as well as resources such as network and storage. To simplify the development process, teams would initially create interfaces for all external dependencies of their service, and then provide simple mock implementations of these interfaces (e.g., the ABS team created a mock for Azure Storage as seen in Figure 4). Importantly, mocks allow developers to express nondeterministic behavior that can be controlled during testing. A large part of the development of each service was done against these mocks in a test environment.
The efficacy of testing relies on how closely mocks model the real behavior. Any deviation can lead to missed bugs (when mocks do not exercise some possible behavior) or even false alarms (when mocks exhibit some behavior that is not possible in reality). Interestingly, each time there was an issue in production that was not found by the tester, it turned out to be either due to a missing test (some workload was not exercised by the tests) or an incomplete mock. It was never the case that the tester could have found the bug, but missed it because of lack of coverage. In these cases, developers would add more tests or patch the mocks. The teams knew about these tradeoffs before deciding to use . Maintaining the mocks was an iterative process and deviations were fixed over time as they got noticed. The initial mocks simply followed the available online documentation and gradually got more detailed over time, and thus more effective. This pay-as-you-go model of writing mocks was important to avoid front-loading the implementation with mocking effort.
No team reported mocking to be a burden, not only because they saw the value that these mocks unlocked, but also because mocking is a common engineering exercise, even without . Some of the services that were mocked included: Azure ServiceBus , Azure Cosmos DB , Azure Storage , various networking services  and resource providers . There was sharing of mocks between teams, but each team ended up owning its own mocks so they could customize them in ways most relevant for their service.
To illustrate one example, the Azure blockchain team built a service  using that is designed to hide the complexity of blockchains from users. It deals with issues of submitting transactions exactly once, hiding forks and rollbacks from users, etc. In order to test this service, the blockchain team wrote a mock of a blockchain network itself that nondeterministically created forks and rollbacks. The ability of authoring tests for exercising such scenarios was very valuable to the team.
The value of systematic testing
Teams typically focused on only writing end-to-end tests for their services, as opposed to writing unit-tests for individual machines. This was enabled by the fact that systematic testing can deal with concurrency and all declared sources of non-determinism. The testing process involves the tester executing a test repeatedly from start to completion (what we call an iteration), each time exploring execution paths using some specified strategy . Without this support, engineers would need to write many more small unit-tests that exercise individual components of their code along with custom assertions for each test.
A common specification among services was to validate liveness properties: each service asserted that it would eventually accomplish the client request, even in the presence of failures. enabled writing such end-to-end specifications just once, and reuse them for all relevant tests. The tests themselves only vary in the client workload they execute. Teams reported that writing an end-to-end specification was much more concise than having multitude of small tests. Further, tests would typically run much faster than stress testing or complex simulations. For example, exercising failover in PoolManager (§4.3) takes approximately two minutes for iterations in an Intel Core i7 laptop with 4 cores and 16GB RAM.
Developers frequently executed tests to validate safety and liveness specifications as they made code changes, ensuring that the implementation never regressed. Some teams reported that the tester helped them find several high-severity bugs before deployment that would have been hard to find using conventional means, and would have resulted in loss of business if they occurred in production. In the words of a service architect, using they “found several issues early in the dev process, this sort of issues that would usually bleed through into production and become very expensive to fix later”.
Once the service code was reasonably functional in a test environment, a team would re-implemented the mocked interfaces (as discussed above) to communicate with the actual external components. Then the code—the same code that was systematically tested—was deployed by using this implementation of the interfaces. As discussed in §1, teams reported that once a feature was tested with , it would just work when put into production.
7 Related Work
Systematic testing tools
The research community has been long interested in finding bugs in distributed systems. Previous work showcased how systematic testing (ST) tools and search techniques can successfully find deep concurrency bugs [32, 31, 41, 34, 19]. Many of these search techniques have been adapted almost directly by . However, it is critical to consider the nature in which these techniques are exposed to a user.
In order to reduce user effort, ST tools have mostly targeted existing systems without modification. This approach requires ST tools to take over all sources of non-determinism in these systems. Obtaining such level of control is difficult because the API surface of such systems can be very broad. For example, Chess  targeted the testing of concurrent multi-threaded programs on Windows. To control thread interleaving, Chess had to interpose at the Win32 API level (via stubs) and reliably identifying all sources of thread synchronization. It was necessary to get these stubs right, without which the tool would be flaky or even deadlock, leading to user frustration. The effort required to maintain these stubs was too large, and Chess went out of support without seeing user adoption, even though it found numerous tough bugs. Instead of targeting unmodified systems, spells out how a new system must be built from the outset. concurrency is simple to control: its only about machine creation and message passing. Any use of external nondeterminism must be mocked: an exercise that is much easier for a programmer who controls the design of their code.
Systems-level imposition can also be slow. For example, SAMC  takes roughly hours for doing test iterations of Cassandra  because it must bring up the actual database in each iteration and inject failures via actual system crashes. The use of mocks to model real-world interactions and failures offers speed: roughly two minutes for iterations of PoolManager with (§6).
Researchers have argued for principled design of distributed systems through the use of modeling languages such as TLA+  or Promela . TLA+ has been extensively used to model and specify distributed protocols and algorithms . One can apply inductive reasoning to these model (which is harder) or use push-button model checkers (which is easy). Modeling languages are useful for validating the high-level design of a system. However, they do not help with the actual implementation. As new features are added, the implementation often diverges from the initial design that was modeled. bridges the gap between design and implementation, and what you test is what you execute (§6).
Formally verified systems
Recent research efforts have focused on developing formally-verified systems [40, 13]. Building such systems involves using a high-level language that can generate executable code, as well as contain logical assertions that mark inductive system invariants. The inductiveness checks, as well as the check that the invariants imply the system specification, are all discharged by a theorem prover to establish the proof of correctness. Examples of such systems include: crash-tolerant file systems , simple operating systems , distributed key-value stores  and protocols [40, 3]. Although this line of research is exciting, all of these systems have been developed in academic settings. Inductive reasoning requires deep understanding of formal logic and that is outside the scope of education that most software developers receive. This constitutes the biggest bottleneck for adoption of these practices in the industry.
removes the need for theorem proving. Developers must write specifications of their program to find bugs, but there is no need for inductive reasoning. At the most, one must learn the concept of liveness properties, using the notion of hot monitor states (§2.2). The emphasis of in programmability allows engineering teams to use effectively without having a researcher in the loop. This does imply that guarantees are not as strong as full verification: testing is still an argument of coverage. However, as this paper shows, testing has (so far) not missed a bug due to lack of coverage (§6).
Previous work on P and
Previous work on focused on defining the framework  and showcasing its bug-finding abilities [7, 30], but only targeted existing systems. In our experience, convincing teams to use with just bug-finding capabilities alone was not enough. It was much easier to convince them once we had a success story (with Azure Batch) that demonstrated overall faster development time, along with increased service quality. It proved that the framework is mature, easy to learn and use.
The P language  is a different design point compared to . It consists of its own language and compiler, which increases its barrier for adoption. Although P has been successful for device driver development, integrating it with an underlying infrastructure was very challenging. With lack of libraries, IDE support, etc., it was hard convincing engineering teams to take a dependence on it. Our focus on cloud services (as opposed to all asynchronous software) also helped amplify our message within Azure.
-  (1986) Actors: a model of concurrent computation in distributed systems. MIT Press, Cambridge, MA, USA. External Links: Cited by: §1, §2.2.
-  (2012) Summary of the AWS service event in the US East Region. Note: http://aws.amazon.com/message/67457/ Cited by: §1.
-  (2017) Everest: towards a verified, drop-in replacement of HTTPS. In 2nd Summit on Advances in Programming Languages, SNAPL 2017, May 7-10, 2017, Asilomar, CA, USA, pp. 1:1–1:12. Cited by: §7.
-  (2010) A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010, pp. 167–178. Cited by: §2.2.
-  (2017) Verifying a high-performance crash-safe file system using a tree specification. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, New York, NY, USA, pp. 270–286. Cited by: §7.
-  (2015) Asynchronous programming, analysis and testing with state machines. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, pp. 154–164. Cited by: §1, §2.2, §6, §7.
-  (2016) Uncovering bugs in distributed storage systems during testing (not in production!). In 14th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22-25, 2016., pp. 249–262. Cited by: §2.2, §7, §7.
-  (2013) P: safe asynchronous event-driven programming. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, Seattle, WA, USA, June 16-19, 2013, pp. 321–332. Cited by: §7.
-  (2015) Systematic testing of asynchronous reactive systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015, pp. 73–83. Cited by: §2.2, §5.
-  (2018) 83% of enterprise workloads will be in the Cloud by 2020. Note: https://www.forbes.com/sites/louiscolumbus/2018/01/07/83-of-enterprise-workloads-will-be-in-the-cloud-by-2020 Cited by: §1.
-  (2019) Public Cloud soaring to $331B by 2022 according to Gartner. Note: https://www.forbes.com/sites/louiscolumbus/2019/04/07/public-cloud-soaring-to-331b-by-2022-according-to-gartner Cited by: §1.
-  (2019) Cassandra. Note: http://cassandra.apache.org/ Cited by: §7.
-  (2015) IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles, Cited by: §7.
-  (2014-10) Ironclad apps: end-to-end security via automated full-system verification. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), Cited by: §7.
-  (1990) Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12 (3), pp. 463–492. Cited by: §5.
-  (2011) Mesos: a platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pp. 295–308. Cited by: §2.1.
-  (2011) The spin model checker: primer and reference manual. 1st edition, Addison-Wesley Professional. External Links: Cited by: §7.
-  (1994) The temporal logic of actions. ACM Transactions on Programming Languages and Systems 16 (3), pp. 872–923. Cited by: §2.2, §7.
-  (2014) SAMC: semantic-aware model checking for fast discovery of deep bugs in cloud systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pp. 399–414. Cited by: §1, §7, §7.
-  (2019) Asynchronous programming with async and await in C#. Note: https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/ Cited by: §5.
-  (2019) Azure Batch: Cloud-scale job scheduling and compute management. Note: https://azure.microsoft.com/en-in/services/batch/ Cited by: §1, §2.1.
-  (2019) Azure Blockchain Service. Note: https://docs.microsoft.com/en-us/azure/blockchain/service/overview Cited by: §6.
-  (2019) Azure Cosmos DB. Note: https://azure.microsoft.com/en-in/services/cosmos-db/ Cited by: §6.
-  (2019) Azure Resource Manager. Note: https://docs.microsoft.com/en-us/azure/azure-resource-manager/ Cited by: §6.
-  (2019) Azure Service Bus. Note: https://docs.microsoft.com/en-us/azure/service-bus-messaging/ Cited by: §6.
-  (2019) Azure Storage. Note: https://azure.microsoft.com/en-us/services/storage/ Cited by: §2.1, §2.1, §3.1, §6.
-  (2019) Azure Subscriptions. Note: https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits Cited by: §2.1.
-  (2019) Azure Virtual Machine Scale Sets. Note: https://azure.microsoft.com/en-in/services/virtual-machine-scale-sets/ Cited by: §2.1, §3.1.
-  (2019) Azure Virtual Network. Note: https://docs.microsoft.com/en-us/azure/virtual-network/ Cited by: §6.
-  (2017) Lasso detection using partial-state caching. In 2017 Formal Methods in Computer Aided Design, FMCAD 2017, Vienna, Austria, October 2-6, 2017, pp. 84–91. Cited by: §2.2, §2.2, §7.
-  (2008) Finding and reproducing Heisenbugs in concurrent programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 267–280. Cited by: §1, §7.
-  (2008) Fair stateless model checking. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 362–371. Cited by: §1, §7, §7.
-  (2015-03) How Amazon Web Services uses formal methods. Commun. ACM 58 (4), pp. 66–73. External Links: Cited by: §7.
-  (2011) dBug: systematic testing of unmodified distributed and multi-threaded systems. In Proceedings of the 18th International SPIN Conference on Model Checking Software, pp. 188–193. Cited by: §1, §7.
-  (2002) The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and Technology, Planning Report 02-3. Cited by: §1.
-  (2019) P#: A framework for rapid development of reliable asynchronous software. Note: https://github.com/p-org/PSharp Cited by: §1, §2.2.
-  (2016) Concurrency testing using controlled schedulers: an empirical study. TOPC 2 (4), pp. 23:1–23:37. Cited by: §5.
-  (2014) GoogleBlog – Today’s outage for several Google services. Note: http://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html Cited by: §1.
-  (2013) Apache hadoop YARN: yet another resource negotiator. In ACM Symposium on Cloud Computing, SOCC ’13, Santa Clara, CA, USA, October 1-3, 2013, pp. 5:1–5:16. Cited by: §2.1.
-  (2015) Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 357–368. Cited by: §7.
-  (2009) MODIST: transparent model checking of unmodified distributed systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, pp. 213–228. Cited by: §1, §7.