FaaT: A Transparent Auto-Scaling Cache for Serverless Applications

04/28/2021 ∙ by Francisco Romero, et al. ∙ 0

Function-as-a-Service (FaaS) has become an increasingly popular way for users to deploy their applications without the burden of managing the underlying infrastructure. However, existing FaaS platforms rely on remote storage to maintain state, limiting the set of applications that can be run efficiently. Recent caching work for FaaS platforms has tried to address this problem, but has fallen short: it disregards the widely different characteristics of FaaS applications, does not scale the cache based on data access patterns, or requires changes to applications. To address these limitations, we present Faa$T, a transparent auto-scaling distributed cache for serverless applications. Each application gets its own Faa$T cache. After a function executes and the application becomes inactive, the cache is unloaded from memory with the application. Upon reloading for the next invocation, Faa$T pre-warms the cache with objects likely to be accessed. In addition to traditional compute-based scaling, Faa$T scales based on working set and object sizes to manage cache space and I/O bandwidth. We motivate our design with a comprehensive study of data access patterns in a large-scale commercial FaaS provider. We implement Faa$T for the provider's production FaaS platform. Our experiments show that Faa$T can improve performance by up to 92 average) for challenging applications, and reduce cost for most users compared to state-of-the-art caching systems, i.e. the cost of having to stand up additional serverful resources.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Motivation. Function-as-a-Service (FaaS) is an increasingly popular way of deploying applications to the cloud. With FaaS, users deploy their code as stateless functions and need not worry about creating, configuring, or managing resources explicitly. FaaS shifts these responsibilities to the FaaS provider (e.g., AWS Lambda, Azure Functions, Google Cloud Functions), which charges users per resource usage during their function invocations. FaaS providers build their platforms by renting and managing resources (VMs or bare-metal containers) in public clouds (e.g., AWS, Azure, Google Cloud Platform). To control their costs, these providers proactively unload a function from memory if it has not been invoked for a while (e.g., after 7 minutes of inactivity in AWS Lambda [50]).

Due to FaaS’s stateless nature, a function invocation is not guaranteed to have access to state created by previous invocations. Thus, any state that might be needed later must be persisted in remote storage. This also applies to applications with multiple stages (often expressed as a pipeline or a directed acyclic graph of functions), with intermediate results passed between invocations. Since existing FaaS platforms typically do not allow functions to communicate directly, functions must also write these results to remote storage.

The remote storage can be object-based (e.g., Amazon S3 [3], Azure Blob Storage [35]), queues [6], or in-memory storage clusters (e.g., Redis [45], InfiniCache [55], Pocket [32]). Regardless of type, remote storage incurs higher latency and lower bandwidth than accessing local memory [32, 24]. When users have to provision in-memory storage clusters, it introduces management overhead and costs.

Given these limitations, local in-memory caching emerges as a natural solution for both speeding up access to remote data and enabling faster data sharing between functions. Prior works have considered both local and remote in-memory caching for FaaS [32, 43, 53, 55] but, we argue, have come up short in multiple ways.

First, they implement a single cache for multiple or all applications. This disregards the widely different characteristics of FaaS applications. For example, Shahrad et al. [48] have shown that 45% of applications are invoked less frequently than once per hour on average. Caching data for rarely-invoked applications at all times is wasteful. However, not caching data for these applications will likely produce poor performance. Furthermore, a shared cache requires complex communication and synchronization primitives for the data of thousands of applications.

Second, in prior approaches, the cache is either fixed in size (e.g.,  [43, 55]) or scales only according to the computational load (e.g.[53]). These approaches work well when data access patterns are stable and working sets are smaller than the available cache space. When this is not the case, scaling the cache based on data access patterns would be beneficial. Moreover, prior works have not considered scaling as a way to mitigate the impact of accessing large data objects; these objects can take long to access as remote storage I/O bandwidth is often limited by the underlying VM/container or contention across co-located applications [24]. Emerging FaaS applications, such as ML inference pipelines and notebooks, would benefit from scaling out for increased cache space and increased I/O bandwidth to remote storage.

Third, prior caches are not entirely transparent to users, either because users need to provision them explicitly (e.g.[43, 32]) or because they provide a separate API for users to access the cache (e.g.[51, 53]). FaaS users do not want to think of data locality or manage caches (or any other resources) explicitly. In fact, a key reason for the popularity of FaaS is exactly that users are relieved from such tasks.

Our work. Fundamentally, the problem with prior approaches is that caching layers for serverless systems have never been truly serverless, i.e. tied to applications, auto-scaling, and transparent. Thus, we propose Faa$T, an in-memory caching layer with these characteristics.

Each application is loaded into memory with its own local Faa$T cache. This cache manages the data accessed by the application transparently as it runs. When the application is unloaded from memory, its Faa$T is also unloaded. This approach obviates the need for a remote in-memory cache and may reduce the overall traffic to remote storage, both of which reduce costs for users. It also means that the cache space required by rarely-invoked applications is proactively removed from memory when not needed, just as the application itself, which reduces costs for the FaaS provider. Moreover, it enables different cache replacement and persistence policies per application, and pre-fetching of the most popular data when (re-)loading each application. The latter feature can be very effective when combined with automatic pre-warming of applications, which can be done accurately and completely off the critical invocation path [48].

As in other systems, an application is auto-scaled based on the number of invocations it is receiving. Scaling out loads a new application “instance” (i.e., a copy of its code) into memory, whereas scaling in unloads an instance. We refer to this as compute-based scaling. However, to match each application’s data access and reuse patterns, Faa$T automatically scales out the number of instances to (a) increase the fetch bandwidth from remote storage for large objects, and (b) increase the overall cache size for frequently-accessed remote data. Scale-in occurs as the need for space and/or bandwidth subsides. With multiple active instances, Faa$T forms a cooperative distributed cache, i.e. a data access can hit locally, miss locally but hit a remote cache, or miss both locally and remotely. By default, Faa$T offers strong data consistency, but each application can optionally select its own consistency and policies for scaling and eviction.

A key aspect of FaaS is that users are only charged for resources they use. As we tie Faa$T to applications, we expect that FaaS providers would charge only for cache accesses and space consumed by objects that were actually accessed.

Implementation and results. We motivate Faa$T with the first comprehensive study of real FaaS applications from the perspective of data access and reuse. We use data collected for 2 weeks from the production workload of a large public FaaS provider. We show, for example, that many infrequently invoked applications exhibit good temporal locality (the same data is accessed across relatively rare invocations), whereas spatial locality in large objects is high (if any byte from such an object is accessed, the rest of it should be pre-fetched).

We implement Faa$T in a public production FaaS offering. To show that it enables new applications that are not efficiently supported by current FaaS platforms, we implement an ML inference pipeline and a Jupyter notebooks [30] server stack that runs unmodified notebooks in a serverless environment.

Our experiments with these applications evaluate Faa$T’s caching and scaling policies. Our results show that Faa$T can improve their performance by up to 92% (57% on average), and reduce costs for most users compared to state-of-the-art FaaS caching systems, i.e., the cost incurred by the provisioning of additional serverful resources.

Contributions. In summary, our main contributions are:

  • [wide,labelwidth=!,labelindent=0pt,topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  • We characterize the data access patterns of the production workload of a large FaaS provider.

  • We design and implement Faa$T, a transparent auto-scaling cache for FaaS applications.

  • We propose scaling policies for Faa$T to increase instance bandwidth and overall cache size based on data access patterns and object sizes.

  • We show that Faa$T broadens the scope of applications that can run on FaaS with near-native performance, including ML inference pipelines and Jupyter notebooks.

Figure 1: Distribution of the size of accessed blobs. 80% of accessed blobs are smaller than 12KB.
Figure 2: CDF of the number of function invocations per unique blob by each application.
Figure 3: CoV of the IaT and number of invocations for each blob accessed more than three times. CoV of 1 is Poisson arrival.

2 Analysis of FaaS Applications and Caching

This section characterizes the invocation and data access patterns of applications running on a production FaaS offering. An application is a collection of logically-related functions. Each instance of an application gets a set of resources (e.g., memory in a VM or container) that is shared by its functions. We focus on data accesses, as several prior works have focused on code accesses to optimize cold-start latencies [1, 38, 16, 11] or reducing the number of cold-starts [48]. We also discuss the limitations of current FaaS platforms for existing and emerging applications.

2.1 Characterizing Current Applications

We use 14 days of logs (November 23 to December 7 2020) from a large-scale FaaS provider, across 28 geographical regions. We analyze a sample of the applications that access remote blob storage over HTTPS. This log includes 855 applications from 509 users and 33.1 million invocations with 44.3 million data accesses. 77.3% of accesses are reads and the rest are writes. The applications are written in multiple programming languages, including C#, Node.js, and Python.

Data size. The log includes accesses to 20.3 million different objects with a total size of 1.9 TB. Figure 3 shows the distribution of the size of blobs accessed. 80% of blobs are smaller than 12KB and more than 25% are smaller than 600 bytes. However, there are also many large blobs; a few as large as 1.8GB. The objects read tend to be larger than the ones written. While the aggregate bandwidth to backend storage is usually high [49], the prevalence of small objects exacerbates the impact of storage latency.

Data accesses and reuse. Figure 3 shows the CDF of the ratio of the number of invocations an application made to the number of unique blobs accessed. Most applications access a single, different blob per invocation (invocation/blob=1). Roughly 11.0% of applications access more than one blob per invocation (invocation/blob¡1). More than 32.0% of the applications access the same blob in more than one invocation and 7.7% in more than 100 invocations. One application accesses the same blob in more than 10,000 invocations.

Around 11.8% of the applications access the same blob across all invocations, 66.1% access less than 100 different blobs, 93.0% access less than 10,000 different blobs, and one accesses more than 8 million different blobs. Even though there are 44.3 million accesses, only 20.3 million are first accesses. Overall, the applications accessed 2.6 TB of data while the corpus of unique data is 1.9 TB. If we were able to cache the already accessed data, we could save up to 27.0% of traffic and 54.3% of the accesses to remote storage.

Data sharing across applications and users (not shown) is extremely uncommon. 99.7% of the blobs are not shared across applications, only 0.02% of blobs are shared across regions, and only 16 blobs across users.

Temporal access pattern. Figure 3 shows the temporal access patterns for each blob accessed by an application. The X-axis is the number of function invocations that read/wrote the blob and the Y-axis shows the coefficient of variation (CoV) for the inter-arrival time (IaT) of those invocations. Each point represents a blob with more than 3 accesses (there is no IaT CoV otherwise). A CoV of 1 suggests Poisson arrivals, values close to 0 indicate a periodic arrival, and values larger than 1 indicate greater burstiness than Poisson. Clearly, accesses to a large percentage of blobs are very bursty.

Data size Large Large Large Large Small Small Small Small
App. invoc. freq. Frequent Rare Frequent Rare Frequent Rare Frequent Rare
Data reuse Low-med Med-high Med-high Med-high Low Med Low-med Low
Example
application
Video
streaming
Jupyter
notebook
ML inference
(pipeline)
Distributed
compilation
Log
aggregator
Software
unit tests
ML inference
(single model)
IoT
sensors
Table 1: FaaS application spectrum. Applications vary in data size, invocation frequency, and data reuse.
Features/Properties Pocket [32] InfiniCache [55] Locus [43] Faasm [51] Cloudburst [53] OFC [37] Faa$T
Cache location Storage cluster Separate function Redis cluster Host cache VM cache RAMCloud server Invoked function
Cached data management Independent of app Independent of app Independent of app Independent of app Independent of app Unloaded w/ app Unloaded w/ app
App transparency Get/Put Get/Put No changes Custom API Custom API No changes No changes
User configuration Hints Num instances Redis None None None None
Data consistency None None None Supported Supported Supported Supported
Object pre-warming None None None None None None Automatic
Compute cache scaling N/A N/A N/A Dynamic Dynamic Dynamic Dynamic
Cache size scaling N/A None None None None Dynamic Dynamic
Cache bandwidth scaling N/A None None None None None Dynamic
Table 2: Existing systems and their properties. Green indicates supported/enabled. Tan indicates limited support, or limitations in what can be enabled. Red indicates not supported/enabled. For Locus, indicates transparency is limited to applications with a shuffle operation.

Access performance. We observe that writes are usually faster than reads, since writes are buffered and do not require persisting data across all replicas synchronously. Reads are slower as we have to wait for the storage layer to deliver all data. We also observe that smaller blobs have a lower throughput (in MB/s) as they cannot amortize the overhead of the initial handshake.

Diverse invocation patterns. Prior work has observed that invocation frequency varies greatly: 81% of applications are invoked at most once a minute on average, whereas 45% are invoked less than once per hour on average [48]. Furthermore, less than 20% of applications are responsible for over 99% of all invocations. These findings are consistent with our log where we observe an even more concentrated behavior: 80% of functions have less than one invocation per minute on average and less than 15% of applications account for 99% of the invocations. This heterogeneous behavior is a challenge, as caching data for rarely-invoked applications can be wasteful but is necessary, as it affects a large number of users.

Takeaways and requirements. Our characterization indicates that many FaaS applications exhibit data reuse: more than 30% of them access the same data across invocations. This suggests that caching can be effective for them. Moreover, the characterization shows a wide spectrum of accessed data sizes and invocation frequencies. Accessed data sizes span almost 9 orders of magnitude (from several bytes to GBs), i.e. large objects cannot be overlooked. Function invocation frequencies also span almost 9 orders of magnitude, i.e. rarely-invoked applications cannot be overlooked.

Table 1 illustrates the spectrum of data sizes, invocation frequencies, and reuse along with some example applications. For instance, distributed compilation of the Chromium browser requires accesses to hundreds of MBs, but happens only a few times per day using a framework like gg [17]. Data reuse across compilations is high since the codebase does not change fast. In contrast, serving an IoT sensor involves a small dataset, rare invocations, and low reuse.

We draw three key requirements for a serverless caching layer. First, it should ensure that data with good temporal locality is cached and reused across invocations (R1). Second, the caching layer should target both frequently- and rarely-invoked applications (R2). It should optimize for data reuse for frequently-invoked applications, while it should have the ability to pre-warm frequently-accessed objects for rarely-invoked applications. Finally, the caching layer should accommodate large objects and exploit spatial locality for them (R3). These requirements must be satisfied for applications written across various programming languages.

2.2 Existing Caching Systems Limitations

Table 2 lists the characteristics of several caching systems.

Caching is managed independently of each application. Except for OFC, all the systems listed in the table include a separate caching or storage infrastructure that is shared by multiple applications. (Cloudburst and Faasm also have a shared cache on the same hosts/VMs that run the functions.) Because of this, either users must manage the extra servers or cache state is left behind when applications are unloaded from memory. In the former case, user costs and management overheads are greater, whereas in the latter the FaaS provider costs are higher. Thus, a serverless caching layer should be tied to each application, so that its code and data can be loaded/unloaded based on the application’s invocation pattern (R4).

Need system configuration or application changes. Pocket, InfiniCache, and Locus rely on user configuration to maximize performance and minimize cost. For example, InfiniCache users must statically set the number of data shards and functions to store them, whereas Locus users must configure a Redis cluster. Faasm and Cloudburst rely on custom APIs to give applications control over the consistency of their data, and to pass messages between functions. To retain the simplicity of the FaaS paradigm, the caching layer should be transparent and not require changes to the application (R5).

Cloudburst additionally requires its key-value store (Anna [56]) to track all objects residing in all caches. Managing this much metadata can lead to scaling limitations and extra costs. The caching layer can mitigate these concerns by minimizing metadata management for each cache instance (R6).

Compute-based scaling only. Finally, while some existing systems do not dynamically scale their caches (Pocket, InfiniCache, Locus), others do so based solely on the amount of offered computational load, i.e. number of function invocations (Faasm, Cloudburst). OFC scales based on the computational load and predicted memory usage of cached objects. However, it limits its caching to objects smaller than 10MB. As applications become more complex and data-heavy, data access characteristics like large working sets or large objects will gain in importance. Thus, the caching layer should scale compute (as the application’s offered load varies), cache size (based on the data reuse pattern), and bandwidth to remote storage (based on the object sizes being accessed) (R7).

2.3 Enabling New FaaS Applications

Current FaaS platforms limit the set of applications that can be efficiently run. Next, we describe two challenging ones.

ML inference pipeline. Many applications across several domains (e.g., health care [27], advertisement recommendation [21], and retail [2]) depend on ML inference for classification and other prediction tasks. ML inference load patterns can vary unpredictably [57, 44], which makes FaaS on-demand compute and scaling an ideal match for serving inference queries. However, ML inference applications require low-latency prediction serving (1 s) [20, 44]. For example, AMBER Alert [34]

responders may use an application to perform car and facial recognition. The application can be deployed on a FaaS platform as a pipeline of ML models (Figure 

4). For each input image, an HTTP request is first sent to a bounding box model function to identify and label all present objects

1
. The labeled image is uploaded to a common data store to trigger the car and people recognition models

2
. Both recognition functions upload their outputs to the common data store

3
. Inference pipelines can exhibit different levels of parallelism at each stage, which also makes them good fits for FaaS deployment [46]. The AMBER Alert pipeline fans out in the second stage, depending on the identified objects.

Figure 4: An example of ML pipeline executed through FaaS. The bounding box model function is invoked using HTTP

1
. The labeled bounding boxed image is uploaded to a common data store, and it subsequently triggers the execution of the car and people recognition functions

2
. The car and people recognition functions also upload their outputs to the store

3
.

To assess whether FaaS can meet low-latency requirements, we ran the AMBER Alert pipeline natively on a local VM, and in a production FaaS environment with remote storage. Figure 4(a) shows that it is up to 3.8 slower running FaaS versus natively, while Figure 4(b) shows that the main reason is the time to load the models. The inefficiency of the storage layer makes it impossible for the FaaS platform to run this application with sub-second latency.

Jupyter notebooks. Jupyter notebooks [30]

are often used for data science tasks such as data processing, ML modeling, and visualizing results 

[41]. They are typically run by defining code in cells and interactively executing each cell. Jupyter notebooks are typically backed by statically-allocated VMs. Depending on how often a notebook is run, the VMs may sit idle for long periods. This is expensive for users and wasteful for service providers. Furthermore, the amount of parallelism and compute needed for each cell’s task can vary. Akin to ML inference, this variability makes FaaS a strong fit.

To test its performance, we ported Jupyter to run on a production FaaS platform — an application we term JupyterLess. Each cell is executed as a function invocation and the state between functions is shared through an intermediate storage layer. We compare the execution time of summing a single 350MB DataFrame column partitioned into 10 chunks with JupyterLess to running on a native VM. JupyterLess is 63 slower than native VM execution as downloading the intermediate state and DataFrame column from remote storage dominates the execution time. Thus, JupyterLess cannot be run interactively on existing FaaS frameworks.

3 Faa$T design

We design Faa$T as a transparent auto-scaling cache that meets the requirements we identify in Section 2. Faa$T caches objects accessed during a function execution so they can be reused across invocations (R1). It is built into the FaaS runtime with no external servers or storage layers, so it can be transparently tied to an application (R4, R5) written in any of the various supported languages. When an application is unloaded from memory, Faa$T collects metadata about the cache objects, and uses it to pre-warm the cache with frequently accessed objects when the application is re-loaded into memory. This is especially important for applications that are rarely-invoked (R2).

(a) Native VM versus prod FaaS.
(b) FaaS execution breakdown.
Figure 5:

(a) Latency of AMBER Alert pipeline on a native VM versus a production FaaS. Native VM does not include the time to load the PyTorch library (700MB). It is up to 3.8

slower to run the pipeline on FaaS. (b) Model run times (BBox is a bounding box model). Data movement from/to remote storage dominates.

Faa$T scales along three dimensions (R7): (a) based on an application’s invocations per second (compute scaling), which benefits applications that are frequently-invoked; (b) based on the data reuse pattern (cache size scaling), which is beneficial for applications with large working sets whose objects are continuously evicted; (c) based on the object size (bandwidth scaling), which is beneficial for applications that access large objects (10MB) and are limited by the I/O bandwidth between the application instance and remote storage (R3). While an application is loaded, Faa$T efficiently locates objects across instances using consistent hashing without the need for large location metadata (R6).

3.1 System architecture

Figure 6 shows the architecture of a FaaS platform with Faa$T. Each application instance runs in a VM or container that contains the FaaS runtime and the code for the application functions. Faa$T instances, which we refer to as cachelets, are a part of the runtime, caching data in memory. Each application instance has one corresponding Faa$T cachelet. In addition, Faa$T forms a cooperative distributed cache; an application’s Faa$T cachelets communicate directly to access data when necessary (Section 3.2). Similar to Faasm, Faa$T maintains a single copy of cached objects, which improves memory efficiency compared to existing systems [37, 53, 55].

We designed Faa$T to be per-application due to the following drawbacks of a shared cache. First, a shared cache requires complex communication and synchronization primitives for the data of thousands of applications (compared to a maximum of tens of instances for a single application with its own cache). This makes it difficult to implement per-application management policies (e.g. scaling) and provide transparency without custom APIs [51, 53], especially given the diversity of application characteristics and requirements (Section 2.1). Second, a shared cache with traditional eviction policies (e.g. LRU) can lead to severe unfairness among applications [42].

To find the location of an object, a Faa$T cachelet interacts with the Membership Daemon, which determines the object’s “owner” based on the current number of cachelets. The owner is responsible for downloading/uploading the object from/to remote storage. The Load Daemon collects cached object metadata, and uses it to decide what data objects to pre-warm when an application is loaded (Section 3.3). To prevent interference with an application’s heap memory usage, the Memory Daemon monitors both function and cachelet memory consumption. Finally, the Frontend load-balances requests across the running application instances, and the Scale Controller adds and removes instances based on metrics provided by the FaaS runtime (Section 4).

Figure 6: Faa$T’s architecture diagram.
Figure 7: Reads in Faa$T: local hit (LH), remote hit (RH), local miss (LM), remote miss (RM). Solid lines indicate communication between the application, Faa$T instances, and remote storage. Dashed lines indicate data movement.

3.2 Accessing and caching data

Reads. Figure 7 shows the four ways to read data. A local hit finds the data cached in the local Faa$T cachelet. A local miss occurs when the local Faa$T cachelet is the owner for the object and does not currently cache the data. The cachelet directly fetches the data from remote storage. A remote hit occurs when the data misses the local cachelet but is found in the owner’s cache. Finally, a remote miss occurs when the access misses both the local cache and the owner’s cache. The owner fetches the data from remote storage and caches it locally. In all cases, Faa$T cachelets cache objects locally, even if they are not the owners, for performance and locality. Thus, a popular object will incur at most one remote hit per cachelet, and local hits thereafter (besides the optional consistency version check, described below).

Faa$T uses consistent hashing to determine object ownership. We choose consistent hashing because (a) it avoids having to track object metadata (e.g. list of objects in each instance), and (b) it reduces rebalancing as instances are added/removed: on average, only need to be remapped [5, 31]. To maintain transparency, the object namespace is the same as that used by the remote storage service. This design choice also enables efficient communication between cachelets, which as observed by prior work [17, 12, 29] is beneficial for applications such as ML training.

Writes. When the application needs to output data, Faa$T writes through to the owner cache. The instance executing the function sends the data to the owner cache, which then writes it to remote storage. This guarantees that the owner always has the latest version that the application has written. By default, the write happens synchronously to the owner and synchronously to remote storage. This offers high fault tolerance while trading off performance, since applications must wait until the write completes before proceeding with their execution. Applications can optionally configure Faa$T to write asynchronously or not write to remote storage at all. Because Faa$T is tied to each application, different applications can use different policies at the same time.

Consistency. Table 3 summarizes the possible read/write settings for Faa$T, and the performance, consistency, and fault tolerance (FT) they achieve. By default, when reading an object, Faa$T first verifies the version in the cache matches the one in remote storage. No data gets transferred during this check. If the version matches (the common case), no object is retrieved. This verification, combined with synchronous writes to remote storage, offers strong consistency (first row of Table 3). We set this as the default because it provides the same fault tolerance with better performance than the production FaaS platform.

Some applications may be willing to trade off consistency for performance (e.g., ML inference). For those applications, Faa$T can read any cached version and write asynchronously to remote storage. This weakens its fault tolerance, and provides only eventual consistency. Applications can also completely skip writing to remote storage and rely on the distributed cache. In Section 6.7, we quantify the performance and consistency impact of these settings.

Write Write Read
target mode target Performance Consistency FT
Storage Sync Storage Low Strong High
Owner Sync Owner Medium Strong Medium
Owner Async Owner Medium Eventual Medium
Owner Sync Local High Weak Medium
Local Sync Local High None Low
Local Async Local High None Low
Table 3: Performance, consistency, and fault tolerance (FT) for different write/read settings. By default, Faa$T writes to storage synchronously, and reads the version from storage (first row).

3.3 Pre-warming application data into Faa$T

To pre-warm future cachelets, Faa$T records metadata about the cache, off the critical path, when the Frontend unloads the application. This includes the size of each cached object, its version, its number of accesses of each type (e.g., local hit, remote miss, produced as an output), and its average inter-arrival access time. We timestamp each metadata collected with the unload timestamp to capture the state history of the cachelet. As we describe next, this is necessary for applications that are rarely-invoked (e.g., once per hour), since their data access pattern cannot be determined by a single invocation.

Faa$T needs to decide when to load an application into memory. For this, the Frontend leverages a previously/proposed hybrid histogram policy [48]. The policy tracks the idle times between invocations of an application in a histogram. When the application is unloaded, the Frontend uses the histogram to predict when the next invocation is likely to arrive, and schedules the reload of the application for just (e.g., 1 minute) before that time. Our approach would work with any other cold-start prevention policy as well.

At this point, Faa$T needs to decide what data objects should be loaded into the new cachelet. To do so, it collects and merges the metadata across all cachelets over a pre-set period of time. The period of time is based on the application’s invocation frequency, which can be determined using the hybrid histogram policy. Next, Faa$T determines the objects to be loaded using the following two conditions. First, if an object’s local or remote cache hit rate is greater than a threshold, the object should be loaded. This indicates that the object has temporal locality. Second, if an object is accessed more than once across the merged metadata, the object should be loaded. This benefits rarely-invoked applications by loading objects accessed across unload/load periods (e.g., an ML inference application’s model and labels). Once the objects to be loaded are determined, the Faa$T cachelet loads the objects that it owns based on consistent hashing (Section 3.2).

To avoid competing with on-demand accesses, Faa$T pre-warms the cache only when the application is not executing, i.e. before an invocation arrives or right after a function execution ends. If we cannot avoid a cold-start invocation, the only data that is loaded into the cache is its inputs.

3.4 Evicting application data from Faa$T

The memory capacity of each application (and thus Faa$T) cachelet is set by the provider (typically a few GBs). Cachelets do not consume any memory beyond that allocated to the application.

Each Memory Daemon monitors the memory usage of the function and the cachelet. When the memory consumed by the function (i.e., heap memory) and the cachelet (i.e., cached objects) is within a small percentage (i.e., 10%) of the application’s total memory capacity, it evicts objects.

Eviction policies are often designed to cover the broad set of applications that can run on the platform [8, 7, 9]. In contrast, as Faa$T is tied to an application, it can use per-application eviction policies. Hence, the eviction policy can be kept simple and tailored to an application’s data access pattern as needed.

We implement two policies that we expect will work well for many applications. The first is a simple Least-Recently-Used (LRU) policy that prioritizes the eviction of objects that are not owned by the evicting cachelet. Only after these objects are evicted, does the policy consider owned objects in LRU order. This is the default policy. The second policy targets objects that are larger than a threshold (e.g., 12KB) and not owned by the evicting cachelet. If there are not enough of these objects, the policy evicts non-owned objects smaller than the threshold. If we still need more capacity, we evict owned objects that are larger than the threshold, before resorting to LRU for the remaining ones. In both eviction policies, targeting non-owned objects first increases the number of remote hits, but also minimizes the number of remote misses which are most expensive. For the applications we consider, we find that targeting non-owned objects first improves application performance by 20% on average when multiple cachelets are running.

Each of these eviction policies fits our emerging applications nicely: ML inference matches the first policy and JupyterLess the second. ML inference applications that exhibit high invocation rates (e.g., frequently used recommendation models [22]) can quickly fill up a Faa$T cachelet’s capacity with invocation inputs (e.g., images) and outputs (e.g., labeled objects). Across invocations, only the model and labels are typically reused; inputs and outputs change each time. Thus, for ML inference and similar applications, the first policy (LRU) is sufficient, since the inputs and outputs will be evicted when the cachelet’s capacity reaches its limit.

JupyterLess data objects can be classified into two types: (a) small objects that maintain the notebook’s state (

e.g., a dictionary object) and (b) larger objects that are used for data analysis (e.g., a DataFrame). A notebook’s state is typically reused across invocations, and should thus be cached as much as possible. Larger objects are reused less frequently and can be replaced more aggressively. Thus, the second policy (size-based) is appropriate.

Faa$T allows for future eviction policies beyond the ones described above. For example, objects can be given a time-to-live (TTL) and get evicted when the TTL expires.

3.5 Charging for Faa$T

When using Faa$T, we expect FaaS providers to charge users only for the memory of the accessed data and not all the cached objects. FaaS providers should also not charge for pre-warming metadata in the same manner that they do not charge for function metadata (e.g. function registration).

4 Scaling Faa$T

FaaS platforms typically include a Scale Controller responsible for scaling applications in/out (Figure 6). As it is part of the front-end component, the Scale Controller monitors the end-to-end performance and the load offered to each application. It also periodically queries the FaaS runtime running each application instance for a vote on how many more instances to add: a positive number means a vote to scale-out and a negative number means a vote to scale-in. Based on the information for an application, it makes a scaling decision and effects it. Faa$T extends this mechanism by including cache-specific metrics when deciding on how to vote. We also extend the Scale Controller to accept unrequested votes, when scaling is needed immediately. When the controller adds or removes an application instance, Faa$T reassigns the objects’ ownership using the Membership Daemon’s consistent hashing.

Faa$T has three types of scaling:

Compute scaling. FaaS platforms scale the number of application instances based on its rate of incoming requests, its number of in-flight requests (queue sizes), and/or its average response time. Degrading performance, high request rates, or long queues cause scale-out; the opposite causes scale-in. Since every application instance includes both compute and caching resources, this traditional way of scaling is sufficient.

Cache size scaling. Faa$T also scales to match the application’s working set size. For example, a JupyterLess notebook performing data-intensive operations may not fit the entire working set in the cache, leading to a high eviction rate. To address this, each cachelet tracks the number of evictions of each locally-cached object and votes to scale out by one instance, if any object has been evicted more than once since the last controller query. If no object has been evicted more than once but there is still substantial cache access traffic, Faa$T votes to do nothing (add 0 instances). It votes to scale in by one instance when the frequency of accesses is low.

Many existing caching systems statically allocate resources and either cannot auto-scale their capacity as the amount of data accessed varies or require application hints to do so. OFC uses per-application machine learning models to achieve the same dynamic cache size scaling, which requires frequent retraining and mechanisms to prevent application “out-of-memory” failure.

Bandwidth scaling. Faa$T also supports applications with large input objects. For such objects, Faa$T equally partitions the download from remote storage across multiple cachelets to (a) create a higher cumulative I/O bandwidth to remote storage, and (b) exploit the higher communication bandwidth between instances () compared to the bandwidth between each instance and remote storage ().

When a Faa$T cachelet receives an object access, it iteratively computes the data transfer latency, , for a number of instances (starting at the current number) and the object size using the following formula:

where is the instance loading latency. Faa$T periodically profiles , , and to account for variations in the network and the remote storage bandwidths. The iterative process stops at the where changes by less than 10% or when increases between iterations. If the resulting is greater than the current number of instances, the Faa$T cachelet immediately contacts the controller to scale out to . Faa$T then waits for the new instances to be created (by checking the Membership Daemon) and sends each of them a request to download a different byte range of size . As scale-in is not as time-critical, Faa$T does it through periodic voting (when queried by the controller) as the number of object accesses becomes small.

We find that bandwidth-based scale-out is worthwhile for objects on the order of hundreds of MB (Section 6.6); this will become smaller as cold-start optimizations continue to appear [1, 38, 16, 11]. Existing systems do not support bandwidth scaling, and instead rely on the user to determine the right number of chunks and instances.

Handling conflicting scaling requests. The scaling policies work concurrently, so there may be scenarios where they make conflicting scaling requests to the Scale Controller. For example, compute scaling may want to scale out, while cache size scaling may want to scale in. When there are conflicting votes, the controller scales out if any policy determines that scale-out is needed. It scales in if all policies suggest scale-in will not hurt. This is similar to the approach taken by existing systems for right-sizing storage clusters [32].

Idle function computation resources. When instances are added due to cache size or bandwidth scaling, their computation resources can be wasted. Faa$T minimizes resource waste by scaling in when the frequency of accesses is low. Providers can also leverage resource harvesting [60] to run low-priority tasks (e.g., analytics jobs) on these resources when they are not in use.

5 Implementation

We implement Faa$T

for a large-scale FaaS platform used in production, and have open-sourced the bulk of it 

[13, 14].

Production FaaS platform. In our platform, a user application comprises one or more functions. Each function defines its data bindings, which Faa$T uses to transparently load and manage objects: trigger (e.g., HTTP request), inputs (e.g., blob), and outputs (e.g., message queue). Users optionally set Faa$T policies (scaling, eviction, and consistency) using simple application-specific configurations at deployment time.

As we show in Figure 8, an application instance executes in either a VM or a Docker container, and includes the FaaS runtime and function-execution workers. Upon receiving incoming requests (e.g., as a result of an incoming HTTP request, a new blob being created), the runtime collects the requested input bindings and invokes the function in a worker while passing the appropriate arguments to it. When the worker finishes executing the function, it replies to the runtime with the produced output(s) and the runtime processes them (e.g., writes a blob to remote storage or writes to a message queue). If there are multiple concurrent invocations, more worker processes can be spawned on the same instance to execute them in parallel. The platform leverages an existing remote storage service that is not tailored to FaaS.

As Figure 6 shows, a Frontend component handles HTTP requests and does compute scaling. We extend this component to implement bandwidth-based scaling (Section 4).

Figure 8: Faa$T integrated with the FaaS platform. Runtime and workers point to the same shared memory objects.

Caching data. We implement the core of Faa$T in the runtime (C# code) with minimal changes to the workers (Python and Node.js). In the original design, the runtime and workers exchanged control and data messages over a persistent RPC channel. Faa$T replaces the data messages with a shared memory area, while keeping control messages over RPC. The shared area is also where Faa$T caches data. Data communication between the Faa$T cachelets and the workers happens via passing shared memory addresses, reducing end-to-end latency. In addition, unlike existing systems that need to maintain duplicate object copies [37], using shared memory reduces the memory footprint by only keeping a single object copy [14]. The workers managed by the same runtime share the cached objects.

When the runtime prepares input data bindings before invoking a function, Faa$T intercepts them and checks the cache first (Section 3.2). When a function produces an output, Faa$T caches it for future invocations. This cache write triggers any functions that have the newly added object as their trigger binding. This improves latency for applications that rely on writing intermediate outputs to external sources (e.g., blob storage) to trigger subsequent functions.

We support applications written in C#, Python, and Node.js. Supporting other languages would require minimal changes. We use the shared memory APIs already available in most languages for both Linux and Windows. When we run applications in containers (vs VMs), we share (setting up permissions) the cache space across containers.

Distributed cache. Each runtime instance saves some metadata about the running application in a blob from remote storage. We store the Faa$T membership information in this blob and the Faa$T cachelets periodically heartbeat their state there. The consistent hashing algorithm uses SHA256 for hashing and 100 cachelet replicas for load balancing. More replicas did not improve load balancing and increased the ownership lookup time. Fewer replicas created ownership hot spots.

As our platform already uses HTTP for communication between its components, we use this interface to exchange data between Faa$T cachelets. We evaluated other approaches like RPC (with Protocol Buffers [40]) but the improvements were negligible and the complexity of maintaining a new channel would offset them. We could also leverage RDMA-based communication but have not experimented with it.

Other platforms. The design and implementation of Faa$T is extensible to other FaaS platforms. Most platforms have a similar architecture and Faa$T directly applies to the equivalent components (e.g., runtime, worker processes, Scale Controller). All platforms use the concept of triggers and interact with external data services. However, not all of them use bindings to map the data but rely on libraries to explicitly access it inside the function body. We would need to extend these libraries (e.g., Boto3 [10] in AWS Lambda) to interact with Faa$T and look-up the cache before accessing the remote storage. These extensions would be equivalent to modifying the binding process in our platform.

6 Evaluation

(a) AMBER Alert pipeline
(b) AlexNet (239MB)
(c) SqueezeNet (5MB)
Figure 9: Faa$T versus existing systems for ML inference. Faa$T improves application performance by accessing data in local or remote cache instances. (LH = Local Hit, RH = Remote Hit, LM = Local Miss, RM = Remote Miss, CB = Cloudburst, IC = InfiniCache.)

6.1 Methodology

Comparison points. We perform two types of comparisons. The first is an analysis of running the application traces from Section 2.1 on top of Faa$T. This allows us to show the improvements these applications would get with Faa$T.

The second evaluates the four access scenarios that functions may encounter: objects are accessed through local hits (LH), local misses (LM), remote hits (RH), or remote misses (RM).

We compare Faa$T against six baselines for performance and cost: (a) a large, local VM where all accesses are local and there are no function invocation overheads (Native); (b) our commercial FaaS offering (Vanilla) without Faa$T that accesses all objects from remote object storage. Its performance is equal to that of Faa$T LM; (c) InfiniCache (IC[55] that we approximate by statically configuring Faa$T to use only remote instances. Its best case performance is equal to that of Faa$T RH; (d) Cloudburst’s caching layer (CB[53]. Its best case performance is equal to that of Faa$T LH; (e) Pocket [32], approximated with a manually managed Redis VM with all data available at memory speed (no Flash accesses); and (f) a commercial Redis service (Redis service). The Redis service is the offering that matches our VM size in memory and network bandwidth. It is akin to what is used by Locus [43]. Data is stored and accessed from Redis as opposed to remote object storage.

Applications. We use the two applications from Section 2.3

: ML inference and JupyterLess notebooks. We use application latency and cost as primary metrics. For each experiment, we report the mean and standard deviation of 3 runs.

For ML inference, we evaluate both single model inference and inference pipelines. For single model inference, we use two separate models that differ in latency and resource footprints: SqueezeNet [26] (5MB), and AlexNet [33] (239MB). For the inference pipeline, we evaluate the AMBER Alert pipeline of Figure 4; the output of the bounding box model (MobileNet Single-Shot Detector [25], 35MB) is fed into people (ResNet50 [23], 97MB) and car recognition (SqueezeNet, 5MB) models. In all inference cases, functions access an input image, the model, and the class labels (a text file).

For JupyterLess, we use five notebooks: (a) single message logging (No-Op); (b) summing a 350MB DataFrame column; (c) capacity planning with data collection and plotting; (d) FaaS characterization of Section 2.1; and (e) counting up to 1K. The function data objects consist of the notebook state after each cell’s execution, which is stored in JSON format.

Experimental setup. Each application instance is a single VM; the default instance size in our experiments includes 8vCPUs with 28GiB of DRAM and up to 500MB/s network bandwidth. I/O bandwidth to remote storage is lower at 90MB/s for large objects. They run Ubuntu 18.04 with 5.4.0 kernel on Intel Xeon E5-2673 CPUs operating at 2.40GHz. In our production setting, the VMs are pre-provisioned: an application instance cold-start involves loading and deploying the serverless runtime together with the application code.

Cost model. We derive user costs following the common pricing by FaaS and cloud providers. Function invocations are charged for the time and the resources they take (, order of ), while VMs are charged for their lifetime (, order of ). We assume Native and systems with additional resources are statically provisioned the whole time. Specifically, we charge for extra resources whenever the caching or storage system is external to the FaaS platform (i.e., Pocket, Redis, Redis Service) or specialized for FaaS in some way (CloudBurst). Except for Redis service, we charge these systems for one extra VM of the same instance size as the default application instance. The VMs are charged their on-demand prices. For Redis service, we provision per service class and charge the class’s price. The additional resource costs can be amortized by multiple applications sharing the same resources. Vanilla, InfiniCache, and Faa$T use existing commodity storage (e.g., AWS S3, Azure Storage), so we do not charge them for extra resources. We also add the cost of remote storage data transfer (, order of ) to Faa$T LM and RM, and Vanilla.

Figure 10: Performance of summing a 350MB DataFrame column in a JupyterLess notebook. There are two Native setups: In IM the DataFrame is already pre-loaded before summing, and RS fetches it from remote storage before summing. Faa$T improves application performance by accessing data in local or remote cache instances.

6.2 Faa$T with applications run in production

We simulate the end-to-end performance of the Faa$T applications from Section 2.1. Our simulator uses the default policies for consistency (synchronous writes to owner, synchronous writes to remote storage, read version from remote storage) and eviction (LRU), and implements the scaling policies described in Section 4. To model Faa$T’s access latencies, we measured read and writes latencies for 1B to 2GB object sizes using our Faa$T implemention described in Section 5. We vary the size of Faa$T from 1KB to 128MB; larger cache sizes showed no further improvement. We also vary the unload period and show how it affects performance when Faa$T cannot pre-warm frequently-accessed objects.

Figure 11 shows the CDF of percent improvement over blob storage for a 128MB cache (left) and average percent improvement as the size of Faa$T varies (right). First, with just 128MB, Faa$T with pre-warm improves performance by 50% or more for about 35% of applications. Faa$T also has an averge improvement of over 40% for a 128MB cache. Second, as the unload period gets smaller, Faa$T’s pre-warm becomes more important to ensure frequently-accessed objects are available during the next application invocation. Third, improvement is correlated with reuse: we found that smaller objects tend to be reused more often, which resulted in greater performance improvements. Finally, we note that Faa$T is designed to support applications that run in production today (with object sizes of tens to hundreds of KB), but also for future applications that will access much larger objects (hundreds to thousands of MB) as shown in Section 6.3.

Figure 11: CDF of percent improvement over blob storage for a 128MB cache (left) and average percent improvement as the size of Faa$T varies (right) with the application traces from Section 2.1. 30 sec, 10 min, and 20 min represent different unload periods with pre-warm disabled. Faa$T with pre-warm improves performance by 50% or more for about 35% of applications, and has an average improvement of over 40% for a 128MB cache.

6.3 Comparing Faa$T to existing systems

ML inference. Figure 9 shows the latency for the AMBER Alert pipeline and single inference with AlexNet and SqueezeNet. First, the figure shows that Faa$T LH improves latency by 50%, 87%, and 60% faster than Vanilla for the AMBER Alert pipeline, AlexNet, and SqueezeNet, respectively. This demonstrates that avoiding remote storage accesses and using cache triggers can significantly improve FaaS performance. Second, for the AMBER Alert pipeline, Faa$T’s LH and RH are faster than using a Redis service, while Faa$T RH is equivalent to using a manually managed Redis VM (Pocket in Figure 8(a)). This is significant given the complexity of manually managing a Redis VM and the significant cost of using a Redis service (discussed below). Faa$T offers lower latency, while remaining transparent and relieving users of any configuration burden. Third, Faa$T’s LM and RM exhibit similar latencies, with the variability coming from the accesses to remote storage. This suggests using a multi-instance Faa$T cache does not further penalize cache miss performance. Finally, Faa$T LH and RH perform well for both small (input images and class labels) and large objects (the models).

JupyterLess. Figure 10 shows the performance of summing a 350MB DataFrame column in a JupyterLess notebook. There are two Native setups: for In-Memory (IM) the DataFrame is pre-loaded in memory before summing, while for Remote Storage (RS) the latency of remote storage access is counted as part of the summation. Similar to the ML inference applications, Faa$T LH and RH improve performance compared to Vanilla by 62% and 29%, respectively. Compared to Native RS, Faa$T’s LH and RH improve performance by 92% and 86%, respectively.

Table 4 shows the latency of running a capacity planning notebook (includes data collection and plotting), the FaaS characterization of Section 2.1, and a No-Op notebook that logs a single message. Each run has a mix of data access scenarios for Faa$T, since a local copy of the Jupyter state is saved per cell, and is read from a cachelet when executing the following cell. For the former two notebooks, the performance gap with Native IM comes from serializing plots and sending them to the notebook user interface. These results show that Faa$T can run JupyterLess notebooks interactively, and with near-native performance when the notebook is not trivial.

(a) AMBER Alert pipeline
(b) 350MB DataFrame notebook
Figure 12: Cost normalized to Native VM for ML pipeline and JupyterLess notebook. Each bar represents an inter-invocation period. The y-axis is in log-scale (lower is better). Faa$T is 50% to 99.999% cheaper than the baselines.
Notebook Native IM Faa$T
Capacity planning 6.0s 0.2s 8.0s 0.3s
FaaS characterization 40.4s 8.9s 68.0s 0.8s
No-Op 0.2ms 0ms 34.4ms 8.7ms
Table 4: End-to-end latency running notebooks. Faa$T can run JupyterLess notebooks interactively.

Cost. Figures 11(a) and 11(b) show the cost-per-hour comparison of running the applications end-to-end (lower is better). Cost is normalized to that of Native, and the y-axis is in log scale. We show cost for different application invocation IaTs between 10ms and 1 hour. For context, the application with the median IaT in our characterization (Section 2.1) has an average IaT of 20 minutes. For Faa$T, we show the case where the application ran end-to-end with all local hits. We do not show Redis service due to its large cost (Faa$T is 6 order of magnitude cheaper).

The figure shows that Faa$T can provide huge cost savings. For applications with infrequent invocations (e.g., once per hour), Faa$T is 99.999% cheaper than a Native VM. Per hour, Faa$T is cheaper than all baselines for all invocation intervals, except for the AMBER Alert pipeline at 10ms, where it is 33% more expensive than Native. For all other cases, Faa$T is 50% to 99.999% cheaper than caching and storage layers that require separate servers, such as Cloudburst and Pocket. From Section 2.1, 99.88% of applications have average IaT 10ms: in these cases, the cost of servers would almost always completely dominate the overall cost.

Discussion on comparison to Native. Even when all accesses are served with local (LH) or remote hits (RH), Faa$T is slower than a Native VM with all data stored locally and there are no function invocation overheads. However, as we have shown in Figure 12, such a Native setup can be orders of magnitude more expensive than Faa$T, since we must keep all VMs running even when applications are idle. Moreover, the user is responsible for resource and data management. With Faa$T being a part of the FaaS runtime, users only pay for the time resources are consumed for both compute and caching. Moreover, local hits are on the order of hundreds of ms, which is close to interactive for many use cases.

Figure 13: AMBER Alert pipeline performance: (a) instance loaded and Faa$T pre-warms based on past history (Hybrid hist + pre-warm), (b) instance loaded but not pre-warmed (Hybrid hist), and (c) instance not loaded (Cold-start). Faa$T automatically loads objects with spatial and temporal locality to improve latency.
(a) 400KB
(b) 40MB
(c) 400MB
(d) 800MB
Figure 14: Latency of fetching an object from remote storage as the number of instances vary for increasingly large object sizes. If the instances are not loaded, they incur a cold-start; we only consider the running case for one instance. Faa$T determines whether to scale data loading across multiple instances to increase bandwidth.

6.4 Is Faa$T pre-warm effective?

In Section 6.2, we showed pre-warming is an important feature for improving existing application performance over remote storage. We now use the AMBER Alert pipeline to evaluate the effectiveness of Faa$T data pre-warming under three scenarios: the application instance was loaded before the function invocation using the hybrid histogram policy [48] and Faa$T automatically pre-warmed the three models and three labels (135MB total) based on history (Hybrid hist + pre-warm); the instance was loaded before the invocation using the hybrid histogram policy, but not pre-warmed with any objects (Hybrid hist); and the instance was not loaded before the invocation (i.e., the runtime is not deployed; Cold-start).

Figure 13 shows the performance of the pipeline for these versions. Faa$T’s data pre-warming improves latency by 58% and 74% over no pre-warming and cold-start, respectively. This is especially important if the AMBER Alert pipeline is infrequently invoked.

Scenario Heap growth succeeded? Latency
No Mem Daemon No 235.0ms 3.2ms
Mem Daemon, no scaling Yes 678.6ms 64.8ms
Mem Daemon, scaling Yes 502.9ms 28.5ms
Table 5: Latency of running a 350MB DataFrame summation in a JupyterLess notebook after growing heap memory, and whether the heap growth succeeded. We evaluate three scenarios: (a) without the Memory Daemon, (b) with the Memory Daemon, but no cache size scaling, and (c) with Memory Daemon, and Faa$T scales to two instances. Faa$T ensures application functionality is not compromised, and improves performance by scaling.

6.5 Can Faa$T manage memory effectively?

We consider the JupyterLess notebook application that sums a 350MB column. After loading in the DataFrame and performing the summation, the application allocates an array that consumes 96% of the application’s total memory. Then, the application again computes the summation of the 350MB column, which requires the reloading of the DataFrame that is evicted. We show three scenarios: (a) without the Memory Daemon to evict objects when the heap memory grows, (b) the Memory Daemon evicts objects, but there is no cache size scaling, and (c) the Memory Daemon evicts objects, and Faa$T scales to two instances.

Table 5 shows the performance of the second summation of the 350MB column, and whether the array allocation completed successfully. When the Memory Daemon does not trigger object eviction, the array allocation fails, but the summation matches the performance of a local hit. When the Memory Daemon triggers object eviction, but cannot scale, the array allocation succeeds, and the summation performance is a mix of local hits and misses. Finally, when the Memory Daemon triggers object eviction, and scales the cache size to improve the number of remote hits, the array allocation succeeds, and the summation performance is a mix of all four data access types. Since Faa$T opportunistically uses application memory, functionality is not compromised for applications that use large heap memory amounts.

6.6 Scaling as the object size varies

Faa$T can scale the number of instances based on object sizes. We consider four object sizes: 400KB, 40MB, 400MB, and 800MB. The amount of data downloaded by each Faa$T cachelet is evenly split between the available instances. For example, if there are two instances and the object size is 400KB, each one downloads 200KB. The data is then processed at a single instance. For each object size, we show two cases: (a) when all application instances are running, and (b) when additional instances (more than one) must be loaded in order to fetch the objects (thus incurring a cold-start latency).

Figure 14 shows the time to download the increasingly large object sizes from remote storage as the number of instances varies. For small objects (400KB), the download latency is low enough that using more than one instance degrades performance, especially if the instances need to be loaded. For the 40MB object, partitioning the download across multiple instances is beneficial if the instances are already loaded. It is 47% faster to download the object with four cachelets than with one. For the 400MB object, it is 9% faster to download the object with four instances than one if the instances are not already loaded, and 47% faster to download the objects with four instances than one if the instances are already loaded. For the 800MB object, it is beneficial to use four cachelets to load the object in parallel, even if the instances need to be loaded. If the instances are already loaded, it is 60% faster to download with four cachelets compared to one. If the instances need to be loaded, it is 44% faster.

Write Write Read
target mode target E2E lat (s) Per-req. lat (ms) # inconsist.
Storage Sync Storage 74.4 0.2 80.0 29.5 0 0
Owner Sync Owner 42.6 0.2 41.6 12.8 0 0
Owner Async Owner 39.7 0.6 38.5 11.4 1.3 1.9
Owner Sync Local 32.7 0.3 32.7 9.0 800 0
Local Sync Local 31.1 0.3 30.3 13.1 800 0
Local Aync Local 31.0 0.3 30.3 13.2 800 0
Table 6: Latency (end-to-end and per-request) and number of inconsistencies for different write/read settings for a JupyterLess notebook counting to 1K with five instances sharing state. Inconsistencies are the absolute difference between the final counter value and 1K. Performance increases as consistency and fault tolerance decrease.

6.7 Trading off consistency and performance

Faa$T allows users to trade off consistency for performance. We evaluate the write/read settings from Table 3 using a JupyterLess notebook in which five cachelets share the state for counting up to 1K. Application instances add to the counter in round-robin fashion, and we expect the counter at the end to have a value of 1K. We measure inconsistencies as the absolute difference between the final counter value and 1K. This is a critical primitive in multiplayer games [15].

Table 6 shows the end-to-end latency, per-request latency, and number of inconsistencies for all five settings. As expected, latency drops as we relax consistency requirements. For example, writing to the local cache and reading from the local cache is fastest, but provides no consistency (800 inconsistencies recorded) and the lowest fault tolerance. Writing to the local cache asynchronously and reading from the local cache is equivalent to Cloudburst’s performance. Cloudburst would exhibit better consistency due to its lattice datatypes, but requires support from the datastore. Latency varies the most when writing and reading from remote storage, and the least when writing and reading from the local cache.

(a) AMBER Alert pipeline
(b) 350MB DataFrame Notebook
Figure 15: Latency as the instance size varies for (a) the AMBER Alert pipeline, and (b) summing a 350MB DataFrame column in a JupyterLess notebook. Instances memory and network bandwidth scale linearly as the number of cores increases. Faa$T benefits from instances with higher bandwidth.

6.8 Sensitivity to instance size

Finally, we evaluate the sensitivity of running applications with Faa$T as the instance size varies. We run the AMBER Alert pipeline and sum a 350MB DataFrame column in a JupyterLess notebook. Instances scale linearly in terms of memory and network bandwidth as the vCPUs increase. For example, the 2vCPU instance has 8GiB of memory and 1Gbps of network bandwidth, and the 4vCPU instance has 16GiB of memory and 2Gbps of network bandwidth.

Figures 14(a) and 14(b) show that as instances increase in size, Faa$T’s latency decreases for both applications. Larger instances have higher network bandwidth, which is beneficial for data accesses to remote storage and between instances. Data accesses of these two applications saturate the bandwidth with the 8vCPU instance. Thus, although the 16vCPU instance is the highest bandwidth instance size, the performance remains the same as the 8vCPU. Some cloud providers offer instances as small as 2vCPU with up to 10Gbps bandwidth, allowing Faa$T to have high performance even on small instances suitable for FaaS.

7 Related Work

Ephemeral serverless storage. In Section 2.2 we describe the limitations of several existing storage and data caching solutions for FaaS [32, 55, 43, 51, 53]. Unlike these systems, Faa$T does not require external resources beyond what is provided to the invoked function, is transparent to applications, and can scale as the data size and access patterns vary.

OFC111This work was done concurrently and independently of OFC. is the closest work to Faa$T. It transparently caches objects using RAMCloud [39] and leverages machine learning to dynamically size the cache. Unlike OFC, Faa$T pre-warms objects when an application is loaded, supports large ( 10MB) object caching and optimizes for their data transfer latency from remote storage with bandwidth scaling, and only needs to keep one copy of data in shared memory (compared to OFC that requires a copy in the worker and in RAMCloud). Faa$T also incurs lower decision overheads and is easier to manage by not requiring the use of machine learning for its decision-making.

Serverless frameworks. Several frameworks have recently emerged enabling users to run, for example, linear algebra [28], video encoding [18], video analytics pipelines [4], ML training [12], and general burst-parallel applications [17] on up to thousands of serverless functions. These, and their applications, would benefit from managing and transferring intermediate data between serverless functions using Faa$T. Since Faa$T is transparent to applications, little to no changes would be needed to interact with Faa$T.

Improving serverless performance. There have been many approaches to reduce the execution time of serverless function, such as making containers more lightweight [1, 38], using snapshotting techniques [11, 16], or reducing the number of cold-starts [48, 19]. Shredder [59] focuses on how to provide isolation for multi-tenancy. Lambada [36] focuses on improving performance for serverless applications with exchange operators. Lambdata [54] allows users to expose their data read and write intents for making optimizations such as co-locating functions working on the same data. These optimizations are orthogonal to Faa$T, which focuses on how to improve state management for serverless functions, and how to scale instances to improve application performance.

Consistency and fault tolerance. Consistency and fault tolerance protocols have been heavily studied. Recent work has explored how to enable both of these for serverless applications. Faasm [51] and Cloudburst [53] provide local caches backed by a distributed key-value store. Faasm allows for strong consistency by using global locks at the KVS; Cloudburst provides guarantees for repeatable reads and causal consistency by using lattice data types supported by its local caches and by its Anna KVS backend [56]. AFT [52] added a fault tolerance shim layer for FaaS, and implemented protocols for read atomic isolation. Beldi [58] provides a framework to write transactional and fault tolerant stateful serverless functions by extending Olive [47] with a novel data structure to support fast logging and exactly-one semantics. Faa$T transparently supports different consistency and fault tolerance settings directly in the functions runtime, and can be extended to support future protocols.

8 Conclusion

We presented Faa$T, a transparent caching layer for serverless applications. We motivated its design with a characterization of production applications. We tie Faa$T to the application, scale based on compute demands and data access patterns, and provide data consistency that can be set per application. We implemented it in a production serverless platform. Compared to existing systems, Faa$T is on average 57% faster and 99.99% cheaper when running two challenging applications.

References

  • [1] A. Agache, M. Brooker, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and D. Popa (2020) Firecracker: Lightweight Virtualization for Serverless Applications. In NSDI, Cited by: §2, §4, §7.
  • [2] (2020) Amazon Go. Note: https://www.amazon.com/b?ie=UTF8&node=16008589011 Cited by: §2.3.
  • [3] (2020) Amazon S3. Note: https://aws.amazon.com/s3/ Cited by: §1.
  • [4] L. Ao, L. Izhikevich, G. M. Voelker, and G. Porter (2018) Sprocket: A Serverless Video Processing Framework. In SoCC, Cited by: §7.
  • [5] G. Aumala, E. Boza, L. Ortiz-Avilés, G. Totoy, and C. Abad (2019) Beyond load balancing: package-aware scheduling for serverless platforms. In CCGRID, Cited by: §3.2.
  • [6] (2020) Azure Function Queues. Note: https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-queue Cited by: §1.
  • [7] N. Beckmann and D. Sanchez (2017) Maximizing Cache Performance Under Uncertainty. In HPCA, Cited by: §3.4.
  • [8] N. Beckmann, H. Chen, and A. Cidon (2018) LHD: Improving Cache Hit Rate by Maximizing Hit Density. In NSDI, Cited by: §3.4.
  • [9] D. S. Berger, R. K. Sitaraman, and M. Harchol-Balter (2017) AdaptSize: orchestrating the hot object memory cache in a content delivery network. In NSDI, Cited by: §3.4.
  • [10] Amazon Web Services (2020) AWS SDK for Python (Boto3). Note: https://aws.amazon.com/sdk-for-python/ Cited by: §5.
  • [11] J. Cadden, T. Unger, Y. Awad, H. Dong, O. Krieger, and J. Appavoo (2020) SEUSS: Skip Redundant Paths to Make Serverless Fast. In EuroSys, Cited by: §2, §4, §7.
  • [12] J. Carreira, P. Fonseca, A. Tumanov, A. Zhang, and R. Katz (2019) Cirrus: A serverless framework for end-to-end ML workflows. In SoCC, Cited by: §3.2, §7.
  • [13] G. I. Chaudhry (2021) Caching of function bindings (input/output) for faster I/O. Microsoft Research. Note: https://github.com/Azure/azure-functions-host/issues/7310 Cited by: §5.
  • [14] G. I. Chaudhry (2021) Shared memory data transfer between Functions Host and out-of-proc workers. Microsoft Research. Note: https://github.com/Azure/azure-functions-host/pull/6836 Cited by: §5, §5.
  • [15] J. Donkervliet, A. Trivedi, and A. Iosup (2020) Towards Supporting Millions of Users in Modifiable Virtual Environments by Redesigning Minecraft-Like Games as Serverless Systems. In HotCloud, Cited by: §6.7.
  • [16] D. Du, T. Yu, Y. Xia, B. Zang, G. Yan, C. Qin, Q. Wu, and H. Chen (2020) Catalyzer: sub-millisecond startup for serverless computing with initialization-less booting. In ASPLOS, Cited by: §2, §4, §7.
  • [17] S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, C. Kozyrakis, M. Zaharia, and K. Winstein (2019) From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers. In USENIX ATC, Cited by: §2.1, §3.2, §7.
  • [18] S. Fouladi, R. S. Wahby, B. Shacklett, K. V. Balasubramaniam, W. Zeng, R. Bhalerao, A. Sivaraman, G. Porter, and K. Winstein (2017) Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads. In NSDI, Cited by: §7.
  • [19] A. Fuerst and P. Sharma (2021) FaasCache: Keeping Serverless Computing Alive With Greedy-Dual Caching. In ASPLOS, Cited by: §7.
  • [20] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G. Wei, H. S. Lee, D. Brooks, and C. Wu (2020) DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference. In ISCA, Cited by: §2.3.
  • [21] U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, et al. (2020) The Architectural Implications of Facebook’s DNN-Based Personalized Recommendation. In HPCA, Cited by: §2.3.
  • [22] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang (2018) Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA, Cited by: §3.4.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §6.1.
  • [24] J. M. Hellerstein, J. Faleiro, J. E. Gonzalez, J. Schleier-Smith, V. Sreekanti, A. Tumanov, and C. Wu (2018) Serverless Computing: One Step Forward, Two Steps Back. arXiv preprint arXiv:1812.03651. Cited by: §1, §1.
  • [25] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    .
    arXiv. Cited by: §6.1.
  • [26] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §6.1.
  • [27] F. Jiang, Y. Jiang, H. Zhi, Y. Dong, H. Li, S. Ma, Y. Wang, Q. a. Dong, H. Shen, and Y. Wang (2017) Artificial intelligence in healthcare: past, present and future. Stroke and Vascular Neurology 2 (4), pp. 230–243. External Links: Document, ISSN 2059-8688, Link, https://svn.bmj.com/content/2/4/230.full.pdf Cited by: §2.3.
  • [28] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht (2017) Occupy the Cloud: Distributed Computing for the 99%. In SoCC, Cited by: §7.
  • [29] E. Jonas, J. Schleier-Smith, V. Sreekanti, C. Tsai, A. Khandelwal, Q. Pu, V. Shankar, J. Carreira, K. Krauth, N. Yadwadkar, et al. (2019) Cloud programming simplified: A berkeley view on serverless computing. arXiv. Cited by: §3.2.
  • [30] (2020) Jupyter. Note: https://jupyter.org/ Cited by: §1, §2.3.
  • [31] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin (1997) Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In STOC, Cited by: §3.2.
  • [32] A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle, and C. Kozyrakis (2018) Pocket: Elastic ephemeral storage for serverless analytics. In OSDI, Cited by: §1, §1, §1, Table 2, §4, §6.1, §7.
  • [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, Cited by: §6.1.
  • [34] M. McFarland (2017) Futuristic cop cars may identify suspects. CNN business. Note: https://money.cnn.com/2017/10/19/technology/future/police-ai-dashcam/index.html Cited by: §2.3.
  • [35] (2020) Microsoft Azure Blob Storage. Note: https://azure.microsoft.com/en-us/services/storage/blobs/ Cited by: §1.
  • [36] I. Müller, R. Marroquín, and G. Alonso (2020) Lambada: interactive data analytics on cold data using serverless cloud infrastructure. In SIGMOD, Cited by: §7.
  • [37] D. Mvondo, M. Bacou, K. Nguetchouang, S. Pouget, J. Kouam, R. Lachaize, J. Hwang, T. Wood, D. Hagimont, N. D. Palma, B. Batchakui, and A. Tchana (2021) OFC: An Opportunistic Caching System for FaaS Platforms. In EuroSys, Cited by: Table 2, §3.1, §5.
  • [38] E. Oakes, L. Yang, D. Zhou, K. Houck, T. Harter, A. Arpaci-Dusseau, and R. Arpaci-Dusseau (2018) SOCK: Rapid Task Provisioning with Serverless-Optimized Containers. In USENIX ATC, Cited by: §2, §4, §7.
  • [39] J. Ousterhout, A. Gopalan, A. Gupta, A. Kejriwal, C. Lee, B. Montazeri, D. Ongaro, S. J. Park, H. Qin, M. Rosenblum, S. Rumble, R. Stutsman, and S. Yang (2015) The ramcloud storage system. ACM Trans. Comput. Syst.. Cited by: §7.
  • [40] (2020) Protocol Buffers. Note: https://developers.google.com/protocol-buffers/ Cited by: §5.
  • [41] F. Psallidas, Y. Zhu, B. Karlas, M. Interlandi, A. Floratou, K. Karanasos, W. Wu, C. Zhang, S. Krishnan, C. Curino, and M. Weimer (2019) Data science through the looking glass and what we found there. External Links: 1912.09536 Cited by: §2.3.
  • [42] Q. Pu, H. Li, M. Zaharia, A. Ghodsi, and I. Stoica (2016) FairRide: Near-Optimal, Fair Cache Sharing. In NSDI, Cited by: §3.1.
  • [43] Q. Pu, S. Venkataraman, and I. Stoica (2019) Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure. In NSDI, Cited by: §1, §1, §1, Table 2, §6.1, §7.
  • [44] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou (2020) MLPerf Inference Benchmark. In ISCA, Cited by: §2.3.
  • [45] (2020) Redis. Note: https://redis.io/ Cited by: §1.
  • [46] F. Romero, M. Zhao, N. J. Yadwadkar, and C. Kozyrakis (2021) Llama: a heterogeneous & serverless framework for auto-tuning video analytics pipelines. External Links: 2102.01887 Cited by: §2.3.
  • [47] S. Setty, C. Su, J. R. Lorch, L. Zhou, H. Chen, P. Patel, and J. Ren (2016) Realizing the Fault-Tolerance Promise of Cloud Storage Using Locks with Intent. In OSDI, Cited by: §7.
  • [48] M. Shahrad, R. Fonseca, I. Goiri, G. Irfan, P. Batum, J. Cooke, E. Laureano, C. Tresness, M. Russinovich, and R. Bianchini (2020) Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In USENIX ATC, Cited by: §1, §1, §2.1, §2, §3.3, §6.4, §7.
  • [49] V. Shankar, K. Krauth, K. Vodrahalli, Q. Pu, B. Recht, I. Stoica, J. Ragan-Kelley, E. Jonas, and S. Venkataraman (2020) Serverless Linear Algebra. In SoCC, Cited by: §2.1.
  • [50] M. Shilkov (2021) Comparison of Cold Starts in Serverless Functions across AWS, Azure, and GCP. Note: https://mikhail.io/serverless/coldstarts/big3/ Cited by: §1.
  • [51] Shillaker,Simon and Pietzuch,Peter (2020) Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing. In USENIX ATC, Cited by: §1, Table 2, §3.1, §7, §7.
  • [52] V. Sreekanti, C. Wu, S. Chhatrapati, J. E. Gonzalez, J. M. Hellerstein, and J. M. Faleiro (2020) A fault-tolerance shim for serverless computing. In EuroSys, Cited by: §7.
  • [53] V. Sreekanti, C. Wu, X. C. Lin, J. Schleier-Smith, J. E. Gonzalez, J. M. Hellerstein, and A. Tumanov (2020) Cloudburst: Stateful Functions-as-a-Service. Cited by: §1, §1, §1, Table 2, §3.1, §3.1, §6.1, §7, §7.
  • [54] Y. Tang and J. Yang (2020) Lambdata: Optimizing Serverless Computing by Making Data Intents Explicit. In CLOUD, Cited by: §7.
  • [55] A. Wang, J. Zhang, X. Ma, A. Anwar, L. Rupprecht, D. Skourtis, V. Tarasov, F. Yan, and Y. Cheng (2020) InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache. In FAST, Cited by: §1, §1, §1, Table 2, §3.1, §6.1, §7.
  • [56] C. Wu, J. Faleiro, Y. Lin, and J. Hellerstein (2018) Anna: A KVS for Any Scale. In ICDE, Cited by: §2.2, §7.
  • [57] N. J. Yadwadkar, F. Romero, Q. Li, and C. Kozyrakis (2019) A Case for Managed and Model-Less Inference Serving. In HotOS, Cited by: §2.3.
  • [58] H. Zhang, A. Cardoza, P. B. Chen, S. Angel, and V. Liu (2020) Fault-tolerant and transactional stateful serverless workflows. In OSDI, Cited by: §7.
  • [59] T. Zhang, D. Xie, F. Li, and R. Stutsman (2019) Narrowing the Gap Between Serverless and Its State with Storage Functions. In SoCC, Cited by: §7.
  • [60] Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura, I. Goiri, and R. Bianchini (2016) History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters. In OSDI, Cited by: §4.