Towards Demystifying Intra-Function Parallelism in Serverless Computing

Serverless computing offers a pay-per-use model with high elasticity and automatic scaling for a wide range of applications. Since cloud providers abstract most of the underlying infrastructure, these services work similarly to black-boxes. As a result, users can influence the resources allocated to their functions, but might not be aware that they have to parallelize them to profit from the additionally allocated virtual CPUs (vCPUs). In this paper, we analyze the impact of parallelization within a single function and container instance for AWS Lambda, Google Cloud Functions (GCF), and Google Cloud Run (GCR). We focus on compute-intensive workloads since they benefit greatly from parallelization. Furthermore, we investigate the correlation between the number of allocated CPU cores and vCPUs in serverless environments. Our results show that the number of available cores to a function/container instance does not always equal the number of allocated vCPUs. By parallelizing serverless workloads, we observed cost savings up to 81 69.8


page 1

page 2

page 3

page 4


Has Your FaaS Application Been Decommissioned Yet? – A Case Study on the Idle Timeout in Function as a Service Infrastructure

Function as a Service (FaaS) is a new cloud technology with automated re...

Exploiting Inherent Elasticity of Serverless in Irregular Algorithms

Serverless computing, in particular the Function-as-a-Service (FaaS) exe...

Making use of supercomputers in financial machine learning

This article is the result of a collaboration between Fujitsu and Advest...

Starling: A Scalable Query Engine on Cloud Function Services

Much like on-premises systems, the natural choice for running database a...

Duet Benchmarking: Improving Measurement Accuracy in the Cloud

We investigate the duet measurement procedure, which helps improve the a...

SeBS: A Serverless Benchmark Suite for Function-as-a-Service Computing

Function-as-a-Service (FaaS) is one of the most promising directions for...

λ-NIC: Interactive Serverless Compute on Programmable SmartNICs

There is a growing interest in serverless compute, a cloud computing mod...

1. Introduction

With the advent of Amazon Web Services (AWS) Lambda in 2014, serverless computing has gained popularity and more adoption in different application domains such as machine learning 

(Chadha2020; carreira2019cirrus; Jiang2021), scientific computing (nanopore; chard2020funcx; jindal2021function; postericdcs), and linear algebra (serverlesslin)

. In serverless computing, developers do not have to manage infrastructure themselves but completely hand over this responsibility to a Function-as-a-Service Platform. Several open-source and commercial FaaS platforms such as OpenWhisk, Google Cloud Functions (GCF), and Lambda are currently available. Applications are developed as small units of code, called functions that are independently packaged and uploaded to a FaaS platform and executed on event triggers such as HTTP requests. On function invocation, the FaaS platform creates an

execution environment (instance) which provides a secure and isolated runtime environment for the function. The functions can be written using various languages such as Java, Go, or Nodejs and a language-specific environment called as runtime is created in the function’s execution environment. However, due to the limitations on the available runtimes in commercial FaaS platforms such as GCF, serverless Container-as-a-Service (CaaS) offerings such as Google Cloud Run (GCR) (GCR) have been introduced. GCR is a fully-managed service based on Knative (knative). CaaS provides developers greater flexibility and allows them to build custom container images for their functions.

While most details about the backend infrastructure management are abstracted away from the user by commercial FaaS and serverless CaaS platforms, they still allow developers to configure the amount of memory and number of vCPUs (GCR) allocated to a function/container instance (LambdaConfig; GCRConfig). As a result, each function/container instance has a fixed number of CPU cores and memory associated with it. For commercial FaaS platforms such as Lambda and GCF, the performance of the function is directly related to the amount of function memory configured. This is because these platforms increase the amount of compute capacity available to a function by increasing the number of vCPUs or the fraction of the CPU time if more memory is configured (behind). Serverless is advertised as a pay-per-use model, where the users are billed based on the execution time of the functions measured in 100ms (GCR/GCF) and 1ms (Lambda) intervals. However, due to the billing policies followed by cloud providers, increasing the amount of memory often leads to an increase in costs due to fees payment wrt GB-Second (and GHz-Second with GCF/GCR (GCFPricing; GCRPricing)). The comparison between the average execution time and cost (LambdaPricing) for the modified MVT benchmark (npbench) when deployed on Lambda is shown in Figure 1. Although the average execution time decreases when more memory is configured, the cost increases significantly. Furthermore, after a certain memory configuration (2048MB) allocating more memory does not considerably affect the function execution time. Serverless FaaS/CaaS platforms launch the function instances on the platform’s traditional Infrastructure-as-a-Service (IaaS) virtual machines (VM) offerings (behind; chadha2021architecture). However, the provisioning of such VM’s is abstracted away from the user. As a result, the user is not aware of the details of the provisioned VMs such as the number of CPU cores and the virtual CPUs (vCPUs). Figure 1 shows the number of CPU cores available to the function for the different memory profiles on AWS Lambda. We obtain the number of available cores using the Linux proc filesystem. Since the native implementation of the MVT benchmark is single-threaded it cannot utilize the underlying cores leading to resource under-utilization. To this end, parallelization of functions can lead to a significant reduction in execution time and thus reduced costs.

Figure 1. Left: average execution time and cost for the modified MVT benchmark (npbench) in C++ using AWS Lambda (us-east1). Right: Number of CPU cores allocated for the different memory profiles for AWS Lambda.

Our key contributions are:

  • Identification of #CPU cores and vCPU allocations: We investigate the number of allocated CPU cores in FaaS/CaaS platforms and how they are mapped to the allocated vCPUs.

  • Intra-function parallel workloads: We modify and parallelize three different compute-intensive serverless workloads111We use the term serverless workload and function interchangeably., using C++, Java and Go. We execute these workloads on AWS Lambda, GCF, and GCR, and analyze their execution times.

  • Cost analysis: We demonstrate the benefits of parallelizing functions and discuss conditions when parallelization can be beneficial.

The rest of the paper is structured as follow. §2 describes the previous work on inter-function parallelism and current strategies for performance optimization of FaaS functions. §3 describes our methodology. In §4, our experimental results are presented. Finally, §5 concludes the paper and presents an outlook.

2. Related Work

The majority of previous work on parallelizing applications via FaaS has focused on splitting the workload via separate function instances, i.e., inter-function parallelism (serverlesslin; parallelworkloads; 10.1145/3361525.3361535). Shankar et al. (serverlesslin) showed that the elasticity provided by serverless computing can be used to efficiently execute linear algebra algorithms, which are inherently parallel. They implemented a system to split a linear algebra algorithm into tasks which are then executed by AWS Lambda functions. The data between function instances is shared via a persistent object-store. With their system, they achieved performance within a factor of two, as compared to a server-centric Message Passing Interface (MPI) (mpi_standard) implementation. Pons et al. (parallelworkloads) analyzed the performance of fork/join workflows using existing services such as AWS Step Functions (awsstep), Azure Durable Functions (azuredurable), and OpenWhisk Composer (owcomposer). They found that, while OpenWhisk Composer offers the best performance, all of the function orchestration solutions have significant overhead for executing parallel workloads. In (10.1145/3361525.3361535), the authors present a system called Crucial that allows developers to program highly-concurrent stateful applications with serverless architectures. Crucial ports multi-threaded applications to a serverless environment, by leveraging a distributed shared memory layer. For coordinating functions they used shared objects. With Crucial, the authors achieved performance results similar to that of an equivalent Spark cluster. While distributing workloads across function invocations can be useful due to the high elasticity, results show that it can still lead to significant communication and synchronization overhead. In contrast, we focus on intra-function parallelism where we parallelize functions and execute them within a single function instance.

Evaluating the general performance of FaaS platforms and improving the performance of FaaS functions has also been actively researched (chadha2021architecture; 10.1145/3447545.3451173). In (10.1145/3447545.3451173), the authors present the Serverless Application Analytics Framework (SAAF), to improve observability on the performance of FaaS functions on commercial FaaS platforms. SAAF currently supports multiple FaaS platforms and several different programming languages. In our previous work (chadha2021architecture), we examined the underlying processor architectures on GCF and demonstrated the usage of Numba (numba), a Just-in-Time (JIT) compiler based on LLVM for optimizing and improving the performance of compute-intensive Python based FaaS functions. We showed that the optimization of FaaS functions can improve performance by x and save costs by %. However, all of the previous approaches evaluate the performance of single-threaded FaaS functions. In contrast, we evaluate and analyze the performance of parallelized FaaS functions. Furthermore, we investigate different function configurations and evaluate their respective parallel efficiency. Moreover, we demonstrate significant cost savings for parallelized functions as compared to their sequential implementations.

3. Methodology

In this section, we first describe the different compute-intensive serverless workloads used in this work. Following this, we describe the different language runtimes we used for adapting and modifying the different workloads. Finally, we describe our benchmarking workflow.

3.1. Serverless Workloads

For our experiments, we chose two microbenchmarks, i.e., Atax and MVT from NPBench (npbench; ziogas2021npbench) and one application, i.e., Monte Carlo from PyPerf (pyperf). Both Atax and MVT

take a JSON file as input describing the input matrix and vector sizes.

Atax performs a matrix-vector product, followed by multiplying the result again with the matrix. On the other hand, MVT

performs two matrix-vector products, followed by adding the results to different vectors. The Monte Carlo simulation estimates the digits of

. It generates random numbers in a square and counts all points for which the distance to the center is less than 1. The ratio of points is an estimate of which is used to estimate . It takes a JSON file as input specifying the number of iterations for the simulation. We port the different workloads, initially written in Python to the different language runtimes used in this work (§3.2).

3.2. Language Runtimes

To evaluate the different services, i.e., AWS Lambda, GCF, and GCR wrt the performance of parallelized functions, we chose multiple programming languages, i.e., C++, Go, and Java. We chose C++ since it is widely used for scientific computing in high performance computing applications (nanopore). However, none of the major commercial FaaS platforms support C++ by default. For executing C++ functions on AWS Lambda, we use the custom C++ Lambda runtime (AWSLambdaRuntimeC++) based on the Lambda Runtime API (AWSLambdaRuntimeAPI). For GCF, it was not possible to run C++ functions since it does not support custom runtimes. On the other hand, for GCR we use a custom docker image and install the required dependencies to build the C++ function.

We chose Go since it is widely used and supported by default on Lambda and GCF. Furthermore, it was designed with concurrency in mind which simplifies parallelization of functions. As our final language, we chose Java not only due to its popularity but also due to differences in its design as compared to the other two languages. In contrast to C++ and Go, Java applications are compiled to bytecode which is executed by the Java Virtual Machine (JVM). Similar to Go, Java is also available by default on both Lambda and GCF. For running Go, C++, and Java based functions on GCR, we used custom docker images with the required dependencies.The details about the different language runtimes, i.e., their versions, compiler, compiler flags, and the different GCR images is shown in Table 1.

For parallelizing the different serverless workloads (§3.1) using the different languages, we first identified suitable regions using profilers. Following this, we used additional libraries or language-specific features to parallelize those regions. For C++, we used OpenMP which is commonly used for shared memory programming. OpenMP allows developers to annotate code using pragmas which are automatically used by the compiler to generate multi-threaded code. In the case of Go, we made use of goroutines, which are lightweight threads having their own stack. For Java, we utilized the ExecutorService class which allows developers to create a thread pool and submit tasks to be executed by the threads. While OpenMP supports automatic division of work between threads, for Go and Java, we had to manually split the workload between the threads.

width=8.5cm,center —c—c—c—c—c— -

Language & Parallelization & Version & Compiler & Flags & GCR Image
- C++ & OpenMP & AWS: C++11, GCR: C++17 & g++, -O3 & debian:buster-slim
- Go & goroutines & AWS,GCR: 1.16, GCF: 1.13 & gc, GOOS=linux & debian:buster-slim
- Java & ExecutorService & Java 11 & OpenJDK 11 & maven:3.8-jdk-11

Table 1. Runtime configurations. Includes parallelization technique, version, compiler and flags. For GCR, the container deployment image is also mentioned.
Figure 2. Different steps involved in our benchmarking workflow.

3.3. Benchmarking Workflow

Initially, we deploy all the serverless wokloads (§3.1) using the respective command line interfaces (CLIs) provided by AWS and Google. Figure 2 shows the different steps involved in our benchmarking process. To facilitate automatic function invocation and collection of function logs, we implement multiple Python scripts. At the beginning, the user provides the service type, serverless workload, and the language runtime (§3.2) as input to the Python script

. Following this, the function configuration according to the input parameters is retrieved from a JSON file

. The function configuration contains relevant parameters to invoke the function such as the function URL. In the main loop

of the Python script, we repeatedly invoke the function synchronously, i.e., we await the function result before invoking it again

. For Lambda, we use the aws CLI for invoking the functions, while for GCF/GCR the functions are triggered using HTTP requests

. Each function takes a JSON file as input that specifies its input size and the number of threads to utilize. We execute each serverless workload 20 times, 10 times sequentially and 10 times with multiple threads. Following this, we update the memory configuration of the function

. For this, we use their respective CLIs

. After all function executions have finished, we store the results in a JSON file

. For Lambda, all the required data can be retrieved from the function response, while for GCF/GCR we use the gcloud CLI



4. Experimental Results

In this section, we describe our experimental setup and present results wrt performance and costs for the parallelized serverless workloads for the different services. Finally, we discuss the impact of cold starts in our experiments.

Figure 3. Number of available CPU cores for the different services at the different memory configurations.
(a) AWS Atax
(c) AWS Monte Carlo
(d) GCF Atax
(f) GCF Monte Carlo
(g) GCR Atax
(i) GCR Monte Carlo
Figure 4. Obtained average speedups for the different parallelized workloads on AWS Lambda, GCF, and GCR. For a particular memory configuration, the red line shows the ideal speedup wrt the number of available CPU cores.

4.1. Experimental Setup

For GCF and Lambda, we deployed the different workloads (§3.1) using the memory profiles MB, MB, MB, and MB. For Lambda, we also utilized the highest available memory configuration for a function, i.e., MB. In contrast to GCF and Lambda, GCR allows developers to configure the number of vCPUs allocated to a function along with the memory. For GCR, we used similar memory configurations as GCF except for MB. In this case, we configured the workloads with MB of memory since that is the minimum memory required to allocate 4vCPUs to a function. We allocate 4vCPUs for every memory configuration in GCR except for

MB where we allocate 2vCPUs. To reduce variance in performance measurements for the serverless workloads due to cold starts 

(10.1145/3447545.3451173), we set the maximum number of concurrent instances for all services to one. Furthermore, we set the maximum number of concurrent requests that can be handled by a container in GCR to one. This is done to prevent the sharing of vCPUs while handling multiple simultaneous requests. We deploy all functions on GCR and GCF in the us-central1 region. For Lambda, all functions are deployed in the us-east1 region.

4.2. #CPU cores to vCPU mapping

For the different services, language runtimes (§3.2), and configurations (§4.1), we identified the number of available CPU cores. We obtained the number of available cores for the function/container instance using the Linux proc filesystem. The number of available CPU cores for the different services at the different memory profiles is shown in Figure 3. For Lambda, we observed at least two CPU cores for every memory configuration. Lambda allocates one full vCPU per MB of allocated function memory (LambdaConfig). This implies that the amount of allocated vCPUs is not equal to the number of available CPU cores. For instance, for a memory configuration of MB on Lambda, we observed two CPU cores while not getting a full allocated vCPU. Moreover, Lambda always rounds up the number of available CPU cores as shown in Figure 3. For example, MB translates to 2.3vCPUs, but we observed three CPU cores for that specific configuration. For GCF, we always observed two CPU cores irrespective of the configured function memory. Similarly for GCR, we observed at least two CPU cores for the different configurations (§4.1). However, for the container with the Java language runtime (§3.2), we observed only one CPU core when configured with 1vCPU. For greater than 1vCPU allocations in GCR, the number of available CPU cores is always equal to the number of configured vCPUs. Note that although not shown in Figure 3, AWS Lambda provides four CPU cores for function instances configured with GB of memory.

A possible explanation for observing two CPU cores at lower memory configurations for the different services, i.e., AWS Lambda and GCF can be hyperthreading or Simultaneous Multithreading (SMT) (eggers1997simultaneous) present in modern Intel Server Family of Processors, i.e, Haswell-EP, Broadwell-EP, Skylake-SP, and Cascade Lake-SP. As shown in previous works (chadha2021architecture; behind), these are the family of processors found in the Virtual Machines of the commercial FaaS providers on which the function/container instances are launched.

(a) AWS Atax C++
(b) AWS Atax Go
(c) AWS Atax Java
(d) AWS MVT C++
(e) AWS MVT Go
(f) AWS MVT Java
(g) AWS Monte Carlo C++
(h) AWS Monte Carlo Go
(i) AWS Monte Carlo Java
Figure 5. Comparison of cost per million function invocations (in USD) for the different workloads on Lambda. The cost values highlighted with red represent the maximum cost savings obtained across the different memory configurations.

4.3. Comparing Performance

Figure 4 shows the average speedups obtained for the different serverless workloads (§3.1) across the different memory configurations (§4.1) for the different services. For a particular memory configuration, we compute the speedup obtained by dividing the mean execution time of the sequential serverless workload by the mean execution time of the parallelized version. Note that, when executing a parallelized workload, we use the maximum number of available CPU cores for a particular memory configuration (§4.2). For Lambda and GCF, we don’t observe any significant speedup for lower memory configurations, i.e., less than MB. This is because, in Lambda and GCF, each function instance has a fixed memory and fraction of allocated CPU cycles. Since both FaaS platforms do not allocate complete two vCPUs for lower memory configurations, utilizing the underlying two CPU cores does not improve performance. For Lambda, two vCPUs are allocated for a memory configuration greater than MB, while for GCF they are allocated at a memory configuration of MB (GCFPricing). This is also apparent from the increase in speedup observed for GCF when switching from MB to MB as shown in Figures 3(d)3(e), and 3(f). For memory configurations greater than MB, we observed speedup close to the number of available vCPUs for both GCF and Lambda. This shows that irrespective of the number of available CPU cores to a function instance, the performance of a parallelized function is limited by the allocated vCPUs.

width=8.5cm,center —c—c—c—c—c— -

Service & Benchmark & Max. Cost savings & Runtime & Memory
- GCF & Atax & 49.1% & Java & 4096MB
- GCF & MVT & 39.7% & Java & 4096MB
- GCF & Monte Carlo & 44.1% & Java & 8192MB
- GCR & Atax & 59.5% & Go & 2148MB
- GCR & MVT & 63.4% & Go & 2148MB
- GCR & Monte Carlo & 69.8% & Go & 2148MB

Table 2. Maximum obtained cost savings, language runtime, and memory configuration for the different serverless workloads on GCF/GCR.

For Lambda, the parallelized versions of Atax and MVT microbenchmarks perform consistently worse than Monte Carlo for higher memory configurations. This is because both these microbenchmarks have a greater number of parallel regions than Monte Carlo and require synchronization and communication between the application threads. Moreover, the Go and C++ implementations of the serverless workloads perform better than the Java based implementation as shown in Figures 3(a)3(b)3(c)3(g)3(h), and 3(i) . This can be attributed to the performance degradation of parallel implementations using Java threads with an increase in the number of application threads and communication (10.1145/1596655.1596661). For GCF, the Go implementations perform worse as compared to Lambda and GCR, since GCF uses an old runtime version for Go. As a sanity check, we could reproduce the results by using golang:1.13-buster as the container image for Go with GCR. For GCR, speedup values obtained are similar to that for Lambda, i.e., close to the number of allocated vCPUs and therefore capped at four. Note that for GCR, we even obtained speedup values for the lowest memory configurations since we were able to allocate two complete vCPUs. The observed performance and speedup depend on the parallelization efficiency of the different serverless workloads (§3.1).

4.4. Comparing Costs

To calculate costs for the different serverless workloads (§3.1), we use the obtained mean execution time across the different memory configurations. For Lambda, we round up the execution time to the nearest 1ms increment, while for GCF and GCR it is rounded up to the nearest 100ms increment. Following this, we use the rounded up mean execution time to calculate function compute time in terms of different units defined by the providers (LambdaPricing; GCFPricing; GCRPricing). For Lambda, the compute time depends on the amount of allocated memory, i.e., GB-Seconds, for GCF it depends on the configured memory and the allocated CPU clock cycles, i.e., GB-Seconds and GHz-Seconds, and for GCR it depends on the configured memory and the allocated vCPUs, i.e, GB-Seconds and vCPU-Seconds. The different providers define a fixed price for one second of compute time depending on the deployment region. We use the pre-defined price values specified by all cloud providers for calculating the compute costs for the serverless workloads. In our calculations, we exclude costs for free-tiers and networking. Moreover, we calculate costs per million function invocations. As a result, a fixed price of $ and $ is added for Lambda and GCF/GCR respectively.

Figure 5 shows the cost comparison for the sequential and parallelized serverless workloads (§3.1) across the different memory configurations on Lambda. For Lambda, the difference in costs for the sequential and parallelized serverless workloads is not significant for lower memory configurations, i.e., less than MB. However, for the sequential serverless workloads the costs significantly increase for higher memory configurations, while for the parallelized versions the increase in cost is considerably less. For instance, the increase in average cost from the lowest to the highest memory configuration for the workloads parallelized using C++ is %, while for the sequential versions it is %. Therefore, by efficiently parallelizing serverless workloads we can obtain improved performance (§4.3) at approximately the same costs. Overall, we observed that parallelizing functions with C++ leads to minimum costs as shown in Figure 5. For the sequential workloads, Java is the cheapest, except for lower memory configurations. For Lambda, we obtained maximum cost savings of % for the Go implementation of the Monte Carlo workload as shown in Figure 4(h).

Due to space limitations, we do not present detailed cost analysis results for GCF/GCR but summarize our findings in Table 2. We obtain the maximum cost savings for GCF with a memory allocation of at least MB which also corresponds to the highest obtained speedup values (§4.3). For GCR, the maximum cost savings are obtained for the memory configuration of MB with vCPUs.

4.5. Impact of Cold Starts

From our experiments, we observed that the latency of cold starts is equally long for the sequential and parallel implementations of the serverless workloads (§3.1). Since cold starts constitute a fraction of the function execution time, if large enough, they can potentially impact the speedup and cost savings obtained from parallelizing serverless workloads. The average billable cold start latency for the different services and language runtimes (§3.2) is shown in Figure 6. While AWS Lambda directly provides the billed cold start time in function logs, for GCF and GCR, we compute the billed cold start latency by subtracting the billed function execution time on a cold and warm start. In our experiments, we did not observe a significant difference in the cold start latencies for the different serverless workloads and function memory configurations. However, the values shown in Figure 6 are averaged across all benchmarks. We observed a significant cold start latency for the Java runtime on GCR as compared to the C++ and Go runtimes. A possible explanation for this could be a high amount of time required for the Java Virtual Machine (JVM) to warm up.

5. Conclusion & Future Work

In this paper, we analyzed the effect of parallelizing compute-intensive serverless workloads within a function/container instance in terms of performance and costs for AWS Lambda, Google Cloud Functions, and Google Cloud Run. We identified that for the different services the number of CPU cores available to the function/container does not always equal the number of allocated vCPUs. Furthermore, we demonstrate that parallelizing serverless workloads can significantly improve performance and lead to cost savings. For Lambda, we observed cost savings up to 81%, for GCF up to 49%, and for GCR up to 79.8%.

Figure 6. Average billable cold start latency for the different runtimes across the different services.

In the future, we plan to evaluate other serverless offerings and language runtimes. Furthermore, investigating a hybrid approach between inter and intra-function parallelism to reduce synchronization overhead is another future direction.

6. Acknowledgement and Reproducibility

This work was supported by the funding of the German Federal Ministry of Education and Research (BMBF) in the scope of the Software Campus program.

All code artifacts related to this work are available at222 We also evaluated other parallelized workloads which could not be included in this paper due to page limitations. Please refer to (michaelkienerthesis) for our comprehensive evaluation results.