Modern biomedical analytical workflows typically consist of multiple applications and libraries, each with their own set of software dependencies. As a result, software containers that encapsulate executables with their dependencies have become popular to facilitate the deployment of complicated workflows and to increase their reproducibility [o2017dockstore, da2017biocontainers]. Many of these biomedical workflows are also computationally intensive stemming from their operation on large datasets requiring significant CPU, network, and disk resources. Cloud computing has emerged as a possible solution that can provide the necessary resources needed for computationally intensive bioinformatics analyses [dai2012bioinformatics, schadt2010computational, schadt2011cloud, lau2017cancer, reynolds2017isb, afgan2011harnessing, birger2017firecloud]. However, deployment of workflows using Infrastructure-as-a-Service (IaaS) cloud platforms requires selecting the appropriate type and quantity of virtual machines (VMs) to address performance goals while balancing hosting costs. Cloud resource type selection is presently complicated by the rapidly growing number of available VM instance types and pricing models offered by public cloud providers. For example, the Amazon, Microsoft, and Google public clouds presently offer more than 265, 204, and 35 VM types respectively under approximately five different pricing models. Further, Google allows users to create custom VM types with unique combinations of CPUs, memory, and disk capacity. These cloud VMs are available directly, or through various container platforms. Determining the best cloud deployment requires understanding the resource requirements of the workflow. In this paper we present a tool called the Container Profiler that runs inside of the container to profile workflow resource utilization. We demonstrate its utility by recording and visualizing the resource usage of a multi-stage containerized bioinformatics application.
1.1 Our Contributions
This paper presents the Container Profiler, a tool that supports profiling the computational resources utilized by software within a Docker container. Our tool is simple, easy-to-use, and can record the resource utilization for any Dockerized computational job. As containerized bioinformatics software become ubiquitous, it is essential to understand the fine-grained resource utilization of computational tasks to identify resource bottlenecks and to inform the choice of optimal cloud deployment. The Container Profiler collects metrics to characterize the CPU, memory, disk, and network resource utilization at the VM, container, and process level. In addition, the Container Profiler provides tools and time-series graphing to visualize and facilitate monitoring of resource utilization of workflows. We present a case study using a multi-stage containerized bioinformatics workflow that analyzes the unique molecular identifiers (UMI) of RNA sequencing data to illustrate the utility of our tools .
1.2 Related Work
Weingartner et al. highlight the importance of profiling resource requirements of applications for deployment in the cloud to improve resource allocation and forecast performance [weingartner2015cloud]. Brendan Gregg described the USE method (Utilization, Saturation, and Errors) as a tool to diagnose performance bottlenecks [gregg2013thinking]. Gregg’s method involves checking utilization of every resource involved in the system including CPUs, disks, memory, and more to identify saturation and errors. Lloyd et al. provided a virtual machine manager known as VM-scaler that integrated resource utilization profiling of software deployments to Infrastructure-as-a-Service (IaaS) cloud virtual machines [lloyd2014virtual]. The VM-scaler tool focused on the management and profiling of cloud infrastructure used to host environmental modeling web services. Lloyd et al. later extended this work by building resource utilization models that enabled identifying the most cost effective cloud-based VM type to host environmental modeling workloads without sacrificing web service runtime or throughput [lloyd2015demystifying] . Their approach demonstrated a possible cost variance of 25% when hosting workloads across different VM types while identifying potential for cost savings up to $25,000 for 10,000 hours of compute time for hosting web service workloads on the cloud.
. Their approach demonstrated a possible cost variance of 25% when hosting workloads across different VM types while identifying potential for cost savings up to $25,000 for 10,000 hours of compute time for hosting web service workloads on the cloud.
Cloud computing has been used to process massive RNA sequencing (RNA-seq) datasets [tatlow2016cloud, lachmann2018massive]. These workflows typically consist of multiple computational tasks, where not all tasks necessarily have the same resource requirements. Tatlow et al. studied the performance and cost profiles for processing large-scale RNA-seq data using pre-emptible virtual machines (VMs) on Google Cloud Platform [tatlow2016cloud]. The authors collected computer resource utilization metrics, including user and system vCPU utilization, memory usage, disk activity, and network activity to characterize the different computational phases of the RNA-seq workflow. Tatlow et al. observed how resource utilization can vary dramatically across different processing tasks in the workflow, while demonstrating that resource profiling can help to identify resource requirements of unique workflow phases. Juve et al. developed a pair of tools called wfprof (workflow profiling) that collect and summarize performance metrics for diverse scientific workflows from multiple domains including bioinformatics [juve2013characterizing]. Wfprof consists of two tools, ioprof to measure process I/O, and pprof that characterizes process runtime, memory usage, and CPU utilization. These tools accomplish profiling at the machine level primarily by analyzing process level resource utilization, and they do not focus on profiling containerized workflows, nor do they collect any container specific metrics.
Recently, Tyryshkina, Coraor, and Nekrutenko leveraged coarse grained resource utilization data from historical job runs collected over 5 years on the Galaxy platform to estimate the required CPU time and memory to improve task scheduling
Recently, Tyryshkina, Coraor, and Nekrutenko leveraged coarse grained resource utilization data from historical job runs collected over 5 years on the Galaxy platform to estimate the required CPU time and memory to improve task scheduling[tyryshkina2019predicting]. This paper identified the challenge of determining the appropriate amount of memory and processing resources for scheduling bioinformatics analyses at scale. The majority of metrics consisted of metadata regarding job configurations and assessing the utility of using fine grained operating system metrics for profiling. Resource prediction was not the focus. In addition, older jobs run on Galaxy do not typically use containers and lack any container based metrics.
2 Container Profiler: Overview
This Container Profiler tool supports profiling the resource utilization including CPU, memory, disk, and network metrics for containerized tasks. Resource utilization metrics are obtained across three levels: virtual machine (VM)/host, container, and process. Our implementation leverages facilities provided by the Linux operating system that is integral with Docker containers. Development and testing of the Container Profiler described in this paper was completed using Debian-based Ubuntu Linux.
The Container Profiler collects information from the Linux /proc and /sys/fs/cgroup file systems while a workload is running inside a container on the host machine. The host machine could be a physical computer such as a laptop or a virtual machine (VM) in the public cloud. The workload being profiled can be any job capable of running inside a Docker container. Figure 1 provides an overview of the various metrics collected by the Container Profiler.
|vCpuTimeUserMode||CPU time for processes executing in user mode||/proc/stat|
|vCpuTimeKernelMode||CPU time for processes executing in kernel mode||/proc/stat|
|vCpuIdleTime||CPU idle time||/proc/stat|
|vCpuTimeIOWait||CPU time waiting for I/O to complete||/proc/stat|
|vCpuContextSwitches||The total number of context switches across all CPUs||/proc/stat|
|vDiskSectorReads||Number of sector reads||/proc/diskstats|
|vDiskSectorWrites||Number of sectors written||/proc/diskstats|
|vDiskReadTime||Time spent reading||/proc/diskstats|
|vDiskWriteTime||Time spent writing||/proc/diskstats|
|vNetworkBytesRecvd||Network Bytes received||/proc/net/dev|
|vNetworkBytesSent||Network Bytes written||/proc/net/dev|
Host-Level Metrics: Host/VM level resource utilization metrics are obtained from the Linux /proc virtual-filesystem. The /proc filesystem is a virtual filesystem that consists of dynamically generated files produced on demand by the Linux operating system kernel that provides an immense amount of data regarding the state of the system [cp12]. Files in the /proc filesystem are generated at access time from metadata maintained by Linux to describe current resource utilization, devices, and hardware configuration managed by the Linux kernel. The Container Profiler queries the /proc filesystem programmatically at regular time intervals to obtain resource utilization statistics. Documentation regarding the Linux /proc filesystem is found on the /proc Linux manual pages [cp12] though other references provide more detailed descriptions of available metadata: [cp1, cp2, cp3, cp4, cp5, cp6, cp7, cp8, cp9, cp10, cp11].
VM-level resource utilization data is obtained from the /proc filesystem. For example, user-mode and kernel-mode CPU utilization data is obtained from the /proc/stat file. Table 1 provides a subset of CPU, disk, and network utilization metrics profiled at the VM/host level.
Container-Level Metrics: Docker relies on the Linux cgroup and namespace features to facilitate the aggregation of a set of Linux processes together to form a container. Cgroups were originally added to the Linux operating system to provide system administrators with the ability to dynamically control hardware resources for a set of related Linux processes [cgroups]. Linux control groups (cgroups) provide a kernel feature to both limit and monitor total resource utilization of containers. Docker leverages cgroups for resource management to restrict hardware access to the underlying host machine to facilitate sharing when multiple containers share the host. Linux subsystems such as CPU and memory are attached to a cgroup enabling the ability to control resources of the cgroup. Resource utilization of cgroup processes is aggregated for reporting purposes under the /sys/fs/cgroup virtual filesystem and we leverage its availability to obtain container-level metrics. Cgroup files provide aggregated resource utilization statistics describing all of the processes inside a container. For example, a container’s CPU utilization statistics can be obtained from /sys/fs/cgroups/cpuacct/cpuacct.stat within a container. Table 2 describes a subset of the CPU, disk, and network utilization metrics profiled at the container level by the Container Profiler.
|cCpuTimeUserMode||CPU time consumed by tasks in user mode||/sys/fs/cgroup/cpuacct/cpuacct.stat|
|cCpuTimeKernelMode||CPU time consumed by tasks in kernel mode||/sys/fs/cgroup/cpuacct/cpuacct.stat|
|cDiskSectorIO||Number of sectors transferred to or from specific devices||/sys/fs/cgroup/blkio/blkio.sectors|
|cDiskReadBytes||Number of bytes transferred from specific devices||/sys/fs/cgroup/blkio/blkio.throttle.io_service_bytes|
|cDiskWriteBytes||Number of bytes transferred to specific devices||/sys/fs/cgroup/blkio/blkio.throttle.io_service_bytes|
|cNetworkBytesRecvd||The number of bytes each interface has received||/proc/net/dev|
|cNetworkBytesSent||The number of bytes each interface has sent||/proc/net/dev|
Process-Level Metrics: The Container Profiler also supports profiling the resource utilization for each process running inside a container by referring to processs files under the Linux /proc filesystem. For example, the CPU utilization of a process with process ID pid can be retrieved from the file /proc/[pid]/stat. The Container Profiler captures the resource utilization data for each process running in a container. Table 3 describes a subset of the process-level metrics collected by the Container Profiler to profile resource utilization of container processes.
|pCpuTimeUserMode||Amount of time that this process has been scheduled in user mode||/proc/[pid]/stat|
|pCpuTimeKernelMode||Amount of time that this process has been scheduled in kernel mode||/proc/[pid]/stat|
|pVoluntaryContextSwitches||Number of voluntary context switches||/proc/[pid]/status|
|pNonvoluntaryContextSwitches||Number of involuntary context switches||/proc/[pid]/status|
|pBlockIODelays||Aggregated block I/O delays||/proc/[pid]/stat|
|pResidentSetSize||Number of pages the process has in real memory||/proc/[pid]/stat|
Data collection of process-level data follows a similar approach as for VM or host-level data. Within the /proc file system, the Linux kernel dynamically generates files that describes the resource utilization for each running process. The /proc/PID path provides access to information of the process with process id PID. As an example, information regarding CPU utilization for a process with pid 10 would be located in /proc/10/stat.
Resource utilization data collected at the VM/host, container, and process level allows characterization of resource use with increasingly greater isolation. Host-level resource metrics for example, do not isolate background processes. This could lead to variance in measurements as background processes may be randomly present. Profiling at the container level allows fine-grained resource profiling of ONLY the resources used by the computational task. Finally, profiling at the process level allows very fine-grained profiling so that resource bottlenecks can be attributed to the specific activities or tasks. The ability of the Container Profiler to characterize resource utilization at multiple levels enables high observability of the resource requirements of computational tasks. This observability can be crucial to improving job deployments to cloud platforms to alleviate performance bottlenecks and optimize performance and analyses costs.
We demonstrate the Container Profiler using unique molecular identifier (UMI) RNA sequencing data generated by the LINCS Drug Toxicity Signature (DToxS) Generation Center at Icahn School of Medicine at Mount Sinai in New York [umi-xiong]. The scripts and supporting files for the analytical workflow to analyse this originated from the Broad Institute [Soumillon003236]. In addition to downloading the datasets, there are 3 other stages. The first stage is a demultiplexing or split step that sorts the reads using a sequence barcode to identify the originating sample. The second stage aligns the reads to a human reference sequence to identify the gene that produced the transcript. The final stage is the "merge" step which counts all the aligned reads to identify the number of transcripts produced by each gene. The unique molecular identifier (UMI) sequence is used to filter out reads that arise from duplication during the sample preparation process. In the original workflow, only the most CPU intensive part of the workflow, the alignment step, was optimized and executed in parallel. We further optimized the split and align steps in the original workflow [Soumillon003236] to decrease the running time from 29 hours to 3.5 hours [hung2019holistic]. We also encapsulated each step in the workflow in separate Docker containers to facilitate deployment and ensure reproducibility.
We adopt this UMI RNA-sequencing workflow as our case study for the Container Profiler as each stage of the workflow should have different resource utilization characteristics. Specifically, the dataset download should be limited by the network capacity. The split step writes many files and should be limited by the speed of the disk writes. The alignment step is performed by multiple CPU-intensive processes which would be largely limited by the CPU. However, it is possible that available memory capacity will limit performance in some circumstances. The final merge step involves reading many files in parallel, consuming both memory and CPU resources depending on the number of threads used.
3.1 Container Profiler can inform workflow optimization
Figure 2 shows the CPU, memory, network, and disk utilization metrics at both the container and VM/host levels over time for the RNA sequencing analytical workflow. Note that the x-axes depicting time in this figure encompasses the entire workflow incorporating the download, split, align, and merge stages. At a high level, the profile results follow the expected utilization patterns that we would expect. The download phase consumes network resources. The split step is the most disk intensive step. The alignment and merge steps consume the most CPU resources. The profile data also points to areas where resource consumption may be a problem. For example, memory usage is high for all the stages. This may be due to greedy allocation by the executables, or it may indicate that more memory could benefit the workflow. Most interesting, is the CPU utilization during the alignment phase. There are two steep drops in CPU usage at the 4 hour mark, and again just before 5 hours. The alignment phase uses separate threads to align different files of reads simultaneously. Near the end of the phase, most of the files will have been processed and there will be more threads than files. As a result, the CPU utilization drops as individual threads lie idle waiting for the final files to be processed. However, this under-utilization of resources lasts for almost TWO HOURS indicating that the final files are rather large. This suggests an opportunity to improve the workflow by splitting into smaller files (which is an option in our software), or by processing the largest files first. We could not have known whether these additional steps would be worth the additional complexity without the fine-grained results from the Container Profiler.
3.2 Container-level metrics can provide useful additional information
One of our contributions with the Container Profiler is the ability to capture container-level metrics. We would expect that these metrics would be similar, but could differ in that the host/VM level metrics would also encompass resources being used by processes not necessarily involved in directly executing the workflow. Since we only ran one instance of our workflow on our test VM, the container metrics should be very similar to the VM/host metrics which is the case. However, one can see differences between the disk utilization metrics during the split and alignment phases where there are a large number of disk writes to the host file system. Docker manages these disk writes by providing the container with an internal mount point which is eventually written to a host file. The caching and management of this data is external to the container and is not captured by the container metrics, but is captured by the host metric. In addition, during the alignment phase, intermediate results from the aligner are continuously piped to another process which then reformats the intermediate output and writes the final output to a file on the host system. Multiple threads are used, more than the available number of cores resulting in frequent context switches, The pipe management and context-switching are also handled by the operating system and are captured by the host metric and not the container metrics. The separation of container and OS based consumption can be useful for example, when trying to assess effects due to resource contention that may occur when multiple jobs are run on the same physical host, which often happens on public clouds where the assignment of instances to hosts is controlled by the vendor.
3.3 Container Profiler can sample container and host metrics at 1-2 second resolution
For the Container Profiler to be useful, the collection of profiling metrics must have sufficiently low overhead to enable rapid sampling of resource utilization to collect many samples for time series analysis. The time required to collect the metrics limits the granularity of the profile. To achieve 1 second resolution requires the ability to record the profile within 1 second or (1000 ms). However, the measurement time is not constant but depends on the resources being utilized the workflow and host. This is shown in the histogram in Figure 4. We note that the highest variation is for the process level data. This makes sense as metrics are collected for each process and the number of processes being executed vary during the execution of a complex parallel workflow. The time required to gather host and container level metrics is less variable as the number of metrics collected is fixed. As shown in Figure 4, 90% of the time, the container and host level metrics are collected in less than a second and always under 1.5 seconds. The process metrics do take longer to collect but still less than 10 seconds in the absolute worst case. This points out an advantage of container metrics in that they isolate the utilization of the application without the need to collect all the process level data.
3.4 Container Profiler has much lower overhead the variation in execution time on public clouds
The Container Profiler must also not significantly impact the performance of the workflow that is being profiled. Otherwise the process of resource profiling might impact the collected metrics. While some overhead is unavoidable, ideally it should be lower than the intrinsic variation in workflow execution times.
To measure the performance impact on the RNA-seq workflow we initially attempted to assess the overhead using Amazon Elastic Compute Cloud (EC2) cloud VMs. However, we discovered that the runtime of the RNA-seq workflow varied by more than 5% on Amazon EC2 which was more than 5x greater than the overhead of the Container Profiler. This made it impossible to accurately quantify the performance overhead since we could not distinguish the difference between cloud performance variance and the overhead of the Container Profiler. To effectively measure the performance overhead we profiled the workflow on a local Dell server equipped with a 10-core, Intel Xeon E5-2640 v4 @ 2.4 Ghz with 72GB of memory. Figure 4 depicts the performance overhead resulting from one-second sampling of resource utilization by the Container Profiler on the RNA-seq workflow on the local Dell server. Running on an isolated server greatly reduced the performance variance of running RNA-seq. We measured worst case overhead for the Container Profiler to be less than 1%, which equates to about 4.4 minutes for an 8-hour workflow with full verbosity metrics collection (VM + container + process). Overhead is reduced to as little as .07% overhead, or about 19 seconds for an 8-hour workflow when only collecting VM level metrics. Adding container-level, and especially process-level metrics increases the amount of time it takes to collect resource utilization data. We believe that workload profiling overhead is within an acceptable level and note that even at maximum verbosity, it is substantially less than the observed performance variance for running a workflow on the public cloud. Users can reflect on our reported overhead times to make informed decisions when planning to profile their own workflows.
4.1 Implementation Details
The Container Profiler is implemented as a collection of Bash and Python scripts. Figure 5 provides an overview. When the Container Profiler is executed inside a Docker container, it snapshots the resource utilization for the host (i.e. VM), container, and all processes running inside the container producing output statistics to a .json file. A sampling interval (e.g. once per second) is specified to configure how often resource utilization data is collected to support time series analysis for containerized applications and workflows. Time series data can be used to train mathematical models to predict the runtime or resource requirements of applications and workflows. Time series data can also be visualized using plotly Python graphing scripts that are included with the Container Profiler.
To improve the periodicity of time series sampling, we subtract the observed run time of the Container Profiler for each sample collection from the configured sampling interval (e.g. 1 second) in rudataall.sh. This approach notably improved the periodicity of sampling when the container was under load improving our ability to obtain the expected number of one-second samples for long running workflows. As an added feature, we also include timestamps for when each resource utilization metric is sampled in the output JSON. These timer ticks enable precise calculation of the time that transpires between resource utilization samples for each metric. This allows the rate of consumption of system resources (e.g. CPU, memory, disk/network I/O) to be precisely determined throughout the workflow. The Container Profiler consists of four scripts depicted in Figure 5: processpack.sh, runDockerProfile.sh, ru_profiler.sh, and rudataall.sh.
The processpack.sh script is intended to be modified by the user and is used to initiate profiling. Specifically in processpack.sh, the user is responsible for providing a Docker image that includes the application to be profiled, and the command to launch the containerized application.
The runDockerProfile.sh script takes as input the name of the file containing the command from the processpack.sh script and the amount of time in seconds between samples as arguments. The runDockerProfile.sh script then builds a Docker run command that runs the container and also mounts a directory from the host specified by the user into the container’s //data directory. Mounting the host directory facilitates providing the Container Profiler’s Bash scripts to the container. Mounting the data directory also enables the Container Profiler to export JSON files describing resource utilization outside of the container to the host. The user also modifies runDockerProfile.sh to provide the container name to be run.
The ru_profiler.sh script takes two parameters: the run command that was built in the previous script, and the the time interval between snapshots. First, the run command is executed synchronously by ru_ profiler.sh, while the script also records the current time before and after invoking rudataall.sh. The ru_profiler.sh script calculates the profiling time and sleeps for the remainder of the sampling interval before repeating the loop again. The rudataaall.sh script collects the resource utilization data. Specifically, this script takes a snapshot of the resource utilization metrics and records output to a JSON file using the time of the sample as a unique filename. The script also accepts parameters -v, -c, and -p to inform the tool what type of data to collect: VM, container, and/or processlevel metrics respectively. The default behavior when running this script without any parameters is to collect all metrics. A user can adjust the verbosity of resource utilization profiling for the Container Profiler by modifying the ru_profiler.sh script.
4.2 Technical details using our scripts
To use the Container Profiler scripts with any container, a Linux based Docker container that encapsulates a script or job to run inside is required. To configure the Container Profiler tool to profile the container, two files are modified inside the Container Profiler: process_pack.sh and runDockerProfile.sh. In process_pack.sh, the user launches the container’s job or task to be profiled. This can be done by calling a script, command, or executable available inside the container to initiate the work. Inside runDockerProfile.sh, two variables named “ContainerName” and “HostPath” need to be set. ContainerName is the name of the container to profile, and HostPath is the path of where the tool runs. Once these two scripts are modified, the setup to use the tool is finished. To start profiling a container, call runDockerProfile.sh and this script will create the specified container, and will run the job while ru_profiler.sh will start to output JSON data from the container to the specified path.
The Container Profiler includes graphing scripts that support the creation of time-series graphs to help visualize Linux resource utilization metrics. Graphs are saved locally and can be created in a browser dynamically. Resource utilization data samples collected by the Container Profiler are stored as JSON files to a specified data directory during profiling. After profiling, and once the graphing scripts and dependencies are installed, time-series graphs can be made by specifying the data directory and the sampling interval for plotting. By default graphs are generated for every metric, or alternatively a specific set of metrics can be specified for graphing. Our graphing library uses the source profiling data typically collected at a one second sampling interval to automatically generate metric deltas based on the desired sampling interval being graphed. The delta_configuration.ini file captures configuration details for how these deltas should automatically be derived for each metric. Additionally, the graph_generation_config.ini captures default graphing behavior for how specific metrics should be graphed. Figure 2 shows sample output graphs depicting CPU, memory, disk, and network utilization for the DToxS RNA sequencing data workflow.
5 Availability of supporting data and materials
Project name: Container Profiler
Project webpage: https://github.com/wlloyduw/ContainerProfiler
Contents available for download: Docker Images, Dockerfiles, installation scripts, and execution scripts.
Operating system(s): Linux, Mac OS X, Microsoft Windows.
Programming language(s): Bash, Python
License: MIT License
6 List of abbreviations
AWS: Amazon Web Services; EC2: Elastic Compute Cloud; VM: virtual machine; CPU: central processing unit; IaaS: Infrastructure-as-a-Service; RNAseq: RNA sequencing; LINCS: Library of Integrated Network-Based Cellular Signatures; DToxS: Drug Toxicity Signature; RNA: ribonucleic acid; cgroup: container group;
7 Author’s Contributions
LHH, HD, RS, and DP contributed to the development of the Container Profiler. LHH implemented Docker containers for RNA-seq workflows. RS, NA, and DP conducted performance testing and empirical experiments. KYY, RS, WL, and LHH drafted the manuscript. WL, KYY, and LHH designed the case study. WL provided cloud computing expertise. WL and KYY coordinated the empirical study. All authors edited the manuscript.
This work is licensed under the Creative Commons Attribution - NonCommercial - NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
LHH, HD, RS, WL, and KYY are supported by the National Institutes of Health (NIH) grant R01GM126019. DP is supported by the NIH Diversity Supplement R01GM126019-02S2. WL is also supported by NSF grant OAC-1849970. We acknowledge support from the AWS Cloud Credits for Research (awarded to LHH, WL, and KYY).