Profiling Resource Utilization of Bioinformatics Workflows

05/23/2020 ∙ by Huazeng Deng, et al. ∙ 0

We present a software tool, the Container Profiler, that measures and records the resource usage of any containerized task. Our tool profiles the CPU, memory, disk, and network utilization of a containerized job by collecting Linux operating system metrics at the virtual machine, container, and process levels. The Container Profiler can produce utilization snapshots at multiple time points, allowing for continuous monitoring of the resources consumed by a container workflow. To investigate the utility of the Container Profiler we profiled the resource utilization requirements of a multi-stage bioinformatics analytical workflow (RNA sequencing using unique molecular identifiers). We examined the collected profile metrics and confirmed that they were consistent with the expected CPU, disk, network resource utilization patterns for the different stages of the workflow. We also quantified the profiling overhead and found that this was negligible. The Container Profiler is a useful tool that can be used to continuously monitor the resource consumption of long and complex containerized workflows that run locally or on the cloud. This can identify bottlenecks where more resources are needed to improve performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

Modern biomedical analytical workflows typically consist of multiple applications and libraries, each with their own set of software dependencies. As a result, software containers that encapsulate executables with their dependencies have become popular to facilitate the deployment of complicated workflows and to increase their reproducibility  [o2017dockstore, da2017biocontainers]. Many of these biomedical workflows are also computationally intensive stemming from their operation on large datasets requiring significant CPU, network, and disk resources. Cloud computing has emerged as a possible solution that can provide the necessary resources needed for computationally intensive bioinformatics analyses  [dai2012bioinformatics, schadt2010computational, schadt2011cloud, lau2017cancer, reynolds2017isb, afgan2011harnessing, birger2017firecloud]. However, deployment of workflows using Infrastructure-as-a-Service (IaaS) cloud platforms requires selecting the appropriate type and quantity of virtual machines (VMs) to address performance goals while balancing hosting costs. Cloud resource type selection is presently complicated by the rapidly growing number of available VM instance types and pricing models offered by public cloud providers. For example, the Amazon, Microsoft, and Google public clouds presently offer more than 265, 204, and 35 VM types respectively under approximately five different pricing models. Further, Google allows users to create custom VM types with unique combinations of CPUs, memory, and disk capacity. These cloud VMs are available directly, or through various container platforms. Determining the best cloud deployment requires understanding the resource requirements of the workflow. In this paper we present a tool called the Container Profiler that runs inside of the container to profile workflow resource utilization. We demonstrate its utility by recording and visualizing the resource usage of a multi-stage containerized bioinformatics application.

1.1 Our Contributions

This paper presents the Container Profiler, a tool that supports profiling the computational resources utilized by software within a Docker container. Our tool is simple, easy-to-use, and can record the resource utilization for any Dockerized computational job. As containerized bioinformatics software become ubiquitous, it is essential to understand the fine-grained resource utilization of computational tasks to identify resource bottlenecks and to inform the choice of optimal cloud deployment. The Container Profiler collects metrics to characterize the CPU, memory, disk, and network resource utilization at the VM, container, and process level. In addition, the Container Profiler provides tools and time-series graphing to visualize and facilitate monitoring of resource utilization of workflows. We present a case study using a multi-stage containerized bioinformatics workflow that analyzes the unique molecular identifiers (UMI) of RNA sequencing data to illustrate the utility of our tools .

1.2 Related Work

Weingartner et al. highlight the importance of profiling resource requirements of applications for deployment in the cloud to improve resource allocation and forecast performance [weingartner2015cloud]. Brendan Gregg described the USE method (Utilization, Saturation, and Errors) as a tool to diagnose performance bottlenecks [gregg2013thinking]. Gregg’s method involves checking utilization of every resource involved in the system including CPUs, disks, memory, and more to identify saturation and errors. Lloyd et al. provided a virtual machine manager known as VM-scaler that integrated resource utilization profiling of software deployments to Infrastructure-as-a-Service (IaaS) cloud virtual machines [lloyd2014virtual]. The VM-scaler tool focused on the management and profiling of cloud infrastructure used to host environmental modeling web services. Lloyd et al. later extended this work by building resource utilization models that enabled identifying the most cost effective cloud-based VM type to host environmental modeling workloads without sacrificing web service runtime or throughput [lloyd2015demystifying]

. Their approach demonstrated a possible cost variance of  25% when hosting workloads across different VM types while identifying potential for cost savings up to $25,000 for 10,000 hours of compute time for hosting web service workloads on the cloud.

Cloud computing has been used to process massive RNA sequencing (RNA-seq) datasets [tatlow2016cloud, lachmann2018massive]. These workflows typically consist of multiple computational tasks, where not all tasks necessarily have the same resource requirements. Tatlow et al. studied the performance and cost profiles for processing large-scale RNA-seq data using pre-emptible virtual machines (VMs) on Google Cloud Platform [tatlow2016cloud]. The authors collected computer resource utilization metrics, including user and system vCPU utilization, memory usage, disk activity, and network activity to characterize the different computational phases of the RNA-seq workflow. Tatlow et al. observed how resource utilization can vary dramatically across different processing tasks in the workflow, while demonstrating that resource profiling can help to identify resource requirements of unique workflow phases. Juve et al. developed a pair of tools called wfprof (workflow profiling) that collect and summarize performance metrics for diverse scientific workflows from multiple domains including bioinformatics [juve2013characterizing]. Wfprof consists of two tools, ioprof to measure process I/O, and pprof that characterizes process runtime, memory usage, and CPU utilization. These tools accomplish profiling at the machine level primarily by analyzing process level resource utilization, and they do not focus on profiling containerized workflows, nor do they collect any container specific metrics.

Recently, Tyryshkina, Coraor, and Nekrutenko leveraged coarse grained resource utilization data from historical job runs collected over 5 years on the Galaxy platform to estimate the required CPU time and memory to improve task scheduling 

[tyryshkina2019predicting]. This paper identified the challenge of determining the appropriate amount of memory and processing resources for scheduling bioinformatics analyses at scale. The majority of metrics consisted of metadata regarding job configurations and assessing the utility of using fine grained operating system metrics for profiling. Resource prediction was not the focus. In addition, older jobs run on Galaxy do not typically use containers and lack any container based metrics.

2 Container Profiler: Overview

This Container Profiler tool supports profiling the resource utilization including CPU, memory, disk, and network metrics for containerized tasks. Resource utilization metrics are obtained across three levels: virtual machine (VM)/host, container, and process. Our implementation leverages facilities provided by the Linux operating system that is integral with Docker containers. Development and testing of the Container Profiler described in this paper was completed using Debian-based Ubuntu Linux.

The Container Profiler collects information from the Linux /proc and /sys/fs/cgroup file systems while a workload is running inside a container on the host machine. The host machine could be a physical computer such as a laptop or a virtual machine (VM) in the public cloud. The workload being profiled can be any job capable of running inside a Docker container. Figure 1 provides an overview of the various metrics collected by the Container Profiler.

Figure 1: Overview summarizing resource utilization metrics (61 total) collected by the Container Profiler across three levels (host/VM, container, and process level) and four categories (CPU, memory, network, and disk). Process level metrics are depicted by red and prefaced with lower case "p", container level metrics by yellow prefaced with lower case "c", and host/VM level metrics by blue prefaced with lower case "v".
Metric Description Source
vCpuTimeUserMode CPU time for processes executing in user mode /proc/stat
vCpuTimeKernelMode CPU time for processes executing in kernel mode /proc/stat
vCpuIdleTime CPU idle time /proc/stat
vCpuTimeIOWait CPU time waiting for I/O to complete /proc/stat
vCpuContextSwitches The total number of context switches across all CPUs /proc/stat
vDiskSectorReads Number of sector reads /proc/diskstats
vDiskSectorWrites Number of sectors written /proc/diskstats
vDiskReadTime Time spent reading /proc/diskstats
vDiskWriteTime Time spent writing /proc/diskstats
vNetworkBytesRecvd Network Bytes received /proc/net/dev
vNetworkBytesSent Network Bytes written /proc/net/dev
Table 1: Selected CPU, disk, and network utilization metrics profiled at the VM/host level.

Host-Level Metrics: Host/VM level resource utilization metrics are obtained from the Linux /proc virtual-filesystem. The /proc filesystem is a virtual filesystem that consists of dynamically generated files produced on demand by the Linux operating system kernel that provides an immense amount of data regarding the state of the system [cp12]. Files in the /proc filesystem are generated at access time from metadata maintained by Linux to describe current resource utilization, devices, and hardware configuration managed by the Linux kernel. The Container Profiler queries the /proc filesystem programmatically at regular time intervals to obtain resource utilization statistics. Documentation regarding the Linux /proc filesystem is found on the /proc Linux manual pages [cp12] though other references provide more detailed descriptions of available metadata:  [cp1, cp2, cp3, cp4, cp5, cp6, cp7, cp8, cp9, cp10, cp11].

VM-level resource utilization data is obtained from the /proc filesystem. For example, user-mode and kernel-mode CPU utilization data is obtained from the /proc/stat file. Table 1 provides a subset of CPU, disk, and network utilization metrics profiled at the VM/host level.

Container-Level Metrics: Docker relies on the Linux cgroup and namespace features to facilitate the aggregation of a set of Linux processes together to form a container. Cgroups were originally added to the Linux operating system to provide system administrators with the ability to dynamically control hardware resources for a set of related Linux processes [cgroups]. Linux control groups (cgroups) provide a kernel feature to both limit and monitor total resource utilization of containers. Docker leverages cgroups for resource management to restrict hardware access to the underlying host machine to facilitate sharing when multiple containers share the host. Linux subsystems such as CPU and memory are attached to a cgroup enabling the ability to control resources of the cgroup. Resource utilization of cgroup processes is aggregated for reporting purposes under the /sys/fs/cgroup virtual filesystem and we leverage its availability to obtain container-level metrics. Cgroup files provide aggregated resource utilization statistics describing all of the processes inside a container. For example, a container’s CPU utilization statistics can be obtained from /sys/fs/cgroups/cpuacct/cpuacct.stat within a container. Table 2 describes a subset of the CPU, disk, and network utilization metrics profiled at the container level by the Container Profiler.

Metric Description Source
cCpuTimeUserMode CPU time consumed by tasks in user mode /sys/fs/cgroup/cpuacct/cpuacct.stat
cCpuTimeKernelMode CPU time consumed by tasks in kernel mode /sys/fs/cgroup/cpuacct/cpuacct.stat
cDiskSectorIO Number of sectors transferred to or from specific devices /sys/fs/cgroup/blkio/blkio.sectors
cDiskReadBytes Number of bytes transferred from specific devices /sys/fs/cgroup/blkio/blkio.throttle.io_service_bytes
cDiskWriteBytes Number of bytes transferred to specific devices /sys/fs/cgroup/blkio/blkio.throttle.io_service_bytes
cNetworkBytesRecvd The number of bytes each interface has received /proc/net/dev
cNetworkBytesSent The number of bytes each interface has sent /proc/net/dev
Table 2: Selected CPU, disk, and network utilization metrics profiled at the container level.

Process-Level Metrics: The Container Profiler also supports profiling the resource utilization for each process running inside a container by referring to processs files under the Linux /proc filesystem. For example, the CPU utilization of a process with process ID pid can be retrieved from the file /proc/[pid]/stat. The Container Profiler captures the resource utilization data for each process running in a container. Table 3 describes a subset of the process-level metrics collected by the Container Profiler to profile resource utilization of container processes.

Metric Description Source
pCpuTimeUserMode Amount of time that this process has been scheduled in user mode /proc/[pid]/stat
pCpuTimeKernelMode Amount of time that this process has been scheduled in kernel mode /proc/[pid]/stat
pVoluntaryContextSwitches Number of voluntary context switches /proc/[pid]/status
pNonvoluntaryContextSwitches Number of involuntary context switches /proc/[pid]/status
pBlockIODelays Aggregated block I/O delays /proc/[pid]/stat
pResidentSetSize Number of pages the process has in real memory /proc/[pid]/stat
Table 3: List of important metrics for profiling process resource utilization.

Data collection of process-level data follows a similar approach as for VM or host-level data. Within the /proc file system, the Linux kernel dynamically generates files that describes the resource utilization for each running process. The /proc/PID path provides access to information of the process with process id PID. As an example, information regarding CPU utilization for a process with pid 10 would be located in /proc/10/stat.

Resource utilization data collected at the VM/host, container, and process level allows characterization of resource use with increasingly greater isolation. Host-level resource metrics for example, do not isolate background processes. This could lead to variance in measurements as background processes may be randomly present. Profiling at the container level allows fine-grained resource profiling of ONLY the resources used by the computational task. Finally, profiling at the process level allows very fine-grained profiling so that resource bottlenecks can be attributed to the specific activities or tasks. The ability of the Container Profiler to characterize resource utilization at multiple levels enables high observability of the resource requirements of computational tasks. This observability can be crucial to improving job deployments to cloud platforms to alleviate performance bottlenecks and optimize performance and analyses costs.

3 Results

We demonstrate the Container Profiler using unique molecular identifier (UMI) RNA sequencing data generated by the LINCS Drug Toxicity Signature (DToxS) Generation Center at Icahn School of Medicine at Mount Sinai in New York [umi-xiong]. The scripts and supporting files for the analytical workflow to analyse this originated from the Broad Institute [Soumillon003236]. In addition to downloading the datasets, there are 3 other stages. The first stage is a demultiplexing or split step that sorts the reads using a sequence barcode to identify the originating sample. The second stage aligns the reads to a human reference sequence to identify the gene that produced the transcript. The final stage is the "merge" step which counts all the aligned reads to identify the number of transcripts produced by each gene. The unique molecular identifier (UMI) sequence is used to filter out reads that arise from duplication during the sample preparation process. In the original workflow, only the most CPU intensive part of the workflow, the alignment step, was optimized and executed in parallel. We further optimized the split and align steps in the original workflow [Soumillon003236] to decrease the running time from 29 hours to 3.5 hours [hung2019holistic]. We also encapsulated each step in the workflow in separate Docker containers to facilitate deployment and ensure reproducibility.

We adopt this UMI RNA-sequencing workflow as our case study for the Container Profiler as each stage of the workflow should have different resource utilization characteristics. Specifically, the dataset download should be limited by the network capacity. The split step writes many files and should be limited by the speed of the disk writes. The alignment step is performed by multiple CPU-intensive processes which would be largely limited by the CPU. However, it is possible that available memory capacity will limit performance in some circumstances. The final merge step involves reading many files in parallel, consuming both memory and CPU resources depending on the number of threads used.

Figure 2:

Output graphs comparing Container and VM (host) level metrics over time for a multi-stage RNA sequencing data workflow. Four output graphs are shown: disk writes (top left), CPU usage (top right), network usage (bottom left) and memory usage (bottom right). In each graph, the container level metrics are shown in blue and the VM (host) level metrics are shown in red. For disk usage and memory usage, the native host metric was transformed to have the same units as the container metric. For disk usage this involved multiplying the vSectorWrites value by the sector size to estimate vBytesWritten. For memory usage, we subtracted the vMemoryFree from the total memory available to get vMemoryUsed. The four phases of the workflow are downloading the data (download), splitting and demultiplexing the reads (split), aligning the reads to the reference (align), and assembling the counts while removing duplicate reads (merge). We observe that the container-level and VM-level metrics mostly overlap in the phases. However, there are differences when there are background processes, most notably when there is considerable disk usage. The alignment phase is also notable in that we can see that the CPU usage declines near the end, probably indicating that the workflow is waiting on some slower threads to finish before it can proceed, indicating that this phase might be improved with better load balancing, or with smaller workloads for the threads. This is an example of how the Container Profiler can be used to flag portions of the workflow that can optimized.

3.1 Container Profiler can inform workflow optimization

Figure 2 shows the CPU, memory, network, and disk utilization metrics at both the container and VM/host levels over time for the RNA sequencing analytical workflow. Note that the x-axes depicting time in this figure encompasses the entire workflow incorporating the download, split, align, and merge stages. At a high level, the profile results follow the expected utilization patterns that we would expect. The download phase consumes network resources. The split step is the most disk intensive step. The alignment and merge steps consume the most CPU resources. The profile data also points to areas where resource consumption may be a problem. For example, memory usage is high for all the stages. This may be due to greedy allocation by the executables, or it may indicate that more memory could benefit the workflow. Most interesting, is the CPU utilization during the alignment phase. There are two steep drops in CPU usage at the 4 hour mark, and again just before 5 hours. The alignment phase uses separate threads to align different files of reads simultaneously. Near the end of the phase, most of the files will have been processed and there will be more threads than files. As a result, the CPU utilization drops as individual threads lie idle waiting for the final files to be processed. However, this under-utilization of resources lasts for almost TWO HOURS indicating that the final files are rather large. This suggests an opportunity to improve the workflow by splitting into smaller files (which is an option in our software), or by processing the largest files first. We could not have known whether these additional steps would be worth the additional complexity without the fine-grained results from the Container Profiler.

3.2 Container-level metrics can provide useful additional information

One of our contributions with the Container Profiler is the ability to capture container-level metrics. We would expect that these metrics would be similar, but could differ in that the host/VM level metrics would also encompass resources being used by processes not necessarily involved in directly executing the workflow. Since we only ran one instance of our workflow on our test VM, the container metrics should be very similar to the VM/host metrics which is the case. However, one can see differences between the disk utilization metrics during the split and alignment phases where there are a large number of disk writes to the host file system. Docker manages these disk writes by providing the container with an internal mount point which is eventually written to a host file. The caching and management of this data is external to the container and is not captured by the container metrics, but is captured by the host metric. In addition, during the alignment phase, intermediate results from the aligner are continuously piped to another process which then reformats the intermediate output and writes the final output to a file on the host system. Multiple threads are used, more than the available number of cores resulting in frequent context switches, The pipe management and context-switching are also handled by the operating system and are captured by the host metric and not the container metrics. The separation of container and OS based consumption can be useful for example, when trying to assess effects due to resource contention that may occur when multiple jobs are run on the same physical host, which often happens on public clouds where the assignment of instances to hosts is controlled by the vendor.

3.3 Container Profiler can sample container and host metrics at 1-2 second resolution

For the Container Profiler to be useful, the collection of profiling metrics must have sufficiently low overhead to enable rapid sampling of resource utilization to collect many samples for time series analysis. The time required to collect the metrics limits the granularity of the profile. To achieve 1 second resolution requires the ability to record the profile within 1 second or (1000 ms). However, the measurement time is not constant but depends on the resources being utilized the workflow and host. This is shown in the histogram in Figure 4. We note that the highest variation is for the process level data. This makes sense as metrics are collected for each process and the number of processes being executed vary during the execution of a complex parallel workflow. The time required to gather host and container level metrics is less variable as the number of metrics collected is fixed. As shown in Figure 4, 90% of the time, the container and host level metrics are collected in less than a second and always under 1.5 seconds. The process metrics do take longer to collect but still less than 10 seconds in the absolute worst case. This points out an advantage of container metrics in that they isolate the utilization of the application without the need to collect all the process level data.

Figure 3: Distribution plot (log-scale) of time required to collect profile data. We collected utilization data using the Amazon EC2 m4.4xlarge VM type (Intel Xeon E5-2676v3 CPU at 2.4 GHzm, 16 virtual CPU cores, 64GB of memory, and elastic block store data volumes). We profiled the complete RNA-seq workflow collecting VM/host metrics, VM/host and container metrics, and ALL metrics, and also in the absence of the profiler. We repeated each of the profiling runs 3 times for a total of 12 executions of the workflow. Plots depict time to collect resource utilization samples at one-second intervals with the Container Profiler while running the entire RNA-seq workflow. Time to collect 11,994 samples of each type (Process-level, Container-level, and VM-level) shown. 93.1% of all samples were collected in under a second. Process-level sampling shows the distribution of sample collection over 9.5 seconds. Container-level and VM-level sampling shows the distribution of sample collection over 1.5 seconds. The 90th percentiles for sample collection are shown.

3.4 Container Profiler has much lower overhead the variation in execution time on public clouds

The Container Profiler must also not significantly impact the performance of the workflow that is being profiled. Otherwise the process of resource profiling might impact the collected metrics. While some overhead is unavoidable, ideally it should be lower than the intrinsic variation in workflow execution times.

Figure 4: This figure depicts the profiling overhead of the Container Profiler

and the resulting percentage increase in the total runtime of the entire RNA-seq workflow. The increases in running time are very modest: Host/VM only (0.07%), Host/VM + Container (0.42%), and Host/VM + Container + Process (0.95%). Error bars depict one standard deviation from the average. Standard deviation of workflow performance on Amazon EC2 with no profiling was more than 5x greater (+/-5.32%) than worst case overhead of the Container Profiler.

To measure the performance impact on the RNA-seq workflow we initially attempted to assess the overhead using Amazon Elastic Compute Cloud (EC2) cloud VMs. However, we discovered that the runtime of the RNA-seq workflow varied by more than 5% on Amazon EC2 which was more than 5x greater than the overhead of the Container Profiler. This made it impossible to accurately quantify the performance overhead since we could not distinguish the difference between cloud performance variance and the overhead of the Container Profiler. To effectively measure the performance overhead we profiled the workflow on a local Dell server equipped with a 10-core, Intel Xeon E5-2640 v4 @ 2.4 Ghz with 72GB of memory. Figure 4 depicts the performance overhead resulting from one-second sampling of resource utilization by the Container Profiler on the RNA-seq workflow on the local Dell server. Running on an isolated server greatly reduced the performance variance of running RNA-seq. We measured worst case overhead for the Container Profiler to be less than 1%, which equates to about 4.4 minutes for an 8-hour workflow with full verbosity metrics collection (VM + container + process). Overhead is reduced to as little as .07% overhead, or about 19 seconds for an 8-hour workflow when only collecting VM level metrics. Adding container-level, and especially process-level metrics increases the amount of time it takes to collect resource utilization data. We believe that workload profiling overhead is within an acceptable level and note that even at maximum verbosity, it is substantially less than the observed performance variance for running a workflow on the public cloud. Users can reflect on our reported overhead times to make informed decisions when planning to profile their own workflows.

4 Methods

4.1 Implementation Details

Figure 5: Summary of Bash scripts used in the implementation of Container Profiler.

The Container Profiler is implemented as a collection of Bash and Python scripts. Figure 5 provides an overview. When the Container Profiler is executed inside a Docker container, it snapshots the resource utilization for the host (i.e. VM), container, and all processes running inside the container producing output statistics to a .json file. A sampling interval (e.g. once per second) is specified to configure how often resource utilization data is collected to support time series analysis for containerized applications and workflows. Time series data can be used to train mathematical models to predict the runtime or resource requirements of applications and workflows. Time series data can also be visualized using plotly Python graphing scripts that are included with the Container Profiler.

To improve the periodicity of time series sampling, we subtract the observed run time of the Container Profiler for each sample collection from the configured sampling interval (e.g. 1 second) in rudataall.sh. This approach notably improved the periodicity of sampling when the container was under load improving our ability to obtain the expected number of one-second samples for long running workflows. As an added feature, we also include timestamps for when each resource utilization metric is sampled in the output JSON. These timer ticks enable precise calculation of the time that transpires between resource utilization samples for each metric. This allows the rate of consumption of system resources (e.g. CPU, memory, disk/network I/O) to be precisely determined throughout the workflow. The Container Profiler consists of four scripts depicted in Figure 5: processpack.sh, runDockerProfile.sh, ru_profiler.sh, and rudataall.sh.

The processpack.sh script is intended to be modified by the user and is used to initiate profiling. Specifically in processpack.sh, the user is responsible for providing a Docker image that includes the application to be profiled, and the command to launch the containerized application.

The runDockerProfile.sh script takes as input the name of the file containing the command from the processpack.sh script and the amount of time in seconds between samples as arguments. The runDockerProfile.sh script then builds a Docker run command that runs the container and also mounts a directory from the host specified by the user into the container’s //data directory. Mounting the host directory facilitates providing the Container Profiler’s Bash scripts to the container. Mounting the data directory also enables the Container Profiler to export JSON files describing resource utilization outside of the container to the host. The user also modifies runDockerProfile.sh to provide the container name to be run.

The ru_profiler.sh script takes two parameters: the run command that was built in the previous script, and the the time interval between snapshots. First, the run command is executed synchronously by ru_ profiler.sh, while the script also records the current time before and after invoking rudataall.sh. The ru_profiler.sh script calculates the profiling time and sleeps for the remainder of the sampling interval before repeating the loop again. The rudataaall.sh script collects the resource utilization data. Specifically, this script takes a snapshot of the resource utilization metrics and records output to a JSON file using the time of the sample as a unique filename. The script also accepts parameters -v, -c, and -p to inform the tool what type of data to collect: VM, container, and/or processlevel metrics respectively. The default behavior when running this script without any parameters is to collect all metrics. A user can adjust the verbosity of resource utilization profiling for the Container Profiler by modifying the ru_profiler.sh script.

4.2 Technical details using our scripts

To use the Container Profiler scripts with any container, a Linux based Docker container that encapsulates a script or job to run inside is required. To configure the Container Profiler tool to profile the container, two files are modified inside the Container Profiler: process_pack.sh and runDockerProfile.sh. In process_pack.sh, the user launches the container’s job or task to be profiled. This can be done by calling a script, command, or executable available inside the container to initiate the work. Inside runDockerProfile.sh, two variables named “ContainerName” and “HostPath” need to be set. ContainerName is the name of the container to profile, and HostPath is the path of where the tool runs. Once these two scripts are modified, the setup to use the tool is finished. To start profiling a container, call runDockerProfile.sh and this script will create the specified container, and will run the job while ru_profiler.sh will start to output JSON data from the container to the specified path.

4.3 Visualization

The Container Profiler includes graphing scripts that support the creation of time-series graphs to help visualize Linux resource utilization metrics. Graphs are saved locally and can be created in a browser dynamically. Resource utilization data samples collected by the Container Profiler are stored as JSON files to a specified data directory during profiling. After profiling, and once the graphing scripts and dependencies are installed, time-series graphs can be made by specifying the data directory and the sampling interval for plotting. By default graphs are generated for every metric, or alternatively a specific set of metrics can be specified for graphing. Our graphing library uses the source profiling data typically collected at a one second sampling interval to automatically generate metric deltas based on the desired sampling interval being graphed. The delta_configuration.ini file captures configuration details for how these deltas should automatically be derived for each metric. Additionally, the graph_generation_config.ini captures default graphing behavior for how specific metrics should be graphed. Figure 2 shows sample output graphs depicting CPU, memory, disk, and network utilization for the DToxS RNA sequencing data workflow.

5 Availability of supporting data and materials

  • Project name: Container Profiler

  • Contents available for download: Docker Images, Dockerfiles, installation scripts, and execution scripts.

  • Operating system(s): Linux, Mac OS X, Microsoft Windows.

  • Programming language(s): Bash, Python

  • License: MIT License

6 List of abbreviations

AWS: Amazon Web Services; EC2: Elastic Compute Cloud; VM: virtual machine; CPU: central processing unit; IaaS: Infrastructure-as-a-Service; RNAseq: RNA sequencing; LINCS: Library of Integrated Network-Based Cellular Signatures; DToxS: Drug Toxicity Signature; RNA: ribonucleic acid; cgroup: container group;

7 Author’s Contributions

LHH, HD, RS, and DP contributed to the development of the Container Profiler. LHH implemented Docker containers for RNA-seq workflows. RS, NA, and DP conducted performance testing and empirical experiments. KYY, RS, WL, and LHH drafted the manuscript. WL, KYY, and LHH designed the case study. WL provided cloud computing expertise. WL and KYY coordinated the empirical study. All authors edited the manuscript.

8 License

This work is licensed under the Creative Commons Attribution - NonCommercial - NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

9 Acknowledgements

LHH, HD, RS, WL, and KYY are supported by the National Institutes of Health (NIH) grant R01GM126019. DP is supported by the NIH Diversity Supplement R01GM126019-02S2. WL is also supported by NSF grant OAC-1849970. We acknowledge support from the AWS Cloud Credits for Research (awarded to LHH, WL, and KYY).

References