The NSF Innovative HPC program is an integrated (currently through the XSEDE program [towns2014xsede] and previously TeraGrid and the PACI programs [reed2003grids, catlett2008teragrid]) collection of state-of-the-art digital resources that enable researchers, engineers, and scholars to conduct computational and data intensive research in a diverse range of disciplines. The resources provided by the program are intended to be technically diverse, reflecting changing and growing use of computation in both the research and education process. With over 7000 users from more than 800 institutions directly using these systems, this program fulfills an important need in the U.S. cyberinfrastructure ecosystem by providing researchers with access to computational and data intensive resources that are beyond the capabilities of most campus based systems. Evidence of this is given by the following plots which serve to underscore the important and unique role that these resources play in supporting computationally and data intensive research in the U.S.
Figure (a)a is a comparison of the distribution of CPU hours consumed by job size (number of cores) for the resources provided through the NSF Innovative HPC program versus that for an academic HPC center, namely the Center for Computational Research (CCR) at the University at Buffalo. Blue Waters, NSF’s capability class HPC system, is also shown in Figure (b)b for comparison. Assuming that CCR’s job mix is representative of that of a typical campus based HPC system, Figure (a)a clearly shows the differences in the scale of the jobs run on the two systems, with NSF Innovative HPC resources providing researchers with the ability to consistently run much larger jobs. Figure (b)b shows a more complete picture the unique roles that these 3 components of the NSF cyberinfrastructure ecosystem play for scientific research in the U.S. It should be noted that the NSF Innovative HPC Program also supports a wide range of more specialized services such as visualization, storage, large memory and data intensive facilities not emphasized in the computation-based figure (b)b.
Figure 4 is a plot of the number of institutions and principal investigators using NSF Innovative HPC Program resources broken out by field of science for 2016. It serves to underscore the extent of the NSF Innovative HPC resources’ user base and the diversity of the areas of science served by this program.
Figure 5 is a plot of CPU hours consumed and number of jobs run by resource type and provides an indication of the diversity of digital resources provided by the NSF Innovative HPC program to enable research. While there is a long history of HPC and HTC computing by U.S. researchers, other areas such as cloud and data intensive computing are growing and NSF’s Innovative HPC program is expanding offerings in these areas to help keep pace with the growing demand. Note that the job type classification scheme employed to create Figure 5 is based on the resource type classification taken from the XSEDE central database, and not the result of an analysis of individual research awards. For example, in the past three years all jobs run on SDSC GORDON, PSC BRIDGES LARGE, and PSC BLACKLIGHT
are classified as Data Intensive. Likewise, all Open Science Grid jobs are classified as high throughput computing jobs, and allIU JETSTREAM jobs are classified as cloud jobs.
Further evidence of the impact and increasing importance that computation and data play in science, including non-traditional areas, can be found in Figure 6, which shows the remarkable growth in XD SUs consumed (approximately 2 orders of magnitude) on the NSF Innovative HPC resources by the NSF directorates over the past 10 years (2007-2017).
Given the pivotal role that the NSF Innovative HPC program plays in the advancement of science and engineering in the United States, it is therefore important to characterize its workload. An understanding of workload properties sheds light on resource utilization and can be used to guide performance optimization both at the software and system configuration levels leading to greater overall throughput for end-users. Here, we report on the results of a a detailed workload analysis of the portfolio of supercomputers comprising the Innovative HPC program in order to characterize its past and current workload and look for trends to understand the nature of how the broad portfolio of computational science research is being supported and how it is changing over time.
2 Workload Analysis Goals
This analysis, which was modeled after a similar analysis carried out on Blue Waters [xdmod-bw2016], and a 2014 NERSC Workload Analysis [nersc2014], builds on prior workload characterizations of TeraGrid and XSEDE carried out by Hart [Hart2011DeepAndWideMetrics, Hart2011b, Hart2012b]. The workload analysis targeted the following high-level questions:
What is the proportional mix of disciplines (field of science, parent science, NSF division, NSF directorate) and and how is it changing over the lifetime of the NSF Innovative HPC program, including job sizes/concurrency, and key resource utilization (wall time, memory, GPUs, etc.)? How does this mix differ among the resources? Section 4.
What if any trends are there in allocations versus awards by resource and discipline? Section 4.
What fraction of the portfolio resources are used for data analytics/data intensive computing (hadoop, spark, etc.)? Is the trend going up or down? How does it vary among the resources? Section 5.
Are jobs using a larger number of cores over time? Are there differences of job size by discipline or application or Innovative HPC Program resource? Section 5.
Is job memory usage increasing/decreasing over time? Section 5.
Are there specific discipline differences in memory usage?
Are there specific memory usage differences in the most used applications?
Are there memory usage differences among the resources and does this impact throughput (i.e., result in a bottleneck)?
Are there important differences among job types, e.g., interactive jobs, gateway jobs, etc.? Section 7.
What are the characteristics of gateway jobs? Section 7.
What are the characteristics and trends for CPU core hours per job, node counts and types, memory and interconnect usage?
Are the parallel jobs simply ensembles (many independent jobs)?
What is the gateway job distribution by resource?
How many new (unique) users are using gateways?
Are the usage patterns of gateway users significantly different than traditional HPC users?
How does gateway utilization and growth differ by discipline
Are jobs constrained by resource policy limits such as queue length, user limits or node sharing? Section 8.
How does this vary by resource?
Do these limits affect the analysis?
Are there differences in the job mixes among the resources and if so, how does this impact job throughput? Section 8.
What is the relative proportion of these jobs between systems?
Is this the result of allocation decisions, or something else we can determine?
How do wait times, throughput and queue length vary among the resources? Has this changed over time? Section 8.
What is the run-time over-subscription? What is the breakdown by resource and resource type? Section 8.
3 Workload Analysis Tools
This analysis leveraged our unique capability in the comprehensive management of HPC systems through XDMoD.
The XD Metrics Service (XMS) for High Performance Computing (HPC) systems supports the comprehensive resource management of XSEDE and the associated computational resources of the NSF Innovative HPC program, and accordingly was employed for the workload analysis presented here. The analysis was carried out primarily through the XD Metrics on Demand (XDMoD) tool, which was developed under the XMS program.
The XDMoD tool provides stakeholders with access to utilization, performance, and quality of service data for high performance computing (HPC) resources [furlani2013using]. Originally developed to provide independent audit capability for the XSEDE program, XDMoD was later open-sourced and is widely used by university, government, and industrial HPC centers [Palmer:2015]. XDMoD enables users, managers, and operations staff to monitor, assess and maintain quality of service for their computational resources. To do this, XDMoD harvests data from the various resources and displays the resulting job, usage, and accounting metrics, using the XDMoD portal (https://xdmod.ccr.buffalo.edu) and its rich array of visual analysis and charting tools.
Metrics provided by XDMoD include: number of jobs, CPU cycles consumed, wait time, and wall time, in addition to many others. These metrics can be broken down in many different ways such as: field of science, institution, job size, job wall time, NSF directorate, NSF user status, parent science, person, principal investigator, and by resource. A context-sensitive drill-down capability is available for many charts allowing users to access additional related information simply by clicking inside a plot and then selecting the desired metric to explore. Another key feature is the ability to make a custom plot of any metric or combination of metrics filtered or aggregated as desired.
The XDMoD tool is also designed to preemptively identify under-performing hardware and software by deploying customized, computationally lightweight “application kernels” that continuously monitor HPC system performance and reliability from the end-users’ point of view. Accordingly, through XDMoD, system managers have the ability to pro-actively monitor system performance as opposed to having to rely on users to report failures or under-performing hardware and software.
In addition to the application kernels, which provide quality of service level metrics, XDMoD allows system support personnel and end-users to obtain detailed performance metrics aggregated by node, user, and application for every job run on an HPC cluster. These metrics cover the usage of CPU, memory and cache, and network and input-output devices. Performance metrics are obtained from hardware performance counters and general UNIX/Linux monitoring tools with no need to recompile codes, which is highly desirable for performance and practical reasons. In addition to characterizing the cluster’s workload, XDMoD can also be used to identify poorly running applications that can be subsequently tuned to improve performance and overall machine throughput.
The data sources for XDMoD include job accounting data from the Teragrid/XSEDE Central Database (XDCDB), allocation information from the XSEDE Resource Allocation System (XRAS) and job performance data from tacc_stats [evans2014comprehensive]. A full list of data sources with detailed descriptions is provided in Appendix A.1.
In addition to job accounting data for most of the production systems during the period covered by this report, job node level performance data was available for TACC RANGER, TACC LONESTAR4, TACC STAMPEDE, TACC STAMPEDE2, CCT LSU SUPERMIC, NICS DARTER, SDSC COMET, and SDSC GORDON, as described in Appendix A.2. The collection of job and node level performance data has not yet been implemented for the current production systems PSC BRIDGES, IU JETSTREAM, and TACC WRANGLER. Accounting data is also not available for TACC WRANGLER.
4 Trends in Utilization
Goals Addressed in Section
What is the proportional mix of disciplines (field of science, parent science, NSF division, NSF directorate) and how is it changing over the lifetime of the NSF Innovative HPC program, including job sizes/concurrency, and key resource utilization (wall time, memory, GPUs, etc.)? How does this mix differ among the resources?
What if any trends are there in allocations versus awards by resource and discipline?
4.1 Allocations and Usage
In this section we present historical allocation and usage data for the NSF Innovative HPC Program resources. Unless stated otherwise, all plots are over the date range 2011-07-01 to 2017-09-30. The end date was chosen to include the transition from TACC STAMPEDE to TACC STAMPEDE2. The start date was selected to include 6 years of data (multiple generations of HPC resources) and coincides with the start of the XSEDE program.
In many of the figures included in this analysis, utilization is measured in terms of XSEDE Service Units (XD SUs) instead of CPU hours. XD SUs are proportional to CPU hours consumed but contain a scaling factor (for each resource) which attempts to account for changes in CPU processing power over time and between resources, and therefore in theory allows for more meaningful comparisons of utilization between different resources and over different time periods. See Appendix A.3 for greater detail. Wherever possible, we report percent utilization for XD SUs, which to first order has the advantage of removing the influence of the XD SU scaling factors on the shape of curves when compared to CPU hours, that is plots done in percent utilization using either metric are very similar.
We begin by examining summary statistics for allocations awarded through the end of 2016 in Table 1. For the purpose of this summary we are defining an allocation as the combination of a project and a resource. For example, the project with charge number TG-MCA06N060 that has access to three resources is considered three distinct allocations. We include only allocations awarded from 2011-07 - 2016-12 so as not to penalize those that were recently awarded and have used only a small portion of the SUs available. Over this period there were 13,226 allocations awarded totaling 7.9B SUs including new allocations, renewals, and supplements to existing allocations. In addition to the total number of allocations made and the mean allocation size for the top 1%, 5%, 25% and bottom 25% by allocation size, we also examine the number of allocations that did not use any of the SUs that were awarded. The percentage allocation utilization will be defined as the allocation percentage weighted by SU. Note that allocations are awarded in SUs local to a particular resource so the values presented in the these tables are also local SUs and not XSEDE SUs (XD SUs). A local SU is defined as 1 SU = 1 core hour for the resources discussed in this report with the exception of TACC STAMPEDE2 and TACC WRANGLER where 1 SU = 1 node hour.
|Allocation||#||#||Allocation||Allocation Statistics (Millions of SU)|
Allocation utilization is calculated by weighting the percentage used by the number of local SUs allocated.
Table 2 examines the allocations broken down by type. The large TRAC/XRAC/Research allocations have a higher utilization than smaller allocations. Among the smaller allocations, the Campus Champion allocations have a very low overall utilization of 9% with 77% going completely unused. This is partly a reflection of the open-ended nature of these allocations.
|Allocation||#||#||Allocation||Allocation Statistics (SUs)|
|Size||Alloc||Unused||Utilization1||Total (M)||Mean||Median||Variance (M)|
|XSEDE2 Staff Alloc||96||59||13.78||3.76||39,142||10,000||7,593|
Allocation utilization is calculated by weighting the percentage used by the number of local SUs allocated.
Table 3 shows allocations broken down by discipline, using the NSF directorate when available. The top 4 disciplines comprise 92.5% of the total SUs awarded (7.3M). Of note is Social, Behavioral, and Economic Sciences (SBE) with a 248% allocation utilization, which is largely attributed to a single 3.0M SU allocation (TG-IBN130001 on OSG) overcharging their account by more than 2000%. This is likely due to policies at the SP level allowing users to overcharge the allocation. Also of note is that over 60% of allocations in all other disciplines go completely unused, although the average size of the unused allocations is small (73,734 SU).
|Discipline||#||#||Allocation||Allocation Statistics (SUs)|
|Alloc||Unused||Utilization1||Total (M)||Mean||Median||Variance (M)|
|Center Sys Staff||166||114||49.12||76||460,467||50,000||4,550,722|
|Sci and Eng Edu||87||45||24.68||5||62,776||44,800||4,611|
Allocation utilization is calculated by weighting the percentage used by the number of local SUs allocated.
Allocation statistics by resource are shown in Table 4 where we can see that all resources providing at least 100M total local SUs (from 2011-07 to 2016-12) have at least an 80% allocation utilization rate with TACC STAMPEDE, NICS KRAKEN, and TACC RANGER, having over 90%.
|Resource||#||#||Allocation||Allocation Statistics (SUs)|
|Alloc||Unused||Utilization1||Total (M)||Mean||Median||Variance (M)|
We can study the characteristics of the users with the largest allocations versus that of the average user to see if there are significant differences in their utilization patterns. For example, are there differences in the average job size, job length or the applications they use? To answer these questions, we compared the usage of all XSEDE users to those running under one of the top 1% of allocations by local SU. We note that the average job size is 88 cores for general XSEDE users and 136 cores for users in the top 1%. Job durations do not change significantly between the two groups with general XSEDE users averaging 3.4 hours per job compared to 3.3 hours for the top 1%. However, in terms of applications, users in the top 1% are more likely to run known community codes such as WRF, ARPS, NAMD, CHARM++, and MILC and less likely to run uncategorized or other custom-developed codes as shown in Figure 7 (blue columns are for all of XSEDE and red columns are for the top 1%).
Figure 8, which is a historical plot of allocated and used XD SUs, shows an upward trend in allocation and consumed XD SUs reflecting the NSF Innovative HPC program’s continued effort to provide sufficient resources to meet the ever increasing computational and data needs of U.S. researchers. As we demonstrate below, downward trends typically coincide with end-of-life resources going off-line prior to the full integration of new resources.
Figure 9, which shows the time history of XD SUs charged by resource, provides a time-line for the NSF Innovative HPC Program resources. The transition from TACC RANGER to TACC STAMPEDE and most recently to TACC STAMPEDE2 is shown, as are the recent additions of PSC BRIDGES and SDSC COMET. It is now evident that the 2014 downward trend in allocated and used XD SUs in Figure 8 is due to NICS KRAKEN going off-line. Similarly, the increase in XD SUs beginning in 2015 is attributable to SDSC COMET coming on-line.
Figure 10 shows the time history of the types of allocations that are possible on NSF Innovative HPC program resources, namely New, Transfer, Renewal, Supplemental, and Advanced. Not surprisingly, renewal and new accounts show an upward trend. Transfer accounts, which are migrations from one resource to another resources show a non-linear trend and are a refection of the integration of new systems and the subsequent transfer of accounts to them. For example, the large increase in transfers in Q1 2017 are a reflection of the transition to TACC STAMPEDE2.
Figure 11 shows the time history of unique accounts as a percentage of all accounts created. Noteworthy is the increase in accounts for disk storage resources that begins in 2013 and shows an increasing trend. The addition of cloud resources to the NSF HPC portfolio is evident by the appearance of cloud accounts starting in 2016.
Figure 12 shows the time history of unique accounts by resource type as a percentage of all accounts. Resource types include high performance computing (HPC), high throughput computing (HTC), data intensive computing, visualization systems, and cloud computing. Note that as described in the introduction, the job type classification scheme employed to create Figure 12 is based on the resource type classification for each resource taken from the XSEDE central database. The classification therefore does not reflect an analysis of the research projects to determine the type of computing carried out by the research group but rather is a reflection of the resource requested by the PI or assigned by the allocation committee. During the time period shown, HPC resource accounts for 70 - 90% of all accounts, with cloud computing resource accounts showing a growing presence.
Figure 13 shows the time history of the end-users running jobs on NSF HPC Innovation program resources by the type of end-user (graduate student, post doc, faculty, etc). This plot excludes Open Science Grid (OSG) jobs. It is interesting to note that in 2011 faculty account for about 40% of all jobs run but by 2017 as a group they account only for about 10%. Over this time period, the utilization of resources by graduate students, post docs, and university research staff increases as a percentage of all users, which speaks well for NSF’s emphasis on fostering the development of the U.S.’s next generation of computational and data scientists.
Figure 14 shows the time history of active institutions, PI’s, and Users running on NSF Innovative HPC Program resources per year. Over the 6 years shown, the number of PI’s utilizing the resources increases by more than 540 (a 40% increase), the number of users by over 3,300 (a 93% increase) and the number of institutions by about 380 (an 80% increase). Note that this plot does not include gateway utilization.
4.2 NSF Directorate, Parent Science and Field of Science Trends
Figures 15 - 23 focus on trends in utilization by the NSF directorates, parent sciences, and fields of science over time. Figure 15 is a pie chart showing the aggregate XD SUs charged by parent science for the time period 2011-07-01 to 2017-09-30. While Molecular Biosciences, Physics, and Materials Research account for half of all XD SUs consumed, there is substantial utilization by many of the other parent sciences. Note that since OSG jobs consume few XD SUs compared to HPC jobs, Figure 15 is unchanged if OSG jobs are excluded from the analysis. In addition to XD SUs, utilization can equally well be measured by the number of jobs run. Using the number of jobs as the utilization metric results in a remarkable difference in the observed parent science utilization as shown in Figure 16. In this case, the dominant Parent Sciences are Behavioral and Neural Sciences and Integrative Biology and Neuroscience, which when taken together account for over 50% of all jobs run (75 million jobs). Figure 16 includes HTC jobs run through Open Science Grid (OSG). While OSG jobs don’t consume a large fraction of CPU hours when compared to the HPC jobs, they clearly play an important role for many of the parent sciences. Excluding OSG jobs from this plot results in Figure 17 for which the trend is more in line with that of Figure 15. However, there are substantial differences, namely utilization by Physics and Astronomical Sciences has diminished and Atmospheric Sciences, Biological Sciences and Chemistry have increased.
Note that the OSG utilization reported here is restricted to community access of OSG via the XRAC allocation process (XSEDE), and the majority of community use of OSG comes from other means.
The time history of the percentage of XD SUs charged by parent science for the same period covered in Figure 15 is shown in Figure 18. Not surprisingly, the top three parent sciences are the same as in Figure 15. However, during this time period, the parent science Molecular Biosciences increases by 10% (from 20% to 30%) at the expense of Physics (20% to 10%). In addition, the thirty-three “Other” parent sciences not individually labeled in this figure increases from 2% to more than 10%, reflecting the increasing reliance of many of the parent sciences on access to advanced computing resources.
Figure 19 shows the time history of the utilization of NSF Innovative HPC Program resources by the NSF directorates in terms of percent of total utilization. Mathematical and Physical Sciences (MPS) and Biological Sciences account for about 70% of the utilization as measured by XD SUs consumed. During this time period MPS utilization decreases by 10% while there was a corresponding 10% increase in utilization by researchers in the Biological Sciences directorate. This trend is also evident in the time history of parent sciences (Figure 18). Also noteworthy is the growth in utilization by the Computer and Information Sciences Directorate (CISE).
As was the case for the analysis of parent science utilization, the time history of utilization by NSF directorate is considerably different if we consider the number of jobs as the utilization metric as opposed to XD SUs, as is shown in Figure 20. Compared to Figure 19, the order of the third, fourth, and fifth ranked directorates in terms of utilization changes from Eng, Geo, and CISE to Geo, CISE, Eng, respectively.
We now consider historical trends in utilization as measured by XD SUs charged for various fields of science (FoS). We begin with a breakdown of the the fields of science for the MPS directorate, which as was shown in Figure 19, accounts for the greatest number of XD SUs consumed among the directorates. Utilization by FoS in Figure 21 is fairly constant for many of the FoS’s. However, Materials Research shows a three fold increase in the percentage of XD SUs consumed relative to the other FoS’s (10% to 30%). Similarly, Astronomical Sciences increases in percentage, albeit not as substantially as Materials Research.
As was noted in Figure 19, utilization in the CISE directorate increases during the time period studied. Figure 22 shows a breakdown of utilization by FoS for the CISE directorate. Unlike the FoS breakdown for the MPS directorate, there is great variability in utilization over time among the fields of science making up the CISE directorate. For example, in terms of percentage of total XD SUs charged in the CISE directorate, Advanced Scientific Computing varies from a high of 80% to a low of 20%. Likewise, Computer and Information Science and Engineering and Computer and Computation Research show fluctuations of 70% and 50% respectively.
The FoS breakdown for the Social, Behavioral, and Economic Science directorate shows even greater variability as is evident in Figure 23. The variability of the fields of science in terms of the percentage of XD SUs consumed within the directorate is remarkable. Given the relatively low XD SUs consumed by this directorate compared to other directorates (see Figure 19), this variability is likely a reflection of the time history of individual awards within the fields of science within this directorate.
4.3 Deep and Wide Metrics for HPC Resource Capability and Project Usage
The TeraGrid project team coined the terms “deep” and “wide” computing to describe the needs of users and their science problems [Catlett2005TeraGrid]. Deep problems are generally defined to be problems which require capability computing, while “wide” refers to the community of researchers whose individually computing needs may be modest but when taken together represent a large capacity of computational work. In order to better understand the ability of the NSF Innovative HPC Program resources to simultaneously meet these needs, we carried out an analysis similar to that conducted by Hart [Hart2011DeepAndWideMetrics]. Project depth was defined as the maximal number of cores used by any job within a project. The reasoning behind defining project depth this way is that the jobs with highest core count are the most computationally demanding part of the project and are essential for completing project goals. Smaller core count jobs within the same project often times corresponds to pre- and post-processing, low resolution preliminary calculations and other less computationally intense types of calculations for the project. This way the maximal core count for each project corresponds to the required computational depth. To quantify the width of a resource or group of resources, we follow the definition proposed by Hart, namely the fraction of projects at or below a given depth. In order to quantify the capability utilization of a resource, we adopt the NERSC definition, namely the fraction of use by projects above a given depth.
The distributions of projects are shown in Figure 24-C. In addition to the unweighted distribution, distributions weighted by jobs and by utilization (as measured by CPU hours) were calculated to take into account the project size as measured by job count and total CPU hours. The integration of these three distributions produces the cumulative plot shown in 24-A.
The plots are over all HPC resources for the duration of the XSEDE program. Here we find it useful to compute (as was done in Reference [Hart2011DeepAndWideMetrics]) the “joint ratio” - namely the point on the x-axis in Figure 24-A where the percentage utilization and the percentage of projects total 100%. It is the point at which x% of the projects use y% of the resources and the remaining y% of the projects consume the other x% of the resources. In terms of cores, the “joint ratio” for all NSF Innovative HPC Program resources during the XSEDE program is 73:27 at 1023 cores (green line in 24-A), meaning that 27% of XSEDE projects account for 73% of the utilization, while the remaining 73% of the projects account for only 27% of the utilization. In addition to cores, Figure 24-D&-B includes the analogous plots in terms of nodes (joint ratio at 64 nodes).
It is informative to study the changes in the project depth, width, and joint ratio over time as well as differences in these metrics among the HPC resources. The change in these metrics over time are shown in Table 5 and Figure 25. It is important to note that the metrics computed in this table are obtained by summation over each project only within each calendar year and not over the entire lifetime of the project as was the case for the joint ratio, depth, and width shown in Figure 24. Accordingly, these metrics should be considered “per year” metrics.
While the joint ratio changes little over time, the “per year” project depth decreases by almost a factor of 2. That is, in 2017, 73% of projects had a maximum core count that was about 50% that of the maximum core count in 2011. In terms of capability class computing, the most interesting observation is a significant reduction of high core-count jobs after NICS KRAKEN was removed from service in the second quarter of 2014. Although many newer systems offer core-counts comparable to NICS KRAKEN, recent projects rarely utilize more than 30,000 cores. While GPU or Xeon Phi accelerator use was not accounted for here, our analysis for TACC STAMPEDE showed that accelerator use was not significant in large core count jobs. The substantial decrease in the size of capability jobs can be partially explained by (1) improved per core performance of most systems (for example TACC STAMPEDE single core performance is 2 times greater than NICS KRAKEN single core performance based on theoretical FLOP/S and 50% faster as measured by XDMoD’s NAMD application kernel) and (2) by the introduction of Blue Waters [Bode:2013], a capability specialized resource. The latter is manifested in resource policies that limit the maximum job size allowed for users (see Table 13, for example for TACC STAMPEDE it is 16384 cores, which is 16% of TACC STAMPEDE). Another factor is the steady increase in projects that only run single node jobs. These trends are evident in Figure 25, which shows the project depth distributions at the core level and node level broken out by year. The light orange area in this figure highlights single node jobs, which show an increase over time, while the light green area highlights the decrease in large core (and node) jobs that occur from 2011 to 2017. This trend is also supported by Figure 29 which shows the average job size by core count over the same period.
In addition to considering the changes over time in these metrics, we can also study the differences in project depth, width, and joint ratio by individual resource as is shown in Table 6. Most resources have a joint ratio close to 70:30. However, the depth at the joint ratio varies considerably. For example NICS KRAKEN has a depth of 6136 cores whereas TACC STAMPEDE has a depth of only 1135 cores. For some resources, typically those with some specialization such as GPU accelerators or large memory, the joint ratio is close to 60:40 indicating more homogeneous core counts from projects. There is only a weak dependency of depth on system total core count, which is most likely due to resource maximum job size policies (see Table 13 for default maximal job sizes throughout resources).
|Year||Joint Ratio||Depth at Joint Ratio||Projects at Joint Ratio||Usage (Core-Hours) at Joint Ratio||Jobs at Joint Ratio|
|Resource||Joint Ratio||Depth at Joint Ratio||Projects at Joint Ratio||Usage (Core-Hours) at Joint Ratio||Jobs at Joint Ratio||Cores||Nodes|
4.4 Summary:Trends in Utilization
Overall, the allocation utilization of NSF Innovative HPC resources is high, with only 10% of allocations going unused by researchers, and the percent unused by the top 5% of users is under 4%. Since 2011, there has been remarkable growth of about two orders of magnitude in utilization of NSF Innovative HPC resources for most NSF directorates, with Social, Behavioral, and Economic Sciences now consuming as many CPU hours as did the Mathematical and Physical Sciences directorate only 10 years earlier. Within this time frame, the MPS and Biological Sciences Directorates account for about 70% of the XD SUs consumed and this percentage has remained relatively constant during this time. However, Biological Sciences utilization has increased by about 10% at the expense of MPS. In terms of parent science, Molecular Biosciences, Physics, and Materials Research account for half of all XD SUs consumed. Behavioral and Neural Sciences and Integrative Biology and Neuroscience account for over 50% of all jobs run, with the bulk of those on Open Science Grid.
Our “deep and wide” analysis shows that 27% of XSEDE projects account for 73% of the utilization and the remaining 73% of projects account for only 27% of the utilization. Average job size has decreased over time. Several factors have contributed to the decrease including, the retirement of NICS KRAKEN, the availability of Blue Waters for capability class computing, improved core performance, and resource policies limiting the maximum core count. However, the introduction of TACC STAMPEDE2, with its multicore architecture, appears to be reversing this trend.
5 Job Characteristics
Goals Addressed in Section
What fraction of the portfolio resources are used for data analytics/data intensive computing (hadoop, spark, etc.)? Is the trend going up or down? How does it vary among the resources?
How much of the resources usage is consumed by high throughput applications (large numbers of loosely-coupled serial, single and small node count jobs) and gateway applications, and is this changing over time?
Are jobs using a larger number of cores over time? Are there differences of job size by discipline or application or Innovative HPC Program resource?
Is job memory usage increasing/decreasing over time?
Are there specific discipline differences in memory usage?
Are there specific memory usage differences in the most used applications?
Are there memory usage differences among the resources and does this impact throughput (i.e., result in a bottleneck)?
5.1 Types of Computing
The infrastructure necessary to support research computing, once dominated primarily by high performance computing systems, has diversified over the years and now includes infrastructure to support data intensive applications as well as high-end visualizations. The trend toward diverse infrastructures in support of research is reflected in Figure 26 which shows the number of jobs and XD SUs consumed by resource type (HTC, HPC, Cloud, data intensive, or visualization) from January 2015 to October 2017. A log scale was chosen in order to better accommodate the wide range of values among the resource types. January 2015 was selected as the start date as opposed to July 2011 because prior to 2015, OSG (HTC) jobs were not accurately reported in the XSEDE central database. While the number of HTC jobs is almost 10 times larger then the number of HPC jobs, the HPC jobs consume about 100 times more XD SUs. Data intensive computing is substantial — almost one tenth the size of HPC in both number of jobs and XD SUs consumed. As indicated previously, the OSG utilization reported here is restricted to community access of OSG via the XRAC allocation process (XSEDE). As discussed in the Introduction, the job type classification scheme employed to generate this figure is based on the resource type classification taken from the Teragrid/XSEDE central database, and not the result of an analysis of individual research awards. For example, even though all jobs on SDSC GORDON are classified as data intensive, we know from the job performance data that there were community HPC applications (such as NAMD, LAMMPS, and GROMACS) that ran on SDSC GORDON. These community HPC applications are likely not being used for data intensive computing. It is also possible that data intensive jobs were run on the HPC-type resources in the NSF portfolio. Unfortunately, we do not have classification information for each job so we are not able to address the question about how much data intensive application usage is on HPC resources and how this varies over time.
It is also important to note that the cloud resource utilization in terms of number of jobs is underreported due to known issues with the accounting data sent to the XSEDE central database. The XD SUs consumed are correct. As noted in Appendix A.2 there are two methods by which jobs are submitted to IU JETSTREAM: via the Openstack API and using the Atmosphere portal developed by CyVerse at the University of Arizona (http://www.cyverse.org/) with 31% and 69% of jobs submitted using each respective method. While we have been able to extract accurate data from the XDCDB for jobs submitted via the Openstack API, accounting data submitted to the XDCDB by Atmosphere is aggregated by user and allocation on a roughly daily basis and groups together all virtual machines in the given reporting period. Due to this summarization, we are only able to determine the total number of XD SUs charged to a particular allocation and unable to determine information such as the number of virtual machines (VMs), the number of cores per VM, or the times that a given VM was running. We are at present working with personnel to resolve these issues and improve the data reporting.
5.2 Job Sizes
Figure 27 shows the job size distribution for the period 2011-07 to 2017-09 by percent of total XD SUs consumed within a given bin size (range of cores). While there is substantial fluctuation over the years with respect to the relative contribution that each bin size makes to the total XD SU consumption, the most obvious trend is the decrease in job sizes larger than 1024 cores. This is supported in our “deep and wide” analysis in Section 4.3 and in Figures 28 and 29. Figure 28 shows the distribution of total core and node hours broken out in 3 bin sizes (1-28 cores, 29-2048 cores, and > 2048 cores). The equivalent distributions in node count are also included. The precipitous drop in large core jobs in 2014 is clearly evident. Figure 29 shows the average job size by core count over the same period, both weighted by XD SUs (solid black line) and unweighted (solid blue line). The unweighted average is fairly steady over the study time period (dip in 2016), while the weighted average shows a steady decrease from a high of almost 10,000 cores to a plateau of about 1000 cores. As discussed in Section 4.3, this decrease coincides with the retirement of NICS KRAKEN in 2014 and with the introduction of Blue Waters, a capability class HPC resource (as well as with resource job submittal policies which limit the maximum number of cores). Core counts tick up strongly at the end of the study period with the addition of TACC STAMPEDE2 and its multicore architecture.
Focusing solely on average core count for the weighted and unweighted jobs without taking into account increases in core performance that occur over time can be misleading. For this reason, we introduce the concept of an “effective” core. The effective core is calculated by taking the actual core count for a given resource and multiplying it by the SU conversion factor for that resource and then dividing the result of the SU conversion factor for NICS KRAKEN. This is shown in Figure 29 which includes time histories of the “effective” average weighted (solid gold line) and unweighted (solid red line) core counts. Not surprisingly, “effective” cores have the effect of making the decrease in average core count less pronounced. Also shown in this figure is the effect that OSG jobs have on the average weighted (dashed gold and black lines) and unweighted core counts (dashed red and blue lines) for both the “effective” and actual average core counts. Since the OSG jobs consume a small amount of XD SUs when compared to the HPC jobs, the weighted curves show no difference when OSG jobs are removed. In the case of the unweighted average core counts, inclusion of OSG jobs, which were only properly accounted for starting in 2014, results in a substantial drop.
Another trend in job sizes is a significant increase in single core and single node jobs as evident in Figure 28. Figures (a)a and (b)b look in greater detail at serial and single-node jobs, showing single-core and single-node jobs grouped by resource. Included in both plots are the total number of jobs run (solid black line) and the total number of jobs run excluding OSG jobs (dashed blue line). The lack of single-core jobs until 2015 is due to exclusive node policies, where an entire node is allocated to single-core requests, and to missing OSG data. As discussed previously, prior to 2015, OSG jobs were not properly accounted for in the XSEDE central database (they were severely under-counted), and accordingly the large increase in jobs that occurs is the 1st quarter of 2015 is a reflection of these jobs now being properly accounted for. All OSG jobs are serial and although there are a large number of them, they consume a relatively small fraction of the total CPU hours (see also Figure 26). Even excluding OSG jobs, serial and single-node jobs have grown over time (dashed blue lines). Much of this can likely be attributed to the introduction of node-sharing policies on SDSC COMET and PSC BRIDGES as well as increases in single core performance and core count per node.
In terms of XD SU’s consumed, the percentage of single node jobs (excluding OSG jobs) increased from 5% of all XD SUs consumed in the second half of 2011 to 21% in the first three quarters of 2017 and at least 14% of the latter was consumed by serial jobs (that is 3% of all XD SUs consumed). This trend is reflected in Figure 35, which shows the percent single node jobs in terms of total number of jobs run (Figure (a)a) and total number of XD SUs consumed (Figure (b)b). In the first three quarters of 2017, single node jobs accounted for more than 60% of the total jobs (excluding OSG jobs). Subsection 5.5 (Concurrency and parallelism) contains additional information on single node and serial jobs utilization.
5.3 Memory Usage
This subsection describes the memory usage of the jobs run on the various facilities. We look at the memory distributions and trends over time and discuss the large memory jobs in particular. The memory usage of an HPC job varies over time but we only sample the memory usage periodically. Therefore, we use two different metrics to characterize the job memory usage — the mean value of these samples and the maximum value. To facilitate easy comparison of memory usage between jobs of different sizes we generally present the memory per core (which is equivalent to the memory used per process for most parallel HPC jobs). The memory data presented here includes the usage by the operating system (O/S) as well as the HPC user software. Therefore, any observed changes could be due to HPC user software or the O/S. The O/S buffer cache and O/S kernel slab cache are not included in the memory usage data, see Appendix A.5 for further information.
The average memory used per core by all resources (for which node level metrics were available - see note below) weighted by core-hours is shown in Figure 36. There is a modest trend towards increased memory usage per job, as shown in the figure. The memory usage by jobs in the large memory queues is shown in Figure 37. As expected, the memory usage per core is substantially higher in the large memory queues but there is no time dependent trend to higher values. There is also no trend to larger usage (that is no increase in XD SUs) for the large memory queues relative to overall normal queue usage (not shown). The average memory usage per job analyzed by individual resource is shown in Figure 38
. Note that all resources have reasonably flat memory usage over time. Aside from a few outlying spikes probably caused by a few large memory usage jobs that occur during low system usage periods, the core-weighted average memory usage on all resources is less than 1 GB/core. The memory usage does vary between different resources, with jobs on newer resources having a higher average memory usage than older resources. One of the reasons for the change in memory usage for the different resources is the O/S memory usage. In general, the O/S memory usage is smaller on older resources. For example, the O/S usage onTACC LONESTAR4 was approximately 450MB per compute node (37 MB per core), whereas newer resources such as TACC STAMPEDE and SDSC COMET have approximately 4GB per compute node (250MB and 170MB per core, respectively). A more in-depth picture of the memory usage was obtained by looking at the entire distribution as a function of time rather than the core weighed average value (not shown). It confirms that over the time range of the XSEDE program, that while individual resources have flat memory usage, overall memory usage of the complete ensemble of machines increases. Figures 36 through 38 only include resources for which memory metrics are available, notably the large memory resource PSC BRIDGES LARGE is not included in the analysis due to the absence of data.
A simple histogram can provide the distribution of memory usage by all jobs for all resources. Figure (a)a shows a version with the jobs weighted by core-hour; the non-weighted version (not shown) is similar. Most resources have on the order of 2GB/core or more memory, however, since the memory per core varies among resources, it is more appropriate to look at memory utilization for individual resources. Figure (b)b shows the corresponding distribution for TACC STAMPEDE; the mean value of memory used per core is only about 19% of the available memory. Other resources have a similar small value of fractional memory used. The standard compute nodes in TACC STAMPEDE have 2 GB/core, therefore the tiny tail beyond 2 GB is due to large-memory nodes which have 32 GB/core. Note the offset from zero memory usage is from the O/S memory usage. With the exception of the offset, the individual resource plots for SDSC COMET, SDSC GORDON, and TACC LONESTAR4 (not shown) are similar. Figure (a)a shows a 2-D histogram with fraction of memory used on the -axis and fraction of cores on the -axis. Fraction of memory is defined as the fraction of the total memory available that is used by the job; fraction of cores is defined as the ratio of cores used by the job to the total available cores on the particular HPC resource on which the job ran. As discussed in Section 4.3 there are resource policies that limit the maximum job size allowed for users. Therefore there are very few jobs that use a sizable fraction of the nodes on any of the resources. As far as the memory usage goes, the largest jobs tend to be very diverse in their memory usage. Some use up to half of the available memory; others use only a rather small fraction of the available memory. Most of the jobs are relatively small using much less than 1/10 of the nodes of the resource on which they run. Here too the memory use is very diverse. Only a relatively small tail require more than 1/2 of the available memory.
A more insightful analysis can be achieved by examining not only the memory usage but simultaneously considering the job CPU usage as measured by the ratio of time spent in user mode to total time (cpu-user). Figure 47 shows 2-D histograms of cpu-user on the -axis and fraction of memory on the -axis. Fraction of memory is defined as the fraction of the total memory available that is used by the job. Figure (a)a shows this analysis for TACC STAMPEDE. A similar analysis, not shown here, in which the jobs are weighted by cpu-hour has essentially identical features. The first feature to note is that the jobs with high CPU usage, those with cpu-user near one, also tend to use large fractions of the memory on the order of 0.5 or more. Conversely, those jobs with low cpu-user values tend to use relatively less memory. Another feature of the plot is that there are vertical streaks of higher memory usage at fractions of 0.50, 0.25, 0.125 and 0.0625. TACC STAMPEDE had 16 cores per node and these streaks correspond to 8-way, 4-way, 2-way and 1-way jobs, respectively. In order to determine if these streaks are primarily due to single node jobs that do not utilize all cores on the node, we carried out a similar analysis (not shown) in which single node jobs were excluded. This analysis produced results similar to Figure (a)a so we concluded that the observed striping of the memory usage is due predominantly to multi-node jobs. Many of these jobs that do not use the full 16 cores per compute node may be jobs using fewer cores per node to obtain more memory per process. Interestingly, there is also a faint streak at 0.75 corresponding to 12-way jobs. This may correspond to jobs ported over from TACC LONESTAR4 in which the users fail to take advantage of TACC STAMPEDE’s 16 cores as opposed to TACC LONESTAR4’s 12 cores. Figure (b)b shows an analogous analysis for SDSC COMET. Here there is almost a linear relationship between cpu-user and memory fraction. Also note that there are no obvious vertical stripes as in TACC STAMPEDE. There are some faint horizontal stripes at 0.56 corresponding to 3 GB per core memory usage. These are too faint to pick up in the simple 1-D histogram of SDSC COMET memory usage. Like TACC STAMPEDE, TACC LONESTAR4 shows vertical streaks at values of cpu-user except that now the strongest are at 2/3 (8-way jobs), 1/3 (4-way jobs), 1/6 (2-way jobs) and 1/12 (1-way jobs) that match Lonestar 4’s 12 cores per node. There is also a streak at cpu-user values of 0.5 corresponding to 12-way jobs which, although not a power of 2, do use exactly half of the cores on each node. Figure (b)b is an analogous composite plot of all resources for which memory information is available. Not surprisingly, it most closely resembles the TACC STAMPEDE plot since many jobs come from this resource but there are some features from other resources for which memory data is available. For example, the strong vertical features from TACC STAMPEDE indicating jobs using only a fraction of the cores are diluted but there are other faint features corresponding to the same partial core usage on the other resources that have different number of cores per node.
Given that, as shown in Figure (a)a, the majority of jobs on all resources use less than 2 GB per core, one may be inclined to conclude that there is little need to design HPC architectures with more than 2 GB per core, as long as resources designed specifically for large memory jobs such as PSC BRIDGES LARGE are available. However, figures such as (a)a only tell part of the story with respect to memory utilization, and making conclusions such as that based on this data is not advisable. First, for some of the resources, such as TACC STAMPEDE, the majority of nodes have only 2 GB per core, and users whose applications require more than this obviously will not run on that resource. Second, TACC STAMPEDE accounts for the majority of XD SUs consumed since 2014 and therefore averaging the memory use over all resources, as was done in Figure (a)a, has the effect of masking the (larger) memory use of resources with smaller utilization. Third, many of the scientific applications running on present day HPC systems were designed to run with modest per core memory requirements, however many of today’s fastest growing research areas are data intensive requiring modest core counts but large memory per core.
As noted above, the usage of jobs that have large per-core memory requirements is rather small as shown in Figure 50 which gives the maximum memory used per core weighted by core-hour for SDSC COMET and TACC STAMPEDE. TACC STAMPEDE shows few jobs in the large memory queue, those beyond 2 GB/core in Figure (b)b and SDSC COMET shows a long small tail of jobs in the range of 2–5.3 GB/core and very few jobs in the large memory queue, Figure (a)a. This is not surprising since the plots of core-hour weighted data are not able to show the details of the jobs running on large memory nodes since their usage is very small compared to the overall “normal” queue usage (there are only 16 large memory nodes out of 6400 on TACC STAMPEDE and 4 out of 1944 on SDSC COMET).
Figure 53 shows a 2D binned scatter plot of the total peak memory usage for jobs on TACC STAMPEDE and SDSC COMET. The peak memory usage is defined as the maximum value of the sampled memory usage for a job. These plots show the total usage rather than the per-core value. Note the logarithmic color coding of the job bins which is intended to emphasize the distribution of the outlying large memory jobs. For TACC STAMPEDE, Figure (a)a
, the O/S memory usage is clearly visible as the locus of the lowest memory bins for each node count. While most of the memory usage points are concentrated in the lower left near the origin, there are some patterns such as the vertical stripe at 1024 nodes, reflecting the default maximum permitted job size (without a special request). Other relatively popular job size choices include 512, 256 and even 2048 nodes. The actual memory usage varies widely over these large jobs, although there are no very large (> 2000 node) jobs that use all of the available memory. The small vertical line near the origin corresponds to jobs running on the large memory compute nodes. With the exception of one outlier all of the jobs use less than 50 TB memory in total. ForSDSC COMET, Figure (b)b, as for TACC STAMPEDE, most of the jobs are concentrated near the origin. The O/S memory usage is approximately the same on SDSC COMET as TACC STAMPEDE but is not as obvious due to the scale on the plot. The job sizes are dramatically smaller which is expected since the resource scheduling policy is specifically designed to promote small jobs. The vertical line at 72 nodes represents the maximum allowed job size. The jobs running in the large memory queue are clearly visible as the points at 1 node above the black guide line. As for TACC STAMPEDE, the larger jobs use a wide range of memory. Note in Figure (b)b a dashed red line has been added to the chart. This line corresponds to the solid black line in Figure (a)a for TACC STAMPEDE. Comparing the two charts, one can see that the large memory usage jobs on SDSC COMET above the dashed red line utilize more memory per node than is available on the standard nodes of TACC STAMPEDE. Hence, although these jobs are small in terms of node count they utilize significantly more memory per node.
Although we have shown that there are a relatively small number of jobs that utilize greater than 2 GB per core memory, the question remains what type of jobs have large memory usage that would be difficult or impossible to run if available resource memory was substantially reduced. The Parent Sciences supported by high memory usage jobs is shown in Figure 54. The figure shows both the large memory queue usage (red) and the high memory tail of the normal compute queue (blue). The majority of this usage is on SDSC COMET. Astronomical Sciences, Physics and Chemical Thermal Systems lead the list of memory intensive disciplines. Biosciences are relatively low on this list considering their overall large usage. A bit more insight comes from examining the applications that require a large amount of memory, as shown in Figure 55. Although there are a few specific community applications such as CACTUS and PSDNS
that are large memory users, by far the largest application on this list is “uncategorized.” The classification “uncategorized” only occurs when we have the job level data necessary to classify the application but no application is matched. A machine learning based study was done previously on the “uncategorized” jobs[Gallo2015]. A model was developed that was able to classify jobs with better than 90% accuracy. It was found that 80-90% of the “uncategorized” jobs were in fact custom user code and only 10-20% of the “uncategorized” jobs were community applications that had been missed by our regular expression classification scheme. Hence, the great majority of the jobs in this category are custom user code that requires memory greater than 2 GB per core to run. Obviously, this is a critical category of innovative usage that needs to be supported.
5.4 Parallel Filesystem Usage
All of the resources in the present analysis for which data is available employ the Lustre parallel file system. These resources include: SDSC COMET, SDSC GORDON, CCT LSU SUPERMIC, TACC LONESTAR4, TACC RANGER, TACC STAMPEDE and TACC STAMPEDE2 (see Table 7 for metrics availability dates). Based on the resources listed in Table 7, the distributions of several parallel file system characteristics on a per job basis are shown in Figure 56, namely file openings, bytes written and read as well as write and read rates. Note that the rates are averaged over the duration of the job and do not show the actual instantaneous network rates. Figure 57 shows the same distributions but using a log scale for both axes in order to better visualize less frequently occurring but more extreme usage. The weighted distribution (in red) for files open, read, and write (the three plots on the top left) are shifted to the right of the non-weighted distributions (in blue). This implies that larger jobs, which are weighted more heavily in the node-hour weighted distribution, do substantially more parallel file system operations. We can determine why this is so by considering the average read and write rates shown on the bottom of Figure 56. The average read and write rates and the analogous per node adjusted average read and write rates, show that the weighted and unweighted rate distributions (red and blue distributions) are very similar. Hence, the greater absolute number of file opens, reads and writes in the large XD SU jobs is primarily due to larger job size (more nodes) rather than such jobs inherently having higher file system usage rates. In fact for the per node files opened, the (red) weighted distribution is shifted to the left of the (blue) unweighted one indicating the opposite conclusion, namely that the smaller SU jobs do proportionately more file opens.
Figure 58 shows daily reads vs writes to the parallel file system for 6 HPC resources. Note that only read and write activity from the compute nodes and not the head nodes is included - thus data uploaded to the resource and downloaded from the resource are not included. The diagonal line indicates an equal number of reads and writes and it can be seen that all of the resources show an approximately even balance between reads and writes. For TACC LONESTAR4 and TACC STAMPEDE this balance is very even overall between read and write activity to the parallel file-systems, with a fair amount of fluctuations, which is similar to the Blue Waters systems. LSU SUPERMIC, SDSC GORDON and TACC STAMPEDE2 show more writes, which can be due to less data input and more post-analysis on the resource, for example if not all of the generated data is used (e.g. check-files writes, excessively verbose logs ) or the system cache is effective in reducing the actual reads from the file-system. SDSC COMET on the other hand shows more reads which could be a characteristic of the smaller jobs generally run on this resource.
5.5 Concurrency and parallelism
In this subsection we look at the degree of concurrency and parallelism of the HPC jobs running on Innovative HPC Program resources. In particular, for multi-process jobs, how are the processes distributed across the cores on the compute nodes and how has this changed over time? Does this vary across HPC resources or by queue or application? We also look at the breakdown of different job launch types. Please note that, at the time of this report, job and node level performance data has not yet been implemented for PSC BRIDGES or IU JETSTREAM and therefore these resources are only included in analyses that depend on accounting level data (see Appendix A.2 for details).
Figure 59 shows the XD SUs broken down by CPU usage for TACC RANGER, TACC LONESTAR4, TACC STAMPEDE, SDSC GORDON and SDSC COMET. The CPU usage data were collected by the tacc_stats [Evans:2014] software. The CPU usage shown here is the percentage of time that the CPU cores allocated to each HPC job spent running user-space processes. Detailed information about the data collection and processing of the job performance data is given in Appendix A.5. The overwhelming majority of the jobs have CPU usage over . This means that most jobs use all or almost all of their allocated CPU cores. The smaller peaks at and correspond to single core jobs and jobs that use half of the cores on a compute node respectively111The CPU usage information alone is not sufficient to determine how many cores were active since CPU could be half of the cores at full usage or all of the cores at half usage. This analysis is also based on the process usage data discussed below..
Most traditional HPC resources provide a “serial” queue that is intended for jobs that do not require the interconnect and a “large memory” queue that provides access to compute nodes with a large amount of RAM compared to the default compute node. Figure 62 shows the breakdown of CPU usage for jobs that were run in the different queues. Large memory compute nodes typically have more CPU sockets (and hence CPU cores) than the “normal” queue compute nodes due to the prevalent NUMA hardware architecture. For example, the large memory nodes on TACC STAMPEDE have 4 sockets (32 cores) rather than the 2 sockets (16 cores) of the general compute nodes. The CPU usage of the large memory queues for TACC RANGER, TACC LONESTAR4, TACC STAMPEDE, SDSC COMET and NICS DARTER is shown in Figure (a)a. Note that the usage of the large memory queues is much less than the overall usage. The majority of HPC jobs, by XD SU, in the large memory queues use all of the CPU cores. However, there is a relatively larger peak at compared to the overall job mix. The serial queue usage is shown in Figure (b)b. Many jobs use all of the available cores but nearly one third of the jobs have CPU usage less than ten percent, which corresponds to single or two core HPC jobs.
The other place to run serial and small (by core count) jobs are the shared queues that are available on several resources. Jobs running in the shared queue share compute nodes. This improves job throughput and leads to higher overall resource utilization with only minor impact on individual job performance [White:2014:ANS:2616498.2616533]. Figure 63 shows the job count and XD SUs for shared queue jobs on the resources that allowed shared-node jobs. Most of the usage is on SDSC COMET. Of the 6.5 M jobs on SDSC COMET the majority were single core jobs (5.6 M jobs, 143 M XD SUs).
The HPC resources discussed here so far all have the same order of magnitude of CPU cores per compute node. The current trend is to increase the number of cores per compute node. The Intel Knights Landing compute nodes on TACC STAMPEDE2 have 68 cores with 4 hardware threads per core (for a total of 272, which is an order of magnitude more that TACC STAMPEDE). The recommended usage is to run 64–68 MPI tasks or independent processes per node with 1–2 threads per process [Stampede2UserGuide2017]. Table 8 shows the breakdown of usage in XD SUs by process/thread count per compute node for TACC STAMPEDE2. For the jobs that we have data, the majority are using between 32 and 68 processes per compute node. There is a significant (198 M XD SUs) usage for jobs that are using less than or equal to 32 cores. For the jobs in this category the most popular core counts are 16 and 32. Note that the Intel Skylake compute nodes on TACC STAMPEDE2 came into production after the end of the workload analysis period so are not considered here.
|Number of processes/threads||XD SUs|
Figure 66 shows the breakdown of job launch type by XD SUs for the three TACC resources that ran the Lariat [Lariat2013] software. The job launcher ibrun on these resources is instrumented to collect information about the job including the number of requested parallel processes and the number of threads per process. We label a job “multi-process + multi-threaded” if there are multiple parallel processes with multiple threads per process, “multi-process” for multiple parallel proceses and one thread per process, “serial” for single process single thread jobs and “multi-threaded” for single process multiple thread jobs. The main causes for the job launch type to be “unknown” are that the job was not launched using the instrumented ibrun or that there was an error in the data collection infrastructure. TACC RANGER
data before October 2012 had a version of Lariat that did not collect the thread information but did collect application usage. There is very little recorded usage of “serial” and “multi-threaded” jobs in this data. This is likely an under-estimate of the actual usage becauseibrun may not have been used to launch these classes of job222The TACC STAMPEDE user guide did not recommended using ibrun to run serial jobs.. There are approximately 372 M XD SUs single node jobs in the “unknown” category, which gives an upper bound estimate of the “serial” or “multi-threaded” job usage. Note that this analysis does not distinguish between different parallel processing implementations (mpi, CAF, charm++, etc.) nor threading models (pthreads, OpenMP) the concurrency determination is solely inferred from the job configuration.
Figure 69 shows the amount of node time on TACC LONESTAR4 and TACC STAMPEDE by the number of active kernel scheduling entities (O/S processes) per compute node. This value is an estimate generated by recording the number of runnable O/S processes periodically during each HPC job and then taking the median value see Appendix A.5 for more details. An O/S process could be an MPI task for an MPI job or a thread for an OpenMP job. The general compute nodes on TACC STAMPEDE have 16 cores333Hyperthreading was disabled.. The general compute nodes on TACC LONESTAR4 had 12 cores. For TACC STAMPEDE, the vast majority of the jobs by node hour run at least one O/S process per hardware core. The next most popular configurations are a single process per compute node and under-subscription by one half and one quarter (8 and 4 processes respectively). TACC LONESTAR4 has a very similar pattern with peaks at 1/12, 1/2, 1/4 and 2/3. The plots for the other HPC resources are qualitatively very similar to the two shown here. Recall that in Figures 47 and (b)b a similar pattern was noted where a minority of jobs use a fraction of the cores per node and since these had relatively high memory usage they were ascribed to memory intensive jobs. Another typical reason to undersubscribe the cores is for I/O intensive jobs.
The large memory nodes on TACC STAMPEDE have 32 cores per compute node and the large memory jobs that use all cores are in the bin. Note that TACC LONESTAR4 used the Linux kernel version 2.6.18 which used the process scheduler, TACC STAMPEDE ran the 2.6.32 kernel with the CFS scheduler. This difference in scheduler software explains why we see jobs with runnable threads on TACC LONESTAR4 but no jobs with runnable threads on TACC STAMPEDE. This data does not provide any information about oversubscription of the cores on the compute nodes since the number of runnable processes is constrained by the number of cores.
The plots in Figure 72 illustrate the number of O/S processes run on the compute nodes for jobs that undersubscribed the compute nodes (i.e. used fewer processes than available CPU cores). Figure (a)a on the left shows the XD SUs consumed by jobs that requested fewer than 12 processes per compute node on TACC LONESTAR4. The number of processes per compute node is obtained from the resource manager log files (both TACC LONESTAR4 and TACC RANGER used the Sun Grid Engine (SGE) resource manager). The jobs are split into two categories: jobs that ran 12 O/S processes per compute node and those that ran fewer than 12. The estimate of number of O/S processes per compute node is based on the median number of runnable processes reported by the Linux kernel as per the data in Figure 69. Similar information is presented for the 16 cores per compute node TACC RANGER in Figure (b)b.
To analyze how concurrency varies with application, we plot the application usage for the general compute queue on SDSC COMET in Figure 75. Figure (a)a shows the application usage for jobs that used all 24 cores per compute node and Figure (b)b shows the application usage for the jobs that only used 16 of the 24 cores on the assigned nodes. The determination of the number of active cores per node was done using the runnable threads metric. The application mix for the two different cases is quite similar. Note the difference in scale of the two plots: the -axis is an order of magnitude larger for the 24 cores per node plot. One application that appears only in the 16 cores per node list is MITGCM this code has only been run by one project team and all instances use 16 cores per node. If we look at NAMD, the majority of the XD SUs are run at 24 cores per compute node, but there is a peak at 16 cores per node and at 12 cores per node. The NAMD usage at 16 is almost entirely due to a single project team. Similarly the usage a 12 cores per node is predominantly due to a single different project team.
The time evolution of the job types is illustrated in Figure 76, which shows the job types for TACC STAMPEDE. The plots for TACC RANGER and TACC LONESTAR4 are qualitatively similar. There is very little change in the mix of multiple processes vs multiple processes with multiple threads over time. The occasional increases in the multiple processes with threads are caused by a project or projects running a using their allocation in a relatively short space of time. The serial and multi-threaded job usage is so small that it is barely visible on the plot; the usage of these types is less than overall.
For the majority of the study period the different HPC resource have had the same order of magnitude of cores available per compute node (12 – 24) with successive generations of hardware having faster cores with increased vector operations. With the increase in the number of cores per node in recent hardware architectures, it will be interesting to see how concurrency changes in future. The initial data fromTACC STAMPEDE2 suggests that a significant proportion of jobs are not yet using all of the cores available and are still using the same number of cores per node as previous resources. We do not have sufficient data to determine the reason for this, but it could be that the users are just reusing existing launch scripts and simply need to increase the number of processes per node. Alternatively it could be that the existing software is not well suited to running at large concurrencies and changes to software may be needed.
Interactive jobs are expected to have much lower CPU usage than batch jobs because the compute nodes are likely to be idling when waiting for user input. Unfortunately, whether a job is interactive or not is not specifically tracked in XDMoD for the NSF resources as of the date of this report. We can however, indirectly illustrate the difference in CPU usage between interactive and non-interactive (batch) jobs; see Figure 79. The distribution of XD SU by CPU user is shown in Figure (a)a for all jobs for which we have Lariat data in the time frame of 2016-01-01 to 2017-09-01. Lariat data is only available for jobs that were launched using the ibrun command. The user guide for TACC STAMPEDE recommends that users do not use the ibrun command to run an interactive job (although the ibrun command could be used to run software with an interactive job). Note that the great majority of the jobs have high CPU usage (> 90%). Figure (b)b shows the CPU user distribution for jobs for which we do not have Lariat data. Although there may be other causes for a job to be missing Lariat data, all jobs that are interactive will be in this category. Note the difference in the distribution of CPU user. For this job category, the great majority of jobs have very low CPU usage (< 10%).
5.6 Job Failures
The exit status of each HPC job is recorded in the job scheduler accounting logs. The interpetation of the exit status depends on the job scheduler software used by each resource and most schedulers report the exit status of the job batch script and also report any failures in the scheduler software or compute nodes. In this section we investigate the job exit status for TACC STAMPEDE and SDSC COMET. Both of these resources use the slurm job scheduler.
Table 9 shows a summary of the job exit codes and the number of jobs that ended with that particular exit code for TACC STAMPEDE and SDSC COMET. An exit code of “completed” indicates that the job finished without reporting an error, “canceled” indicates that the job was canceled after submission, “timeout” indicates that the job was killed by the scheduler because it reached the requested wall time limit, “node-fail” indicates that the job ended prematurely due to a problem with the scheduler software or a problem with one of the compute nodes on which the job was assigned. And exit code of “failed” indicates that the job batch script returned a non-zero exit code.
The job batch scripts are controlled by the end-users so the reported exit status is unlikely to be consistent between jobs for different users or even between different jobs for a user. Despite this known ambiguity around the batch script exit codes, we carried out an analysis of job exit codes for TACC STAMPEDE and SDSC COMET. The first analysis was carried out on the node-fail jobs in TACC STAMPEDE
. Since this exit status is a job scheduler status and is not directly controlled by the user supplied batch script it should be an unambiguous indicator of a failed job and provides a lower bound. Since a single node failure on a multi-node job will cause the entire job to fail we would expect that the probability that a job will fail is strongly dependent on the number of nodes in the job. Since a job failure is a binary process, a job can either fail or complete, a logistic regression will be done to quantitatively determine the probability of node-failure as a function of the number of nodes in the job. For a simple initial analysis we will take advantage of the fact that that the overall distribution of job wall time is not greatly dependent on job size. TheTACC STAMPEDE node-fail data was fit using logistic regression; the fit was well behaved and the slope and intercept were statistically significant as measured by their p-values. Figure (a)a plots the result of this logistic regression; TACC STAMPEDE has 6400 compute nodes. For large capability jobs using one third to one half of the nodes, the probability of a node-failure is still small. Very few jobs exceed this number of nodes on TACC STAMPEDE as shown in Figure 27 (the job scheduling policy on TACC STAMPEDE has a default maximum job size of 1024 nodes). Hence the data in the knee of the curve in Figure (a)a should be the most accurate. The extrapolation to full utilization of all 6400 nodes of TACC STAMPEDE is somewhat more uncertain due to the scarcity of the data. We can perform a similar analysis on the TACC STAMPEDE failed jobs if we include both the failed jobs and the node-failed jobs, however this assumes that all of the jobs with a failed exit code status are truly failed jobs. This leads to the conclusion that the failure probability of even small jobs is well above zero (on the order of 0.1) and it increases nearly linearly with the number of nodes, approaching 1 for jobs that are the full size of TACC STAMPEDE. A failure rate of this magnitude is unrealistic and leads to the obvious conclusion that the job batch script exit status is not a reliable indicator of job failure. A similar analysis carried out on SDSC COMET data showed similar results. A more sophisticated logistic regression model of the TACC STAMPEDE node failure data can be used where the probability of a job failure is a function of wall time in years raised to the power of the number of nodes. This model is a bit more complex to interpret. Figure (b)b plots the result of this more complex logistic regression model. This model also produces a well behaved fit with statistically significant parameters. Once again the data in the knee of the curve where the failure probability just starts to rise above zero should be the most accurate.
|Exit Code||TACC STAMPEDE||SDSC COMET|
5.7 Summary: Job Characteristics
The average job size, weighted by XD SUs, shows a steady decline from 2011 - 2017 (9,000 cores to 1,000 cores). This is attributable to the retirement of NICS KRAKEN, the introduction of Blue Waters, improved core performance (reflected in higher XD SUs/core-hour), and individual job size limits at the resource level. However, a very recent trend, with the addition of the TACC STAMPEDE2 Xeon Phi resource, shows increasing percentages for the largest job sizes (8k and larger in core count).
The percentage of single node jobs (excluding OSG jobs) increased significantly from 5% of all consumed XD SUs in 2011 to 21% in 2017. In addition, the number of single node jobs accounted for more than 60% of all jobs in 2017, and while this percentage fluctuates from year to year, it is typically greater than 50% of all jobs run (excluding OSG jobs). In 2017, serial jobs comprised at least 14% of XD SUs consumed by all single node jobs.
Most parallel jobs efficiently use all or almost all of the allocated CPU cores. However, nearly one-third of single node jobs utilize only one or two cores on TACC STAMPEDE and accordingly throughput may be improved through node sharing for single node jobs.
Many of the most heavily utilized HPC community applications have modest per core memory requirements of less than 2 GB per core (primarily because they were designed that way). However, many custom (user built) codes, which can arguably be associated with innovative usage, requires substantially more than 2 GB per core to run.
The method of parallelism employed by most HPC jobs is primarily based on multiple single threaded processes as opposed to multi-thread based. Initial early data from TACC STAMPEDE2 suggests that a significant proportion of jobs are running 32 or fewer processes/threads per compute node and may not yet be making optimal use of its multicore architecture.
Lustre file system usage as measured by reads, writes and file opens rates is independent of job size. Total daily aggregate Lustre reads and writes are approximately equal for most resources.
Job batch script exit codes are not a reliable indication of true job failure since there is no enforced standard for error reporting. The “node fail” exit code (indicating a scheduler or compute node failure) is a reliable indicator of job failure as shown by regression models.
6.1 Application Usage
). The application classification uses pattern matching of the job’s executable against a reference set of known executables (the methodology is described in detail in AppendixA.5.6). Only the most common applications are currently recognized (which are generally open source packages in widespread use)[xdmod-bw2016]. Note that proprietary applications are intentionally masked (as many have licensing agreements that discourage comparative performance evaluation). Approximately 39% of consumed XD SUs were able to be characterized by application name in this study period (an additional 15.6% were captured but not matched to known applications, and the balance were not collected, mainly due to resources not yet supporting collection).
It should be noted that some of the applications shown in Figures (a)a-(b)b are not “applications,” but fall into other categories (see Appendix A.5.6). In particular, charm++ is a parallel programming system that is used in applications (NAMD
is one), and therefore much of its usage should actually be ascribed to other applications, but are not yet recognized as such by our pattern recognition system. Similarly,python refers to the Python language interpreter, and the underlying application being run by the interpreter is not yet being captured by our system.
6.2 Application Memory Usage
The average memory used per core weighted by core-hour for the top 20 applications based on usage as given in Figure (a)a is shown in Figure (a)a. Only the top three applications substantially exceed 1GB; the remainder range from 0.4GB to 1.05GB per core. Figure (b)b shows the average memory used per core weighted by core-hour for the top 20 applications with the largest average memory usage. Here the average per core memory lies in the range of 1.3 GB - 3.1 GB per core. Note that there are only two applications, CACTUS and ENZO that are present on both of these lists, that is a top 20 application by core-hour usage and a top 20 application by average memory usage.
The average memory usage by application on a given resource over time is remarkably constant throughout the resource lifetime. Figure 89 shows the average memory usage weighted by core-hour of the top 10 applications by core-hour usage (see Figure (a)a) for TACC STAMPEDE throughout its lifetime for which the longest and best memory usage data is available. Of the 10 applications in the figure, nine of them show no systematic memory trend for TACC STAMPEDE throughout its lifetime. Only charm++ shows any trend. In the first year, 2013, the memory usage for charm++ was higher than for the final three years, 2014-2017, during which is was very flat. Hence, the individual applications follow the trend already noted for overall memory usage by each resource, see Figure 38.
6.3 Application Interconnect and I/O Use
Figure 90 shows the average high-performance interconnect use (receives only) by application, weighted by the node-hours consumed by that application. The average bandwidth used shown in Figure 90 is per job and then averaged over all jobs (weighting by node-hours consumed), therefore burst rates are undoubtedly higher. Transmits are not shown (mainly due to historical collection issues), therefore one should assume roughly double the bandwidth usage overall, in the case of symmetric bidirectional use (which we examine in more detail below).
The averages shown in Figure 90
need to be taken in context. First, the average per job is showing only receives, sampled over relatively coarse time intervals (ten minutes), therefore smoothing over burst rates. Second, the distribution of averages is not a normal distribution. In particular, if we examine the distribution of average interconnect use for some popular applications that show relatively high interconnect use, we see highly variable rates. Figure107 shows the distribution of average interconnect use per job for the popular molecular dynamics code AMBER, the quantum chemistry package NWCHEM, electronic structure package Q-ESPRESSO and the weather research forecasting package WRF on TACC STAMPEDE, for several specific usage categories.
Given that the interconnect sampling interval was ten minutes, the bandwidth burst rates are likely to be considerably higher.
Figures (a)a, (e)e, (i)i, and (m)m shows the average interconnect receive rate per job and per node, for the applications AMBER, NWCHEM, Q-ESPRESSO, and WRF, respectively. We can also examine the maximum sampled rate per job to see if the maximum value differs significantly from the average. Figures (b)b, (f)f, (j)j, and (n)n shows both the maximum sampled rate (in green, now showing the sum of receive and transmit rather than just receives), and also all of the available samples (black), for each job, per node, again for the same applications on TACC STAMPEDE.
Note in Figures (b)b, (f)f, (j)j and (n)n that the same trends shown for the averages per job (Figures (a)a, (e)e, (i)i, and (m)m) are present, and that indeed the combined receive and transmit values are roughly twice the receives (indicating that the non-Lustre bandwidth use is roughly symmetric). The maximum bandwidth per job (shown in green in Figures (b)b, (f)f, (j)j and (n)n) is only a bit higher than the largest average per job (roughly 20% for WRF). One should bear in mind the sample rate is ten minutes, however, so burst rates may indeed be higher still within the sample window. Also shown in Figures (a)a, (e)e, (i)i, and (m)m are the average Lustre transmits and receives, again showing high variability. This variability is also present in file opens, which are shown in Figures (c)c, (g)g, (k)k, and (o)o.
We compare the parallel filesystem usage to the non-Lustre inter-compute node communication data for selected applications in Figures (d)d, (h)h, (l)l and (p)p for TACC STAMPEDE. TACC STAMPEDE and all of the other HPC resources considered here use an InfiniBand high speed interconnect and a Lustre parallel filesystem. The parallel filesystem data is transferred between the compute nodes and filesystem nodes over the InifiniBand network. We estimate the inter-compute node communications by subtracting the Lustre data from the InfiniBand data. This is labeled as “non-Lustre IB” in the figures. The correlation between the non-Lustre InfiniBand network activity and I/O shows only a very modest interaction. The correlation is quite dependent on the individual applications as shown by Figures (d)d, (h)h, (l)l and (p)p. AMBER has virtually no correlation, NWCHEM and WRF have some interaction while Q-ESPRESSO has substantial interaction.
6.4 Summary: Applications
Of identified, non-proprietary applications, the top five in terms of consumed XD SUs are NAMD, GROMACS, CACTUS, LAMMPS, and WRF. Most of the identified applications with the greatest utilization use less than 1.5 GB per core, with the majority falling in the range of 0.4–1.1 GB per core. The average per core memory for the top twenty applications in terms of memory use lies in the range of 1.3 GB–3.1 GB per core. The average memory usage by application is relatively constant throughout the resource lifetime. Applications show a wide range of interconnect usage, with the more intensive averaging more than 1GB/s of unidirectional bandwidth, bursting considerably higher. Data indicates, at least within the sampling intervals, that applications are not bandwidth limited. Application I/O usage also shows a wide range of usage patterns.
7 Science Gateways
Goals Addressed in Section
Are there important differences among job types, e.g., interactive jobs, gateway jobs, etc.?
What are the characteristics of gateway jobs?
Are the parallel jobs simply ensembles (many independent jobs)?
What is the gateway job distribution by resource?
How many new (unique) users are using gateways?
Are the usage patterns of gateway users significantly different than traditional HPC users?
How does gateway utilization and growth differ by discipline
Science Gateways are defined on the XSEDE User Portal Science Gateway page (https://portal.xsede.org/science-gateways) as “Customized portals granting members access to HPC applications, workflows, shared data and other services. XSEDE Gateways unite communities of like-minded members, whether united by discipline or other criteria.” In this report, we will examine the characteristics of Science Gateways jobs, usage and users showing how they are unique and in what ways they are similar to their traditional HPC analogs.
The methods used for collecting Science Gateway usage data have evolved over the lifetime of XSEDE and there have been multiple methods used to report on Gateway usage as explained in Appendix A.4. Table 19 shows the deployment period (through 2017-09-30) for each Gateway along with other information such as the number of jobs submitted and CPU hours consumed. Gateways shown in boldface are active at the time of writing. For the purposes of this report, we will present Gateway usage via jobs submitted by Community User accounts associated with a particular Gateway where possible as this closely aligns with XSEDE’s definition of a Science Gateway. Figure 108 shows that, with a few exceptions, the number of Gateway jobs tracked using Community User submissions aligns closely with other methods described in Appendix A.4.
7.1 Gateways Usage
In addition to applying for an allocation on NSF Innovative HPC Program resources through XSEDE, researchers can utilize the growing number of Science Gateways to carry out their research. Indeed, XSEDE usage by Science Gateways has shown a steady increase over the lifetime of XSEDE in the number of jobs run, XD SUs charged, and the number of Gateways in use as shown in Figure 109. Since the start of XSEDE over 1.983M Gateway jobs have been run while the number of jobs run quarter quarter has increased more than 7-fold from 20,000 in Q3 of 2011 to 150,000 in Q3 of 2017 and the number of XD SUs consumed has increased at a similar rate from less than 10M in Q3 of 2011 to 55M in Q3 of 2017. It is interesting to note that the average job size run by Gateway users has been steadily decreasing since 2014 with a marked drop in 2016-Q3 mainly due to the large number of single-core jobs submitted by the I-TASSER (Zhanglab) Gateway. If we discount the effect of this Gateway, the average job size has shown only a slight decrease between 2014 and 2017.
If we look at the allocations known to be associated with Science Gateways (Figure 110), CIPRES, which supports phylogenetic research, and GridChem, which provides access to a wide variety of computational chemistry programs, were the only active gateways from 2011-07-01 through October 2013. At the end of 2013, several other Science Gateways came online bringing the total number of active gateways to seven in February 2016 although the CIPRES Gateway accounted for most of the jobs run up to this point. Starting in March 2016 the I-TASSER and Neuroscience gateways came online and we see a marked increase in the number of jobs submitted. As of 2017-Q3 there were 14 active Science Gateways submitting a total of 975,000 jobs up to that point.
Figure 111 is similar to Figure 110 but shows the historical trend as a function of Science Gateway rather than individual allocations. Prior to 2016, CIPRES was the dominant gateway based on number of jobs. However, since late 2016, the I-TASSER (ZhangLab) Gateway, which conducts protein folding predictions, is now the leading provider of Science Gateway jobs. Third in this figure is the University Buffalo’s TAS gateway which runs application kernels to provide provide quality of service metrics for XSEDE [simakov2015application]. Since the application kernels are designed to be computationally lightweight, it is reassuring to note that the TAS Gateway does not appear in Figure 112 which shows the top ten Gateways in terms of XD SUs consumed. UC San Diego’s CIPRES Gateway is the largest consumer of CPU cycles followed by Gridchem.
As shown in Figure 113 Science Gateways have submitted jobs to all of the major XSEDE compute resources over the time period of this report with the exception of Keeneland. However, SDSC’s HPC systems have been the primary resources utilized by Science Gateways. Indeed, as the trend lines in Figure 114 show, Gateway usage on SDSC resources is growing steadily while usage is actually decreasing on TACC resources with the retirement of TACC STAMPEDE and increasing only slightly at PSC and LSU. Out of these, SDSC Gateway usage has averaged 7.6% of total available XD SUs while usage at each of the other service providers averages less than 0.5% as shown in Figure 115.
Individual Science Gateways are typically designed to support a particular discipline such as Population Biology (Cipres Gateway), Chemistry (Gridchem Gateway), or Neuroscience (Neuroscience Gateway) with many XSEDE Science Gateways servicing users in the Biological Sciences. From 2011 - 2014 XSEDE Science Gateways supported research in five NSF Directorates (Biological, Mathematical & Physical, Social, Behavioral & Economic, Geosciences, and Engineering) with Biological Sciences accounting for over 60% of the total XD SU consumed. However, as shown in Figure 116 between 2015 and 2017 the number of directorates serviced has declined with the Biological and Mathematical & Physical Sciences now accounting for almost 98% of all Gateway XD SUs consumed.
7.2 Gateways Users
A census comparing XSEDE HPC and Science Gateway users, compiled from the data listed in Table 10, is shown in Figure 117. The data shows that, while the overall XD SU consumed by Gateway users is only 3̃% of the total XSEDE XD SUs, the number of active Gateway users has been steadily increasing at a faster rate than active HPC users. In 2015, the number of active Gateway users surpassed the number of active XSEDE HPC users and has been steadily rising.
The number of HPC users actively running jobs each quarter has steadily increased from 2,075 at the start of XSEDE to 3,432 in 2017-Q3. However, the number of new HPC users added to allocations each quarter dropped between 2012 and 2016, until finally regaining original 2011 levels in 2017. During this period it is significant to note that the number of active Science Gateway users has grown steadily - increasing from 1,983 to 3,590 users - and has shown significant growth between 2016-Q3 and 2017-Q3 topping out at 12,757 users in 2017-Q2. This increase is mainly a result of the I-TASSER Gateway with over 6,000 active users in 2016-Q4 and over 7,000 active users in 2017-Q1 and is also visible in the increased number of jobs ended shown in 109.
Gateway users are identified by a unique user name, typically an email address, that is unique to each Gateway and is specified by the gateway upon job submission. The same user name on two different Gateways could potentially refer to two different people. It is important to note that data for newly created Gateway users was not reliably collected prior to 2015-Q2 as explained in Appendix A.4, so the new Gateway user values presented here typically represent a lower bound.
|Open HPC Accounts||XDCDB||The number of HPC accounts (i.e., people)|
|that have access to XSEDE resources. They|
|may or may not have run any jobs.|
|Active HPC Accounts||XDCDB||The number of HPC accounts that have run|
|at least one job during the specified period.|
|New HPC Accounts||XDCDB||Newly created HPC accounts|
|Active Gateway Users||XDCDB/PI polling||Science gateway users who have run at least|
|one XSEDE job via a Gateway in the specified period.|
|Active HPC +||XDCDB/PI polling||Sum of Active HPC Users and Active Gateway users.|
|New Gateway Users||XDCDB||Unique new Gateway users created in the specified period.|
|Note that this is a lower bound as gateway reporting is|
|incomplete prior to 2015-Q2 as discussed in Appendix A.4|
In an attempt to determine if Gateway users eventually converted to XSEDE users with their own allocations, we attempted to cross-reference XSEDE users and Gateway users by using their email address, limiting our analysis to users who ran more than 10 XSEDE jobs. As shown in Table 11, a very limited number of users started off using a Science Gateway and then later ran jobs on XSEDE. Indeed, only six users started as Gateway users and went on to run more than 100 XSEDE jobs.
|Gateway||Gateway User||# GW Jobs||First GW Job||# XSEDE Jobs||First XSEDE Job|
7.3 Gateways Job Mix and Characteristics
In order to determine whether or not Gateway users utilize HPC resources differently than traditional XSEDE HPC users, either in the types of applications used or the properties of jobs run, we examined the mix of applications submitted by these two categories of users. The top 12 applications by XD SUs consumed by traditional HPC and Science Gateway users were selected and are shown in Figure 118; XD SU usage is on a log scale. Note that this includes only jobs for which we were able to identify the application that was run, which is 55% of the total XD SU for XSEDE and 19% of the total XD SU for Science Gateways. The top applications for Gateway users clearly overlaps those of traditional HPC users however, looking at the total SU consumed for XSEDE and Gateways, 99% of the XSEDE HPC XD SUs are consumed by 64 categorized applications while 99% of the Gateway XD SUs are consumed by 27 categorized applications. Figure (a)a shows a cumulative plot of the usage of applications for Gateway and non-Gateway users. Note that the Gateway usage is much more limited on the number of applications employed. Since gateways typically target a particular field of science, a narrower application scope is expected. Note that the XSEDE values include Gateway usage. Similarly, Figure (b)b shows that the Gateway usage encompasses many fewer fields-of-science.
In addition to the applications utilized by Science Gateways users, we examined the average job size by Core Count Per Job and the average run time per job. Note that we have not included jobs run on OSG in this analysis as they are exclusively single core jobs at a rate of 30M jobs per year which causes a steep drop in the average job size starting in 2015. We have also excluded jobs run on IU JETSTREAM as many of these are very long running, low core-count jobs. Figure 122 (log scale) shows that the average size of non-Gateway jobs (yellow line) is roughly twice that of Gateway jobs (black line) but the gateway jobs tend to run for longer periods of time (blue line vs red line), roughly a factor of two longer through most of the time span. Starting in Q3 of 2016 when the Zanglab gateway started both the gateway per job core count and the run time dropped sharply.
7.4 Summary: Science Gateways
Science Gateway usage has increased steadily over the lifetime of the XSEDE program, with a 5-fold increase in the number of jobs run in 2017 relative to 2011. Today, the Biological (82%) and Mathematical and Physical Sciences (16%) directorates account for almost 98% of Gateway XD SUs consumed, up from 70% in 2011. While Science Gateways consume only about 3% of the total XD SUs, the number of active Gateway users is growing more rapidly than the active HPC users, surpassing the number of active HPC users in 2015. The Gateway job mix in terms of applications run and fields of science is more narrow than that of HPC users, as expected due to the targeted nature of Science Gateways. The average size of non-Gateway jobs is roughly twice that of Gateway jobs. However, gateway jobs tend to run longer than the non-gateway jobs.
8 Job Submission Patterns & Over-Subscription
Goals Addressed in Section
Are jobs constrained by resource policy limits such as queue length, user limits or node sharing?
How does this vary by resource?
Do these limits affect the analysis?
Are there differences in the job mixes among the resources and if so, how does this impact job throughput?
What is the relative proportion of these jobs between systems?
Is this the result of allocation decisions, or something else we can determine?
How do wait times, throughput and queue length vary among the resources? Has this changed over time?
What is the run-time over-subscription? What is the breakdown by resource and resource type?
In this subsection we address the issue of over subscription in which the total number of requests for resources exceeds the capacity of the existing resources and therefore user jobs are queued waiting for other jobs to end. Figure 123 characterizes the backlog of jobs by looking at the time history of queued jobs in terms of core years (Figure 123.A) as well as analyzing the wait times (mean, median, etc) of the queued jobs broken out by resource (Figure 123.B). While the backlog of jobs shows a substantial fluctuation over time, the demand for resources clearly outstrips capacity as is readily evident in Figure 123.A. In terms of wait times, the horizontal box plots in Figure 123.B shows that there is wide variation in the wait times among the resources. Particularly noteworthy is the large difference on most resources between the mean wait time (blue dots) and the mean wait time weighted by job core hours (red dots), indicating that large jobs have much longer wait times then small jobs.
To address the question of how much larger NSF Innovative HPC Program resources would have to be in order to dramatically reduce wait times, we analyzed the utilization time history and queued job time history of these resources. Figure 124.A shows the node-based utilization time history of each HPC resource along with the maximum nodes available (solid black line). The utilization shown (80-90%) is typical of HPC systems that support large parallel jobs in which the job scheduling software holds nodes idle while waiting for a sufficient number of nodes to become available to allow a job to run. Most job schedulers backfill jobs whenever possible in order to maximize resource utilization while at the same time facilitating the throughput of large parallel jobs. Figure 124.B shows the node utilization for 3 specific resources, namely NICS KRAKEN, TACC STAMPEDE, and SDSC COMET. Both NICS KRAKEN and SDSC COMET are entirely dedicated to XSEDE jobs (black dots), while TACC STAMPEDE is 90% allocated to XSEDE (orange dots).
Since we know the time history of the node utilization and the attributes of the queued jobs (nodes requested, etc), we can project the number of nodes that are required to run all queued jobs immediately, This result is shown in Figure 125.A where the solid black line represents the cumulative number of nodes that are available and anything over this line indicates needed additional nodes. Figure 125.B shows the same thing but this time for specific resources, namely NICS KRAKEN, TACC STAMPEDE, and SDSC COMET.
Providing a system with no wait time is not realistic. However, we can also frame our analysis in terms of what size would each of the HPC systems need to be to immediately run a given percentage of the queued jobs. This is presented in Table 12 in which we show the number of additional nodes and cores needed for 95% or 99% percent of the queued jobs on each resource to run immediately. At the 95% level, many of the systems would require about a 10 - 20% increase in size. Of course, this analysis assumes that the allocations are not increased to take advantage of the increased resource size.
|resource||Nodes, Actual||Nodes to run immediately 95% of jobs1||Nodes to run immediately 99% of jobs1||Cores, Actual||Cores to run immediately 95% of jobs1||Cores to run immediately 99% of jobs1|
|CCT-LSU-SUPERMIC||360||267 (0.74)||433 (1.20)||7200||5300 (0.74)||8640 (1.20)|
|GATECH-KEENELAND||264||424 (1.61)||769 (2.91)||4224||6784 (1.61)||12304 (2.91)|
|LONI-QUEENBEE||668||519 (0.78)||909 (1.36)||5344||4152 (0.78)||7272 (1.36)|
|NICS-DARTER||724||1073 (1.48)||1884 (2.60)||11968||17168 (1.43)||30144 (2.52)|
|NICS-KRAKEN||9408||15042 (1.60)||22665 (2.41)||112896||179388 (1.59)||270924 (2.4)|
|PSC-BLACKLIGHT||2||2.2 (1.08)||3 (1.63)||4096||4432 (1.08)||6672 (1.63)|
|PSC-BRIDGES||752||1034 (1.38)||1579 (2.10)||21056||28953 (1.38)||44200 (2.10)|
|PSC-BRIDGES-GPU||48||54 (1.13)||82 (1.70)||1472||1662 (1.13)||2509 (1.70)|
|PSC-BRIDGES-LARGE||46||34 (0.73)||39 (0.85)||4512||3294 (0.73)||3851 (0.85)|
|PSC-GREENFIELD||3||2.5 (0.83)||3 (1.08)||360||300 (0.83)||390 (1.08)|
|PURDUE-STEELE||893||654 (0.73)||1062 (1.19)||7144||5232 (0.73)||8496 (1.19)|
|SDSC-COMET||1944||2088 (1.07)||2677 (1.38)||46656||50113 (1.07)||64256 (1.38)|
|SDSC-COMET-GPU||72||116 (1.61)||177 (2.46)||1872||3014 (1.61)||4600 (2.46)|
|SDSC-GORDON||1024||1192 (1.16)||1818 (1.78)||16384||18336 (1.12)||28160 (1.72)|
|SDSC-TRESTLES||324||352 (1.09)||534 (1.65)||10368||11255 (1.09)||17094 (1.65)|
|Stanford-XSTREAM||65||20 (0.30)||35 (0.54)||1300||394 (0.30)||696 (0.54)|
|TACC-LONESTAR4||1888||1080 (0.57)||1578 (0.84)||22656||12960 (0.57)||18936 (0.84)|
|TACC-RANGER||3936||4674 (1.19)||6096 (1.55)||62976||74784 (1.19)||97536 (1.55)|
|TACC-STAMPEDE||6400||7344 (1.15)||9435 (1.47)||102400||117504 (1.15)||150960 (1.47)|
|TACC-STAMPEDE2||4200||3803 (0.91)||4917 (1.17)||285600||258604 (0.91)||334352 (1.17)|
The value in brackets shows fraction of actual number of nodes or cores.
8.2 Allocations Impact
The previous subsection addressed the question of how much larger a given HPC resource would need to be in order for 95% or 99% of queued jobs to run immediately. Another associated question is how would increasing the awarded allocations influence hypothetical system sizes. The answer to this question could help inform future system requirements. This is especially useful given the decrease in percentage of awarded SU from the 40’s in 2011 to the 20’s in 2017 as indicated in Figure 126 which shows the requested and awarded allocations over the time. Prior to the introduction of TACC STAMPEDE2, the requested allocations were steadily increasing in terms of both absolute core hours and performance scaled XD SUs (red lines) while the total core count for all available XSEDE systems has remained about the same and the available XD SUs has increased only moderately (blue lines).
Given the pent up demand for CPU cycles shown in this figure, it is natural to ask how big would HPC resources need to be to accommodate a given percentage increase in allocation awards. The trivial answer is that system size should be increased proportional to the increase in the allocation. However, there are several factors that can affect the scaling. First, not all projects use their entire allocation. As shown in Table 1, in aggregate, projects utilized about 89% of their allocation. Second, larger HPC systems have more opportunities to schedule jobs, and therefore improve job throughput. Accordingly, the increase in system size needed to support a targeted increase in allocations might be somewhat smaller.
Whatever new systems of increased size are commissioned, it is always worthwhile planning to operate them optimally to serve the computational community. There are potential ways to increase allocations by optimizing the utilization of existing systems and trade-offs to be considered between throughput and wait time. One possibility is to institute over-allocation, a practice common in other industries such as internet service providers. If properly balanced over-allocation can improve utilization without a significant negative effect on users. This can lead to a compromise solution in which increased system utilization can be accompanied by a modest increase in job wait times. There are tools available to study the impact of over subscription on system utilization and wait times [Simakov2018_SlurmSim]. Another area that can be exploited in order to allow for a more effective use of allocations is node sharing for small core jobs, which has been shown in most cases not to impact performance for the jobs sharing a node [White:2014:ANS:2616498.2616533, Simakov2018_SlurmSim].
8.3 Temporal and Spatial Resource Usage Patterns
We can gain insight into temporal resource usage patterns by analyzing the variation in the job submission queues over time. Here we include only the results for TACC STAMPEDE as the other resources showed similar patterns. Figure 127 is a plot of the time history of the number of jobs submitted to TACC STAMPEDE over the 2013-2017 time range.
In order to detect periodic trends in the data we carried out a Lomb-Scargle analysis of the queue data. A Lomb-Scargle analysis is similar to a FFT analysis except the exact time of each point can be specified and there is no requirement for uniformly sampled points. Figure (a)a shows a computation of the Lomb-Scargle periodogram for the TACC STAMPEDE job submission data. There are three main periodic trends detected, namely 1 day, 7 days (1 week), and 3̃65 days (1 year). Note that there are some other small peaks which are overtones (artifacts) as established by a comparison to the auto correlation function, not shown. The daily variation of TACC STAMPEDE job submission over the 2013-2017 time range is shown in Figure (b)b. The queue is at its minimum in the early hours of the morning. As users go to work they submit more jobs and the queue grows to a maximum at the end of normal working hours. The submission then decreases back down until the users return to work. The effect is such that the minimum submission early in the morning is only a small fraction of the peak value.
The weekly variation of TACC STAMPEDE job submission over the 2013-2017 time range is shown in Figure (a)a. Not surprisingly, the job submission is smallest during the weekend (both Saturday and Sunday) relative to the week days. The week-end value is about 1/2 of the peak week-day values. The monthly variation is shown in Figure (b)b. The most noticeable trend is that the queue during the holiday months of December and January is substantially less than its value during peak months. We have repeated a similar analysis for the other resources. The same general trends are evident but with a bit more noise in the plot. Note that in reference [Hart2011b] a closely related study of periodicity in job submissions also detected similar trends for the TeraGrid resources.
The resources on which this analysis is based are intended to serve the entire country as a whole. Figure 134 shows a map of the facility usage in millions of XD SUs by state of origin. Not surprisingly, the usage tends to be concentrated in a few of the most populous states. Figure 135 shows a map of per capita usage, that is, the usage has been normalized by state population. The per capita usage is substantially more uniform although there are still some high and low usage states. We can make a second adjustment based on the individual state economies and and their reliance on high technology. We would expect that states that have economies that are more dependent on high technology would be more able to take advantage of the facilities that NSF provides. When adjusted for the state technology index the usage map appears as in Figure 136. The adjusted usage seems to be rather evenly distributed over various regions of the country. Interestingly, the main feature common to the three maps is that Illinois stands out as relatively bright (high usage).
8.4 User Limits & Throughput
In this subsection we study the effect that queue limits have on job throughput. Table 13 shows the maximum number of queued jobs allowed per user among different resources. Only TACC and STANFORD XSTREAM impose limits. In order to determine what effect, if any, imposing a limit on the number of queued jobs a user can have, we analyzed the distribution in the maximum number of queued jobs per user and the number of jobs per user on the resources TACC STAMPEDE and SDSC COMET; one which has a queue limit (TACC STAMPEDE) and one which does not (SDSC COMET). The results are shown in Figure 137. The time period is over the production lifetime of each resource. Figure 137.A shows the distribution of maximum queued jobs per user for both resources. While TACC STAMPEDE has a limit of 50 jobs, the vast majority of users do not reach this limit. However, a moderate peak at 50 queued jobs is apparent, which confirms the imposition of TACC STAMPEDE’s queue limit. Interestingly, there is also a faint tail beyond the limit of 50 jobs which must be attributable to a relaxed constraint for some users. SDSC COMET, which has no limit has a much more pronounced tail beyond 50 jobs but otherwise the two distributions are similar. Figure 137.B, which shows the distribution in the number of jobs run by users over the same time period, addresses the issue of potential impact of the queue limit. Based on this figure, we see little if any impact since most of the users for both resources run fewer than 50 jobs. SDSC COMET exhibits a more pronounced tail however.
|Resource||Max Duration, Hours||Max Jobs per User||Max Nodes||Max Cores|
|SDSC Comet GPU||48||8||96|
|PSC Bridges Large||96||250|
|PSC Bridges GPU||48||16|
8.5 Summary: Job Submission Patterns & Over-Subscription
Given current XSEDE allocations, in order for 95% of the backlog of queued jobs to run immediately, current HPC resources would need to be 10-20% larger. Queue limits in terms of the total number of jobs a user is allowed to have in a queue are imposed only by TACC and STANFORD XSTREAM resource and were found to have no impact on job throughput. The majority of users do not reach the maximum queue limit in their work flow.
We gratefully acknowledge the support of NSF awards OCI 1025159, ACI 1445806, ACI 1566393, ACI 1763033, and OCI 1203560. This analysis would not have been possible without the support and expertise of the XDMoD development team, including Cynthia Cornelius, Martins Innus, Ryan Rathsam, Jeanette Sperhac, and former members Thomas Yearke, Amin Ghadersohi and Ryan Gentner.
Appendix A Supplementary Data
a.1 Data Sources
XDMoD draws information from a number of data repositories including the XSEDE Central Database (XDCDB), the XSEDE Resource Allocation System (XRAS), the XSEDE Resource Description Repository (RDR), tacc_stats software, resource manager accounting logs, and Cray RUR software [Barry:2013]. In many cases, additional data is required that is not available in one of these existing repositories and is provided to XDMoD directly from the source. For example, the XDCDB does not track the hosts that an individual job was allocated so this information comes directly from the resource manager accounting data at the service providers. Open Science Grid (OSG) processes over 30M jobs each year, which is a higher volume than the XSEDE accounting workflow can currently handle so OSG provides this data directly to XDMoD. Metrics for cloud and innovative resources such as IU JETSTREAM require detailed event information so this data is provided data directly to XDMoD from the service providers. Data from these various repositories is brought into XDMoD through an Extract/Transform/Load (ETL) process where it is cleaned, normalized, and cross-referenced in the XDMoD data warehouse.
|XDCDB||Primary source of job accounting data as well as XSEDE users, organizations,|
|system usernames, allocation charge numbers, gateway usernames, etc.|
|XRAS||Source of XD SUs requested and allocated.|
|RDR||Resource configuration information (nodes, cores, production dates, etc.)|
|tacc_stats||Job performance data for TACC RANGER, TACC LONESTAR4, TACC STAMPEDE,|
|TACC STAMPEDE2, LSU SUPERMIC, SDSC GORDON and SDSC COMET|
|Cray RUR||Job performance data for NICS DARTER|
|Resource Mgr Log Files||Source for the hosts allocated to individual jobs|
|OSG Job Accounting||Detailed accounting data for jobs submitted to OSG|
a.2 Resource Characteristics
This appendix contains architectural and operational characteristics of the resources studied in this report, including configuration, dates of operation, and the data available for analysis. Table 15 lists resources and their composition, and Table 16 includes resource service dates and data collected.
It is important to note that there are two methods by which jobs are submitted to IU JETSTREAM: via the Openstack API and using the Atmosphere portal developed by CyVerse at the University of Arizona (http://www.cyverse.org/) with 31% and 69% of XD SUs submitted using each respective method. While we have been able to extract accurate data from the XDCDB for jobs submitted via the Openstack API, accounting data submitted to the XDCDB by Atmosphere is aggregated by user and allocation on a roughly daily basis and groups together all virtual machines (VMs) in the given reporting period. Due to this summarization, we are only able to determine the total number of XD SU charged to a particular allocation and unable to determine information such as the number of VMs, the number of cores per VM, or the times that a given VM was running. We have been in contact with the Atmosphere team to obtain more detailed accounting records going forward, but the data is not currently available at the time of this report.
Figure 138 is a detailed plot showing the dates each resource was in production, along with the time period for which accounting and job level performance data is available. The size of each resource in cores is also given. In addition to job accounting data for almost all of the production systems during the period covered by this report, job level performance data was available for TACC RANGER, TACC LONESTAR4, SDSC GORDON, TACC STAMPEDE, CCT LSU SUPERMIC, NICS DARTER, SDSC COMET, and TACC STAMPEDE2, as indicated in this figure. At the time of this report, the collection of job level performance data has not yet been implemented for the current production systems PSC BRIDGES and IU JETSTREAM. Neither accounting or job level performance data is available for TACC WRANGLER.
|Resource||Type1||Cores||Nodes||Cores per Node||CPU Clock Rate, GHz||RAM, GiB||GPU/MIC per Node||HTC||14000||1750||8||Intel Misc||>3.0||0.5-32|
|GATECH-KEENELAND||HPC||4224||264||16||32||3||HPC||102400||6400||16||Intel, E5-2680, Sandy Bridge||2.7||32||1, 2||Intel MIC|
|TACC-MAVERICK2||Vis||2640||132||20||2.8||256||1||HPC||7200||360||20||Intel, Ivybridge||2.8||64||2||Intel Xeon Phi 7120P|
|SDSC-COMET-GPU||HPC||1872||72||24, 18||2.5||128||4||HPC||360||3||240, 60, 60|
|PSC-BRIDGES-GPU||HPC||1472||48||1||128||4, 2||HPC||1300||65||20||Intel Ivy-Bridge||256||8 (16-logical)||K80|
Resource types: HPC - High-performance computing, HTC - High-throughput computing, DIC - Data-intensive computing, Cloud - Cloud resource and Vis - Visualization system.
HPC resource types: HPC - High-performance computing, HTC - High-throughput computing, DIC - Data-intensive computing, Cloud - Cloud resource and Vis - Visualization system
a.3 XD SU Conversions
Table 17 shows the conversion factors from local service units (local SU) to XSEDE SU (XD SU). Although local SUs are intended to be costing units they are generally proportional to CPU core-hours (or node-hours in the case of TACC STAMPEDE2 and TACC WRANGLER) but can be more complex formulas depending on the type of resource as described by XSEDE allocation policy. For example, IU JETSTREAM defines a local SU as 1 virtual CPU core-hour plus 2 GB of memory.
|Resource||Type||Factor Start||Factor End||Conversion Factor|
|SDSC COMET GPU||HPC||2017-04-12||Current|
|PSC Bridges Large||HPC||2016-01-01||2016-04-28|
|PSC Bridges GPU||HPC||2017-03-14||2017-04-24|
a.4 Science Gateways
We define Science Gateways as described on the XSEDE User Portal Science Gateway page (https://portal.xsede.org/science-gateways) which states that Science Gateways are “Customized portals granting members access to HPC applications, workflows, shared data and other services. XSEDE Gateways unite communities of like-minded members, whether united by discipline or other criteria.”
Historically, Science Gateway usage has been tracked in multiple ways, with varying degrees of completeness, including
Verbal communication with Science Gateway PIs,
Jobs run under a shared Community User account (one for each Gateway),
Jobs charged to an allocation that a Community User has utilized at some point, and
Jobs run under an allocation that is tagged as a "Science Gateway" grant
The number of active and unique Gateway users has been tracked using two methods
Verbal communication with Science Gateway PIs and
Job attributes that link a Gateway user to a particular job that they submitted (supplied by the Gateway to the XDCDB starting in 2015-Q2)
When examining Gateway jobs, we will primarily look at jobs that were run under a Gateway’s shared Community User account as these appear to most closely match both the definition of a Science Gateway and also best captures Gateway Usage by end users. This will be supplemented by data collected through verbal communication with Gateway PIs, which is especially useful for examining active and unique Gateway users prior to 2015-Q2.
Jobs run under an allocation that a Community User has utilized at some point casts a larger net and includes jobs not submitted by Community User accounts including development and preliminary jobs. Comparing jobs submitted via Community User accounts versus jobs submitted against an associated allocation that the Community User utilized at some point we see that, with a few exceptions such as SciGaP and Tera3D, usage through the Community User accounts aligns fairly closely with that through associated allocations as shown in Table 18.
Allocations that are tagged as a “Science Gateway” in the XDCDB appear to comprise a much broader definition of gateways and are not restricted to portals. In addition, jobs submitted via a Community User account are not always charged to an allocation tagged as a “Science Gateway,” making this method unreliable. For example, the ChemCompute portal (https://chemcompute.org/) charged usage in 2017 to charge numbers TG-CDA160009 and TG-CDA170003 but only TG-CDA160009 is tagged as a "Science Gateway" allocation. Also of note is that many “Science Gateway” allocations are not listed as a Science Gateway on the XSEDE User Portal such as TG-PHY140033 “Cloud Computing on Jetstream for the ATLAS Experiment at the Large Hadron Collider” which appears to have grown out of the ATLAS Connect (https://connect.usatlas.org/) project.
|Gateway||via Community User||via Associated Allocation|
|# Jobs||SUs||# Jobs||SU|
|UCI Social Science Gateway||24,906||25,191||25,076||58,427|
In addition to submitting jobs on behalf of its users using a shared Community User account, a gateway may also provide job attributes to the XDCDB to associate individual Gateway users (typically via an email address) with the jobs run on their behalf. These users are unique to a gateway and cannot be assumed to be unique across multiple gateways. Also note that not all Science Gateways provide job attributes, as shown in Table 19, and that no Gateway has complete coverage in this respect. To improve reporting by science gateways and facilitate the storing of job attributes in the XDCDB, the XSEDE Cyberinfrastructure Integration (XCI) and XDCDB teams developed the gateway-submit-attributes tool. This tool allows a gateway to easily associate the local gateway user who submitted a job with that job’s record in the XDCDB. The first record of job attributes being recorded using this method is 2015-03-05. What this means is that any information presented on the number of unique or new Gateway users is a lower bound and that prior to 2015-Q2 the tracking of unique Gateway users was far less reliable.
It should be noted that in the early stages of a Science Gateway’s implementation, job attribute data may not be submitted to the XDCDB. For example, it may take some time for a Gateway to set up the infrastructure for submitting job attributes to track individual users and a community user may not have been set up for a gateway. In these cases, data has been supplemented by verbal communication with gateway PIs by the XSEDE Science Gateways Team which has provided this data to XMS.
|Gateway||Gateway Deployment||# Jobs||CPU Hrs||w/Attrib(%)|
|First Job||Last Job|
|UCI Social Science Gateway||2013-11-04||2017-09-03||24,906||843||24,421(98%)|
a.5 Job Performance Data
Job performance data was obtained from tacc_stats [evans2014comprehensive] running on the compute nodes of TACC RANGER, TACC LONESTAR4, TACC STAMPEDE, TACC STAMPEDE2, LSU SUPERMIC, SDSC GORDON and SDSC COMET. The performance data from NICS DARTER was obtained from the Cray RUR software [Barry:2013]
The tacc_stats software runs on every compute node and records a large number of metrics. The data are written to an archive file and there is one file per compute node per day. The software is executed at 10 minute intervals using cron and is also called from the job scheduler prolog and epilog scripts to record the metrics immediately before each HPC job begins and just after each job ends. These data are used to generate summary information for each HPC job using the data collected on the compute nodes on which each job ran.
The summarization software uses the accounting data from the resource manager and the compute node level archives and generates various metrics for each job. There are two semantic metric types, instantaneous metrics and counter metrics. Examples of instantaneous metrics are memory usage or number of processes in the O/S run queue. Examples of counter metrics are number of floating point operations or number of clock ticks in CPU user. The summarization software treats the two semantic metric types differently. For counter metrics, the total change in the counter value is typically used and is calculated as the value recorded after the end of the job minus the value recorded immediately before the start. The instantaneous metrics are sub-sampled so any derived data from these are estimates. The values of any instantaneous metric before job begin and after job end are not used when computing derived data.
In the following sections we describe the job level metrics in more detail.
The CPU usage data is obtained from the CPU statistics reported in /proc/stat. The various CPU metrics are computed as the ratio of time spent in the different modes (user, system, etc.) to the overall time as reported in the proc file. The CPU metrics are counter metrics so the value for the HPC job is the value of the counter at the end minus the value at the beginning.
a.5.2 Runnable threads
The number of runnable threads is obtained from the procs_running field in /proc/stat. This value is an instantaneous metric so the measurements made at 10 minute intervals during the job are used.
The memory data is obtained from the /sys/devices/system/node/node*/meminfo files for each numa node on the compute node. The value of the memory usage per core at time is given by
where is the total number of CPU cores per compute node, MemTotal is the total amount of memory, MemFree the amount of unused memory, FilePages the amount of memory used by the kernel page cache and Slab the amount of memory used by the Linux kernel data structures cache.
The memory metrics are instantaneous metrics so the measurements at the beginning and end of the job are not used to compute the job summary values. The average memory usage is the mean value of all of the measurements. The maximum memory usage is the value of the largest measurement. All of the memory metrics are estimates of the actual usage since the memory statistics are only sampled at 10 minute intervals during a job.
All of the resources that run tacc_stats use the Lustre parallel filesystem. The metrics for Lustre are obtained from the /proc/sys/lnet/stats file. The tx_bytes and rx_bytes values are used for the data transmit and receive metrics respectively.
All of the resource use InfiniBand. The ibmad library is used to query the port statistics on the InifiniBand switch.
The application associated with a job is derived from the executable path information. If Lariat [Lariat2013] or XALT [Agrawal:2014:UET:2691136.2691140] information is present then this is used as the source of the executable information444Lariat is available on TACC RANGER, TACC LONESTAR4 and TACC STAMPEDE. XALT data is available on TACC STAMPEDE and TACC STAMPEDE2, however Lariat is the source for the data presented in this report.. If Lariat or XALT data is absent then the process information from tacc_stats is used. If neither Lariat, XALT or tacc_stats data is available then the application is marked as “NA”.
Lariat or XALT provide the path name of the main parallel process in the job. tacc_stats does not have that information and instead records the list of processes running on the compute nodes (ignoring common system daemons). When this process data is used, there are potentially multiple different processes to choose from some of which may be software run for setup or tear down steps and not the main job step. To attempt to identify the main step for the job, the classification algorithm records the number of unique process ids (PIDs) for each process name and checks them from most to fewest until it finds a process name that is not in the ignore list. The ignore list contains common unix processes such as cp and bash. This algorithm therefore typically picks up the main parallel processing step in an HPC job rather than, say, the data copy in or copy out steps.
Once the main executable is identified then this string is checked against reference database of known community applications. The reference database has 188 known applications and several application categories, such as debugging tools (e.g. ddd, gdb) and interactive session commands (e.g. xterm). Applications are identified by regular expressions. If no application is matched then the job is placed in the “uncategorized” category. A machine learning based study was done previously on the “uncategorized” jobs [Gallo2015]. A model was developed that was able to classify jobs with better than 90% accuracy. It was found that 80-90% of the “uncategorized” jobs were in fact custom user code and only 10-20% of the “uncategorized” jobs were community applications that had been missed by our regular expression classification scheme.