Adaptive parallelism with RMI: Idle high-performance computing resources can be completely avoided

01/22/2018
by   Florian Spenke, et al.
0

In practice, standard scheduling of parallel computing jobs almost always leaves significant portions of the available hardware unused, even with many jobs still waiting in the queue. The simple reason is that the resource requests of these waiting jobs are fixed and do not match the available, unused resources. However, with alternative but existing and well-established techniques it is possible to achieve a fully automated, adaptive parallelism that does not need pre-set, fixed resources. Here, we demonstrate that such an adaptively parallel program can indeed fill in all such scheduling gaps, even in real-life situations on large supercomputers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

06/13/2021

Multi-Resource List Scheduling of Moldable Parallel Jobs under Precedence Constraints

The scheduling literature has traditionally focused on a single type of ...
10/28/2020

Benchmarking Parallelism in FaaS Platforms

Serverless computing has seen a myriad of work exploring its potential. ...
02/22/2013

LFTL: A multi-threaded FTL for a Parallel IO Flash Card under Linux

New PCI-e flash cards and SSDs supporting over 100,000 IOPs are now avai...
08/31/2021

A log-linear (2+5/6)-approximation algorithm for parallel machine scheduling with a single orthogonal resource

As the gap between compute and I/O performance tends to grow, modern Hig...
09/01/2019

Improving the Effective Utilization of Supercomputer Resources by Adding Low-Priority Containerized Jobs

We propose an approach to utilize idle computational resources of superc...
11/02/2020

IOS: Inter-Operator Scheduler for CNN Acceleration

To accelerate CNN inference, existing deep learning frameworks focus on ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

In practice, standard scheduling of parallel computing jobs almost always leaves significant portions of the available hardware unused, even with many jobs still waiting in the queue. The simple reason is that the resource requests of these waiting jobs are fixed and do not match the available, unused resources. However, with alternative but existing and well-established techniques it is possible to achieve a fully automated, adaptive parallelism that does not need pre-set, fixed resources. Here, we demonstrate that such an adaptively parallel program can indeed fill in all such scheduling gaps, even in real-life situations on large supercomputers.

keywords

adaptive parallelism, malleable parallelism, scheduling, genetic algorithms, non-deterministic global optimization

1 Introduction

Traditional job scheduling for high-performance computing (HPC) machines requires fixed, pre-set resources (amount of memory, number of CPUcores or compute nodes, maximum required time) for each job. These settings are provided by the user, and most of them cannot be changed after job submission, neither by the scheduler nor by the user. With a realistic job mix in real-life situations, this leads to a total machine load being substantially smaller than 100%, in at least two generic situations:

  1. during standard operation, compute nodes remain unused because the smallest jobs in the queue requests nodes; hence, typically, an HPC installation is considered “full” already at average loads of 90%;

  2. before huge jobs requesting a big portion (50%-100%) of the whole machine can run, a correspondingly big portion of the machine has to be drained completely; during this time, no new jobs can start that request more time than the time left until the scheduled start of the big job; typically, this can lead to the load dropping down towards zero, for extended times of several hours.

Idle computing resources are not productive in terms of producing results, but they still cost real money. Electricity consumption of the actual hardware and of the periphery (cooling) may be smaller in idle mode, but is not zero – and should not be zero: In scenario (2) mentioned above, if the huge job finally starts, heat dissipation is instantly required, and then the cooling should already be up and running.

As the size of installations grows, the amount of computing performance left over under these conditions is non-negligible and may in itself be sufficient to run computationally intensive large-scale simulations. Hence, parallel jobs that can be adapted in their resource consumption during runtime are highly desirable to fit into this reality, from the perspectives of both the HPC users and the HPC centers.

How such adaptively parallel jobs change the (non-trivial) task of optimal job scheduling has been investigated theoretically [1, 2, 3, 4], but so far surprisingly few instances of real-life adaptively parallel applications have been demonstrated: Indeed we found only one example [5]. This previous realization of a malleable job setting indeed achieves several objectives: For four different, small but real-life-like applications (including a simplified molecular-dynamics code), both shrinking and expanding the needed resources on the fly was successfully demonstrated. However, to achieve malleability, several additional layers of software (including self-written code) had to be inserted between the application and the (modified/extended) scheduler, in addition to (small-scale) instrumentation of the application code itself.

In contrast, in our implementation, the external dependencies are minimal. In fact, most of the implementation solely relies upon standard components of the Java virtual machine; only the interaction with the scheduler has to be adapted to the actual scheduler present. This clearly maximizes portability. With several application examples on different HPC installations, we show that such transfers between different scheduler/queueing systems are easily possible.

The paper is organized as follows. Our new algorithm is described in section 2, the main experimental results are in section 3, and the conclusions follow in section 4.

2 Algorithm

For the original, detailed discussion of our RMI-based resilient parallelization, we refer to Ref. [6]. We hence will restrict ourselves in the following to a short discussion of the algorithmic features and details of relevance to this work.

A server process maintains an internal queue of tasks to be accomplished, is responsible to receive intermediate results, and combines them to the final result. This process requires only very little resources, so it is often started on infrastructure nodes. Any number of client processes can attach at any point to the server process to obtain a list of tasks to work on and, upon completing them, return the intermediate results to the server. Server and clients maintain heartbeat connections to assure that client malfunction from, e.g., hardware failures are gracefully handled by the server and server shutdowns cause all clients to shutdown reliably on their own. Excellent parallel scalability to 6144 cores for evolutionary algorithms was demonstrated.

[6]

Obviously, the ability to attach and detach client processes trivially at job runtime lends itself naturally to adapt a job to available computing resources, i.e., be used as a fill-in job. We can hence use scheduling leftovers of ordinary, statically shaped jobs in an efficient manner. As resources become available, clients are started and attach to the server. As resources are needed for static jobs, clients can be killed by the queuing system.

To achieve this goal of a complete machine load fill, the fill-in jobs need to be treated differently by the queuing system than ordinary jobs. For an undisturbed workflow of ordinary jobs it is essential to cancel running fill-in jobs to free computing resources. Luckily, this is a standard feature supported by modern scheduling systems. Without excessive, costly, and usually inflexible checkpointing, an ordinary job suffers substantial loss of computing time on a sudden cancellation, so this feature often remains unused. In contrast, in our setup, the master-client dialogue is built on task/result chunks that are communicated frequently and can be controlled in their size and frequency by the user. Additionally, lost computing time due to client cancellation can be rescheduled or, if possible (as in the present evolutionary algorithm (EA) applications), just dropped. Hence, losses upon cancellation can be kept minimal, and the balance between chunk size, communication frequency and computations that possibly have to be re-done can transparently be adapted by the user to an expected interruption frequency scenario.

In order to employ realistic fill-in jobs, we performed global minimum-energy structure optimizations of water clusters with them, using the TIP4P [7] water model, in most cases for a cluster size of (HO), which is the upper size limit for this system and for this task, in studies published so far [8].

3 Experimental results

3.1 Local hardware

As a first real-life test, we checked the queue-fill-in capabilities of our RMI setup on the heterogeneous local hardware of the Hartke group. The following machines were involved in the test shown in Fig. 1: 4 nodes with 8 cores of type AMD Opteron 2358 SE Quad-Core each, 2 nodes with 28 cores of type Intel Xeon E5-2680 v4 each, 3 nodes with 12 cores of type Intel Xeon X5675 each, 2 nodes with 48 cores of type AMD Opteron 6172 each, 1 node with 8 cores of type Intel Xeon X5355 Quad-Core, 1 node with 12 cores of type AMD Opteron 2427 Six-Core, 1 node with 36 cores of type Intel Xeon Gold 6154 and a login node with 8 cores of type Intel Xeon E5-2620. All nodes run openSuse 42.x. For all calculations a Java Runtime Environment (JRE) version 1.8.0_73 was used. The nodes are connected via 2x 1 GBit/s LACP.

The scheduling system used for batch processing is SLURM. [9] It is configured to distribute the existing resources on a first come, first serve basis. To accommodate the fill-in jobs, a new partition was introduced. Jobs started in this partition are low priority, i.e., they are only eligible to start if no job in another partition can use the existing free resources. Also these low-priority jobs are pre-emptible, i.e., they can be aborted by the scheduler before they reach their currently set walltime. The scheduler regards the resources occupied by jobs in this low-priority, pre-emptible partition as free for jobs in other partitions. If these resources are needed, low-priority jobs are canceled and subsequently requeued by the scheduler.

To test the adaptability of the backfill-jobs against a strongly varying background of normal (fixed-resource) jobs, numerous short-lived jobs were added to the latter category. The normal jobs running during the test consisted of photodynamics simulations with MOPAC and global optimizations with OGOLEM (in shared memory mode). The RMI-server was started on the login node. 40, 20 and 30 RMI-clients of size 1, 4 and 8 CPU cores were used as backfill-jobs to fill the 276 available cores. None of the normal jobs was hindered or affected in any way by the backfill jobs, and no manual guidance of the fill-in was made. Fig. 1 shows a typical resulting breakdown of the total CPU load into these two categories (normal jobs, fill-in jobs).

Figure 1: Queue fill-in at the Hartke workgroup computing cluster; normal jobs with fixed resources in blue, RMI-backfill-jobs in green. Note that the normal job set (blue) contains a random admixture of many short-lived jobs, leading to strong, irregular, high-frequency load oscillations. Nevertheless, the RMI-backfill is able to keep the overall load close to 100% at all times.

Obviously, despite a strongly varying load of standard jobs, the fill-in automatically tops off the available hardware to maximum load, at all times.

3.2 University computing center

A similar queue-fill-in was demonstrated on a homogenous subcluster of the university computing center. This cluster consisted of 7 nodes with 40 cores of type Intel Xeon E7-4820 10-core each (i.e., 280 CPUcores overall) and a login node with 8 cores of type Intel Xeon E5-2640. All nodes run CentOS Linux 7.x. For all calculations a JRE version 1.8.0_101 was used. The nodes are connected via InfiniBand.

The used scheduling system is SLURM. Changes to the scheduler are nearly identical to those described in section 3.1. There are again low-priority jobs that are started in a separate partition and can be canceled by the scheduler. In contrast to the above-mentioned scheduler setup, the canceled jobs are not requeued. A fair-share policy is used at this cluster to regulate the start of jobs competing for the same resources. This does not impact the fill-in jobs since these are low-priority and hence do not compete for resources.

The attempt to test the backfill-jobs at this hardware was again accompanied by an artificial background of short-lived jobs in the main partition of the scheduler. The RMI-server was started on the login node. Scripts were used to keep at least 5 backfill-jobs of size 1, 4 and 8 CPU-cores each queued and eligible to start. Fig. 2 shows the CPU load during the test.

Figure 2: Queue fill-in at the CAUcluster; normal jobs with fixed resources in blue, RMI-backfill-jobs in green.

3.3 Regional HPC hardware

A core application case of this work is a demonstration that our flexible RMI parallelization allows us to completely fill even the biggest load gaps on TOP500-class HPC installations, during normal operation and without any impact on other jobs. For such a demonstration, we have chosen the “Konrad” cluster in Berlin, which is part of the North-German Supercomputing Alliance (HLRN) [10] installations in Berlin and Hannover, serving hundreds of HPC users in the North-German states Berlin, Brandenburg, Bremen, Hamburg, Mecklenburg-Vorpommern, Niedersachsen and Schleswig-Holstein. In November 2017, it ranked 182 on the Top500 list.[11]

For the demonstration case in this subsection, we used the MPP1 section of “Konrad”, a Cray XC30 consisting of 744 Intel Xeon IvyBridge compute nodes. Each node contains 24 CPUcores (in two Intel Xeon IvyBridge E5-2695v2 2.4 GHz CPUs). Hence, this MPP1 section comprises a maximum of 17.856 CPUcores and has a theoretical peak performance of 342.8 TFlop/s.

As in the tests described above, to allow our fill-in jobs at the HLRN, a new class for pre-emptible, low-priority jobs was specified to the Moab scheduler running there. According to HLRN policies, the number of jobs is restricted via class settings, in our case to 24 simultaneously running jobs.

Additionally, we employed dynamically assigned resources: During a fill-in run we submitted jobs in a per-minute interval that were tailored to fit the available resources at that moment until the next revervation of a regular job.

Independent RMI-client processes were then started on each node. The RMI-server process was started on one of the scheduling nodes of the HLRN.

The dynamically scaled fill-in jobs can easily fill any free nodes. However, some restrictions arise due to HLRN user policies and scheduler limitations. The optimal case would be one fill-in job per node. On the event of a newly started ordinary job, exactly the needed number of nodes could be freed and the new job could be started on these. If the cancelled fill-in jobs were bigger than the new regular job, a portion of freed nodes will not be used and needs to be refilled with fill-in jobs.

To adapt to the limited job number according to HLRN user policies, we implemented an additional, automated job monitoring: If this limiting number was to be exceeded, e.g., because another regular job finished execution, the smallest running fill-in job was canceled and a new, larger one was started on the combined free nodes. This once more exploits the high adaptibility built into our concept.

Figure 3 illustrates this fill-in on the HLRN hardware, both during normal operation and during the machine draining period before a huge, machine-wide job starts, i.e., this test covered both scenarios mentioned at the beginning of the introduction. Additionally, from t = 0 to t 200 minutes, a portion of the nodes was reserved for test jobs by the HLRN, so the maximally fillable node count is slightly lower than during the remaining time. However, our setup is sufficiently flexible to also cope with such changes, without manual interventions.

Figure 3: Queue fill-in at the HLRN computing center. A huge job (filling all of the machine) starts at t = 1010 minutes (visible only as narrow red line, because it finished again very quickly). In preparation for this event, normal job load (blue) decreases from about 100% to 0% between t = 500 minutes and t = 990 minutes, in a non-linear and unpredictable manner, since this decrease is not guided externally but arises spontaneously: The remaining time until the pre-set huge job start cannot be used by standard fixed-shape jobs available in the queue. However, our fill-in (green) covers this big gap completely, as it already did during normal operation (before t = 500 minutes), where the overall load also frequently failed to reach 100%. The blue line indicates the maximally available resources at all times. They were not completely constant: Until t200 minutes, a part of the system was reserved for administrative purposes. Note that 1 node contains 24 CPUcores, hence 744 nodes correspond to 17.856 CPUcores.

In contrast to the two test cases described in the previous subsections 3.1 and 3.2, at the HLRN we had no influence whatsoever on the normal jobs (blue load in Fig. 3). Instead, these jobs had been submitted by other regular HLRN users, as in any other period of normal HLRN operation. Hence, this third test case at HLRN not only demonstrates our adaptive fill-in at a very large HPC center but also under perfectly normal, real-life conditions.

The reactions of our adaptive fill-in to current load levels and their changes are essentially instantaneous, but reaction times of the HLRN queueing system (including simple load-level queries, as needed to produce this load figure and to drive the deployment of our fill-in jobs) are not negligible, since high-frequency scheduling/queueing queries were not part of the design specifications when this queueing system was set up several years ago. These delays lead to the jagged appearance of the upper edge of the fill-in load (green area in Fig. 3), occasionally escalating into highly oscillatory load dips (between t=800 and t=900 minutes) or white “stripes” across the whole (green) fill-in load (between t=1000 and t=1100 minutes, after the big job finished). These irregularities were much smaller in the other two demonstrations (Figs. 1 and 2). Hence, they are artifacts from a limited scheduler time resolution at HLRN, and do not reflect true deficits of our own fill-in setup. Conversely, to fully exploit these fill-in capabilities, the design specifications of an HPC scheduler installation should include sufficient responsiveness to higher-frequency queries than needed in more traditional HPC scheduling.

With the fill-in jobs displayed in Fig. 3, over 75 M global optimization steps for a (HO) cluster in the TIP4P model could be performed and 80 k CPU-hours were used.

Most of these 80 k CPU-hours accrued between min. and min., i.e., within only 3 hours and 20 minutes. Nevertheless, this is a substantial HPC resource usage, equivalent to what one full, typical HLRN project uses within one whole week. With conventional parallel (or serial) jobs, these resources would have remained idle, i.e., wasted, as evidenced by the mere existence of all these load gaps and by the obvious fact that normal HLRN jobs were not able to fill them. The HLRN queues were never empty during this time, but typically contained 50–100 jobs in a submitted/waiting state, and none of these normal, waiting jobs was in any way hindered by our fill-in jobs.

4 Conclusions

In summary, we have shown that our highly flexible RMI-parallelization of OGOLEM [6] can indeed be employed to adaptively fill in each and every bit of computing resources that standard parallel jobs under standard scheduling have to leave unused – no matter if these leftovers are small or huge. We have demonstrated this with one and the same program package on three very different computer systems, with significant differences in the queueing/scheduling software and with huge differences in hardware size, ranging from a small, heterogeneous, local computing cluster to a national Top500 supercomputer. To transfer our setup from one of these machines to another one, no changes at all were necessary in our OGOLEM package, and only minor adaptations had to be made in small helper scripts (interacting with the scheduler) and in the scheduler setup (in all cases exclusively exploiting already existing scheduler features). Additionally, on the largest system (HLRN) our demonstration also was a fully real-life case, against a backdrop of standard jobs from many other users, completely beyond our control.

Therefore, with present-day adaptively parallel technologies, even on large-scale HPC installations and in everyday situations, it is now demonstrably possible to avoid all scheduling losses and to achieve a total load level of 100%, at all times, maximally exploiting the available computing resources.

Acknowledgments

We would like to acknowledge Holger Marten of the Kiel University Computing Center and Thomas Steinke, Christian Schimmel and the “Fachberater” team of the Zuse Institute Berlin (ZIB) / North-German Supercomputing Alliance (HLRN) for allowing us to perform the real-life tests reported here on their machines during normal operation, despite their non-standard queueing/scheduling. Additionally, FS and BH are grateful for a computer time grant which made the big fill-up calculations at HLRN possible, and to Peter Hauschildt (Hamburg Observatory) for submitting huge astrophysics jobs at HLRN, triggering machine drainings that we could then fill up.
JMD wishes to extend his gratitude to Scientific Computing & Modelling (SCM) who allowed him to pursue these questions in his free time. He also wishes to thank Dean Emily Carter for her current and ongoing support of his other scientific endeavours.

References

  • [1] K. Jansen. Scheduling malleable parallel tasks: An asymptotic fully polynomial time approximation scheme. Algorithmica, 39:59–81, 2004.
  • [2] J. Blazewicz, M. Machowiak, J. Weglarz, M. Y. Kovalyov, and D. Trystram. Scheduling malleable tasks on parallel processors to minimize the makespan. Annals Oper. Res., 129:65–80, 2004.
  • [3] L. Y. Fan, F. Zhang, G. M. Wang, and Z. Y. Liu. An effective approximation algorithm for the malleable parallel task scheduling problem. J. Paral. Distrib. Comput., 72:693–704, 2012.
  • [4] Y. J. Cao, H. Y. Sun, D. P. Qian, and W. G. Wu. Scalable hierarchical scheduling for malleable parallel jobs on multiprocessor-based systems. Comput. Syst. Sci. Eng., 29:169–181, 2014.
  • [5] A. Gupta, B. Acun, O. Sarood, and L. V. Kalé. Towards realizing the potential of malleable jobs. In 21st International Conference on High-Performance Computing (HiPC), 2014.
  • [6] J. M. Dieterich and B. Hartke. An error-safe, portable, and efficient evolutionary algorithms implementation with high scalability. J. Chem. Theory Comput., 12:5226, 2016.
  • [7] W. L. Jorgensen, J. Chandresekhar, J. D. Madura, R. W. Impey, and M. L. Klein. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys., 79:926, 1983.
  • [8] S. Kazachenko and A. J. Thakkar. Water nanodroplets: Predictions of five model potentials. J. Chem. Phys., 138:194302, 2013.
  • [9] A. B. Yoo, M. A. Jette, and M. Grondona. SLURM: Simple Linux Utility for Resource Management. In D. Feitelson, L. Rudolph, and U. Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, JSSPP 2003, volume 2862 of Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2003.
  • [10] North-German Supercomputing Alliance (HLRN). https://www.hlrn.de/, accessed 2017/11/24.
  • [11] Top500 list. https://top500.org/list/2017/?page=2, accessed 2017/11/24.