Delay Propagation and Overlapping Mechanisms on Clusters: A Case Study of Idle Periods based on Workload, Communication, and Delay Granularity

05/25/2019 ∙ by Ayesha Afzal, et al. ∙ 0

Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") destroy the assumptions of regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long delays that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of idle waves emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Simple analytic models and noise

White-box analytic performance models of parallel applications are built on simplifying assumptions about the interactions of code with the hardware. For example, the Roofline model [19] predicts the performance of parallel loops on multicore processors by assuming that data transfers to and from main memory overlap perfectly with the execution of work in the cores, and whichever takes longer determines the runtime. Distributed-memory parallel applications require some communication model in addition, such as the Hockney model [5], LogP [2], or one of their variants and extensions. Further assuming a bulk-synchronous model of computation without overlap of communication and computation, the overall runtime of a program may then be predicted by adding the execution time to the communication time: . Many refinements are possible, and even though the construction of analytic models is often tedious, they lead to invaluable insights about bottlenecks and governing mechanisms.

Many of the assumptions underlying those models are fulfilled only approximately in practice, which limits their accuracy and, if deviations become too large, their usefulness. In this paper, we want to shed light on a phenomenon that violates the bulk-synchronous assumption about a parallel program on typical cluster systems comprising many nodes with multicore processors: Even if the workload is perfectly balanced across all workers, slight variations in execution performance or communication time can lead to prominent collective phenomena that break the regular “lockstep” pattern. Such variations exist in all real computer systems, and they have a wide spectrum of origins. A very coarse categorization of statistical causes for variation can be made: Fine-grained noise originates from OS interference, clock speed fluctuations, interrupts, data management in driver software, and more, and usually leads to delays of the order of microseconds. Execution delays are longer and more coarse-grained; examples are regular administrative jobs (e.g., cron job scripts), page faults, garbage collection, or infrequent application imbalances that take a significant amount of a core’s resources for a long time, maybe longer than the typical periodicity of a bulk-synchronous program. We will keep the distinction between noise and delays in this work.

I-B Motivating examples

Number of sockets

Number of processes

(a)

(b)

Number of nodes

(c)
Fig. 1: Comparison of actual measurements with a performance model for strong scaling of an MPI-parallel STREAM triad benchmark on a cluster system with 20 cores per node and InfiniBand interconnect. See main text for details of the setup and model. (a) Total measured (blue squares) and model (red squares) performance on up to 9 full sockets, execution-only performance model (red diamonds), and execution-only median and min/max performance across all cores (blue diamonds and whiskers). (b) Closeup of the node level. (c) Comparison of model and measurement when running only one core per node.

One prominent result of noise- or delay-induced desynchronization is that simple execution/communication performance models as described above tend to deliver very inaccurate predictions, and those can be optimistic as well as pessimistic. We performed a simple experiment that shows this effect (see Figure 1): On an InfiniBand cluster with dual-socket Intel “Ivy Bridge” nodes (details on the hardware can be found in Section III-A) we ran a pure-MPI version of the McCalpin STREAM triad [12] loop (A(:)=B(:)+s*C(:)) in a strong scaling scenario. An overall working set of ( elements) is split evenly across the MPI processes. After each full loop traversal, each MPI rank sends and receives of data to and from each of its direct neighbors and (the process topology is a closed ring). All nodes are connected to a single, fully nonblocking IB leaf switch to eliminate interference from other cluster jobs and enable full non-blocking bandwidth. Given the memory bandwidth of a socket () and the asymptotic node-to-node bandwidth of the InfiniBand network () one can construct a simple optimistic model for the runtime of one compute-communicate cycle. When is the number of sockets (each running ten processes), the predicted time per time step is

(1)

Here we ignore the communication between processes within a node, (which can be significant in practice, but neglecting it makes the model just more optimistic). The performance in flop/s is then just . Figure 1(a) shows the total predicted (red squares) and measured (blue squares) performance versus the number of sockets. The strong deviation of almost a factor of two at 9 nodes might be expected because of our ignorance of intra-node communication; it is surprising, however, that the pure execution performance (calculated from the mean execution time of individual processes) is so much higher than the prediction, which assumes linear scalability, of course (blue vs. red diamonds). Obviously, the assumption that computation and communication do not overlap cannot be true, but since the load is perfectly balanced across processes, this overlap must be something that emerges automatically during the program’s runtime due to the intrinsic system noise. Indeed, a trace analysis reveals that the MPI processes are massively out of sync, which leads to automatic overlap and mitigation of the memory bandwidth bottleneck but also causes massive delays due to processes waiting for messages. Due to the statistical nature of this effect, variations across runs and processes are inevitable, as shown by the min/max whiskers. Figure 1(b) is a zoom-in on the node level, where the simple bandwidth model works fine on up to one socket but communication overhead becomes visible beyond. In Figure 1

(c) we show a similar experiment with only one process per node. The relative communication overhead is now much smaller since the node-level performance is only about 1/6 of the saturated case. With less opportunity for overlap, and the memory bandwidth bottleneck removed, the model actually delivers a good prediction of the average performance although some outliers at the larger node counts indicate the onset of overlap.

Fig. 2: Measured LBM irregular structure in comparison with expected model regularity. The plot shows that the time steps with more irregularity (iterations in between and at the end of the results) manifest maximum percentage variation towards better performance (% t, t and % t).

Beyond simple toy benchmarks, real-world applications show similar patterns of nonsynchronicity. On the same system as above we ran an MPI-parallel double precision Lattice-Boltzmann (LBM) fluid solver with D3Q19 discretization and a single relaxation time (SRT) model on five nodes (100 cores). The node-level performance properties of such codes were thoroughly investigated in [20]. We use an overall problem size of lattice cells (including a boundary layer in each diraction) for a working set of more than . Domain decomposition is done only along the outer dimension with periodic boundary conditions, leading to a rather large communication overhead of at least 30% of the runtime. In Figure 2 we show timeline snapshots at different LBM time steps , from to , where the location of the time step along the wall-clock time axis is marked on each process (red markers). For reference, we also show the expected positions according to the simple nonoverlapping execution-communication model (1). While the deviation from the model and the variation across processes is still small after 20 time steps (), a global structure has emerged at with a fundamental “wavelength” equal to the size of the system (100 processes) and an amplitude of 0.3 wall-clock seconds. This pattern is not static but shifts and changes shape, as can be seen to and . Moreover, the actual runtime at is about 28 s smaller than expected. While this is only a deviation of about 2.5%, the pattern is interesting and may show up more prominently with applications that have different communication overheads and patterns.

The examples above have demonstrated that some scenarios allow noise to act as an application accelerator as well as a slowdown factor. There is, however, a very complex interplay between application code execution, the message passing library, and the network, which leads to a rich spectrum of local and collective phenomena in parallel code, especially when noise is present. The accelerating effect is certainly not guaranteed. In this work we want to study a particular aspect of this theme: the wave-like propagation of execution delays (“idle waves”) through the network under the influence of system noise and variable injected noise.

I-C Contributions

The major contribution of this paper is the investigation of idle waves, which emanate from strong delays occurring on individual processes of an MPI-parallel program.

  • We investigate and categorize the mechanisms of the propagation of “idle waves” emanating from execution delays across communicating processes under some simplifying assumptions, notably a bulk-synchronous application structure.

  • We show how such idle waves interact and (partially) cancel each other, proving that a linear wave equation is inappropriate to describe the phenomenon.

  • We give an analytic expression for the speed of an idle wave in a noise-free homogeneous system unde core-bound computational load, taking into account execution time, communication time, communication topology, and communication mode (eager vs. rendezvous).

  • We investigate the impact of injected, fine-grain exponential noise on the propagation speed and lifetime of idle waves and show how the decay of the wave depends on the strength of the noise.

  • We demonstrate that the application slowdowm usually caused by strong idle waves may be unobservable due to the presence of noise.

This paper is organized as follows: In Section II we introduce some important terms and categorize the execution and communication scenarios under investigation. Section III gives details about our hardware and software setup and the inherent node-level noise structure of the cluster system we use for the benchmarks. The mechanisms of delay propagation under various conditions are covered in Section IV, while Section V deals with the analysis of idle waves decaying under noise. Related work is covered in Section VI, and SectionVII concludes the paper and gives an outlook to future work.

Ii Categorization of parameters

A multitude of system and application parameters and properties influence the phenomenology of delay propagation and desynchronization. This section tries to categorize the most relevant factors.

Ii-a Node level: Scalability vs. saturation

In HPC, most parallel codes comprise sequences of back-to-back loops or bags of tasks that manipulate data. This data is either already present in the local memory hierarchy or must be communicated from elsewhere via message passing. Leaving aside communication for the moment, hardware-software interaction on the compute node level can be categorized into

data-bound and compute-bound phases. Lacking any load imbalance, the performance of truly compute-bound code will scale across the cores of the node since no shared resources are on the critical path. In case of data boundedness, the actual data transfer bottleneck may be a per-core (i.e., private) cache level, which will not impede scalability; if a shared resource such as a shared cache or the memory interface is involved, performance does not scale but saturate as the number of cores increases. The motivating STREAM triad and LBM examples in the introduction are clearly in the data-bound category. With such a code, using fewer than the maximum number of cores on the contention domain will usually not change the performance, and some load imbalance in the form of a few “speeders” may be tolerated. The Roofline [19] and Execution-Cache-Memory (ECM) [16, 7] performance models allow a rather complete analytic prediction of steady-state loop performance on multicore CPUs.

Note that manifest load imbalance within a single phase is considered an application-induced delay here, just as lock contention, false sharing, and similar effects (see below).

Ii-B System topology

Clusters of dual-socket multicore nodes (with or without accelerators) are the dominating high-performance parallel computer architecture today. These systems show a complex topology, in the sense that basic, identical components are assembled on multiple levels to build a hierarchical structure: cores, chips, ccNUMA domains, sockets, nodes, network islands. Apart from data bottlenecks on the chip level (see above), which manifest themselves only when an on-chip shared resource is exhausted, communication characteristics like latency and asymptotic bandwidth for point-to-point and collective primitives can be very different across intra-chip, inter-chip, and inter-node connections.

Ii-C Communication features

Ii-C1 Message-passing modes

Beyond the well-defined distinction between blocking and nonblocking communication, most of the details about how communication takes place in MPI are left to the implementation. As a general rule, short messages sent via the standard MPI_Send are transferred using an eager protocol, i.e., due to internal buffering there is no handshake between communicating processes, which may entail “automatic” asynchronous message transfer. Larger messages usually require a handshake (rendezvous protocol

), causing synchronization and, probably, explicit, nonoverlapping message transfer.

The MPI Standard is purposefully vague about how the eager protocol should be implemented. MPI implementations often allow the user to choose the protocol by setting an “eager limit” for different (intra-node and inter-node) devices. This is an upper bound on the size of messages sent or received using the eager protocol. Implementations also provide tuning knobs to control the number and the size of shared memory buffers or other internal parameters. As a consequence, the performance gain of eager protocol over rendezvous because of reduced synchronizing delays is also implementation dependent.

Ii-C2 Communication patterns

Communication patters are highly application dependent, and we do not attempt a full categorization here. Instead, we point out those patterns that we will need for our experiments in later sections. We also restrict ourselves to point-to-point communication patterns in this work.

Unidirectional vs. bidirectional next-neighbor

Although rarely seen in practice, unidirectional next-neighbor communication along an ordered set of processes is a good starting point for studying propagation phenomena. Each process receives data from one neighbor process ( or ) and sends it to the other neighbor process ( or ). In bidirectional communication, each process exchanges, i.e., both sends and receives, data from its two neighbors .

Next-neighbor vs. multiple-neighbor

Generalizing on the previous pattern, each process can have multiple neighbors on each direction. This occurs in many linear algebra and domain decomposition scenarios and entails more rigid dependencies across the processor grid.

Periodic vs. open boundaries

If the process grid is nonperiodic, propagating disturbances die out at the boundaries. Periodic boundary conditions (in one or more dimensions) enable the propagation of disturbances over larger distances and allow for more interactions between processes.

Iii Hardware testbed characterization

Iii-a Cluster systems and software

(a) System noise with enabled SMT
(b) System noise with disabled SMT
Fig. 3: Histograms for statistical baseline (natural/realistic) system-induced execution-level delays on InfiniBand (Emmy) and Omni-Path (Meggie) Systems over a period of 3 ms, with and without SMT. The bin size was 640 ns (top) and 7.2 s (bottom), respectively.

Time step

Delay

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec

Exec
Fig. 4: The delay propagation mechanism in the most simple setting: A long execution delay (spanning several execution phases) is injected at MPI rank and time step . Communication is in eager mode, and unidirectional from process to after each execution phase. The injected delay causes a waiting phase (red bar) at rank and, after another execution phase, at rank , etc.. The idle wave propagates through the system at a fixed speed due to the regularity of execution phases. Note that the width of the communication phases has been exaggerated for clarity; communication accounts for only about 0.2% of the runtime.

The “Emmy” system is installed at Erlangen Regional Computing Center (RRZE). It comprises dual-socket compute nodes with ten-core Intel Xeon “Ivy Bridge” E5-2660v2 CPUs running at . The fat-tree QDR InfiniBand interconnect fabric ( per link and direction) is built on a hierarchy of 36-port switches. “Meggie,” also installed at RRZE, features 724 dual-socket Intel Xeon “Broadwell” E5-2630v4 CPUs with ten cores each and a fat-tree Omni-Path network ( per link and direction). For all measurements presented here, the clock frequency of all nodes was fixed to the base value of 2.2 GHz. Multi-node experiments were run on a homogeneous set of nodes connected to a single leaf switch. Process-core affinity was enforced using the available facilities in the MPI implementation, ignoring the SMT feature (i.e., using only physical cores) unless specified otherwise.

We used the Intel C++ compiler version and the Intel MPI library version (update ) for all experiments. MPI traces were collected with the Intel Trace Analyzer and Collector (same version). We take advantage of C++ high-resolution chrono clock for timing measurement. Communication delays for non-blocking calls were measured by time spent in the MPI_Wait wait function.

Iii-B System noise characteristics

In order to know which levels and characteristics of noise are realistic, an analysis of the natural system noise on a standard compute node of each system was conducted. We ran an MPI-parallel code with compute-bound workload whose execution time (excluding communication) is known exactly. The code comprises a large number of back-to-back double-precision divide instructions (vdivpd), the throughput of which is exactly one instruction per 28 clock cycles on Ivy Bridge and one instruction per 16 clock cycles on Broadwell [8]. For the experiment we ran this workload on all physical cores of a node, with execution phases of 3 ms alternating with latency-bound next-neighbor MPI communication, and measuring on each core and for each execution phase the deviation of the pure execution time from the ideal (ignoring the communication). Overall, data points were collected.

Figure 3(a) shows histograms of execution delays with SMT switched on. Both systems are quite similar, with average delays of and , respectively, and maximum delays of less than . With SMT deactivated, however, the Omni-Path system exhibits a bimodal distribution of noise with a distictive second peak at . We speculate that this is the influence of the Omni-Path driver, which is much more CPU-intensive than the InfiniBand driver on the other system. The damping effect of SMT on system noise is well known [10]. In our experiments we will use the clusters in their official configuration, i.e., the InfiniBand cluster with SMT enabled and the Omni-Path cluster with SMT disabled.

Note that system noise is an extrinsic source of variation from the application’s point of view. There may also be instrinsic sources of noise such as random load imbalance or variations in data access times. In later experiments we will inject intrinsic fine-grained noise as part of the application execution, but there is really no difference to extrinsic noise as far as the observable outcome is concerned.

Iv Mechanisms of delay propagation

2

4

6

8

10

12

12

14

16

18

20

Time step

(a) Unidirectional open boundary

send to

receive from

2

4

6

8

10

12

12

14

17

20

Time step

(b) Unidirectional periodic

send to

receive from

2

4

6

8

10

12

12

14

16

18

20

Time step

(c) Bidirectional open boundary

send to

receive from

2

4

6

6

9

12

14

16

18

20

Time step

(d) Bidirectional periodic

send to

receive from

Small messages communication with eager protocol

2

4

6

8

10

12

12

14

16

18

20

Time step

(e) Unidirectional open boundary

send to

receive from

2

4

6

6

9

12

14

16

18

20

Time step

(f) Unidirectional periodic

send to

receive from

2

4

6

6

8

10

12

14

16

18

20

Time step

(g) Bidirectional open boundary

send to

receive from

3

3

5

8

10

12

14

16

18

20

Time step

(h) Bidirectional periodic

send to

receive from

Large messages communication with rendezvous protocol
Fig. 5: Qualitative timeline analysis of delay propagation under controlled conditions on the InfiniBand cluster with one process per node, next-neighbor non-blocking communication, and only native system noise. Execution and communication phases are shown in white, blue is a deliberately injected delay at rank , and idle periods and communication delays are red. The message size for small messages (top row) was and for large messages (bottom row) , the eager limit being at . For each message size, all four combinations of uni/bidirectional and periodic/open boundary conditions are shown.

We are especially interested in the propagation behavior of long execution delays on the execution time of regular, bulk-synchronous applications and how intrinsic or extrinsic fine-grained noise influences this propagation. To this end, we performed a series of experiments to fathom a part of the vast parameter space. It turned out that the Ivy Bridge InfiniBand cluster (“Emmy”) showed behavior that is almost perfectly in line with the LogGOPSim [6] simulator, so we use the “real” systems throughout unless indicated otherwise. One execution phase is purely compute bound and 3 ms long, and the message size is 8192 byte if no other specification is given. Figure 4 shows the most simple case: eager-mode, unidirectional communication, one process per node, no significant system or application noise. Each rank sends a message to the next, , and cannot continue before it has received a message from . A delay of a length of 4.5 execution phases (blue bar) is injected at rank 5 at the first time step. Due to the delayed message from rank 5 to rank 6, the latter gets delayed by the same amount, and so is the message it sends to rank 7 at the end of the next execution phase, etc.. Due to the eager protocol, ranks smaller than 5 are unaffected by the delay because they can get rid of their messages.111There is of course a limit to the internal buffers that store such messages, but this can be handled like a transition to a rendezvous protocol. In effect, the injected delay causes an “idle wave” [11] to ripple through the system at a constant speed of one rank per execution plus communication phase length. Note that this is only strictly true for homogeneous systems and core-bound execution. The presence of a memory bottleneck and/or different domains with different communication characteristics (e.g., intranode vs. internode) will change the picture, but this is outside the scope of this work.

Iv-a Basic flavors of delay propagation

It can be expected that the different communication parameters described in Section II-C cause different idle wave propagation patterns. Figure 5 shows a scan of all eight combinations of eager/rendezvous protocol, periodic/open boundary conditions, and unidirectional/bidirectional non-blocking next-neighbor communication, again running only one process per node. In all cases the communication was implemented by first initiating nonblocking MPI_Isend/MPI_Irecv calls to all neighbors of a process and then calling MPI_Waitall.

Figure 5(a) depicts the same situation as in Figure 4 (eager, unidirectional, nonperiodic) but shown for the full set of 18 ranks. Due to the open boundary conditions, the idle wave runs out at the last process. This changes with periodic boundaries (Figure 5(b)): The idle wave wraps around until, after 17 steps, it hits the process on which the delay was injected. There it dies out because process 5 is still busy receiving the outstanding eager messages from above, and as soon as the idle period on rank 4 ends, everything is in sync again.

Figures 5(c) and (d) show the situation for eager but bidirectional communication. Idle waves must now propagate in both directions from the injection but die out at the boundary for a nonperiodic process chain. In the periodic case, they wrap around and meet at rank 14 where they cancel. This is the first indication that idle waves must be a nonlinear phenomenon that cannot be adequately described by a linear wave equation.

With larger messages, the rendezvous protocol kicks in (lower row in Figure 5). Now even with unidirectional communication (Figures 5(e), (f)) the idle wave must propagate in both directions because rank 4 cannot get rid of its messages to rank 5 as long as the injected delay lasts. The general pattern is thus the same as for bidirectional eager-mode communication (Figures 5(c), (d)).

Finally, with bidirectional rendezvous-mode communication (Figures 5(g), (h)) the idle wave propagates twice as fast, because two neighbors of the delayed process are blocked in either direction.

These observations are entirely expected when looking at the basic mechanisms of point-to-point communication, but several questions come to mind: Does the interaction of propagating idle periods have a more intricate phenomenology than shown with these simple and controlled experiments? What is the speed by which an idle wave ripples through the system? Does system or application noise change the overall picture? And what is the role of system topology, specifically the intranode multicore structure of the cluster? These questions will be addressed in the following sections.

Iv-B Interaction of propagating delays

20

Time step

(a) equal

20

Time step

(b) half

20

Time step

(c) random
Fig. 6: Qualitative analysis of interacting idle waves in a periodic process chain. Eager-mode bidirectional communication (message size

) was used on the InfiniBand cluster, and ten processes per socket were run on 10 sockets (5 nodes). (a) Same delay injected at sixth process on each socket, (b) delay injected at sixth process on each socket, but delay duration was half on odd sockets, (c) random delay injected at sixth process of each socket. Dotted lines mark socket boundaries.

As demonstrated in the previous section, idle waves can run out at process chain boundaries or cancel completely when hitting each other. Since delays of different duration might be injected in random ways across the whole communicator, the question arises what happens in more complex scenarios. Figure 6 shows the result of three experiments on 100 MPI processes with bidirectional eager-mode communication and periodic boundary conditions running on ten sockets (five nodes) of the InfiniBand cluster. Of course, the intra-node communication characteristics differ from the InfiniBand parameters, but this is of no significance here. Delays were injected on local rank 5 of every socket. For equal delays (Figure 6(a)) we observe the expected cancellation after five hops. If delays on odd sockets are just half as long (Figure 6(b)), partial cancellation occurs and the (originally) longer idle periods continue to propagate until they cancel with their symmetric counterparts from the next even socket. With random injections (Figure 6(c)), the longest initial delays cause idle waves that survive until they run out by other mechanisms (in our case, by the termination of the program after 20 time steps).

These experiments point to an important hypothesis, namely that idle waves can survive for a long time on a non-noisy system, but might be damped away by deliberate injection of noise, which is just a collection of statistical, short-term delays. We will investigate this in Section V below.

Iv-C Wave propagation speed

Our experiments in Section IV-A have shown that the speed of idle wave propagation doubles with bidirectional rendezvous-mode communication because the initial delay “reaches out” twice as far into neighboring processes (Figures 5(g), (h)). Basic analysis shows that, on a noise-free system, this speed depends in a simple manner on the execution period , the communication time , the bidirectional rendezvous mode, and the distance of neighbor communication , which is the largest distance to any communication partner of a process (up to now we have only considered ):

(2)

where

Note that it does not matter here what is composed of, be it latency, overhead, transfer time, etc.. In fact, communication overhead and execution time appear on an equal footing here. Figure 7 shows an example with and rendezvous-mode unidirectional and bidirectional communication, respectively: The presence of bidirectional communication doubles the propagation speed. No such effect can be observed for eager mode.

It turns out that even in a noisy system the propagation speed along the “forward,” i.e., the leading slope of an idle wave is hardly changed from , while the trailing slope is strongly influenced by it. The reason for this is that system noise and past delays with all their accumulated effects mainly interact with the trailing edge of the idle wave. On any particular process, the delay (i.e., the current manifestation of the idle wave) acts as a “buffer” and swallows much of the variation accumulated up to this point. On the other hand, the idle wave can at most survive for one full traversal of the process chain, so the interaction time of the leading edge with noise is strictly limited.

The effect of noise on the trailing edge of the wave is investigated in the next section.

2

4

4

6

8

10

12

14

16

18

20

Time step

(a) Unidirectional

open boundary rendezvous

2

2

4

6

8

10

12

14

16

18

20

Time step

(b) Bidirectional

open boundary rendezvous

Fig. 7: Delay propagation speed for a noise-free program with the simplest direct neighbor communication without changing communication protocol in a homogeneous environment. (b,c) for multi-directional (direct and indirect) communication. Results are shown for direct neighbor communication and next-to-next indirect neighbor communication . Red color shows idle period and communication delays. Delays propagate with three times faster speed for all communication patterns except bidirectional communication with rendezvous protocol (six time faster).

V Idle period decay

Mean delay, [%]
Fig. 8: The average decay rate of an idle period in s per rank on the InfiniBand and Omni-Path clusters, and using the LogGOPSim simulator for comparison. Results show median, minimum and maximum values of statistics.

V-a Noise and decay rate

It has been known for some time [11] that idle waves tend to decay under the influence of system noise, but there was no quantitative analysis so far. Here we analyze the average decay rate of a single idle wave while deliberately injecting fine-grained application noise of an average length

into every execution phase. This extra noise is exponentially distributed in order to mimic the natural noise distribution on our systems (see Figure 

3

). It has the probability density function

(3)

The parameter we use to characterize the noise is and quantifies the mean relative delay per execution period.

Figure 8 shows the measured decay rate with statistics over 15 runs on our two cluster systems and, for reference, a modified version of the LogGOPSim simulator (implementing a simple Hockney model), versus the average delay ratio . There is no qualitative difference among the three data sets, so the decay rate is independent of the existing system noise. There is also a clear positive correlation between the noise level and the decay rate, although more measurements are required in order to be able to discern a definite functional dependence. Note that we chose our standard execution and communication parameters as defined in Section IV. Unless the idle wave is very narrow (incurring massive statistical variation), the decay rate does not depend on the length of the injected delay. For the above experiment we injected long delays of 90 ms.

V-B Idle period elimination

30

Time step

(a)

30

Time step

(b)

30

Time step

(c)

Fig. 9: Damping of an idle wave by exponential noise of different average duration (zero, 20%, and 25%) on the InfiniBand cluster, running six processes per socket on six sockets (three nodes). An idle wave with a length of four execution periods (6 ms) is injected at time step and rank . Red color shows the sum of communication time and communication delays, and dotted lines denote socket boundaries.

Finally we investigate a core-bound parallel code under the influence of an idle wave and variable noise. Figure 9(a) depicts the noise-free situation (natural system noise is present but insignificant). We show application time steps (up to 30) and, in addition, extra wallclock time (orange bar) caused by the idle wave. After 30 time steps, the excess runtime is roughly equal to the injected delay (6 ms). With exponential noise injection at (Figure 9(b)) we can observe the strong decay of the idle wave but the execution time is only marginally smaller. Of course, the overall runtime increases due to the presence of noise. However, the processes causing the excess time are now concentrated near the middle of the set. In Figure 9(c) at we observe no excess runtime – the idle period was damped away by the noise.

Vi Related work

Much interesting research has been conducted about the characterization of system noise, its impact on code performance, and how to mitigate it [15, 9, 17, 4, 18, 1, 3, 13]. However, little insight is available about how perturbations of regular communication and communication structure travel through and interact with a cluster system and the parallel applications running on it.

The initial motivation for our work was provided by Markidis et al. [11] and Peng et al. [14] who, based on results from the LogGOPSim simulator [6], used Fourier analysis to learn that isolated idle periods propagate among MPI processes as nondispersive linear waves. Their expression for propagation speed was purely phenomenological, though, and missed the pivotal ingredients of communication distance () and mode (), which are part of our model (2). This makes our model a starting point for the investigation of collective communication primitives. Their speculation that idle waves may be described by a linear wave equation with damping cannot be upheld, as our analysis of idle wave interaction shows. Although they also observed the idle wave damping phenomenon, no quantitative investigation of the connection between damping and noise was provided.

Vii Conclusion and future work

We have explored the phenomenology of idle waves in point-to-point message-passing parallel programs with regular, bulk-synchronous structure and core-bound performance characteristic communicating via a non-blocking flat network infrastructure. When an idle wave, which typically initiated by a strong delay on one of the processes, travels through the system, it does so with a certain speed that depends on the range of point-to-point communication between individual processes, the communication mode (eager vs. rendezvous) and the direction of communication (one-directional vs. bidirectional). Idle waves interact by partial cancellation, which indicates that a linear wave propagation model cannot be applied. This led to the hypothesis that fine-grained noise may be capable of interacting with idle waves. Indeed we have shown that in the presence of noise, the forward edge of the idle wave is rather insensitive to the noise amplitude but its backward edge is changed, leading to a damping effect. Running experiments with exponentially distributed execution noise injections, we have observed a clear positive correlation between the average noise amplitude (relative to the undisturbed execution phase) and the decay rate of the idle wave. We have further shown that the impact of idle waves on the runtime of programs is limited on a noisy system, to the point where the wave is completely absorbed by the noise.

We have only started to explore the huge parameter space of idle wave phenomena, and our findings open possibilities for future research in many directions. Our idle wave propagation model shows that the speed of idle waves depends on the communication time per process, which can be different due to hierarchical system structure (multicore CPUs, multiple sockets per node, network topology). Hence, the propagation speed changes whenever a domain boundary is crossed. This effect will be analyzed further. We will also look into code with memory-bound characteristic (such as the motivating triad and LBM examples in the introduction) because they bear a strong potential for desynchronization and, thus, better utilization of the memory bandwidth and potential automatic execution-communication overlap. In this context it will be useful to compare pure MPI and hybrid MPI/OpenMP code since the latter tends to enforce frequent thread synchronization, lessening the potential for inter-process skew. Most of the examples in this paper were run in non-blocking communication mode with a simple

Isend/Irecv/Waitall pattern. We will explore how more advanced point-to-point and also collective communication patterns influence the idle wave phenomenon. Finally, our long-term goal is to establish a nonlinear continuum model of message-passing programs that describes collective phenomena like long-distance correlations and structure formation.

Acknowledgments

This work was supported by KONWIHR, the Bavarian Competence Network for Scientific High Performance Computing in Bavaria, under project name “OMI4papps.” We are indebted to Thomas Zeiser and Michael Meier (RRZE) for excellent technical support.

References

  • [1] Pete Beckman, Kamil Iskra, Kazutomo Yoshii, and Susan Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Cluster Computing, 2006 IEEE International Conference on, pages 1–12. IEEE, 2006.
  • [2] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’93, pages 1–12, New York, NY, USA, 1993. ACM.
  • [3] Kurt B Ferreira, Patrick Bridges, and Ron Brightwell. Characterizing application sensitivity to OS interference using kernel-level noise injection. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 19. IEEE Press, 2008.
  • [4] Roberto Gioiosa, Fabrizio Petrini, Kei Davis, and Fabien Lebaillif-Delamare. Analysis of system overhead on parallel computers. In Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, pages 387–390. IEEE, 2004.
  • [5] Roger W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389 – 398, 1994.
  • [6] T. Hoefler, T. Schneider, and A. Lumsdaine. LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 597–604. ACM, Jun. 2010.
  • [7] Johannes Hofmann, Georg Hager, and Dietmar Fey. On the accuracy and usefulness of analytic energy models for contemporary multicore processors. In Rio Yokota, Michèle Weiland, David Keyes, and Carsten Trinitis, editors, High Performance Computing, pages 22–43, Cham, 2018. Springer International Publishing.
  • [8] Johannes Hofmann, Georg Hager, Gerhard Wellein, and Dietmar Fey. An analysis of core- and chip-level architectural features in four generations of Intel server processors. In Julian M. Kunkel, Rio Yokota, Pavan Balaji, and David Keyes, editors, High Performance Computing: 32nd International Conference, ISC High Performance 2017, Frankfurt, Germany, June 18–22, 2017, Proceedings, pages 294–314, Cham, 2017. Springer International Publishing.
  • [9] Terry Jones, Shawn Dawson, Rob Neely, William Tuel, Larry Brenner, Jeffrey Fier, Robert Blackmore, Patrick Caffrey, Brian Maskell, Paul Tomlinson, et al. Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In Supercomputing, 2003 ACM/IEEE Conference, pages 10–10. IEEE, 2003.
  • [10] E. A. León, I. Karlin, and A. T. Moody. System noise revisited: Enabling application scalability and reproducibility with SMT. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 596–607, May 2016.
  • [11] Stefano Markidis, Juris Vencels, Ivy Bo Peng, Dana Akhmetova, Erwin Laure, and Pierre Henri. Idle waves in high-performance computing. Physical Review E, 91(1):013306, 2015.
  • [12] John D McCalpin et al. Memory bandwidth and machine balance in current high performance computers. IEEE computer society technical committee on computer architecture (TCCA) newsletter, 2(19–25), 1995.
  • [13] Alessandro Morari, Roberto Gioiosa, Robert W Wisniewski, Francisco J Cazorla, and Mateo Valero. A quantitative analysis of OS noise. In 2011 IEEE International Parallel & Distributed Processing Symposium, pages 852–863. IEEE, 2011.
  • [14] Ivy Bo Peng, Stefano Markidis, Erwin Laure, Gokcen Kestor, and Roberto Gioiosa. Idle period propagation in message-passing applications. In

    High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016 IEEE 18th International Conference on

    , pages 937–944. IEEE, 2016.
  • [15] Fabrizio Petrini, Darren J Kerbyson, and Scott Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Supercomputing, 2003 ACM/IEEE Conference, pages 55–55. IEEE, 2003.
  • [16] Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. In Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15, New York, NY, USA, 2015. ACM.
  • [17] Paul Terry, Amar Shan, and Pentti Huttunen. Improving application performance on HPC systems with process synchronization. Linux Journal, (127):68–71, 2004.
  • [18] Dan Tsafrir, Yoav Etsion, Dror G Feitelson, and Scott Kirkpatrick. System noise, OS clock ticks, and fine-grained parallel applications. In Proceedings of the 19th annual international conference on Supercomputing, pages 303–312. ACM, 2005.
  • [19] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, 2009.
  • [20] Markus Wittmann, Georg Hager, Thomas Zeiser, Jan Treibig, and Gerhard Wellein. Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience, 28(7):2295–2315, 2016.