OS-level Failure Injection with SystemTap

by   Camille Coti, et al.

Failure injection in distributed systems has been an important issue to experiment with robust, resilient distributed systems. In order to reproduce real-life conditions, parts of the application must be killed without letting the operating system close the existing network communications in a "clean" way. When a process is simply killed, the OS closes them. SystemTap is a an infrastructure that probes the Linux kernel's internal calls. If processes are killed at kernel-level, they can be destroyed without letting the OS do anything else. In this paper, we present a kernel-level failure injection system based on SystemTap. We present how it can be used to implement deterministic and probabilistic failure scenarios.



There are no comments yet.


page 1

page 2

page 3

page 4


Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

We present a set of fault injection experiments performed on the ACES (L...

Resilient Virtualized Systems Using ReHype

System-level virtualization introduces critical vulnerabilities to failu...

Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

Cloud computing systems fail in complex and unexpected ways due to unexp...

TripleAgent: Monitoring, Perturbation And Failure-obliviousness for Automated Resilience Improvement in Java Applications

In this paper, we present a novel system for fault injection in producti...

ProFIPy: Programmable Software Fault Injection as-a-Service

In this paper, we present a new fault injection tool (ProFIPy) for Pytho...

A Graphical Interactive Debugger for Distributed Systems

Designing and debugging distributed systems is notoriously difficult. Th...

A Resilient AWGR and Server Based PON Data Centre Architecture

This paper studies the resilience of an AWGR and server based PON DCN ar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Failure injection in distributed systems has been an important issue to experiment with robust, resilient distributed systems. People who develop fault-tolerant applications must be able to test their software in faulty conditions, i.e.,, with realistic failures. For instance, the fault-tolerant implementation of the MPI standard MPICH-V [2] aims at providing an execution environment that can survive failures. It was tested using a failure-injection system and a specific fault scenario exhibited a rare bug in a precise situation of two consecutive failures [10]. This tool helped the developers of MPICH-V finding out this bug and, more important, fixing it.

In a similar way, failure detectors need to be strained and tested in real-life situations [1]. Traditional failure injection tools kill processes. However, when a process is killed, the operating system of the machine closes all the network and system sockets in a “clean” way. For instance, all the TCP connections are closed according to the TCP protocol. However, this is not a realistic situation. In real life, most failures follow the fail-stop semantics: when a component fails, it simply stops doing anything [14, 15]. The crashed component can be a process or a communication channel. The cause of this failure can be an electric failure, a cable cut, a scheduling bug… In any case, the failed component behaves normally and then halts. As a matter of fact, no warning is issued before the crash happens. As a consequence, the operating system cannot close the sockets nor any such thing.

In this paper, we present a method to inject failures in applications at kernel-level; therefore, the operating system cannot interfere with the application and the sockets are not closed. Our approach is based on two Linux tools: SystemTap and control groups (cgroups). We also explain how this approach can be extended to other crash semantics, such as replication and omission of network packets.

The remainder of this paper is organized as follows. In section II, we present and analyze existing failure injection systems, and we compare them with our approach. In section III, we present our approach for Failure Injection with SystemTap (FIST), and how it can be used to inject failures in distributed applications. In section IV, we explain how failure scenarios are implemented in practice with FIST. Last, we conclude the presentation of this approach and we mention perspectives for other kinds of failures in section V.

Ii Related works

Faults injection in parallel computing is more trick than in common distributed systems. Developers can use a virtual HPC infrastructure to check fault resilience of parallel code running on top of parallelisation libraries (OpenMPI, OpenMP etc.) [9]. The libraries used for parallel computing are closely related to hardware. For example, when Open MPI’s run-time daemon orted is started, it selects a network medium (Ethernet, InfiniBand, etc.) to perform communication between MPI processes. Network media like Infiniband cannot be virtualized.

In [11], a daemon process is used on each node to apply a fault injection policy in the instrumented processes. This approach introduces additional processes on the host. Those processes interfere with kernel scheduler in two ways:

  1. it consumes resources, such as CPU cycles and scheduling quanta;

  2. more processes need to be handled by the scheduling policy.

The scheduler behavior should not be altered by the fault injection infrastructure in parallel computing libraries. For example, parallel computing libraries like OpenMP [7] interact with scheduler to optimize process affinity with CPU caches etc [3]. Moreover, if a fault injection daemon fails, the policy becomes inapplicable.

Iii Failure injection

In common-Linux based systems, a running process is a kernel structure task_struct. This structure defines an integer variable pid and a pointer to another task_struct. The pid is used by the kernel to identify the process in a non-ambiguous way. The pointer refers the parent task_struct. To put it another way, it refers to the father process that performed the fork/exec sequence to spawn the child process. As a consequence, Linux process organization can be seen as a tree with the initd process as a root.

Control groups (Cgroups) are a major evolution of this model. Without cgroups, resource usage policy can only be defined for a targeted process and its potential childs. Cgroups allow to add a process dynamicalle to one or more groups at each creation of a task_struct. Such a group is called a cgroup. Resource usage policy can target a cgroup instead of a single process. Cgroups make it possible to apply the same resource usage policy on a group of processes without wondering about the process tree structure.

Cgroups are always used with a new improvement of Linux kernel: name spaces. Name spaces make it possible to have different instances of some kernel objects. This provides powerful isolation capabilities to Linux kernels. For instances, processes that belong to a given cgroup can use their own instance of the IP stack. These two components allow to run a lightweight sandbox under a strict resource limitation policy [13].

Iii-a SystemTap

SystemTap is a production-ready kernel auditing tool. An audit policy is written in a high-level language. The main part of the policy consist in definitions of probes. A probe is an instrumentation point of the kernel (for example, a kernel function). Systemtap relies on two kinds of probes: Kprobes and Jprobes. Kprobes can trap at almost any kernel code address and define a handler to execute code when the address is reached. A Kprobe can be attached to any instruction in the kernel. Jprobes are inserted at a kernel function’s entry, providing a convenient access to its arguments.

Systemtap provides a compiler that transforms an auditing policy to a loadable kernel module. When loaded, this kernel module registers all the defined probes. Every time a probed function is reached, the defined handler code is executed. Handler codes can use SystemTap’s native collection of functions or a guru mode. SystemTap’s native functions are high-level primitives implemented in tapsets (quite similar to libraries). The guru mode enables to pass rough C kernel code to handler code.

Iii-B Process supervision and control

Process supervision consists in collecting states and metrics about a targeted process. Control consist in performing actions on a supervised process. Prior to the introduction of cgroups in the Linux kernel, common “master” processes were used to supervise and control child processes. The two main drawbacks of this model are the following ones:

  1. if the master process fails, every supervised child process also fails. To mitigate this issue, the master’s source code should be very short;

  2. a process can belong to only one supervision domain, since it has only one father.

Iii-C Running distributed applications on FIST

The idea behind FIST can be summarized as follows:

  • a specific control group is defined for processes that belong to the experiment; these are the processes that can be affected by intentional failures.

  • a SystemTap script inserts a hook in the kernel’s scheduler. When the scheduler is invoked to run a process, it checks whether this process belongs to the experimental cgroup. If it does, the failure injection scenario is followed to decide whether the process must be run normally or killed.

In practice, a process can be executed in a given cgroup by using the command cgexec. For instance, the command sleep can be executed in the cgroup xp on all the mounted controllers using the following command:

[frame=single,fontsize=] cgexec -g *:xp sleep 1

Distributed applications are often made of two parts: the run-time environment, and the application processes themselves. In the case of MPI programs, the application is supported by a distributed run-time environment which is made of a set of processes (one on each machine used by the computation) that are spawning, supporting and monitoring the application processes on their machine [6, 4]. The failure detector is usually implemented in this run-time environment. As a consequence, the processes of the run-time environment are the ones that need to be executed in the experimental cgroup. For instance, Open MPI [8] can be specified on the command-line which command must be used to start its run-time environment’s daemons (called orted). In order to execute these daemons in the cgroup xp on all the mounted controllers, the mpiexec command can receive the following option:

[frame=single,fontsize=] –mca orte_launch_agent ’cgexec -g *:xp orted’

Otherwise, if we want to run the application processes in the aforementioned control group, the parallel program can be executed by the cgexec command:

[frame=single,fontsize=] mpiexec -n 5 cgexec -g *:xp mpiprogram

Iv Implementing failure scenarios with SystemTap

In this section, we present how failure scenarios can be implemented using SystemTap. We give two examples: a deterministic scenario (section IV-A) and a probabilistic one (section IV-B).

A process can be killed at kernel-level by canceling it (and freeing it) from the scheduler. When the kernel is about to schedule a process, it calls the context_switch function. As explained in section III-A, a Jprobe can be inserted when this function is entered. Then, we can see which control group the process belongs to (see section III and, based on the failure scenario, decide to kill or not the process.

In figure 1 we give an example of a C function that can be compiled by SystemTap to find out which cgroup a process belongs to. Based on the unique pid of the process, the task_cgroup function gets the cgroup of this given task.

function find_cgroup:string(task:long) %{
  struct cgroup *cgrp;
  struct task_struct *tsk = (struct task_struct *)((long)THIS->task);

  /* Initialize with an empty value */
  strcpy(THIS->__retvalue, "NULL");

  cgrp = task_cgroup(tsk, cpu_cgroup_subsys_id);
  if (cgrp)
    cgroup_path(cgrp, THIS->__retvalue, MAXSTRINGLEN);
Fig. 1: Finding out which cgroup a process belongs to.

SystemTap modules can call functions defined in the kernel. Hence, the free_task

function can be called to cancel and free a process at the moment when it is about to be scheduled. We give an example of a piece of code that kills a process calling this function from a SystemTap module in figure

2. Therefore, the process is canceled by the scheduler, but the operating system cannot do anything else. The I/O structures (e.g., network sockets) remain open, like with any real-life failure.

function skip_task(task:long) %{
  struct task_struct *tsk = (struct task_struct *)((long)THIS->task);
  free_task( tsk );
Fig. 2: Killing a process by freeing the corresponding task in the kernel’s scheduler.

Iv-a Deterministic failure scenarios

We can imagine the following deterministic failure scenario: after a certain time TIMEOUT, the process is killed. Hence, whenever the process is about to be scheduled, the SystemTap module needs to determine for how long it has been running. This information can be obtained from a field of the task_struct data structure used by the kernel (see section III for more information). On recent versions of the Linux kernel, a tapset function provides this information.

Figure 3 gives an excerpt of what can be used by SystemTap. The two functions task_stime_ and task_utime_ return respectively the system time and the user time spent by the process, obtained from the internal task_struct data structure. When the context_switch function is called, the module finds out which cgroup the process to be scheduled belongs to. If it belongs to the xp cgroup, the process is concerned by failure injection. The module finds out for how long the process has been running. If it has been running for longer than TIMEOUT, the process is killed.

function task_stime_:long(task:long){
  if (task != 0)
    return @cast(task, "task_struct", "kernel<linux/sched.h>")->stime;
    return 0;

function task_utime_:long(task:long){
  if (task != 0)
    return @cast(task, "task_struct", "kernel<linux/sched.h>")->utime;
    return 0;

probe kernel.function("context_switch") {
  cgroup = find_cgroup($next)
  if ( cgroup == "/xp" ){
    t = task_stime_( $next ) + task_utime_( $next )
    if( t > TIMEOUT ) {
      skip_task( $next )
Fig. 3: Deterministic failure injection scenario: after a certain time TIMEOUT, the process is killed.

Iv-B Probabilistic failure scenarios

We can imagine the following probabilistic failure scenario: every time a process is scheduled, it has one chance out of two (likelihood of 50%) to be killed. Hence, whenever the process is about to be scheduled, the SystemTap module generates a random number in , and if this number is larger or equal to 1, then the process is killed.

Figure 4 gives an excerpt of what can be used by SystemTap. When the context_switch function is called, the module finds out which cgroup the process to be scheduled belongs to. If it belongs to the xp cgroup, SystemTap generates a random integer in using the function randint. If this integer is larger or equal to 1, the process is killed.

probe kernel.function("context_switch") {
    cgroup = find_cgroup($next)
    if ( cgroup == "/xp" ){
        r = randint( 2 )
        if( r >= 1 ) {
            skip_task( $next )
Fig. 4:

Probabilistic failure injection scenario: every time it is about to be scheduled, the process has a certain probability to be killed.

V Conclusion and perspective

In this paper, we have presented FIST, a failure-injection technique that takes advantage of recent Linux kernel features, namely, cgroups and SystemTap. This technique injects highly realistic failures, taking advantage of the fact that it kills processes directly in the kernel’s task scheduler, making them die without notice. We have presented a method to use it to implement deterministic and probabilistic failure scenarios.

V-a Limitations and how they can be overcome

The major limitation of this approach is that the SystemTap module must, of course, be executed with super-user privileges. This can be overcome by two approaches. The most generic one is to use sudo. The administrator of the machine can allow it on the modprobe command only. As a consequence, the only thing that users would be able to do is to insert their modules on the kernel.

The other way is to work on an experimental testbed that provides a deployment software such as Kadeploy [12], such as the Grid’5000 platform [5]. Kadeploy allows users to deploy their own system image on the nodes they have reserved and therefore have (temporarily) their own root access on these nodes.

V-B Using this method to implement other failure models

In this work we have focused in fail-stop failures. However, it can be easily extended to other semantics. For instance, probes can be added in the network stack. Messages can be dropped to inject omission failures, or sent twice to inject duplication.


  • [1] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Heartbeat: A timeout-free failure detector for quiescent reliable communication. In Marios Mavronicolas and Philippas Tsigas, editors, Proceedings of the 11th Workshop on Distributed Algorithms (WDAG’97), volume 1320 of Lecture Notes in Computer Science, pages 126–140. Springer, 1997.
  • [2] George Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fédak, Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodyg ensky, Frédéric Magniette, Vincent Néri, and Anton Selikhov. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In High Performance Networking and Computing (SC2002), Baltimore USA, November 2002. IEEE/ACM.
  • [3] François Broquedis, Jérôme Clet-Ortega, Stéphanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010), pages 180–186, Pisa, Italia, February 2010. IEEE Computer Society Press.
  • [4] Ralph M. Butler, William D. Gropp, and Ewing L. Lusk. A scalable process-management environment for parallel programs. In Jack Dongarra, Péter Kacsuk, and Norbert Podhorszki, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 7th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI’02), volume 1908, pages 168–175. Springer, 2000.
  • [5] Franck Cappello, Eddy Caron, Michel Dayde, Frederic Desprez, Yvon Jegou, Pascale Vicat-Blanc Primet, Emmanuel Jeannot, Stephane Lanteri, Julien Leduc, Nouredine Melab, Guillaume Mornet, Benjamin Quetier, and Olivier Richard. Grid’5000: A large scale and highly reconfigurable grid experimental testbed. In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing CD (SC—05), pages 99–106, Seattle, Washington, USA, November 2005. IEEE/ACM.
  • [6] Ralph H. Castain, Timothy S. Woodall, David J. Daniel, Jeffrey M. Squyres, Brian Barrett, and Graham E. Fagg. The open run-time environment (openRTE): A transparent multi-cluster environment for high-performance computing. In Beniamino Di Martino, Dieter Kranzlmüller, and Jack Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 12th European PVM/MPI Users’ Group Meeting, volume 3666 of Lecture Notes in Computer Science, pages 225–232, Sorrento, Italy, September 2005. Springer.
  • [7] Barbara Chapman, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared memory parallel programming, volume 10. MIT press, 2008.
  • [8] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI’04), pages 97–104, Budapest, Hungary, September 2004.
  • [9] Thomas Hérault, Thomas Largillier, Sylvain Peyronnet, Benjamin Quétier, Franck Cappello, and Mathieu Jan. High accuracy failure injection in parallel and distributed systems using virtualization. In Conf. Computing Frontiers, pages 193–196, 2009.
  • [10] William Hoarau, Pierre Lemarinier, Thomas Hérault, Eric Rodriguez, Sébastien Tixeuil, and Franck Cappello. FAIL-MPI: How Fault-Tolerant Is Fault-Tolerant MPI? In CLUSTER, 2006.
  • [11] William Hoarau and Sébastien Tixeuil. A language-driven tool for fault injection in distributed systems. In GRID, pages 194–201, 2005.
  • [12] Emmanuel Jeanvoine, Luc Sarzyniec, and Lucas Nussbaum. Kadeploy3: Efficient and Scalable Operating System Provisioning for Clusters. USENIX ;login:, 38(1):38–44, February 2013.
  • [13] Dirk Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2, 2014.
  • [14] David Powell. Failure mode assumptions and assumption coverage. In FTCS, pages 386–395, 1992.
  • [15] Richard D. Schlichting and Fred B. Schneider. Fail stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1:222–238, 1983.