Failures and Fixes: A Study of Software System Incident Response

08/25/2020
by   Jonathan Sillito, et al.
0

This paper presents the results of a research study related to software system failures, with the goal of understanding how we might better evolve, maintain and support software systems in production. We have qualitatively analyzed thirty incidents: fifteen collected through in depth interviews with engineers, and fifteen sampled from publicly published incident reports (generally produced as part of postmortem reviews). Our analysis focused on understanding and categorizing how failures occurred, and how they were detected, investigated and mitigated. We also captured analytic insights related to the current state of the practice and associated challenges in the form of 11 key observations. For example, we observed that failures can cascade through a system leading to major outages; and that often engineers do not understand the scaling limits of systems they are supporting until those limits are exceeded. We argue that the challenges we have identified can lead to improvements to how systems are engineered and supported.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2017

An Exploratory Study of Field Failures

Field failures, that is, failures caused by faults that escape the testi...
research
03/10/2021

Practitioners Testimonials about Software Testing

As software systems are becoming more pervasive, they are also becoming ...
research
06/27/2022

Reflecting on Recurring Failures in IoT Development

As IoT systems are given more responsibility and autonomy, they offer gr...
research
08/25/2022

PREVENT: An Unsupervised Approach to Predict Software Failures in Production

This paper presents PREVENT, an approach for predicting and localizing f...
research
03/14/2019

What Makes Research Software Sustainable? An Interview Study With Research Software Engineers

Software is now a vital scientific instrument, providing the tools for d...
research
07/15/2022

Towards Understanding Confusion and Affective States Under Communication Failures in Voice-Based Human-Machine Interaction

We present a series of two studies conducted to understand user's affect...
research
04/26/2023

Systems Modeling for novice engineers to comprehend software products better

One of the key challenges for a novice engineer in a product company is ...

Please sign up or login with your details

Forgot password? Click here to reset