On Representing and Eliciting Resilience Requirements of Microservice Architecture Systems

by   Kanglin Yin, et al.

Together with the spread of DevOps practices and container technologies, Microserivce Architecture has become a mainstream architecture style in recent years. Resilience is a key characteristic in Microservice Architecture Systems(MSA Systems), and it shows the ability to cope with various kinds of system disturbances which cause degradations of services. However, due to lack of consensus definition of resilience in the software field, although many works have been done on resilience for MSA Systems, developers still don't have a clear idea on how resilient an MSA System should be, and what resilience mechanisms are needed. In this paper, by referring to existing systematic studies on resilience in other scientific areas, the definition of microservice resilience is provided and a Microservice Resilience Measurement Model is proposed to measure service resilience. And we give a requirement model to represent resilience requirements of MSA Systems. A process framework is also proposed to elicit MSA System resilience requirements. As a proof of concept, a case study is conducted on an MSA System to illustrate how the resilience requirements are elicited and represented.



There are no comments yet.


page 1


Multivariate Modeling for Sustainable and Resilient Infrastructure Systems and Communities

Sustainability and resilience of urban systems are multifaceted concepts...

A Survey on Resilience in the IoT: Taxonomy, Classification and Discussion of Resilience Mechanisms

Internet-of-Things (IoT) ecosystems tend to grow both in scale and compl...

Towards a Resilient Information System for Agriculture Extension Information Service: An Exploratory Study

Although digital technologies are contributing to human development, sev...

Resilience for Landslide Geohazards and Promoting Strategies in the Three Gorges Reservoir Area

Recently, resilience is increasingly used as a concept for understanding...

Digital Resilience for What? Case Study of South Korea

Resilience has become an emerging topic in various fields of academic re...

Network Weaving to Foster Resilience and Sustainability in ICT4D

A number of studies in Information and Communication Technologies for De...

On the Constituent Attributes of Software and Organisational Resilience

Our societies are increasingly dependent on services supplied by compute...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Microservice Architecture (aka Microservices)[1] is a new architectural style which modularizes software components as services, which are called microservices, and makes services and service development teams as independent of each other as possible. In recent years, Microservice Architecture has already been a mainstream architecture style adopted by many leading internet companies[2][3]. Shifting to Microservice Architecture promises fast time-to-market of individual services, and enables modern development processes like Continuous Delivery[4], DevOps[5]. Besides the delivery speed, scalability, flexibility in resource allocation, code reuse and other features are also improved greatly by the Microservice Architecture.

Although the Microservice Architecture has many benefits, Microservice Architecture Systems (MSA Systems) are more fragile than traditional monolithic systems[6], because Microservices are usually deployed in a more sophisticated environment using virtual infrastructures like virtual machines and containers, and there are lots of components for service decoupling and management (e.g. API Gateway, Message Queue, Service Registry, etc.). Threats come from anywhere in an MSA System: small-density components with faults [7], unstable message passing among microservices[8], the underlying cloud environment with unreliable containers, virtual machines, or bare-metal servers[9]. Even normal actions taken in cloud environments like software/hardware upgrades and dynamic changes in configuration files may lead to severe service outages, which are lessons learned from hundreds of service outage reports of cloud companies in the literature[10].

Reliability, availability and fault tolerance etc. are traditional metrics to evaluate a software system’s ability to cope with failures [11]

. These metrics assume that a software system has two states such as ”Available/Unavailable” or ”Reliable/Unreliable”, and calculate the probability of these two states. However, recent studies on cloud system failures

[12][10][13] found that cloud systems are more likely to be in a ”limped” state rather than be totally unavailable. The ”limped” state means that although a cloud system can provide services with normal functionalities, the services work in performance under users’ satisfaction, which is known as service degradation. In such situation, metrics like reliability or availability can’t evaluate the software system so well. For example, in situation A, the average response time of an MSA System’s service is degraded from 3 seconds to 5 seconds due to a failure; while in situation B, the average response time of the service is degraded from 3 seconds to 12 seconds due to the same failure. The failure recovery time of these two situations are the same. It is obvious that the service in situation A performs better than the service in situation B when failure happens. But if we take ”response time higher than 3 seconds” as an unreliable state, the metrics of service reliability in these two situations are the same.

As a result, many practitioners of the Microservice Architecture [14][15][16], proposed Resilience as a characteristic describing how an MSA System copes with the occurrence of failures and recovers the degraded service back to its normal performance. Existing reliability/fault tolerant mechanisms used in Service-Oriented Architectures and cloud platforms like Circuit Breakers and Bulkheads, are used as resilience mechanisms in MSA Systems[8][15].

Although engineering resilience in MSA Systems has gained popularity among designers and engineers, the consensus on how engineering resilience can be designed and improved has not yet been reached. Until now, there is no common definition for microservice resilience. And although several works have been done for software resilience benchmarking [17] [18], available engineering quantification metrics still exhibit very little standardization.

Due to no standard definitions and quantification metrics for microservice resilience, it is hard to make definite resilience requirements for MSA Systems. As a result, microservice developers seldom have a clear idea of the following questions about resilience, which may lead to development failures according to the theory of requirement engineering[19].

  1. What is microservice resilience?

  2. How to evaluate microservice resilience?

  3. How resilient an MSA System should be?

  4. How to set goals for resilience mechanisms?

Furthermore, how to represent resilience requirements for MSA Systems is an another problem to face even if standard definitions and quantification metrics of microservice resilience are given. Although Domain Driven Design (DDD) [20] is the suggested way to build MSA System requirements [14], integrating resilience into DDD seems to be difficult. DDD focuses on how to decompose a system to microservices by business context boundaries, so introducing notions in resilience like service degradations, failures may trouble the partition of context boundaries. Another type of requirement model is needed for microservice resilience requirements.

In order to solve the problems above, this paper works on the representation and elicitation of MSA System resilience requirements, contributions of this paper are:

  • By referring to systematic studies on definitions and measurements of resilience in other scientific areas. We provide the definition of microservice resilience. And a Microservice Resilience Measurement Model (MRMM) is proposed to measure service resilience of MSA Syetems.

  • Based on MRMM, a requirement model is designed to represent resilience requirements of MSA Systems. The requirement model contains a Resilience Goal Decomposition View refining service resilience goals to system behaviors with a customized goal model, and a Resilience Mechanism Implementation View to show how resilience mechanisms work in MSA Systems.

  • A process framework to elicit resilience requirements of MSA Systems is proposed. The process framework outlines steps to elicit our resilience requirement model for MSA Systems, which follows the methodology of Goal-Oriented Requirement Engineering.

The remain of this paper organizes as follows: Section II summarises some related works of this paper; Section III provides the definition of microservice resilience and proposed Microservice Resilience Measurement Model; Section IV proposes the service resilience requirement model based on the definition and measurement model in Section III; Section V describes the process framework to elicit MSA System reslience requirements; In Section VI a case study is conducted using an MSA System to illustrate the whole resilience requirement elicitation process, and Section VII makes conclusion of this paper and outlines some future works.

Ii Related Works

This section discusses existing related studies in these three research areas: Resilience in other scientific areas, Resilience in microservices, and Goal-Oriented Requirement Engineering.

Ii-a Resilience in Other Scientific Areas

The word ”resilience”, originates from the Latin verb ”resilire”, means an object’s ability to bonus back to its normal size after being pulled or pushed. Holling [21] firstly used this word in the field of ecology, to represent an ecosystem’s ability to absorb disturbances. In recent decades, the notion of resilience has been used in many scientific areas, like psychology, city planning, management science, etc. As is described in Righi’s review of resilience engineering[22], the ability of anticipating/being aware of hazards, the capacity of adapting to variability, and the ability of responding/restoring are major concerns of resilience.

In Hosseini’s literal review on system resilience[23]

, resilience assessment is classified into two categories: qualitative approaches and quantitative approaches. Qualitative approaches gives concept framework of practices to archieve resilience

[24], which is usually used in Society-Ecology, Organization Management, and Healthcare. Quantitative approaches use resilience curves to illustrate the resilient behavior of an engineered system undergoing a disruptive event. Many researchers used the properties of the resilience curve to the measure the resilience of the system.

Figure 1 shows the Bruneau’s resilience triangle model [25] on a resilience curve, which is the most used resilience model in quantitative assessment. In Figure 1, the x-axis stands for the time and the y-axis stands for the quality of a system. The Bruneau’s model proposed three dimensions of Resilience: Robustness, Rapidity and Resilience. Robustness and Rapidity are the measurements on the x and y axis in Figure 1, while Resilience measures the shaded area in Figure 1. A great number of researches on resilience quantification also proposed metrics on these three dimensions [26][27] [23].

Fig. 1: The Reslience Triangle Model Proposed in [25]

Ii-B Resilience in Microservices

The importance of resilience has been pointed out in many practitioner books of microservices [14][15][16], and studies discussing key features of MSA Systems [6][28], which makes it a common sense that resilience is a key characteristic in MSA Systems. In these books and studies, some typical resilience mechanisms like the Circuit Breakers and Bulkheads [8] are mentioned.

In recent years there have been several resilience related works on MSA Systems. Richter et al. showed that the Microservice Architecture itself can have positive impacts on dependability and fault-tolerance[29]. Nane analyzed the impact of containers to microservice performance [30]. Giovanni, et al, proposed a self-managing mechanism for MSA System[31], where auto-scaling and health management of services are implemented by monitoring. Soenen et al. designed a scalable mechanism for microservice-based NFV System for fault recovery and high availability [32]. Zwietasch used Time-Series Prediction method to predict failures in MSA Systems[33]. Stefan et al. designed a decision guidance model for service discovery and fault of microservices, where some faults in an MSA System are related to certain designs in the system [34]. Heorhiadi et al, designed a resilience testing framework for the Microservice Architecture [35], how to inject faults into microservices was not discussed in detail. Brogi proposed a reference dataset generation framework for microservices which includes failure data[36]. Thomas and Andre, built a meta-model for MSA Systems, which is used for performance and resilience benchmarking[37], but how this model is used for benchmarking and how resilience is evaluated was not discussed.

Ii-C Goal-Oriented Requirement Engineering

Goal-oriented Requirement Engineering (GORE) is a branch of requirement engineering. GORE is used in earlier stages of requirements analysis to elicit the requirements from high-level system goals[38], while object-oriented analysis like UML[39] fits well to the later stages. KAOS [40][41], i* [42], GBRAM [43] and the NFR Framework [44] are main goal modeling frameworks used in GORE[45]. The main concerns of recent papers on GORE are implementations, integrations, extensions and evaluations of the main goal models, which is summarized by the literature review of Jennifer et al. [46].

Few works were done on GORE for microservice-related systems. Wang et al. introduced service discovery in GORE, which helps the requirement decomposition of SOA systems[47]. Zardari and Bahsoon discussed about how GORE is adapted into clouds[48].

On GORE with failures or performance, Rashid et al. extended the goal-oriented model to aspect-oriented model, which separates the functional and non-functional properties of a requirement [49]. Robert and Axel presented an approach to use exiting refinements as a refinement pattern for goal refinement [50], which proved the validity of refinement in the studies above. Van and Letier integrated the notion of obstacle into their KAOS model in [51]. And then Axel presented formal techniques for reasoning about obstacles to satisfaction of goals in his paper [52], which mainly focuses on finding and resolving obstacles in functional goals that are too ideal. Van Lamsweerde explicitly modeled the goals of an attacker, an agent that creates security risks for the system goals in KAOS[53]. John, Lawrence and Brian proposed a goal-oriented framework for Non-Functional Requirements [44], a series of refinement methods were designed for accuracy and performance requirements. Fatama et al. proposed a three layered goal model and conducted multi-objective risk analysis on the goal model[54].

Similar to the studies above, we integrate notions in microservice resilience such as disruption, system resources into a customized goal model to represent resilience requirements of MSA Systems.

Iii Microservice Resilience

The definition of microservice resilience and a Microservice Reslience Measurement Model are proposed in this section, as the basis of resilience requirement representation for MSA Systems.

Iii-a Definition of Microservice Resilience

By summing up existing viewpoints on resilience in other scientific areas, and associating these viewpoints with what is required to be achieved in MSA Systems, we have come to the following conclusions on MSA System resilience:

  • In running environments of MSA Systems, there a lot of unpredictable events that make services of MSA Systems perform not as good as expected. These events are termed ”disruptions” in the field of resilience[23] [22].

  • It is hard to get the probabilities of disruptions because the architecture and deployment environment of MSA Systems always change with the quick iteration of DevOps. Considering high fault density in MSA Systems[6], and assumptions on disruptions in other scientific areas [23] [27], it can be assumed that disruptions are inevitable and always happen on MSA Systems.

  • Service performance is the main concern of MSA System Resilience. Disruptions in MSA Systems cause losses of service performance, which are called service degradations. The curve representing how a service’s performance varies from time under a service degradation is a typical ”resilience curve” in researches of resilience[27].

  • Resilient MSA Systems should keep performance degraded services from a too low level which is unacceptable by users, and make degraded services’ performance back to normal as fast as possible.

Based on above conclusions, the definition of resilience for MSA Systems is provided as follows:

Resilience of a Microservice Architecture System is the ability to maintain the performance of services at an acceptable level and recover the service back to normal, when a disruption causes the service degradation.

Iii-B Microservice Resilinece Measurement Model

To quantify the service resilience in MSA System, we proposed Microservice Resilience Measurement Model (MRMM). MRMM quantifies the resilience by measuring service degradations caused by disruptions. Resilience metrics in MRMM are later used for service resilience goal setting in resilience requirements.

Figure 2 shows the meta-model of MRMM. Below we provide the definitions for elements (e.g. Service Resilience, Disruption Tolerance) in MRMM. Mathematical presentations of these elements are also given, in order to encode resilience requirements to formal propositions and verify the satisfiability of resilience requirements in our future work.

Fig. 2: Meta-Model of MRMM
Definition 1 (MSA System and Service)

An MSA System is a software system that provides services. Every service in an MSA System is an interface exposed to users or other systems. Users and other systems can fulfil certain functionalities by accessing services. Factors like how service are modularized and deployed are not included in our definition for MSA Systems, because they are out of the scope of our definition on microservice resilience.

In mathematics, An MSA System is represented by a set where are services provided by the MSA System. Each service is represented by a tuple where:

  • is the label of the service which is used for verification in Goal Models[55].

  • is the set of performance attributes of the service. Performance attribute is defined in the next paragraph.

Definition 2 (Service Performance)

Each service in an MSA System has one or several performance attributes (e.g. response time, throughput) to evaluate service performance. Performance Attributes of a service are decided by type of the service. For example, availability and success rate are common performance attributes of transactional services, while video stream services are usually benchmarked by throughputs. Selection of performance attributes is discussed in detail in our proposed resilience requirement process framework in Section V.

Service performance is metric of a service’s performance attribute in a certain time period. Service performance can be represented by function , where is the service, is the performance attribute, and is the timestamp.

In well-developed MSA Systems, real-time service performance data can be collected by monitoring tools like cAdvisor, Zabbix. These data are stored in time-series databases so that the performance value of a service at a timestamp can be queried in the form of .

Definition 3 (Service Performance Benchmark)

A service performance benchmark is the baseline value of service performance. Service performance benchmarks are used to judge whether services are degraded. If the service performance value is lower than its service performance benchmark at some time, the corresponding service is regarded as degraded.

Same with mathematical representation of service performance, service performance benchmark is represented by function , meaning the baseline value of service ’s performance attribute at time . In mathematics, if the predicate means a service is degraded at time t, can be defined by following propositional formula, where the symbol means the equivalence relationship in propositional logic.


Depending on the types of services and performance attributes, a service performance benchmark may be either a constant value (e.g. benchmark of service response time is usually fixed), or a dynamic value varying from time. How to set service performance benchmarks is discussed in Section V.

Definition 4 (Disruption)

A disruption in an MSA System is an event that happens to the MSA System which makes a service degraded. A disruption should contain the following information:

  • The related objects when a disruption happens. An object may be any abstract or realistic entity that can be identified in an MSA system (e.g. servers, containers, CPUs, processes, network connections, services).

  • The event type of a disruption. For an object where a disruption happens, there are several event types. For example, a Virtual Machine’s disruption event type may include VM halt down, OS upgrade, kernel break, etc.

A disruption is represented by the tuple in mathematics, where , are the labels of the object and event type. If the predicate means a disruption occurs at timestamp , the fact that causes service degradation on service at timestamp can be presented by Formula 2.


Definition 5 (Service Degradation and Service Resilience)

Service degradation is a phenomenon happening in MSA Systems that a service is kept degraded because of a disruption. Degraded state of a service is confirmed by judging whether the service performance is lower than the service performance benchmark. In mathematics, a service degradation is represented by tuple , where:

  • is the degraded service;

  • is the performance attribute where service performance benchmark is violated;

  • is the disruption causing the service degradation;

  • and are the start time and the end time of the service degradation.

Service degradation shows the impact of a disruption on service performance. Service resilience is a model that measures the impact of a service degradation. Since service performance values under a service degradation is a typical resilience curve, by referring to the Bruneau Model[25], three metrics are included in service resilience to measure a service degradation: Disruption Tolerance, Recovery Rapidity and Performance Loss.

  • Disruption Tolerance:

    Disruption Tolerance measures how much service performance is degraded compared with service performance benchmark. Disruption Tolerance of a service degradation is the maximum deviation of service performance from service performance benchmark in the period of service degradation. In mathematics, the Disruption Tolerance of a service degradation is represented by Formula 3.


    When a service is suffering degradation, the MSA System should keep the service from severe degradation which is unacceptable to users (For example, the frame rate of a video stream service can be lowered a bit but not too low to make a video look like a slide).

  • Recovery Rapidity:

    Recovery Rapidity measures how fast a degraded service can be recovered and reach the service performance benchmark again. Similar to Mean Time to Repair (MTTR) used in reliability assessment, Recovery Rapidity is measured by calculating the time interval of the service degradation , as is shown in Formula 4.


  • Performance Loss:

    Performance Loss is a quantification of the magnitude of service degradation in service performance. Performance Loss of a service degradation is mathematically expressed by Equation 5.


    Performance Loss measures the cumulative degraded performance during the service degradation, which is shown as the shaded area in Figure 1. Performance Loss can reveal business loss in a service degradation. For example, if a data transmission service benchmarked by throughput suffers a service degradation, Performance Loss can measure how much data is less transmitted than expected.

    In time series databases used for monitoring, there already exist data types used for recording cumulative values of performance (Like the Counter data type in Prometheus). So it is possible to collect Performance Loss data of service degradations in MSA Systems.

Service resilience can measure a service degradation with these three metrics. Mathematically, Service resilience is represented by tuple , where , and are Disruption Tolerance, Recovery Rapidity, and Performance Loss.

Take a service with performance attribute TPS (Transactions Per Second) as an example. The performance benchmark on TPS of this service is 50 requests/second. When the service suffered from a disruption and was recovered later, and collected TPS value during service degradation is shown in Figure 3. Disruption Tolerance of the service is , Recovery Rapidity of the service is seconds, and Performance Loss is requests, which means 200 user requests are less processed than expected due to the disruption.

Fig. 3: Performance on TPS of a Sample Service Degradation

Iv Resilience Requirement Representation

In Section III, we proposed MRMM to measure service resilience when a service degradation happens. With measurable service resilience metrics, microservice practitioners can set service resilience goals to indicate how resilient an MSA System is supposed to be. Service resilience goals specify service resilience with thresholds of service resilience metrics in MRMM. For a service with a service resilience goal, service resilience metrics of any service degradation happened to this service are expected within the thresholds in service resilience goal.

When there is a service degradation that violates the service resilience goal, the service degradation should be further analyzed to diagnose the disruption causing service degradation. Then developers establish corresponding resilience mechanism to mitigate the impact of the identified disruption so that the service resilience goal is satisfied again. In MSA Systems, a resilience mechanism is a process consisting of system behaviors (like monitoring, failure detection) executed by one or several components to react to disruptions. And microservice practitioners usually use architecture-level diagrams to show how a resilience mechanism works in MSA Systems[14][56].

Thus the resilience requirement of an MSA System consists of the following information based on our definition on microservice resilience and MRMM:

  • Service resilience goals;

  • Disruptions that cause service resilience goal violations;

  • Resilience mechanisms established to mitigate impacts of disruptions.

We proposed Service Resilience Requirement Model to represent the microservice resilience requirement above. The proposed requirement model consists of two views: Resilience Goal Decomposition View and Resilience Mechanism Implementation View. In Resilience Goal Decomposition View, we integrate notions of MRMM into a customized goal model, to represent service resilience goals. Disruptions causing service resilience goal violations are regarded as obstacles to service resilience goals, and resilience mechanisms are resolutions to these disruptions. Resilience Mechanism Implementation View uses microservice practitioners’ existing documentation styles for resilience mechanisms (like architectural-level diagrams in [14][56]), to show how resilience mechanisms work in MSA Systems in a more expressive way. Figure 4 shows how notions in MRMM are represented in Resilience Goal Decomposition View, and how a resilience mechanism established in Resilience Goal Decomposition View is implemented in Resilience Mechanism Implementation View.

Fig. 4: Relations between MRMM, Resilience Goal Decomposition View and Resilience Mechanism Implementation View

Iv-a Resilience Goal Decomposition View

In Resilience Goal Decomposition View, service resilience goals are set and decomposed with the methodology of Goal-Oriented Requirement Engineering (GORE) [38]: service resilience goals are decomposed into resource resilience goals, disruptions obstructing service/resource resilience goals are identified, resilience mechanisms resolving obstacles are established, and resilience mechanisms are further refined to detailed system behaviors which can be implemented by components of MSA Systems.

Resilience Goal Decomposition View uses a mainstream goal model, KAOS[41], which is more expressive than other goal models[57], and notions in MRMM are integrated into this goal model. The customized goal model uses basic elements in KAOS: goal, obstacle, agent and domain property. Besides these elements, we introduce an new element asset to our goal model to represent necessary services and system resources in MSA Systems. The symbolic representations of these elements in diagrams are shown in the example in Figure 4. The following paragraphs explains in detail how resilience requirements of an MSA Systems are represented in Resilience Goal Decomposition View, and some mathematical representations are given, for our future work on verifying resilience goal satisfaction.

Iv-A1 Service Resilience Goal

Goals are objectives that the target system should achieve. In KAOS, goals cover different types of concerns: from high-level, strategic concerns, to low-level, technical concerns; from functional concerns, to non-functional concerns[38]. In diagrams of our proposed goal model, goals are represented by blue parallelograms.

Service resilience goals are final goals to achieve in an MSA System, which specify how resilient the services in MSA Systems are supposed to be. A service resilience goal of a service contains thresholds of service resilience metrics in MRMM. And the performance attribute is also specified in a service resilience goal, because service resilience metrics are calculated from service performance variations.

Mathematically, if means the service resilience goal of service on performance attribute , and the predicate means is satisfied, based on definitions of service degradation and service resilience in MRMM, satisfies the propositional formula in Formula 6, where means all service degradations of service on performance attribute .


A service resilience goal is linked to an asset representing the service and a domain property representing the service performance benchmark, to show which service the service resilience goal specifies and which service performance benchmark is used to calculate service resilience when service degradation happens, as is shown in Figure 5.

Fig. 5: The Service Resilience Goal of a Service benchmarked by a Service Performance Attribute

In KAOS models, textual specification of elements is required[58]. We use the following information to specify a service resilience goal:

  • Goal Name: The identifier of the service resilience goal;

  • Service: The service which the service resilience goal specifies;

  • Performance Attribute: The performance attribute of the service resilience goal;

  • Service Resilience Thresholds: Thresholds of service resilience metrics.

Iv-A2 Resilience Goal Refinement

In KAOS model, A high-level goal can be refined to low-level goals, these low-level goals are called sub goals of the high-level goal. There are two types of refinements: AND-refinement and OR-refinement.

AND-refinement means a goal can be achieved by satisfying all of its sub goals. Given a goal and a set of ’s sub goals , AND-refinement is textually denoted as . Mathematically, AND-refinement satisfies the propositional Formula 7.


OR-refinement means a goal can be achieved when one of its sub goals is satisfied. Given a goal and a set of ’s sub goals , OR-refinement is textually denoted as . Mathematically, OR Refinement satisfies the propositional Formula 8.


Figure 6 shows the difference in representation between AND-refinement and OR-refinement in Resilience Goal Decomposition View.

(a) And Refinement
(b) OR Refinement
Fig. 6: And Refinement and OR Refinement

In our goal model, each service resilience goal is refined to resilience goals of service’s dependency system resources which ensure the running of the service (such as containers, VMs), if the performance attribute of the service resilience goal is directly influenced by performance attributes of these system resources. Resource resilience goals can be further refined to resilience goals of resources’ dependency system resources. Such refinement is an AND-refinement because a service resilience goal may be obstructed by just one resource resilience goal violation.

When disruptions that obstructs service/resource resilience goals are found, resilience mechanisms are established to resolve disruptions. These resilience mechanisms are sub goals of service/resource resilience goals since they promise the satisfaction of service/resource resilience goals. The refinement from a resilience goal to a resilience mechanisms is an OR-refinement, because when a disruption happens, only the corresponding resilience mechanism is required.

Resilience mechanisms are decomposed one or several times by AND/OR-refinements, until detailed system behaviors which can be executed by individual components in MSA Systems are figured out. System behaviors of resilience mechanisms are represented by blue parallelograms with bold borders to show that they are terminal goals in the goal model. Each system behavior is linked with an agent representing a system component, to show which component executes the system behavior.

Figure 7 shows the whole goal refinement from a service resilience goal to system behaviors in Resilience Goal Decomposition View.

Fig. 7: Goal Refinement from a Service Resilience Goal to System Behaviors

Iv-A3 Resilience Obstacle

Obstacles are a dual notion to goals in goal models, and they are represented by red parallelograms. When obstacles get true, some goals may not be achieved[52]. In our goal model, service degradations are obstacles to service resilience goals. And a service degradation is transformed into a disruption obstructing a resource resilience goal after the service degradation is diagnosed, as is shown in Figure 8. The disruption is linked to the affected system resource and then corresponding resilience mechanism will take actions to the resource.

The textual specification of an obstacle contains the following information:

  • Obstacle Name: The identifier of the obstacle;

  • Event: The description that how service/resource resilience goals are obstructed.

(a) Service Degradation Obstructing Service Resilience Goal
(b) Disruption Obstructing Resource Resilience Goal
Fig. 8: A Service Degradation(a) and its Root Cause Disruption(b)

Iv-A4 Asset

Assets are services or system resources(like containers, physical servers) in an MSA System. Assets are represented by purple hexagons in diagrams of our proposed goal model. Textual specification of an asset includes the asset name and the asset type.

A service asset is linked with assets representing service’s dependency system resources, and these resource assets are linked with other resource assets. Identified dependency relationships among assets are used as references to service resilience goal refinement and service degradation diagnose.

Iv-A5 Agent

Agents are individual objects that perform behaviors to archive goals[38]. In our goal model, agents are represented as yellow hexagons, and they represent for components in an MSA System that execute system behaviors decomposed from resilience mechanisms. Individual components for MSA System operation like monitoring tools or anomaly detectors are typical agents in Resilience Goal Decomposition View. Same with assets, textual information of agents includes the agent name and the agent type.

Iv-A6 Domain Property

Domain properties are indicative statements of domain knowledge which are used as references of elements in KAOS[38][59]. Service performance benchmarks are domain properties of service resilience goals, because service performance benchmarks are key variables to calculate service resilience metrics in MRMM. Domain knowledges of MSA Systems (like architectural patterns, operation principles) are domain properties of resilience mechanisms, because they are the references of resilience mechanisms. A domain property is represented by an orange pentagon, and textual specifications of domain properties include the following information:

  • Domain Property Name: The identifier of the domain property;

  • Description: Detailed description of the domain property;

  • Reference Resources: Reference links to related documentations of the domain property.

Iv-B Resilience Mechanism Implementation View

Resilience Mechanism Implementation View shows how a resilience mechanism, which is established in Resilience Goal Decomposition View, is implemented in an MSA System. Existing documentation styles for resilience mechanisms (like architecture-level diagrams used for Circuit Breakers and Bulkheads in [14][60]) are directly used for Resilience Mechanism Implementation View, because there is no need to design a new model since these documentation styles are expressive enough.

Resilience Mechanism Implementation View is drawn according to the elements in Resilience Goal Decomposition View. For example, if the component-and-connector architecture style[61] is used for Resilience Mechanism Implementation View, each agent in Resilience Goal Decomposition View is drawn as boxes representing components, and system behaviors of resilience mechanisms are drawn as connections launched by these components.

V Resilience Requirement Elicitation

A process framework is proposed in this section to elicit resilience requirements of MSA Systems represented by the Service Resilience Requirement Model in Section IV. The proposed process framework describes how to set service performance benchmarks and service resilience goals, find disruptions that cause service resilience goal violations, and establish resilience mechanisms to mitigate the impact of disruptions.

V-a Assumptions

The proposed process framework follows these assumptions:

  • The target MSA System is iteratively developed.

  • The first version of the target MSA System has already been developed and deployed before the eliciting resilience requirements.

  • There exists a monitoring component in the target MSA System. And the monitoring component is well-designed so that necessary performance data of all services and resources, and service resilience metrics of service degradations can be collected.

  • The target MSA System meets its performance requirements when no disruptions happen.

  • The resilience requirements focus on service performance degradations caused by disruptions, and software failures caused by code errors are not considered.

V-B Stakeholders of Resilience Requirements

Stakeholders are participants involving software requirement elicitation. Following stakeholders in the development team of an MSA System may participate in the resilience requirement elicitation process:

  • Quality Assurer: Stakeholder for quality assurance is present in all stages of an MSA System’s development lifecycle. When eliciting resilience requirements, quality assurers should provide key performance attributes and corresponding benchmarks of each service, and then set service resilience goals according to performance benchmarks.

  • Operation Engineer: Operation engineers deploy the system and maintain the system after deployment. They detect service degradations in running MSA Systems, and find possible root cause disruptions of service degradations.

  • Software Architect: Software architects are responsible for the architecture of the entire MSA System, or for the architecture of a single microservice. When disruptions are identified, architects should establish resilience mechanisms to cope with these disruptions.

  • Developer: Developers responsible for developing operation components in MSA Systems (which sometimes are called operations developers in DevOps) will implement resilience mechanisms, so they should know why resilience mechanisms are required.

  • Tester: Testers need to know what is achieved by resilience mechanisms in order to design test cases for these resilience mechanisms.

V-C Process Framework for Resilience Requiremenxt Elicitation

Microservice Architecture is an architecture adapting to the DevOps process [62], thus the resilience requirement elicitation of MSA Systems is also an iterative process. Figure 9 outlines the steps (bold line boxes) of resilience requirement elicitation in MSA System development. The proposed process framework has three major stages: System Identification, Resilience Goal Setting and Disruption Analysis. DevOps processes after requirement elicitation like system implementation, deployment and monitoring data collection (the dashed line box) are not discussed in this paper.

Fig. 9: Process Framework for Resilience Requirement Elicitation

V-C1 Stage I: System Identification

The architecture of an MSA System including services and system resources is identified first as the basis of resilience requirement elicitation. Identified architecture of an MSA System is represented in Resilience Goal Decomposition View, by a number of assets linked with each other showing what services is provided by an MSA System and what system resources ensure the running of a service, as is shown in Figure 10.

Fig. 10: A Sample Identified Architecture of MSA System in Resilience Goal Decomposition View
  • Identify Services:

    Services in an MSA System is identified in order to set service resilience goals and find dependency system resources of these services. Since the Microservice Architecture is the architectural style which modularize components as services, microservices in an MSA System can be directly identified as services on which service resilience goals are set.

    Identified services are represented as assets in Resilience Goal Decomposition View, and these service assets are linked to an asset representing the MSA System.

  • Identify Dependency Resources:

    Service degradations are caused by disruptions which affect system resources ensuring the running of services. And resilience mechanisms are established to take action on these affected system resources. Thus the dependency system resources of a service need to be identified in resilience requirement elicitation.

    In MSA Systems, services are deployed as independent to each other as possible. So each service in an MSA System usually has individual supporting system resources. Resource dependency of a service can be generally structured as , which follows the SaaS-PaaS-IaaS architectural pattern of cloud systems.

    The granularity of dependency resource identification should not be too fine-grained, because too many identified system resources may reduce the readability of resilience requirement diagrams. Only system resources that can be manipulated by system operation components are needed to be identified, since system behaviors are terminal goals in Resilience Goal Decomposition View.

In Stage I the architecture of MSA System is identified and represented in Resilience Goal Decomposition View, in order to set resilience goals and analysis disruptions in later stages. As MSA Systems always face changes, the architecture in Resilience Goal Decomposition View should also be renewed when microservices are added/removed or the deployment architecture is changed (e.g. from virtual machine deployment to container deployment).

V-C2 Stage II: Resilience Goal Setting

In Stages II, performance benchmarks of services and resources are determined, and then service resilience goals are set based on these performance benchmarks in Resilience Goal Decomposition View.

  • Set Performance Benchmarks:

    As is defined in MRMM, service resilience of a service degradation is measured by performance changes on services comparing to service performance benchmarks. So performance attributes for services should be determined after services in an MSA System are identified. Table I lists common service performance attributes in standards of IT Systems such as SPEC[63], TPC [64] [65], ETSI[66], datasets for service performance evaluation [67][68], and other researches on web service metrics selection [69]. Selection of service performance attributes depends on the type of services, as is discussed in Section III. Sometimes specific performance attributes are required to set considering the business need of MSA Systems. For example, Netflix uses how many streams are started in a given second as a performance attribute because it impacts the success of the business[70].

    Performance Attribute Description Unit
    Response Time Time taken to send a request and receive a response ms
    Availability Number of successful invocations/total invocations %
    Throughput Total Number of invocations for a given period of time invokes/s
    Successability Number of response / number of request messages %
    TABLE I: Common Service Performance Attributes

    Not only services performance attributes, performance attributes for services’ dependency resources are also to be determined. Because a service resilience goal may able be refined to resilience goals of its dependency resources if service performance is directly influenced by resource performance, and anomalies in resource performance benchmarks can also be regarded as disruptions in MSA Systems.

    Table II shows collectable metrics of virtual infrastructures in monitoring tools like Zabbix, cAdvisor and Heapster (no available performance standard for virtual infrastructures is found as far as we studied), and Table III lists some performance attributes of physical infrastructures in standards of SPEC[63], TPC[64] and EEMBC[71].

    Performance Attribute Description Unit
    system.cpu .util CPU utility %
    system.cpu .intr CPU interrupts per second intr/s
    proc.num Number of processes processes
    vm.memory .size.available Available memory %
    system.swap .size.pfree Free swap space(percentage) %
    vfs.fs.inode .pfree Free inodes on / (percentage) %
    vfs.fs.size .pfree Free disk space on / (percentage) %
    net.if.in Incoming network traffic bps
    net.if.out Outgoing network traffic bps
    TABLE II: Performance Attributes of Virtual Infrastructures
    Performance Attribute Description Unit
    Instances Created Application Virtual Instances instances
    Elasticity Whether the work performed by application instances scales linearly in a cloud when compared to the performance of application instances during baseline phase %
    Mean Instance Provisioning Time The time interval between the instance provisioning request and connectivity to port 22 on the instance s
    CPU Speed Average time for a computer to complete single tasks s
    Network Throughput Throughput of network in unit of time bps
    TABLE III: Performance Attributes for Physical Infrastructure

    After performance attributes are determined, performance benchmarks are set on these performance attributes. As is defined in Section III, performance benchmarks may be either a constant value or a dynamic value varying from time. Constant performance benchmark values can be set by referring to history mean values, expert experience, or suggested values in standards. Dynamic performance benchmark values can be generated by running time series prediction algorithms (like EWMA, ARIMA, LSTM, etc.) on historical performance data.

    Performance benchmarks are represented as domain properties in Resilience Goal Decomposition View. Each performance benchmark is linked to an asset of a service or a resource, to illustrate that the performance benchmark is set for the service/resource. The specification of a performance benchmark in Resilience Goal Decomposition View contains the performance attribute of performance benchmark, benchmark value (for dynamic benchmark values it may be a reference to the data training result), and references to related standards/discussion minutes/algorithms if necessary.

  • Set Resilience Goals:

    As is defined in Section III, service resilience metrics of a service degradation are calculated from the service performance benchmark. In Resilience Goal Decomposition View, a service resilience goal is linked to domain property of the service performance benchmark.

    For each service resilience goal, thresholds for Disruption Tolerance, Recovery Time and Performance Loss are set. Not all of these three metrics are required to set in a service resilience goal since sometimes some of these metrics are unnecessary or meaningless in certain performance attributes (For example, Performance Loss is meaningless for response time).

    Thresholds of Disruption Tolerance and Recovery Time can be set by referring to existing standards or expert experience on reliability or fault tolerance(e.g. ETSI[72] has suggested values on minimum acceptable response time and outage recovery time on different types of NFV services). And threshold of Performance Loss is set depending on history disruption data or business need of the target MSA system.

    If the performance of a service is directly influenced by the performance of its dependency resources (e.g. TPS of a service is influenced by CPU processing speed of the service container), the service resilience goal of this service is refined to resilience goals of dependency resources. So that system resources affected by disruptions can be found out by detecting resource resilience goal violations. The way to set thresholds on metrics of resource resilience goals are the same with service resilience goals. Figure 11 shows an example that a service resilience goal is refined to resource resilience goals in Resilience Goal Decomposition View.

    Fig. 11: A Service Resilience Goal if Refined to Resource Resilience Goals

Same with Stage I, Stage II is also an iterative process that performance benchmarks and service resilience goals are updated with changes of the MSA Systems.

V-C3 Stage III: Disruption Analysis

In Stage III, service degradations that violates service resilience goals are detected, and these service degradations are diagnosed to find out possible root cause disruptions. Then resilience mechanisms are established to mitigate disruptions’ impact.

  • Service Degradation Analysis:

    To achieve service resilience goals in MSA Systems, service degradations that violate service resilience goals are detected and analyzed. Service resilience goal violations in an MSA System can be easily detected by setting alarms of service resilience metrics on the monitoring component. The corresponding service degradations are represented as obstacles to service resilience goals in Resilience Goal Decomposition View, as is shown in Figure 12.

    Fig. 12: Detected Service Degradations Violating Service Resilience Goals

    Detected service degradations are further analyzed to find possible root cause disruptions which affect services’ dependency resources. Detailed ways of making inference from service degradations to disruption are not discussed in this paper, since a great number of works have already been done on performance analysis and fault diagnosis in software systems.

    In Resilience Goal Decomposition View, analyzed service degradation is substituted by an obstacle representing the possible root cause disruption. The disruption is linked to the service’s dependency resource which it affects, and its original link to service resilience goal may be redirected to resource resilience goals if resource resilience goal violations are detected. Moreover, a reference to event log files may also be attached to a disruption in the form of domain property. Figure 13 is a sample for founded disruptions in Resilience Goal Decomposition View.

    Fig. 13: Founded Disruptions in Resilience Goal Decomposition View
  • Establish Resilience Mechanisms:

    When disruptions in MSA Systems are identified, corresponding resilience mechanisms are established to prevent service degradations from service resilience goal violations. As is described in many practitioner books of microservices [14] [16], MSA Systems use typical fault-tolerance mechanisms used in large-scale internet applications[60] (e.g. Circuit Breaker, Bulkhead, Active-Active Redundancy, etc.) as resilience mechanisms. These mechanisms are usually implemented in standalone components/middlewares (like Hystrix, Istio) in MSA Systems. Depending on the types of disruptions and affected system resources, patterns of resilience mechanisms are different, which has already been concluded in the literature[73].

    A resilience mechanism is presented as a goal refining service/resource resilience goals and resolving disruptions. The resilience mechanism is further refined, until detailed system behaviors to manipulate affected system resources by individual system components are figured out, and these system behaviors can be represented in Resilience Mechanism Implementation View. The reference to Resilience Mechanism Implementation View and other references like resilience mechanism patterns are attached to the resilience mechanism as domain properties. Figure 14 shows a sample resilience mechanism in Resilience Goal Decomposition View.

    Fig. 14: Using Resilience Mechanisms to Resolve Disruptions

After Stage III, the resilience requirement elicitation process for an MSA System at a certain iteration is finished.

Vi Case Study

In order to verify the feasibility of our proposed resilience requirement model and resilience requirement elicitation process framework, we conducted a case study on an MSA System. One of the benchmark MSA Systems proposed in the literature[74] was used as our target system.

In this study, the documentations and deployment configuration files of the target system were read to identify the system’s services and resources to be used in Stage I. Meanwhile, we deployed the target system on a cluster of servers, and we used tools to generate workloads of user operations, collect performance data of services and resources, in order to simulate a real-world running MSA System. By referring to Netflix’s proposed methodology of finding faults in MSA Systems[70], we randomly injected faults to our 24*7 running system, to find disruptions causing service resilience goal violations. The detected service degradations and monitoring data for service performance were collected as inputs of Stage II and Stage III.

In order to draw diagrams of Resilience Goal Decomposition View, we developed an open source web application called KAOSer 111https://github.com/XLab-Tongji/KAOSer. Users can create, edit elements in Resilience Goal Decomposition View, and export the diagram and textual documentation to their local storage. Figure 15 shows the user interface of KAOSer. Users can drag elements to the diagram area from the toolbox on the left, and edit the textual specifications of these elements by clicking on these elements.

Fig. 15: User Interface of KAOSer

Vi-a System Description

Sock Shop 222https://github.com/microservices-demo/microservices-demo is an open source MSA System for demonstration and testing of microservice and cloud native technologies. It is built using Spring Boot, Go kit and Node.js, and is packaged in Docker containers. Sock Shop provides basic services of an online shopping system, Figure 16 shows the architecture of these services.

Fig. 16: Sock Shop Architecture

We deployed Sock Shop on a Kubernetes cluster with one master node and three worker nodes. A Controller Node (which contains a bunch of tools including deployment, workload simulation, performance monitoring, fault injection, etc.) was used to generate and collect necessary data for resilience requirement elicitation. Figure 17 shows the deployment scenario of Sock Shop in our case study.

Fig. 17: Sock Shop Deployment Scenario

Vi-B Resilience Requirement Elicitation

The resilience requirement elicitation for Sock Shop follows the process framework proposed in Section V. Limited by the length of the paper, we cannot list all service resilience goals, disruptions and resilience mechanism in our case study. So at each step we just illustrate a typical example.

Vi-B1 Stage I: System Identification

  • Identify Services

    According to the system architecture shown in Figure 16, we decomposed Sock Shop into microservices it provides, as is shown in Figure 18.

    Fig. 18: Services in Sock Shop
  • Identify Dependency Resources

    Sock Shop was deployed on a Kubernetes Cluster in this study, so the dependency system resources of services were identified according to the Kubernetes deployment configuration file of Sock Shop.

    In Kubernetes, a service is deployed on a pod. A pod is a group of containers having the same functionalities. Containers in a pod are deployed on worker nodes of a Kubernetes cluster, and managed by Kubernetes management services deployed on the master node. In the Kubernetes configuration file of Sock Shop, there are pods providing database services to microservices of Order, User, Cart and Catalogue. These database services are identified as dependency resources of microservices. Containers in Kubernetes are deployed to work nodes dynamically so we didn’t know on which worker node a container was deployed, so we took the whole Kubernetes cluster as a dependency resource for all containers. The Kubernetes cluster is supported by its worker nodes and Kubernetes management applications running on the master node. Figure 19 shows identified services and resources in Resilience Goal Decomposition View of Sock Shop.

    Fig. 19: Services and Resources in Sock Shop

Vi-B2 Stage II: Resilience Goal Setting

  • Set Performance Benchmarks

    Considering the metrics collectable from the monitoring tool Prometheus, and existing common researches & standards on web service qualities (which are mentioned in Section V), we chose the following metrics as the general service performance attributes since all services in Sock Shop are transactional services:

    • Response Time: Time taken to send a request and receive a response.

    • Success Rate: Number of response / Number of request messages.

    As is instructed in performance benchmark standards like TPC[65], performance attributes that price services (like revenue per second per service interaction) are required for benchmarking transactional services. Since the final goal of resilience is to prevent business loss caused by service degradations, we also set some performance attributes on business for certain services besides general service performance attributes.

    Limited by the capability of our monitoring tool, we could not directly collect data on total price of an order. As an alternative, we used the number of success orders, add-to-cart operations and online users per second, as our business performance attributes, since they are collectable and directly impact the ”revenue” of Shop Shop. Table IV shows all service performance attributes selected for Sock Shop.

    Service Performance Attribute Description Unit
    All Services Response Time Average Time taken to send a request and receive a response ms
    Success Rate Number of response / number of request messages %
    Order Success Orders Number of finished order per second orders/s
    Cart Add-Cart Count Number of add-to-cart requests per second request/s
    User Online Users Number of online users users
    TABLE IV: Service Performance Attributes of Sock Shop

    Then the performance attributes of services’ dependency resources were determined. We just show resource performance attributes of the Order service here, since how performance attributes are set for dependency resources of different services are similar.

    The Order service has three performance attributes: Response Time, Success Rate and Success Orders. In the Kubernetes cluster where Sock Shop was deployed, the Order service was deployed on a pod which consists of several container instances. Metrics of the Order service’s performance attributes can be calculated by performance attributes of containers by the following formulas:


    In Formula 9,10 and 11, means the number of available container instances of a POD and means the th container instance. The number of available instances of a pod, TPS, Response Time, and Success Rate of containers, were selected as performance attributes to be benchmarked because they directly influence performance attributes of the Order service.

    In this case study, we used a workload generator to generate workload of a number of users, in order to simulate a system in a real world and collect performance data. The workload generator simulated the whole shopping procedure on Sock Shop (including user logging, browsing items, adding items to cart, submitting orders) so that workload on all services were ensured. We used continuous integration tool Jenkins to trigger workloads in a certain period every day so that a daily behavior of a running system was simulated. Figure 20 shows the performance monitoring data with the workload we simulated.

    Fig. 20: Success Orders of Sock Shop under the Simulated Workload(the Blue Line), and the Performance Benchmark(the Red Line)

    By referring to mean values of running data we simulated, we set 3 seconds and 90% as the baseline value of Response Time and Success Rate of services and containers. The baseline value of Available Instances in a pod was set to 3 instances, as is configured in our deployment configuration file. For performance attributes varying from time like Success Orders and TPS, we used time series prediction algorithms to build performance benchmarks. Figure 20 shows the predicted benchmark of the service performance attribute Success Orders (marked as red line), calculated by Triple Exponential Smoothing with historical monitoring data (marked as blue line) on Success Orders. And Figure 21 shows the representation of the performance benchmarks of services and resources in Resilience Goal Decompositions View.

    Fig. 21: Performance Benchmarks for the Order Service and its Dependency Resources
  • Set Resilience Goals

    We set thresholds of Disruption Tolerance and Recovery Time by referring to suggested values in ETSI standards. For Performance Loss, by assuming that 5% loss of orders in a day is not expected in Sock Shop, the Performance Loss thresholds for Success Orders of the Order service and TPS of the Order container instances were roughly calculated. Table V shows thresholds in resilience goals of the Order service and its dependency resources.

    Service/Resource Performance Attribute Disruption Tolerance Recovery Time Performance Loss
    Order(Service) Response Time 10s 5s -
    Successability 20% - -
    Success Orders - - 500 orders
    Order(POD) Available Instances 1 instance 2s -
    Order(Container) Response Time 10s 5s -
    Successability 20% - -
    TPS - - 150 transactions
    TABLE V: Resilience Goals of the Order Service and its Dependency Resources

    As is shown in Formula 9,10 and 11, performance attributes of the Order service can be calculated by performance attributes of its pods and containers. So service resilience goals of the Order service were refined to resource resilience goals of the Order services’ pod and container in Resilience Goal Decomposition View, as is shown in Figure 22.

    Fig. 22: Service Resilience Goal Refinement for the Order service

Vi-B3 Stage III: Disruption Analysis

  • Service Degradation Analysis

    In this case study, we proactively injected random faults to our 24*7 running Sock Shop and detected service degradations violating service resilience goals we set. The following types of faults were injected by referring to tutorials of Chaos Engineering [70]:

    • Generates high load for one or more CPU cores of physical machines.

    • Allocates a specific amount of a physical machine’s RAM.

    • Put read/write pressure on disks of physicals machines and containers

    • Inject latency to network traffic between containers

    • Induce packet loss to a container network.

    • Reboot the physical machines

    • Kill a POD in Kubernetes

    • Kill management processes of Kubernetes

    We used the alert manager in Prometheus to alarm when resilience goal violations were found. Finally we detected 37 service degradations that violate service resilience goals during the 5-day running with fault injections. The number of violated service degradations is acceptable because both Shop Shop and Kubernentes are open source projects, and we just used only 5 physical machines to build up our system.

    Considering the length of the paper, here we just show how a service degradation detected is analyzed and resolved.

    Figure 23 shows a service degradation detected on the Order Service that violated the service resilience goal on Success Orders. By searching fault injection logs and related monitoring data, we found that the service degradation was caused by the network delay we injected to containers of the Order service, and it also violates resilience goals on TPS of the Order service’s container, as is shown in Figure 24.

    Fig. 23: A Service Degradation Violating Service Resilience Goal of the Order Service
    Fig. 24: Root Cause Disruption of the Service Degradation in Figure 23
  • Establish Resilience Mechanisms

    Network delays on containers impact the container’s ability to process transactions, so we planned to transfer more transactions to other normal container instances when network delay on a container is detected. This can be realized by using Service Mesh to control network traffics among containers. In order to implement container network traffic control, the sidecar component Enovy is injected to all containers to proxy container network connection, and the Service Mesh middleware, Istio, is used to monitor and manage traffic among sidecars. Kubernetes is responsible for injecting sidecars to containers. Figure 25 shows how this resilience mechanism is represented in Resilience Goal Decomposition View, and Figure 26 shows the representation in Resilience Mechanism Implementation View.

    Fig. 25: Resilience Mechanism Established for Container Network Delay
    Fig. 26: Resilience Mechanism in Resilience Mechanism Implementation View

    After integrating Service Mesh to Kubernetes, we tested the resilience mechanism by injecting the same fault again. Collected monitoring data are shown in Figure 27. The service degraded a bit in testing but no alarm on resilience goal violation was reported, which proves that the established resilience mechanism for the disruption is effective.

    Fig. 27: Monitoring Data of Success Orders after Integrating Service Mesh to Kubernetes

Vii Conclusion

In recent years, the microservice architecture has already been a mainstream architecture adopted by internet companies. In cases when achieving higher system reliability is no longer affordable and failure is inevitable [75], microservice practitioners start to use the word ”resilience” to describe the ability coping with failures. However, due to no consensus on definitions and measurements of resilience, microservice practitioners seldom have a clear idea on how to evaluate the resilience of an MSA System and how resilient an MSA System is supposed to be.

In this work, we have the following contributions:

  • The definition of microservice resilience is provided by referring to systematic studies on resilience in other scientific areas. And a Microservice Reslience Measurement Model is proposed to measure service resilience of service degradations.

  • Service Resilience Requirement Model is proposed to represent service resilience goals, disruptions and resilience mechanisms. The requirement model uses goal model to refine service resilience goals with thresholds of resilience metrics in MRMM, to system behaviors to be implemented in MSA Systems.

  • We propose the process framework for microservice resilience requirement elicitation. The process framework outlines steps to set service resilience goals, analysis disruptions and establish resilience mechanisms against disruptions.

Possible limitation of our proposed work, which can be improved in future work, include the followings:

  • The MRMM model gives measurable resilience metrics to evaluate a service’s resilience, and it may be worthwhile to find a modeling technique to model system resilience with service resilience.

  • Verification on goal satisfaction is an important type of research in Goal Modeling. We have already given mathematical presentations of notions in our measurement model and requirement model. How to encode these mathematical presentations into formal languages, and verify them with model checkers, are future works of this paper.

  • In our case study, an open source microservice project was used as the target system. The target system contains not too many microservice so that we can easily build resilience documents with manual efforts. For large scale MSA Systems, generating resilience requirements completely manually is difficult because there are a lot of services and resources to be identified. So auto generation techniques (e.g. Generate the architecture of the system by reading deployment configuration files) for our resilience requirement model is needed.


  • [1] Martin Fowler and James Lewis. Microservices. ThoughtWorks. http://martinfowler. com/articles/microservices. html [last accessed on February 17, 2015], 2014.
  • [2] Todd Hoff. Lessons learned from scaling uber to 2000 engineers, 1000 services, and 8000 git repositories, 2017.
  • [3] TONY Mauro. Adopting microservices at netflix: Lessons for architectural design, 2016.
  • [4] Jez Humble and David Farley. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Adobe Reader). Pearson Education, 2010.
  • [5] Len Bass, Ingo Weber, and Liming Zhu. DevOps: A Software Architect’s Perspective. Addison-Wesley Professional, 2015.
  • [6] Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. Microservices: yesterday, today, and tomorrow. In Present and Ulterior Software Engineering, pages 195–216. Springer, 2017.
  • [7] Les Hatton. Reexamining the fault density component size connection. IEEE software, 14(2):89–97, 1997.
  • [8] Fabrizio Montesi and Janine Weber. Circuit breakers, discovery, and api gateways in microservices. arXiv preprint arXiv:1609.05830, 2016.
  • [9] Christian Esposito, Aniello Castiglione, and Kim-Kwang Raymond Choo. Challenges in delivering software in the cloud as microservices. IEEE Cloud Computing, 3(5):10–14, 2016.
  • [10] Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, and Kurnia J. Eliazar. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Acm Symposium on Cloud Computing, 2016.
  • [11] Organización Internacional de Normalización. ISO-IEC 25010: 2011 Systems and Software Engineering-Systems and Software Quality Requirements and Evaluation (SQuaRE)-System and Software Quality Models. ISO, 2011.
  • [12] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-Anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, and Vincentius Martin. What bugs live in the cloud? a study of 3000+ issues in cloud systems. 2014.
  • [13] Haryadi S Gunawi, Riza O Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, et al. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS), 14(3):23, 2018.
  • [14] Sam Newman. Building Microservices. ” O’Reilly Media, Inc.”, 2015.
  • [15] Eberhard Wolff. Microservices: Flexible Software Architecture. Addison-Wesley Professional, 2016.
  • [16] Irakli Nadareishvili, Ronnie Mitra, Matt McLarty, and Mike Amundsen. Microservice Architecture: Aligning Principles, Practices, and Culture. ” O’Reilly Media, Inc.”, 2016.
  • [17] A Bondavalli et al. Research roadmap deliverable d3. 2. AMBER–Assessing Measuring and Benchmarking Resilience, Funded by European Union, 2009.
  • [18] Raquel Almeida and Marco Vieira. Changeloads for resilience benchmarking of self-adaptive systems: a risk-based approach. In Dependable Computing Conference (EDCC), 2012 Ninth European, pages 173–184. IEEE, 2012.
  • [19] Jeremy Dick, Elizabeth Hull, and Ken Jackson. Requirements engineering. Springer, 2017.
  • [20] Eric Evans. Domain-driven design: tackling complexity in the heart of software. Addison-Wesley Professional, 2004.
  • [21] Crawford S Holling. Resilience and stability of ecological systems. Annual review of ecology and systematics, 4(1):1–23, 1973.
  • [22] Angela Weber Righi, Tarcisio Abreu Saurin, and Priscila Wachs. A systematic literature review of resilience engineering: Research areas and a research agenda proposal. Reliability Engineering & System Safety, 141:142–152, 2015.
  • [23] Seyedmohsen Hosseini, Kash Barker, and Jose E Ramirez-Marquez. A review of definitions and measures of system resilience. Reliability Engineering & System Safety, 145:47–61, 2016.
  • [24] Michael Ungar. Qualitative contributions to resilience research. Qualitative social work, 2(1):85–102, 2003.
  • [25] Michel Bruneau, Stephanie E Chang, Ronald T Eguchi, George C Lee, Thomas D O’Rourke, Andrei M Reinhorn, Masanobu Shinozuka, Kathleen Tierney, William A Wallace, and Detlof Von Winterfeldt. A framework to quantitatively assess and enhance the seismic resilience of communities. Earthquake spectra, 19(4):733–752, 2003.
  • [26] Devanandham Henry and Jose Emmanuel Ramirez-Marquez. Generic metrics and quantitative approaches for system resilience as a function of time. Reliability Engineering & System Safety, 99:114–122, 2012.
  • [27] Nita Yodo and Pingfeng Wang. Engineering resilience quantification and system design implications: A literature survey. Journal of Mechanical Design, 138(11):111408, 2016.
  • [28] Nicola Dragoni, Ivan Lanese, Stephan Thordal Larsen, Manuel Mazzara, Ruslan Mustafin, and Larisa Safina. Microservices: How to make your application scale. arXiv preprint arXiv:1702.07149, 2017.
  • [29] Daniel Richter, Marcus Konrad, Katharina Utecht, and Andreas Polze. Highly-available applications on unreliable infrastructure: Microservice architectures in practice. In Software Quality, Reliability and Security Companion (QRS-C), 2017 IEEE International Conference on, pages 130–137. IEEE, 2017.
  • [30] Nane Kratzke. About microservices, containers and their underestimated impact on network performance. arXiv preprint arXiv:1710.04049, 2017.
  • [31] Giovanni Toffetti, Sandro Brunner, Martin Blöchlinger, Florian Dudouet, and Andrew Edmonds. An architecture for self-managing microservices. In Proceedings of the 1st International Workshop on Automated Incident Management in Cloud, pages 19–24. ACM, 2015.
  • [32] Thomas Soenen, Wouter Tavernier, Didier Colle, and Mario Pickavet. Optimising microservice-based reliable nfv management & orchestration architectures. In Resilient Networks Design and Modeling (RNDM), 2017 9th International Workshop on, pages 1–7. IEEE, 2017.
  • [33] Tim Zwietasch. Online failure prediction for microservice architectures. Master’s thesis, 2017.
  • [34] Stefan Haselböck, Rainer Weinreich, and Georg Buchgeher. Decision guidance models for microservices: service discovery and fault tolerance. In Proceedings of the Fifth European Conference on the Engineering of Computer-Based Systems, page 4. ACM, 2017.
  • [35] Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K Reiter, and Vyas Sekar. Gremlin: systematic resilience testing of microservices. In Distributed Computing Systems (ICDCS), 2016 IEEE 36th International Conference on, pages 57–66. IEEE, 2016.
  • [36] Antonio Brogi, Andrea Canciani, Davide Neri, Luca Rinaldi, and Jacopo Soldani. Towards a reference dataset of microservice-based applications. In International Conference on Software Engineering and Formal Methods, pages 219–229. Springer, 2017.
  • [37] Thomas F. Düllmann and André Van Hoorn. Model-driven generation of microservice architectures for benchmarking performance and resilience engineering approaches. In The Acm/spec, pages 171–172, 2017.
  • [38] Axel Van Lamsweerde. Goal-oriented requirements engineering: A guided tour. In Requirements Engineering, 2001. Proceedings. Fifth IEEE International Symposium on, pages 249–262. IEEE, 2001.
  • [39] Kenneth S Rubin and Adele Goldberg. Object behavior analysis. Communications of the ACM, 35(9):48–62, 1992.
  • [40] Axel Van Lamsweerde and Emmanuel Letier. From object orientation to goal orientation: A paradigm shift for requirements engineering. In Radical Innovations of Software and Systems Engineering in the Future, pages 325–340. Springer, 2004.
  • [41] Anne Dardenne, Axel Van Lamsweerde, and Stephen Fickas. Goal-directed requirements acquisition. Science of computer programming, 20(1-2):3–50, 1993.
  • [42] Eric SK Yu. Towards modelling and reasoning support for early-phase requirements engineering. In Requirements Engineering, 1997., Proceedings of the Third IEEE International Symposium on, pages 226–235. IEEE, 1997.
  • [43] Annie I Anton. Goal-based requirements analysis. In Requirements Engineering, 1996., Proceedings of the Second International Conference on, pages 136–144. IEEE, 1996.
  • [44] John Mylopoulos, Lawrence Chung, and Brian Nixon. Representing and using nonfunctional requirements: A process-oriented approach. IEEE Transactions on software engineering, 18(6):483–497, 1992.
  • [45] Alexei Lapouchnian. Goal-oriented requirements engineering: An overview of the current research. University of Toronto, page 32, 2005.
  • [46] Jennifer Horkoff, Fatma Başak Aydemir, Evellin Cardoso, Tong Li, Alejandro Maté, Elda Paja, Mattia Salnitri, John Mylopoulos, and Paolo Giorgini. Goal-oriented requirements engineering: a systematic literature map. In Requirements Engineering Conference (RE), 2016 IEEE 24th International, pages 106–115. IEEE, 2016.
  • [47] Hongbing Wang, Suxiang Zhou, and Qi Yu. Discovering web services to improve requirements decomposition. In Web Services (ICWS), 2015 IEEE International Conference on, pages 743–746. IEEE, 2015.
  • [48] Shehnila Zardari and Rami Bahsoon. Cloud adoption: a goal-oriented requirements engineering approach. In International Workshop on Software Engineering for Cloud Computing, pages 29–35, 2011.
  • [49] Awais Rashid, Peter Sawyer, Ana Moreira, and João Araújo. Early aspects: A model for aspect-oriented requirements engineering. In Requirements Engineering, 2002. Proceedings. IEEE Joint International Conference on, pages 199–202. IEEE, 2002.
  • [50] Robert Darimont and Axel Van Lamsweerde. Formal refinement patterns for goal-driven requirements elaboration. In ACM SIGSOFT Software Engineering Notes, volume 21, pages 179–190. ACM, 1996.
  • [51] Axel Van Lamsweerde and Emmanuel Letier. Integrating obstacles in goal-driven requirements engineering. In Proceedings of the 20th international conference on Software engineering, pages 53–62. IEEE Computer Society, 1998.
  • [52] Axel Van Lamsweerde and Emmanuel Letier. Handling obstacles in goal-oriented requirements engineering. IEEE Transactions on Software Engineering, 26(10):978–1005, 2000.
  • [53] Axel Van Lamsweerde. Elaborating security requirements by construction of intentional anti-models. In Proceedings of the 26th International Conference on Software Engineering, pages 148–157. IEEE Computer Society, 2004.
  • [54] Fatma Başak Aydemir, Paolo Giorgini, and John Mylopoulos. Multi-objective risk analysis with goal models. In 2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS), pages 1–10. IEEE, 2016.
  • [55] Paolo Giorgini, John Mylopoulos, Eleonora Nicchiarelli, and Roberto Sebastiani. Reasoning with Goal Models. 2002.
  • [56] Kjell Jørgen Hole. Anti-fragile ict systems. 2016.
  • [57] Chi Mai Nguyen, Roberto Sebastiani, Paolo Giorgini, and John Mylopoulos. Multi-objective reasoning with constrained goal models. Requirements Engineering, 23(2):189–225, 2018.
  • [58] Robert Darimont, Emmanuelle Delor, Philippe Massonet, and Axel van Lamsweerde. Grail/kaos: an environment for goal-driven requirements engineering. In Proceedings of the (19th) international conference on software engineering, pages 612–613. IEEE, 1997.
  • [59] A Lamsweerde. Kaos tutorial. Cediti, September, 5, 2003.
  • [60] Michael T Nygard. Release it!: design and deploy production-ready software. Pragmatic Bookshelf, 2018.
  • [61] Paul Clements, David Garlan, Len Bass, Judith Stafford, Robert Nord, James Ivers, and Reed Little. Documenting software architectures: views and beyond. Pearson Education, 2002.
  • [62] Armin Balalaie, Abbas Heydarnoori, and Pooyan Jamshidi. Microservices architecture enables devops: migration to a cloud-native architecture. IEEE Software, 33(3):42–52, 2016.
  • [63] SPEC Benchmarks. Standard performance evaluation corporation, 2000.
  • [64] TPC Benchmark A Standard Specification. Transaction processing performance council. San Jose, CA, 5, 1989.
  • [65] Daniel A Menascé. Tpc-w: A benchmark for e-commerce. IEEE Internet Computing, 6(3):83–87, 2002.
  • [66] ISGNFV ETSI. Etsi gs nfv-tst 001 v1. 1.1: Network functions virtualisation(nfv); pre-deployment testing; report on validation of nfv environments and services, 2016.
  • [67] E. Al-Masri and Q. H. Mahmoud. Qos-based discovery and ranking of web services. In 2007 16th International Conference on Computer Communications and Networks, pages 529–534, Aug 2007.
  • [68] Yilei Zhang, Zibin Zheng, and Michael R. Lyu. Wsexpress: A qos-aware search engine for web services. In IEEE International Conference on Web Services, 2010.
  • [69] Sravanthi Kalepu, Shonali Krishnaswamy, and Seng Wai Loke. Verity: a qos metric for selecting web services and providers. In Web Information Systems Engineering Workshops, 2003. Proceedings. Fourth International Conference on, pages 131–139. IEEE, 2003.
  • [70] Ali Basiri, Niosha Behnam, Ruud De Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. Chaos engineering. IEEE Software, 33(3):35–41, 2016.
  • [71] Embedded Microprocessor Benchmark Consortium et al. Eembc benchmark suite, 2008.
  • [72] ISGNFV ETSI. Etsi gs nfv-rel 001 v1. 1.1: Network functions virtualisation(nfv); resiliency requirements, 2015.
  • [73] Saurabh Hukerikar and Christian Engelmann. A pattern language for high-performance computing resilience. In Proceedings of the 22nd European Conference on Pattern Languages of Programs, page 12. ACM, 2017.
  • [74] Carlos M. Aderaldo, Claus Pahl, and Pooyan Jamshidi. Benchmark requirements for microservices architecture research. In International Workshop on Establishing the Community-wide Infrastructure for Architecture-based Software Engineering, 2017.
  • [75] Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering: How Google Runs Production Systems. ” O’Reilly Media, Inc.”, 2016.