Doping Tests for Cyber-Physical Systems

04/18/2019 ∙ by Sebastian Biewer, et al. ∙ 0

The software running in embedded or cyber-physical systems (CPS) is typically of proprietary nature, so users do not know precisely what the systems they own are (in)capable of doing. Most malfunctionings of such systems are not intended by the manufacturer, but some are, which means these cannot be classified as bugs or security loopholes. The most prominent examples have become public in the diesel emissions scandal, where millions of cars were found to be equipped with software violating the law, altogether polluting the environment and putting human health at risk. The behaviour of the software embedded in these cars was intended by the manufacturer, but it was not in the interest of society, a phenomenon that has been called software doping. Doped software is significantly different from buggy or insecure software and hence it is not possible to use classical verification and testing techniques to discover and mitigate software doping. The work presented in this paper builds on existing definitions of software doping and lays the theoretical foundations for conducting software doping tests, so as to enable attacking evil manufacturers. The complex nature of software doping makes it very hard to effectuate doping tests in practice. We explain the biggest challenges and provide efficient solutions to realise doping tests despite this complexity.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Embedded and cyber-physical systems are becoming more and more widespread as part of our daily life. Printers, mobile phones, smart watches, smart home equipment, virtual assistants, drones and batteries are just a few examples. Modern cars are even composed of a multitude of such systems. These systems can have a huge impact on our lives, especially if they do not work as expected. As a result, numerous approaches exist to assure quality of a system. The classical and most common type of malfunctioning is what is widely called “bug”. Usually, a bug is a very small mistake in the software or hardware that causes a behaviour that is not intended or expected. Other types of malfunctioning are caused by incorrect or wrongly interpreted sensor data, physical deficiencies of a component, or are simply radiation-induced.

Another interesting kind of malfunction (also from an ethical perspective [4]) arises if the expectation of how the system should behave is different for two (or more) parties. Examples for such scenarios are widespread in the context of personal data privacy, where product manufacturers and data protection agencies have notoriously different opinions about how a software is supposed to handle personal data. Another example is the usage of third-party cartridges in printers. Manufacturers and users do not agree on whether their printer should work with third-party cartridges (the user’s opinion) or only with those sold by the manufacturer (the manufacturer’s opinion). Lastly, an example that received very high media attention are emission cleaning systems in diesel cars. There are regulations for dangerous particles and gases like CO and NO defining how much of these substances are allowed to be emitted during car operation. Part of these regulations are emissions tests, precisely defined test cycles that a car has to undergo on a chassis dynamometer [28]. Car manufacturers have to obey to these regulations in order to get admission to sell a new car model. The central weakness of these regulations is that the relevant behaviour of the car is only a trickle of the possible behaviour on the road. Indeed, several manufacturers equipped their cars with defeat devices that recognise if the car is undergoing an official emissions test. During the test, the car obeys the regulation, but outside test conditions, the emissions extruded are often significantly higher than allowed. Generally speaking, the phenomena described above are considered as incorrect software behaviour by one party, but as intended software behaviour by the other party (usually the manufacturer). In the literature, such phenomena are called software doping [3, 10].

The difference between software doping and bugs is threefold: (1) There is a disagreement of intentions about what the software should do. (2) While a bug is most often a small coding error, software doping can be present in a considerable portion of the implementation. (3) Bugs can potentially be detected during production by the manufacturer, whereas software doping needs to be uncovered after production, by the other party facing the final product. Embedded software is typically proprietary, so (unless one finds a way to breach into the intellectual property [9]) it is only possible to detect software doping by observation of the behaviour of the product, i.e., by black-box testing.

This paper develops the foundations for black-box testing approaches geared towards uncovering doped software in concrete cases. We will start off from an established formal notion of robust cleanness (which is the negation of software doping) [10]. Essentially, the idea of robust cleanness is based on a succinct specification (called a “contract”) capturing the intended behaviour of a system with respect to all inputs to the system. Inputs are considered to be user inputs or environmental inputs given by sensors. The contract is defined by input and output distances on standard system trajectories supplemented by input and output thresholds. Simply put, the input distance and threshold induce a tube around the standard inputs, and similar for outputs. For any input in the tube around some standard input the system must be able to react with an output that is in the tube around the output possible according to the standard.

Example 1

For a diesel car the standard trajectory is the behaviour exhibited during the official emissions test cycle. The input distance measures the deviation in car speed from the standard. The input threshold is a small number larger than the acceptable error tolerance of the cycle limiting the inputs considered of interest. The output distance then is the difference between (the total amount of)  extruded by the car facing inputs of interest and that extruded if on the standard test cycle. For cars with an active defeat device we expect to see a violation of the contract even for relatively large output thresholds.

A cyber-physical system (CPS) is influenced by physical or chemical dynamics. Some of this can be observed by the sensors the CPS is equipped with, but some portion might remain unknown, making proper analysis difficult. Nondeterminsm is a powerful way of representing such uncertainty faithfully, and indeed the notion of robust cleanness supports non-deterministic reactive systems [10]. Furthermore, the analysis needs to consider (at least) two trajectories simultaneously, namely the standard trajectory and another that stays within the input tube. In the presence of nondeterminism it might even become necessary to consider infinitely many trajectories at the same time. Properties over multiple traces are called hyperproperties [8]. In this respect, expressing robust cleanness as a hyperproperty needs both and trajectory quantifiers. Formulas containing only one type of quantifier can be analysed efficiently, e.g., using model-checking techniques, but checking properties with alternating quantifiers is known to be very complex [7, 16]. Even more, testing of such problems is in general not possible. Assume, for example, a property requiring for a (non-deterministic) system that for every input , there exists the output , i.e., one of the systems possible behaviours computes the identity function. For black-box systems with infinite input and output domains the property can neither be verified nor falsified through testing. In order to verify the property, it is necessary to iterate over the infinite input set. For falsification one must show that for some the system can not produce as output. However, not observing an output in finitely many steps does not rule out that this output can be generated. As a result, there is no prior work (we are aware of) that targets the automatic generation of test cases for hyperproperties, let alone robust cleanness.

The contribution of this paper is three-fold. (1) We observe that standard behaviour, in particular when derived by common standardisation procedures, can be represented by finite models, and we identify under which conditions the resulting contracts are (un)satisfiable. (2) For a given satisfiable contract we construct the largest non-deterministic model that is robustly clean w.r.t. this contract. We integrate this model into a model-based testing theory, which can provide a non-deterministic algorithm to derive sound test suites. (3) We develop a testing algorithm for bounded-length tests and discretised input/output values. We present test cases for the diesel emissions scandal and execute these tests with a real car on a chassis dynamometer.

2 Software Doping on Reactive Programs

Embedded software is reactive, it reacts to inputs received from sensors by producing outputs that are meant to control the device functionality. We consider a reactive program as a function on infinite sequences of inputs so that the program reacts to the -th input in the input sequence by producing non-deterministically the -th output in each respective output sequence. Thus, the program can be seen, for instance, as a (non-deterministic) Mealy or Moore machine. Moreover, we consider an equivalence relation that equates sequences of inputs. To illustrate this, think of the program embedded in a printer. Here would for instance equate input sequences that agree with respect to submitting the same documents regardless of the cartridge brand, the level of the toner (as long as there is sufficient), etc. We furthermore consider the set of inputs of interest or standard inputs. In the previous example, contains all the input sequences with compatible cartridges and printable documents. The definitions given below are simple adaptations of those given in [10] (but where parameters are instead treated as parts of the inputs).

Definition 1

A reactive program is clean if for all inputs such that , . Otherwise it is doped.

This definition states that a program is clean if its execution exhibits the same visible sequence of output when supplied with two equivalent inputs, provided such inputs comply with the given standard . Notice that the behaviour outside is deemed immediately clean since it is of no interest.

In the context of the printer example, a program that would fail to print a document when provided with an ink cartridge from a third-party manufacturer, but would otherwise succeed to print would be considered doped, since this difference in output behaviour is captured by the above definition. For this, the inputs (being pairs of document and printer cartridge) must be considered equivalent (not identical), which comes down to ink cartridges being compatible.

However, the above definition is not very helpful for cases that need to preserve certain intended behaviour outside of the standard inputs . This is clearly the case in the diesel emissions scandal where the standard inputs are given precisely by the emissions test, but the behaviour observed there is assumed to generalise beyond the singularity of this test setup. It is meant to ensure that the amount of NO and NO (abbreviated as ) in the car exhaust gas does not deviate considerably in general, and comes with a legal prohibition of defeat mechanisms that simply turn off the cleaning mechanism. This legal framework is obviously a bit short sighted, since it can be circumvented by mechanisms that alter the behaviour gradually in a continuous manner, but in effect drastically. In a nutshell, one expects that if the input values observed by the electronic control unit (ECU) of a diesel vehicle deviate within “reasonable distance” from the standard input values provided during the lab emission test, the amount of  found in the exhaust gas is still within the regulated threshold, or at least it does not exceed it more than a “reasonable amount”.

This motivates the need to introduce the notion of distances on inputs and outputs. More precisely, we consider distances on finite traces: and . Such distances are required to be pseudometrics. ( is a pseudometric if , and for all , , and .) With this, D’Argenio et. al [10] provide a definition of robust cleanness that considers two parameters: parameter refers to the acceptable distance an input may deviate from the norm to be still considered, and parameter that tells how far apart outputs are allowed to be in case their respective inputs are within distance (Def. 2 spells out the Hausdorff distance used in [10]).

Definition 2

Let denote the -th prefix of the sequence . A reactive program is robustly clean if for all input sequences with , for all such that for all , the following holds:

  1. for all there exists such that , and

  2. for all there exists such that ,

where and similarly for .

Notice that this is what we actually need for the non-deterministic case: each output of one of the program instances should be matched within “reasonable distance” by some output of the other program instance. Also notice that does not need to satisfy , but it will be considered as long as it is within distance of any input satisfying . In such a case, outputs generated by will be requested to be within distance of some output generated by the respective execution induced by a standard input.

We remark that Def. 2 entails the existence of a contract which defines the set of standard inputs , the tolerance parameters and as well as the distances and . In the context of diesel engines, one might imagine that the values to be considered, especially the tolerance parameters and for a particular car model are made publicly available (or are even advertised by the car manufacturer), so as to enable potential customers to discriminate between different car models according to the robustness they reach in being clean. It is also imaginable that the tolerances and distances are fixed by the legal authorities as part of environmental regulations.

3 Robustly Clean Labelled Transition Systems

This section develops the framework needed for an effective theory of black-box doping tests based on the above concepts. In this, the standard behaviour (e.g. as defined by the emission tests) and the robust cleanness definitions together will induce a set of reference behaviours that then serve as a model in a model-based conformance testing approach. To set the stage for this, we recall the definitions of labelled transition systems (LTS) and input-output transitions systems (IOTS) together with Tretmans’ notion on model-based conformance testing [25]. We then recast the characterisation of robust cleanness (Def. 2) in terms of LTS.

Definition 3

A labelled transition system (LTS) with inputs and outputs is a tuple where (i) is a (possibly uncountable) non-empty set of states; (ii) is a (possibly uncountable) set of labels; (iii) is the transition relation; (iv) is the initial state. We say that a LTS is an input-output transition system (IOTS) if it is input-enabled in any state, i.e., for all and there is some such that .

For ease of presentation, we do not consider internal transitions. The following definitions will be used throughout the paper. A finite path in an LTS is a sequence with for all . Similarly, an infinite path in is a sequence with for all . Let and be the sets of all finite and infinite paths of beginning in the initial states, respectively. The sequence is a finite trace of if there is a finite path , and is an infinite trace if there is an infinite path . If is a path, we let denote the trace defined by . Let and be the sets of all finite and infinite traces of , respectively. We will use to denote that .

Model-Based Conformance Tests.

In the following we recall the basic notions of  conformance testing [25, 26, 27], and refer to the mentioned literature for more details. In this setting, it is assumed that the implemented system under test (IUT) can be modelled as an IOTS while the specification of the required behaviour is given in terms of a LTS . The idea of whether the IUT conforms to the specification is formalized by means of the  relation which we define in the following.

We first need to identify the quiescent (or suspended) states. A state is quiescent whenever it cannot proceed autonomously, i.e., it cannot produce an output. We will make each such state identifiable by adding a quiescence transition to it, in the form of a loop with the distinct label .

Definition 4

Let be an LTS. The quiescence closure (or -closure) of is the LTS with . Using this we define the suspension traces of by .

Let be an LTS with initial state and . We define as the set . For a state , let and for a set of states , let .

The idea behind the  relation is that any output produced by the IUT must have been foreseen by its specification, and moreover, any input in the IUT not foreseen in the specification may introduce new functionality.  captures this by harvesting concepts from refusal testing. As a result, is defined to hold whenever for all .

The base principle of conformance testing now is to assess by means of testing whether the IUT conforms to its specification w.r.t. . An algorithm to derive a corresponding test suite is available [26, 27], so that for any IUT , iff passes all tests in .

It is important to remark that the specification in the setting considered here is missing. Instead, we need to construct the specification from the standard inputs and the respective observed outputs, together with the distances and the tresholds given by the contract. Furthermore, this needs to respect the interaction required by the cleanness property (Def. 2).

Software Doping on LTS.

To capture the notion of software doping in the context of LTS, we provide two projections of a trace, projecting to a sequence of the appearing inputs, respectively outputs. To do this, we extend the set of labels by adding the input , that indicates that in the respective step some output (or quiescence) was produced (but masking the precise output), and the output that indicates that in this step some (masked) input was given.

The projection on inputs and the projection on outputs are defined for all traces and as follows: and . They are lifted to sets of traces in the usual elementwise way.

Definition 5

A LTS is standard for a LTS , if for all and , implies .

The above definition provides our LTS-specific interpretation of the notion of for a given program modelled in terms of LTS . is implicitly determined as the input sequences occurring in , which contains both the standard inputs and the outputs that are produced in , and altogether covers exactly the traces of whose input sequences are in . If instead and are given, LTS can be defined such that iff and . This is indeed standard for and contains exactly all traces of whose input sequences are in .

In this new setting, we assume that the distance functions and run on traces containing labels and , i.e. they are pseudometrics in and , respectively.

Now the definition of robustly clean can be restated in terms of LTS as follows.

Definition 6

Let be an IOTS and a standard LTS. is robustly clean if for all such that then for all such that for all , the following holds:

  1. there exists s.t. and .

  2. there exists s.t. and .

Following the principles of model-based testing, Def. 6 takes specific care of quiescence in a system. In order to properly consider quiescence in the context of robust cleanness it must be considered as a unique output. As a consequence, in the presence of a contract , we use – instead of , and – the quiescence closure of , and an extended output distance defined as if for all , and otherwise, where is the same as whith all removed.

Def. 6 echoes the semantics of the HyperLTL interpretation appearing in Proposition 19 of [10] restricted to programs with no parameters. Thus, the proof showing that Def. 6 is the correct interpretation of Def. 2 in terms of LTS, can be obtained in a way similar to that of Prop. 19 in [10].

4 Reference Implementation for Contracts

As mentioned before, doping tests need to be based on a contract , which we assume given.  specifies the domains , , a standard LTS , the distances and and the bounds and . We intuitively expect the contract to be satisfiable in the sense that it never enforces a single input sequence of the implementation to keep outputs close enough to two different executions of the specification while their outputs stretch too far apart. We show such a problematic case in the following example.

Example 2.

On the right a quiescence-closed standard LTS for an implementation (shown below) is depicted. For simplicity some input transitions are omitted. Assume and . Consider the transition labelled of . This must be one of either or , but we will see that either choice leads to a contradiction w.r.t. the output distances induced. The input projection of the middle path in is and the input distance to and is exactly , so both branches and of must be considered to determine . For , the output distance of to in the right branch of is , i.e. less than . However, . Thus the output distance to the left branch of is too high if picking . Instead picking does not work either, for the symmetric reasons, the problem switches sides. Thus, neither picking nor for satisfies robust cleanness here. Indeed, no implementation satisfying robust cleanness exists for the given contract.

We would expect that a correct implementation fully entails the standard behaviour. So, to satisfy a contract, the standard behaviour itself must be robustly clean. This and the need for satisfiability of particular inputs lead to Def. 7.

Definition 7 (Satisfiable Contract)

Let and define some contract . Let input be the input projection of some trace. is satisfiable for  if and only if for every standard trace and such that for all there is some implementation that satisfies Def. 6.2 w.r.t.  and has some trace with and .

 is satisfiable if and only if all inputs are satisfiable for  and if is robustly clean w.r.t. contract . A contract that is not satisfiable is called unsatisfiable.

Given a satisfiable contract it is always possible to construct an implementation that is robustly clean w.r.t. to this contract. Furthermore, for every contract there is exactly one implementation (modulo trace equivalence) that contains all possible outputs that satisfy robust cleanness. Such an implementation is called the largest implementation.

Definition 8 (Largest Implementation)

Let  be a contract and an implementation that is robustly clean w.r.t. . is the largest implementation within if and only if for every that is robustly clean w.r.t.  it holds that .

In the following, we will focus on the fragment of satisfiable contracts with standard behaviour defined by finite LTS. For unsatisfiable contracts, testing is not necessary, because every implementation is not robustly clean w.r.t. to . Finiteness of will be necessary to make testing feasible in practice. For simplicity we will further assume past-forgetful output distance functions. That is, whenever and (where .) Thus, we simply assume that , i.e., the output distances are determined by the last output only. We remark that for all .

We will now show how to construct the largest implementation for any contract (of the fragment we consider), which we name reference implementation . It is derived from by adding inputs and outputs in such a way that whenever the input sequence leading to a particular state is within distance of an input sequence of , then the outputs possible in such a state should be at most distant from those outputs possible in the unique state on reached through . This ensures that will satisfy condition 2) in Def. 6.

Reference implementation.

To construct the reference implementation we decide to model the quiescence transitions explicitly instead of using the quiescence closure. We preserve the property, that in each state of the LTS it is possible to do an output or a quiescence transition. The construction of proceeds by adding all transitions that satisfy the second condition of Def. 6.

Definition 9

Given a standard LTS , bounds and , and distances and , the reference implementation is the LTS where is defined by

Notably, is deterministic, since only transitions of the form are added. As a consequence of this determinism, outputs and quiescence may coexists as options in a state, i.e. they are not mutually exclusive.

Figure 1: The reference implementation of in Example 3.

Example 3.

Fig. 1 gives a schematic representation of the reference implementation for the LTS on the right. Input (output) actions are denoted with letter (, respectively), quiescence transitions are omitted. We use Euclidean distances , so that and . For this example, the quiescence closure looks like but with -loops in states , , , and . Label should be interpreted as any value and similarly and , appropriately considering closed and open boundaries; “” represents any other input not explicitly considered leaving the same state; and “” and “” represent any possible input and output (including ), respectively. In any case and are not considered since they are not part of the alphabet of the LTS. Also, we note that any possible sequence of inputs becomes enabled in the last states (omitted in the picture).

Robust cleanness of reference implementation.

In the following, the aim is to show that is robustly clean. By construction, each state in equals the trace that leads to that state. In other words, for any can be shown by induction. As a consequence, a path in can be completely identified by the trace it defines. The following lemma states that preserves all traces of the standard  it is constructed from. This can be proven by using that is robustly clean w.r.t. the (satisfiable) contract (see Def. 7).

Lemma 1 ()

Let be constructed from contract . Then .

The following theorem states that the reference implementation is robustly clean w.r.t. the contract it was constructed from.

Theorem 4.1 ()

Let  be constructed from . Then  is robustly clean w.r.t. .

Furthermore, it is not difficult to show that is indeed the largest implementation within the contract it was constructed from.

Theorem 4.2 ()

Let  be constructed from contract . Then  is the largest implementation within .

5 Model-Based Doping Tests

Following the conceptual ideas behind , we need to construct a specification that is compatible with our notion of robust cleanness in such a way that a test suite can be derived. Intuitively, such a specification must be able to foresee every behaviour of the system that is allowed by the contract. We will take the reference implementation from the previous section as this specification. Indeed we claim that is constructed in such a way that whenever an IUT is robustly clean, holds. The latter translates to

Theorem 5.1 ()

Let be a contract with standard , let IOTS be robustly clean w.r.t. and with . If is constructed from , then .

The key observations to prove this theorem are: (i) the reference implementation is the largest implementation within the contract, i.e. if the IUT is robustly clean, then all its traces are covered by , and (ii) by construction of and satisfiability of , the suspension traces of are exactly its finite traces.

0  history
0  pass or fail
1  c  /* Pick from one of three cases */
2  if c = 1 then
3     return  pass /* Finish test generation */
4  else if c = 2 and no output from  is available then
5       /* Pick next input */
6       /* Forward input to SUT */
7     return    /* Continue with next step */
8  else if c = 3 or output from  is available then
9       /* Receive output from SUT */
10     if  then
11        return    /* If o is foreseen by oracle continue with next step */
12     else
13        return  fail /* Otherwise, report test failure */
14     end if
15  end if
Algorithm 1 Doping Test ()

Test Algorithm.

An important element of the model-based testing theory is a non-deterministic algorithm to generate test cases. A set of test cases is called a test suite. It is shown elsewhere [27], that there is an algorithm that can produce a (possibly infinitely large) test suite , for which a system passes if is correct w.r.t.  and, conversely, is correct w.r.t.  if passes . The former property is called soundness and the latter is called exhaustiveness. Algorithm 1 shows a tail-recursive algorithm to test for robust cleanness. This algorithm takes as an argument the history of the test currently running. Every doping test is inititalized by . Several runs of the algorithm constitute a test suite. Each test can either pass or fail, which is reported by the output of the algorithm. In each call  picks one of three choices: (i) it either terminates the test by returning pass (line 3), (ii) if there is no pending output that has to be read from the system under test, the algorithm may pick a new input and pass it to the system (lines 5-6), or (iii) reads and checks the next output (or quiescence) that the system produces (lines 9-10). Quiescence can be recognized by using a timeout mechanism that returns if no output has been received in a given amount of time. In the original algorithm, the case and the next input are determined non-deterministically. Our algorithm is parameterized by and , which can be instantiated by either non-determinism or some optimized test-case selection. Until further notice we assume non-deterministic selection. An output or quiescence that has been produced by the IUT is checked by means of an oracle (line 10). The oracle reflects the reference implementation , that is used as the specification for the  relation and is defined in equation (2).


Given a finite execution, returns the set of acceptable outputs (after such an execution) which corresponds exactly to the set of outputs in (after such an execution). Thus is precisely the set of outputs that satisfies the premise in the definition of after the trace , as stipulated in Def. 9.

We refer to as an oracle, because it cannot be computed in general due to the infinite traces of in the definition. However, we get the following theorem stating that the algorithm is sound and exhaustive with respect to  (and we present a computable algorithm in the next section). The theorem follows from the soundness and exhaustiveness of the original test generation algorithm for model-based testing and Def. 9.

Theorem 5.2 ()

Let be a contract with standard . Let be an implementation with and let be the largest implementation within . Then, if and only if for every test execution it holds that .

Together with Theorem 5.1 and satisfiability of , we derive the following corollary.

Corollary 1

Let be a contract with standard . Let be an implementation with .If is robustly clean, then for every test execution it holds that .

It is worth noting that in Corollary 1 we do not get that is robustly clean if always passes . This is due the intricacies of genuine hyperproperties. By testing, we will never be able to verify the first condition of Def. 6, because this needs a simultaneous view on all possible execution traces of . During testing, however, we always can observe only one trace.

Finite Doping Tests.

As mentioned before, the execution of  is not possible, because the oracle is not computable. There is, however, a computable version of for executions up to some test length for bounded and discretised and . Even for infinite executions, can be seen as a limit of interest and testing is still sound. is shown in eq. (3). The only variation w.r.t.  lies in the use of the set , instead of , so as to return all traces of whose length is exactly . Since is finite, function can be implemented.


Now we get a new algorithm by replacing by in  and by forcing case 1 when and only when . We get a similar soundness theorem for as in Corollary 1.

Theorem 5.3 ()

Let be a contract with standard . Let be an implementation with . If is robustly clean, then for every boundary and every test execution it holds that .

Since implies for any , we have in summary arrived at an on-the-fly algorithm that for sufficiently large (corresponding to the length of the test) will be able to conduct a “convicting” doping test for any IUT that is not robustly clean w.r.t. a given contract . The bounded-depth algorithm effectively circumvents the fact that, except for and , all other objects we need to deal with are countably or uncountably infinite and that the property we check is a hyperproperty.

We implemented a prototype of a testing framework using the bounded-depth algorithm. The specification of distances, value domains and test case selection are parameters of the algorithm that can be set specific for a concrete test scenario. This flexibility enables us to use the framework in a two-step approach for cyber-physical systems not equipped with a digital interface to forward the inputs to: first, the tool can generate test inputs, that are executed by a human or a robot on the CPS under test. The actual inputs (possibly deviating from the generated inputs) and outputs from the system are recorded so that in the second step our tool determines if the (actual) test is passed or failed.

6 Evaluation

The normed emission test NEDC (New European Driving Cycle) (see Fig. 2) is the legally binding framework in Europe [28] (at the time the scandal surfaced). It is to be carried out on a chassis dynamometer and all relevant parameters are fixed by the norm, including for instance the outside temperature at which it is run.

Time [s]

Speed []
Figure 2: NEDC speed profile.

For a given car model, the normed test induces a standard LTS as follows. The input dimensions of are spanned by the sensors the car model is equipped with (including e.g. temperature of the exhaust, outside temperature, vertical and lateral acceleration, throttle position, time after engine start, engine rpm, possibly height above ground level etc.) which are accessible via the standardized OBD-2 interface [24]. The output is the amount of  per kilometre that has been extruded since engine start. Inputs are sampled at equidistant times (once per second). The standard LTS is obtained from the trace representing the observations of running NEDC on the chassis dynamometer, say with inputs given by the NEDC over its 20 minutes (1180 seconds) duration, and is the amount of  gases accumulated during the test procedure. This is the only standard trace of our experiments. The trace ends with an infinite suffix of quiescence steps.

The input space,

is a vector space spanned by all possible input parameter dimensions. For

we distinguish the speed dimension as (measured in km/h). We can use past-forgetful distances with if , and otherwise. The speed is the decisive quantity defined to vary along the NEDC (cf. Fig. 2). Hence if regardless of the values of other parameters. We also take for the average amount of  gases per kilometre since engine start (in mg/km). We define if , and otherwise.

Doping tests in practice.

For the purpose of practically exercising doping tests, we picked a Renault 1.5 dci (110hp) (Diesel) engine. This engine runs, among others, inside a Nissan NV200 Evalia which is classified as a Euro 6 car. The test cycle used in the original type approval of the car was NEDC (which corresponds to Euro 6b). Emissions are cleaned using exhaust gas recirculation (EGR). The technical core of EGR is a valve between the exhaust and intake pipe, controlled by a software. EGR is known to possibly cause performance losses, especially at higher speed. Car manufacturers might be tempted to optimize EGR usage for engine performance unless facing a known test cycle such as the NEDC.

We fixed a contract with km/h, mg/km. We report here on two of the tests we executed apart from the NEDC reference: (i) PowerNEDC is a variation of the NEDC, where acceleration is increased from to in phase 6 of the NEDC elementary urban cycle (i.e. after and ) and (ii) SineNEDC defines the speed at time to be the speed of the NEDC at time plus (but capped at ). Both can be generated by for specific deterministic and . For instance, SineNEDC is given below. Fig. 3 shows the initial 200s of SineNEDC (red, dotted).

Time [s]

Speed []
Figure 3: Initial 200s of a SineNEDC (red, dotted), its test drive (green) and the NEDC driven (blue, dashed).

The car was fixed on a Maha LPS 2000 dynamometer and attached to an AVL M.O.V.E iS portable emissions measurement system (PEMS, see Fig. 4) with speed data sampling at a rate of 20 Hz, averaged to match the 1 Hz rate of the NEDC. The human driver effectuated the NEDC with a deviation of at most 9 km/h relative to the reference (notably, the result obtained for NEDC are not consistent with the car data sheet, likely caused by lacking calibration and absence of any further manufacturer-side optimisations).

Figure 4: Nissan NV200 Evalia on a dynamometer
NEDC Power Sine
Distance 11,029 11,081 11,171
Avg. Speed 33 29 34
189 186 182
180 204 584
Table 1: Dynamometer measurements (sample rate: 1Hz)

The PowerNEDC test drive differed by less than 15 km/h and the SineNEDC by less than 14 km/h from the NEDC test drive, so both inputs deviate by less than . The green line in Fig. 3 shows SineNEDC driven. The test outcomes are summarised in Table 1. They show that the amount of CO for the two tests is lower than for the NEDC driven. The  emissions of PowerNEDC deviate by around 24 mg/km, which is clearly below . But the SineNEDC produces about 3.24 times the amount of , that is 404 mg/km more than what we measured for the NEDC, which is a violation of the contract. This result can be verified with our algorithm a posteriori, namely by using to replay the actually executed test inputs (which are different from the test inputs generated upfront due to human driving imprecisions) and by feeding the outputs recorded by the PEMS into the algorithm. As to be expected, this makes the recording of the PowerNEDC return pass and the recording of SineNEDC return fail.

Our algorithm is powerful enough to detect other kinds of defeat devices like those uncovered in investigations of the Volkswagen or the Audi case. Due to lack of space, we cannot present the concrete and for these examples.

7 Discussion

Related Work.

The present work complements white-box approaches to software doping, like model-checking [10] or static code analysis [9] by a black-box testing approach, for which the specification is given implicitly by a contract, and usable for on-the-fly testing. Existing test frameworks like TGV [18] or TorX [29] provide support for the last step, however they fall short on scenarios where (i) the specification is not at hand and, among others, (ii) the test input is distorted in the testing process, e.g., by a human driving a car under test.

Our work is based on the definition of robust cleanness [10] which has conceptual similarities to continuity properties [6, 17] of programs. However, continuity itself does not provide a reasonably good guarantee of cleanness. This is because physical outputs (e.g. the amount of  gas in the exhaust) usually do change continuously. For instance, a doped car may alter its emission cleaning in a discrete way, but that induces a (rapid but) continuous change of  gas concentrations. Established notions of stability and robustness [23, 13, 19, 21] differ from robust cleanness in that the former assure the outputs (of a white-box system model) to stabilize despite transient input disturbances. Robust cleanness does not consider perturbations but (intentionally) different inputs, and needs a hyperproperty formulation.

Concluding Remarks.

This work lays the theoretical foundations for black-box testing approaches geared towards uncovering doped software. As in the diesel emissions scandal – where manufacturers were forced to pay excessive fines [22] and where executive managers are facing lawsuits or indeed went to prison [14, 5] – doped behaviour is typically strongly related to illegal behaviour.

As we have discussed, software doping analysis comes with several challenges. It can be performed (i) only after production time on the final embedded or cyber-physical product, (ii) notoriously without support by the manufacturer, and (iii) the property belongs to the class of hyperproperties with alternating quantifiers. (iv) Non-determinism and imprecision caused by a human in-the-loop complicate doping analysis of CPS even further.

Conceptually central to the approach is a contract that is assumed to be explicitly offered by the manufacturer. The contract itself is defined by very few parameters making it easy to form an opinion about a concrete contract. And even if a manufacturer is not willing to provide such contractual guarantees, instead a contract with very generous parameters can provide convicing evidence of doping if a test uncovers the contract violation. We showed this in a real automotive example demonstrating how a legally binding reference behaviour and a contract altogether induce a finite state LTS enabling to harvest input-output conformance testing for doping tests. We developed an algorithm that can be attached directly to a system under test or in a three-step process, first generating a valid test case, afterwards used to guide a human interacting with the system, possibly adding distortions, followed by an a-posteriori validation of the recorded trajectory. For more effective test case selection [15, 11] we are exploring different guiding techniques [2, 12, 1] for cyber-physical systems.


  • [1] Adimoolam, A.S., Dang, T., Donzé, A., Kapinski, J., Jin, X.: Classification and coverage-based falsification for embedded control systems. In: Majumdar, R., Kuncak, V. (eds.) Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I. Lecture Notes in Computer Science, vol. 10426, pp. 483–503. Springer (2017).,
  • [2] Annpureddy, Y., Liu, C., Fainekos, G.E., Sankaranarayanan, S.: S-taliro: A tool for temporal logic falsification for hybrid systems. In: Abdulla, P.A., Leino, K.R.M. (eds.) Tools and Algorithms for the Construction and Analysis of Systems - 17th International Conference, TACAS 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Saarbrücken, Germany, March 26-April 3, 2011. Proceedings. Lecture Notes in Compute