Contextualisation of Data Flow Diagrams for security analysis

06/07/2020 ∙ by Shamal Faily, et al. ∙ Bournemouth University Göteborgs universitet 0

Data flow diagrams (DFDs) are popular for sketching systems for subsequent threat modelling. Their limited semantics make reasoning about them difficult, but enriching them endangers their simplicity and subsequent ease of take up. We present an approach for reasoning about tainted data flows in design-level DFDs by putting them in context with other complementary usability and requirements models. We illustrate our approach using a pilot study, where tainted data flows were identified without any augmentations to either the DFD or its complementary models.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data Flow Diagrams (DFDs) are useful as a sketch that explores how a system and its elements might be exploited; their simplicity makes it possible for different people with different levels of expertise to contribute to the security analysis of a system as it is evolves.

As DFDs become more critical to security design practices, so too is the need to reason about their properties using software tools. Limitations around cognitive ability, expertise and time constrain the effectiveness of modellers when scaling up or making decisions around DFDs [sim79]

. However, their limited semantics makes reasoning with DFDs alone difficult; this leads to an inherent trade-off between using easy to adopt notations and those that afford automated reasoning but are more elaborate 


Data flows are analogous with information flows. Information flow analysis (like taint analysis) is a long established technique for reasoning about the interactions of data within entities, and their impact on security as the data flows through the system [denn79, yiso07]. Unfortunately, visual inspection alone is insufficient for spotting potential issues with data inside data flows. Formal policy specifications and binary instructions provide the context necessary to reason about tainted information flows, but DFDs lack this level of precision. The options are either (i) adding additional information to the diagram itself, or (ii) providing context via other models aligned with DFDs. In the related work, the first route has been extensively explored [tcs18, tusb19], so this paper takes the less followed second path. Usability models could play a particularly important role in providing such context. For example in [fail18], usability models describe the main tasks performed by a software system, and the roles associated to those tasks. The models relate to the overall goals and requirements of the system. Just as DFDs provide early insights into how systems might be exploited, usability models indicate where interaction problems might subsequently facilitate exploitation. These different models might be produced independently and, with inter-operable tools, we can reason about the security impact these models have on DFDs, and vice-versa.

Contribution. In this short paper, we present an approach for identifying potential taint in design-level DFDs. Our guiding principle is that, to encourage adoption, DFDs should be no more graphically complex than they currently are. Instead, we should leverage the alignment between DFDs and other usability and requirements models. We present the related work upon which our approach is based in Section 2 before presenting the key concepts and algorithms in our approach in Section 3. We illustrate our approach in Section 4 by using it to identify pre-process and post-process taint in a critical infrastructure pilot study, before discussing the implications of this work in Section 5.

2 Related Work and Background

2.1 Reasoning about Data Flow Diagrams in Threat Modelling

Data Flow Diagrams (DFDs) graphically model flows of information (data flows) between human or system actors external to a system (entities), activities that manipulate data (processes), and persistent data storage (data stores) [yoco79]. This notation is often extended with trust boundaries: dotted boxes encompassing DFD elements operating at the same level of privilege. Trust boundaries help identify data flows that cross privilege levels [shos14].

DFDs have overlapping functions. Diane (a diagram creator) creates a DFD that diagrammatically represents her mental model. On viewing the DFD, Elaine (an engineer) internalises this mental model and requests changes. Dialogue around their differences subsequently brings both mental models closer together. Francis (a formal modeller) crafts a structured representation of a system, from which subsequent reasoning can be performed. This relationship between a mental model, a diagram, and a formal model has not been well explored.

Tuma et al. [tsws18] first examined the potential of using information flow analysis to reason about DFDs. They extended the DFD notation by labelling data flows with assets and their security properties, indicating the source and target of assets, including domain properties and assumptions from the KAOS modelling language [lams09]. In later work, Tuma et al. [tusb19] further illustrate the potential for using DFDs for design-level information flow analysis. In their approach, a domain specific language is used to model DFDs annotated with security labels. The model is subsequently rendered as a graph and statically analysed.

Antigac et al. [anss18] examined how certain properties of a DFD can be hotspots for further investigation. For example, a usage hotspot corresponds with 3 DFD elements: data flow into process , process , and data flow from . Antigac et al. showed how such hotspots bridge the gap between different models, and provide a basis for subsequent model transformation without fundamentally changing the visual semantics of DFDs.

2.2 Security and Software Design Meta-models

Meta-models specify how model concepts are associated. In doing so, they guide analysts in collecting and analysing model data, and guide tool builders in constructing tools to support them. The software engineering community has examined the relationship between software and requirement modelling approaches and security, as summarised by [matu17]. These approaches do not, however, account for the role played by usability data and models. The IRIS (Integrating Requirements and Information Security) meta-model was devised to provide guidance on how early-stage design concepts from usability as well as security and requirements engineering might be aligned [fail18]. A sub-set of the IRIS concepts relevant to this paper is provided in Figure 1.

Figure 1: A UML class diagram showing the IRIS concepts related to threat modelling (red), usability (blue) and requirements modelling (grey)

Coles et al. [cofk18] demonstrated how use cases and assets provide the concepts necessary to threat model with data flow diagrams, and how – in addition to modelling system goals – the KAOS modelling language [lams09] is also suitable for modelling attack trees as obstacles. To make attacker assumptions more explicit, IRIS supports the specification of attackers. Attackers need not be intrinsically malicious, but they will have some motivations as drivers for carrying out an attack, and capabilities that provide the knowledge and resources necessary to mount and sustain any threat. IRIS draws its taxonomy of motivations from [van07], and capabilities from [joas05]. An additional motivation of productivity was also added to better reflect non-malicious attackers who intentionally or unintentionally commit harm to get their job done.

To leverage the outputs of user research in security design, two popular usability modelling concepts are supported by IRIS. Personas are specifications of archetypical user behaviour [core14]; they not only capture user goals and expectations, but their construction and usage helps elicit security requirement [fafl106]. Tasks are narrative scenarios that describe both the personas and the broader system – including use cases – in context.

3 Approach

Our approach focuses on how tainted data flows cast doubt on the safety of the data they carry. Unlike traditional taint analysis on program source code, the origins of data flow taint in our approach could be human error resulting from human entities and processes, or issues resulting from the DFDs and associated specifications. These problems could have an indeterminate impact on affected endpoints, thereby warranting further investigation. Aligning DFDs with usability and requirements models provides context to assist such an investigation.

Assuming the pre-requisite models exist, our approach validates them using the analysis checks described in Section 3.2. Because of its alignment with the DFD concepts as shown in Figure 1, our approach relies on the IRIS meta-model. DFD processes are analogous with use cases, and actors in use cases could be human or system entities. DFDs directly link to usability models because use cases, as processes, put tasks in context. DFDs are also indirectly linked because roles constituting use case actors are also fulfilled by personas – who interact in tasks – putting these roles in context.

3.1 Dataflow specification

DFDs are graphs, but can be specified as a set of data flow types. In our approach, a data flow consists of a label, names of the DFD elements data flows from and to, and the types of these elements, where is either an , a , or a . Data flows also specify the information assets (as ) they carry. Using Z [woda96], we can express a data flow formally, where the predicate part of the schema contains the well-formedness constraints:

DataFlow label, from, to: STRING
fromType, toType: NODE
assets: DATA assets ≠∅
((fromType = entity) ∧(toType = process)) ∨
((fromType = process) ∧(toType = entity)) ∨
((fromType = datastore) ∧(toType = process)) ∨
((fromType = process) ∧(toType = datastore)) ∨
((fromType = process) ∧(toType = process))

3.2 Pre-Process and Post-Process analysis

For each entity in the DFD, our approach first visits the entity’s data flows using the recursive graph traversal function described in Algorithm 1. The function populates a persistent array of unique data flow sequences (), and a persistent set of previously visited DFD elements ().

Input : currentNode - , prefix -
Data: allSeqs - , visited - , nodeFlows -
1 Function dataFlows(, ) is
2          .add(); nodeFlows ; if  =  then
3                   if prefix.length > 0 then
4                            .append();
5                   end if
7         else
8                   while   do
9                            ; .append(); if .to  then
10                                     .append();
11                           else
12                                     dataFlows .to ;
13                            end if
15                   end while
17          end if
18         return;
19 end
Algorithm 1 Identification of data flows
Input : dfSeq -
Data: contextualisedTask - , taskAsset - , personaRoles - , taskPersonas - , roleAttackers - , allAttackerRoles - , attackerMotivation - , attackerCapability - , taskDemand - , goalConflict - , processExceptions - , obstructedGoals - , obstacleAssets - , nameToProcess - , logPreProcessTaint - logs taint to process resulting from named task, logPostProcessTaint - logs taint to process resulting from named obstructed goal
1 Function analyseDataFlows() is
2          while   do
                   /* Check for pre-process taint */
3                   if .fromType = entity .toType = process .fromName  then
4                            while  contextualisedTask (nameToProcess .toName) do
5                                     if .assets taskAssets  then
6                                              while  (personaRoles (taskPersonas ) allAttackerRoles) do
7                                                       while  roleAttackers  do
8                                                                if (Productivity attackerMotivation ) (Low Time attackerCapability ) ( (taskDemand  {Medium,High}) (goalConflict  {Medium,High}) )  then
9                                                                         logPreProcessTaint (nameToProcess .toName) ;
10                                                                end if
12                                                       end while
14                                              end while
16                                     end if
18                            end while
20                   end if
                  /* Check for post-process taint */
21                   if .fromType = process then
22                            while  processExceptions .fromName do
23                                     if (obstacleAssets .)  then
24                                              while  obstructedGoals  do
25                                                       if isObstacleObstructed  = true  then
26                                                                logPostProcessTaint (nameToProcess .fromName) ;
27                                                       end if
29                                              end while
31                                     end if
33                            end while
35                   end if
37          end while
38         return;
39 end
Algorithm 2 Taint analysis

Each sequence in is then enumerated to identify and log potential data pre-process and post-process taint as described in Algorithm 2. The types mentioned in the algorithm can be found in Figure 1, with the exception of , where .

Pre-process taint checks (lines 3–15) identify instances where means, motives, and opportunity are present for human errors and violations. The checks are performed on data flows going from human entities to processes contextualised as tasks; these processes are use cases linked to tasks as indicated in Figure 1. Tasks become a possible source of human error when three conditions hold. First, roles fulfilled by personas in a task are shared with roles fulfilled by attackers. Second, attackers have a non-malicious motive and are constrained in the means available; we define such attackers as motivated by productivity and, as a capability, a limited amount of time. Finally, affected tasks are demanding to the affected personas, or in tension with their personal goals.

Post-process taint checks (lines 16–26) identify instances where exceptions resulting from processes are unresolved, and these exceptions impact information flowing from processes. Exceptions are modelled as obstacles obstructing one or more system goals operationalised as the affected processes. An obstacle impacts an out-going data flow if assets associated with the obstacle intersect with information assets in the data flow. An exception is unresolved if these obstacles are not resolved by another goal, as determined by the function defined in Algorithm 3. It begins by determining whether the input obstacle has been resolved by another goal. After evaluating whether the obstacle has been resolved, the check enumerates both obstacles that are or-refined and and-refined. In the case of or-refined obstacles, an obstruction on any of the refined obstacles is enough to consider the obstacle obstructed. Conversely, in the case of and-refined obstacles, an obstruction is present only if all refined obstacles are obstructed.

Data: resolvedObstacles - , orRefinedObstacles - , andRefinedObstacles -
Input : o - the obstacle name
Output : isObstructed - indicates if obstacle is obstructed
1 Function isObstacleObstructed() is
2          resolvedObstacles ; if   then
3                   isObstructed false;
4         else
5                   orRefinedObstacles ; while   do
6                            isObstacleObstructed ; if  = true then
7                                     break;
8                            end if
10                   end while
11                   andRefinedObstacles ; while   do
12                            isObstacleObstructed ; if  = false then
13                                     break;
14                            end if
16                   end while
18          end if
19         return ;
20 end
Algorithm 3 isObstacleObstructed check

3.3 Implementation

We have demonstrated the feasibility of our approach by implementing it in CAIRIS release 2.3.3. CAIRIS (Computer-Aided Integration of Requirements and Information Security) is an open-source software platform for eliciting, specifying and validating secure and usable system specifications

[cairis] developed as an exemplar for IRIS tool-support.

CAIRIS models, once imported into the platform, are implemented as relational databases. Graphical models in CAIRIS are automatically generated using a pipeline process, where a declarative model of graph edges is generated by CAIRIS; this is processed and annotated by graphviz [graphviz] before being subsequently rendered as SVG. SQL stored procedures implement a suite of security and privacy model validation checks. Algorithms 1 - 3 were implemented as SQL stored procedures; these are executed during a normal model-validation check. No changes were made to pre-existing visual models and the IRIS meta-model.

4 Pilot Study: Modifying telemetry outstation software

We used our approach to identify process taint in a partial specification of a software repository for industrial control software. While based on a hypothetical water treatment company, this anonymised specification is drawn from a more complete specification model created for a UK water treatment company. The CAIRIS model111Available from of this partial specification consists of 1 attacker, 1 role, 1 persona, 1 task, 1 use case, 28 goals, 17 obstacles, 58 goal and obstacle associations, 11 assets, 11 asset associations, and 7 data flows. Creation of the model is not the subject of this paper, but further details of how the broader model was created are provided in [fafl103].

The specification captures the system goals and complementary model elements associated with modifying software running on telemetry outstations. Such outstations provide the means for remotely monitoring and controlling physical infrastructure such as water pumps. Malicious tampering of such outstations contributed to the well publicised Maroochy Water Breach [slmi07].

Dataflow Assets
job Job
software (to Sandbox) Telemetry Software File
software (from Sandbox) Telemetry Software File
updated software Telemetry Software File
current software Telemetry Software File
alarm Alarm
update Software Change
Table 1: Dataflows and assets
Id Sequence Pre-Proc. Post-Proc.
1 job, alarm
2 job, update
3 job,updated software,current software
4 job,software, software
5 current software
Table 2: Dataflow sequences and results of pre-process and post-process taint checks

Our pilot study considers the impact of human error by an overworked technician focusing on the intricate task of updating software on telemetry outstations (Outstation update). This task puts in context the use case Modify Telemetry Software as shown in Figure 2 (top), which is carried out by an instrument technician persona (Barry). Details of how the persona and tasks were constructed are described in more detail in [fafl106]. The task model provided the context necessary to model the DFD generated by CAIRIS in Figure 2 (bottom). Table 2 specifies the assets carried in each data flow.

Figure 2: Usability model (top) and DFD (bottom) of Modify Telemetry Software generated by CAIRIS

Not shown in the visible models is an attacker (Unintentional Barry). This attacker’s motivation and resources are specified as ‘Productivity’ and ‘Low Resources/Personnel and Time’ to reflect non-malicious intent and a busy schedule. The task model also indicates the assets that Barry directly or indirectly interacts with in completing this task. The relationship between these and other assets associated with the specification are shown in Figure 3 (right).

Figure 3: Complementary KAOS goal model (left) and UML class diagram-based asset (right) model generated by CAIRIS

On performing a model validation check, five unique sequences of data flows were generated as shown in Table 2. The check indicates pre-process taint associated with sequences 1, 2, 3, and 4 resulting from the flow between the technician and the process. This was due to the flow carrying alarm information associated with the task and the potential for error. The task narrative describes how Barry needs to raise an alarm to validate the setup is correct; the alert draws attention to the implications of not safeguarding this information asset.

The model validation check also indicates post-process taint associated with Sequence 1; this outgoing process flow carries alarm information. An exception is associated with the second step of the process, where the system sends a change alarm. As a cut of the goal model in Figure 3 (left) shows, the associated obstacle remains unresolved and, although not visible, the obstacle is concerned with the alarm asset carried in .

5 Discussion and Conclusion

This short paper showed how, by putting DFDs in context, we can identify process taint without changing any DFD semantics. CAIRIS demonstrates the feasibility of our approach, but it could be adapted to any inter-operable combination of tools. Solutions for resolving the problems are not prescribed besides changing the attacker model and tasks, or resolving exceptions. However, by indicating otherwise invisible problems, our approach sheds light on why problems exists, and how a system or its context of use might need to change to address them. This approach is contingent on specifications containing the concepts in Figure 1 that might be created before, during, or after DFD creation. Small or poorly resourced teams may lack the resources to maintain such models given the user research investment required. However, this approach does allow human factor experts to become more engaged with threat modelling. We are currently working with system engineering teams with such expertise to evaluate the impact this approach has on increasing such engagement.

A threat to validity is the small size of the pilot study specification. However, we have also evaluated our approach using a more complex military medical evaluation system model described in [kfdw18] consisting of 10 attackers, 14 roles, 9 personas, 12 tasks, 29 use cases, 46 goals, 25 obstacles, 167 goal and obstacle associations, 82 assets, 388 asset associations, and 134 data flows. No differences in model validation performance were noted for this larger model, but a detailed evaluation of this and other larger models will be the subject of future work.

Our approach only considers non-malicious attackers engaging in difficult tasks. However, Algorithm 2 can be extended to consider alternative attacker and task attributes corresponding with different means, motives, and opportunities. For example, an inside attacker might be motivated by improved esteem or thrill seeking, and participate in tasks with differing levels of goal conflict.


This paper resulted from discussions at Dagstuhl Seminar 19231: Empirical Evaluation of Secure Development Processes.