Automated specification mining or model inference  is the process of automatically reverse engineering a model of an existing software system. Inferred models can be useful in situations where a system is not accompanied by an up-to-date hand-crafted specification, and can be especially useful for common tasks such as debugging [1, 53, 37], testing [60, 51, 46, 18], anomalous behavior detection , and requirements engineering .
Behavioral model inference techniques are commonly based upon dynamic analysis methods. These involve the collection of execution traces - sequences of execution events or data from which the model can be inferred. Collecting these traces either requires some form of source code instrumentation (i.e. the systematic insertion of logging statements), or relies upon features of the execution environment (e.g. Reflection in Java). These methods are especially helpful for unit-level analysis, particularly when source code is available and the performance overheads incurred by tracing are negligible.
Such tracing approaches can however be impossible (or at best highly impractical) to apply in larger systems, especially if they incorporate black-box components. Black box components can be impossible to instrument and inspect, and might even be physically sealed (such as embedded components). Aside from the challenges of observability, large-scale systems can be prohibitively expensive to trace , and tracing can incur performance overheads that lead observed behaviour that deviates from the non-traced equivalent - an issue that is especially problematic for real-time systems .
Instead, the task of tracing such large systems is often limited to ‘lightweight’ alternatives; passively recording the observable state-variables of the system without accessing any implementation details. This form of analysis is particularly useful for control systems, which tend to comprise large numbers of black-box components. In a car, for example, we might not have access to the internals of its cruise control system, but we can readily monitor state variables such as the extent to which the throttle or break are applied, and its speed and acceleration. Traces that are collected in this way tend to take the form of multivariate time-series, where each state variable corresponds to a signal.
One key difference from ‘traditional’ software traces is that these more continuous traces are not associated with internal discrete events, which means that there is no clear indicator of (1) what the main states of the system are, and (2) when state changes occur. In our cruise control example, the trace of the system does not include internal calls to indicate when the cruise control is accelerating the car or decelerating it, or when it deactivates itself. This has to be discerned from the multivariate signals in the trace.
In this paper we present a technique that can address these questions. Our solution involves training a hybrid deep learning model (including convolution and recurrent layers) on the time-series to predict the state of the system at each point in time. The deep learning model automatically performs feature extraction, which makes it much more effective and flexible than traditional methods. In addition, we do not make any assumption about statistical properties of the data which makes it applicable to a wide range of subjects.
We applied and evaluated this method on an autopilot software used in an Unmanned Aerial Vehicle (UAV) system developed by our industrial partner, Winnipeg-based Micropilot Inc. We then replicated the results on another highly capable and widely used autopilot, Paparazzi . We evaluated the method from two perspectives: (1) how well the model can detect the point in time at which a state change happens, and (2) how accurately it can predict which state the system is in. We also experimented with non-hybrid architectures to see how much does the hybrid architecture contribute to the overall performance, and explored role of hyper-parameter tuning on performance. Finally, we also explored the application of transfer learning to lower the cost of data labeling which is the most expensive step in this approach.
Our results indicate that the approach outperforms the state of the art approaches in, both in terms of state-change (or ‘change-point’) and state detection. In the MicroPilot case study we observed improvements ranging between 88.00% and 102.20% in the F1 score compared to traditional change-point detection techniques. For the state classification we saw improvements in the range from 7.35% to 16.83% in the F1 score compared to traditional sliding-window classification algorithms. For the Paparazzi case study we witnessed smaller (albeit still substantive) improvements for the change-point detection algorithm in the range between 13 and 43%, and a much larger improvement for the state-detection accuracy in the range from 77.20% to 87.97%.
We also observed a significant reduction of manual labeling cost, when using transfer learning, which achieves up to 90% of the potential F1 scores by only 2% of data set being labeled (only 5 test cases).
The contributions of this paper can be summarised as follows:
A deep learning architecture to infer behavioural models from black-box software systems.
An empirical evaluation that demonstrates the accuracy of our approach with respect to two real-world and large-scale case studies, involving a UAV autopilot system developed by our industry partner as well as an open source UAV autopilot.
An automated fuzz testing tool capable of generating and executing test cases for Paparazzi autopilot.
A hyper-parameter tuning pipeline to optimize the performance of the deep learning model.
A transfer learning approach to reuse the pretrained models, as much as possible, and reduce the manual labeling cost.
We have made all the source code, models, execution scripts, and some of the data available online111https://github.com/sea-lab/hybrid-net, Due to confidentiality, the Micropilot dataset cannot be shared publicly.
The rest of this paper is organized as follows: In section 2 we go through some background material and related papers this work is based on. In section 3, we present our proposed method and model. The way it was evaluated and the results are presented in section 4. We provide a summary of the paper briefly discuss some paths for future continuation of this work in section 5.
2 Background and Related Work
We start this section with a description of the analysis scenario that motivates our work. We then cover the related work in the dynamic analysis of software behaviour and in time-series analysis. This is followed by an overview of the Deep Learning and transfer learning techniques that we will be drawing upon in our own solution.
2.1 Motivating scenario
In this paper we consider the task of carrying out a system-level analysis of a Cyber-Physical System (CPS). By this we refer to a system, where the high level behaviour is controlled by a network of hardware and software components. The behaviour of the system is also time-sensitive; a small delay in the execution of one component can have a significant impact.
To analyse these systems, we cannot employ conventional tracing approaches that involve instrumentation and logging, because this may be physically impossible, and could incur an unacceptable performance overhead [36, 42]. Instead, we merely rely on the ability to record the data that is available anyway - the external input and output data-streams, perhaps coupled with hardware registers that passively provide information about the system (e.g. integrated circuits commonly include pins that automatically provide profiling information such as cache-misses or energy usage ).
Our setting is motivated by a partnership with an autopilot manufacturer for UAVs to study their autopilot software. The goal was to determine its internal state from its input/output signals, over time. In this scenario, the inputs are the sensor readings going into autopilot and the outputs are command signals sent to controller motors of the aircraft, reflecting the reaction of the autopilot to each input at each state. A ‘state’ in this example is the high-level stage of a flight (e.g. “take-off”, “landing”, “climbing”, etc.), and a state change happens when the input values trigger a constraint in the implementation (which is hidden) that changes the way the output signals are generated.
During a flight, the autopilot monitors changes in the input signals and makes adjustments to its outputs in order to hold some invariants (predefined rules). For example, if it is in the “hold altitude” mode, it monitors the altimeter’s readings and when it goes out of the acceptable range, proportionate adjustments to the throttle or the nose pitch will be made to get it back to the desired altitude. In this respect it amounts to a typical feedback loop controller, such as a Proportional Integral Derivative controller (PID) . When the state is changed (e.g. by a pre-loaded flight plan) from “hold altitude” to “descend to X ft”, the set of invariants that guide the autopilot are changed. It means its reactions to variations in inputs will change. In this example, a decreasing altimeter reading will not trigger an increase in the throttle anymore.
When inspecting the time-series, it is often possible for a domain expert to visually determine what the state of autopilot is in at each point in time. However, for applications that involve potentially large amounts of data (e.g. as a result of testing on a simulator), relying on a domain expert can become prohibitively expensive. Our goal is to automate this task. We assume that we have access to a labelled dataset (a manageable amount of data labelled by an expert, as was the case with our industrial partners), where the time-stamps signifying a state-change have been labelled in the trace. The challenge is to use this data to predict the corresponding states and state-changes in unlabelled data, collected from new additional executions of the system.
2.2 Inference of State-Based Models
The task of reverse-engineering state-based models from software systems has been extensively explored over the past 20 years. These approaches tend to start from a sample of execution traces, where a trace comprises a sequence of discrete events and (depending on the approach) an accompanying snapshot of the data-state (or input parameters). The challenge is to infer from this partial sample a model that accurately captures the general behaviour of the system. There have been efforts to infer state-based models in a variety of representations, including LTL models , Petri-nets , neural nets  and sequence charts .
However, the majority of efforts have focussed on inferring state machine representations. These can, in broad terms, be split into the following categories:
Passive or active: Passive approaches infer the model from the trace data without being able to obtain further input. These tend to be based on the ‘state-merging’ approach, popularised by Biermann and Feldman’s -tails algorithm 
, but have also been successfully inferred with Genetic Algorithms. Active approaches presume an ability to pose queries (perhaps in the form of test inputs) to gain additional missing information about the system. These include the QSM state-merging variant  and the LearnLib inference approach , based on Angluin’s active L* algorithm .
Data or no data: Simple state machines are insufficient for representing state-based behaviour that depends upon data. To accommodate this, there have been several efforts to enhance traditional inference approaches to produce state machines that incorporate data constraints or transformations. These include efforts to augment transitions with guards on a data state or on input parameters [34, 59, 13], as well as functions to describe the data transformations that underpin a state transition .
In these terms, our motivating scenario would amount to a ‘passive’ inference approach. We do not assume that we are able to submit queries or tests to the system under analysis. With respect to the question of data, our scenario is novel, in the sense that it assumes that traces contain only data values, without the accompanying discrete events that are customary for traditional state machine inference approaches. The research that we present in this paper is complementary to the aforementioned state machine inference techniques. Given a multivariate time-series without discrete events, our task is to derive sequences of labels that could, if desired, be used as a basis for state machine inference.
2.3 Time-Series Analysis
When an execution trace is encoded as a time-series, it can be analysed with a variety of signal-processing algorithms. This was first established (in a software engineering context) by Kuhn and Greevy . In their case, a trace was encoded as a series of call-stack depths, producing a single series per execution. They then showed how signal-processing techniques could be used to cluster and compare different traces.
2.3.1 Change Point Detection
In our context we are especially interested in inferring which portions of a trace correspond to states, and which points in a trace correspond to a state-change. This latter task is referred to as ‘Change Point Detection’ (CPD) . It is a well-studied subject due to its wide range of applications .
In general, CPD algorithms consist of two major components: a) the search method and b) the cost function . Search methods are either exact or approximate. For instance, Pelt is the most efficient exact search method in the CPD literature, which uses pruning . Approximate methods include window-based , bottom-up , and binary segmentation .
The cost function measures the ‘goodness of fit’ of a specific area of the signal to a given model. These can vary from simply subtracting each point from the mean to more complex metrics, such as auto-regressive  or kernel-based cost functions. Amongst kernel-based functions, linear and Gaussian kernels are the most popular .
A multitude of techniques have been developed to tackle specific variations of the CPD problem [43, 49]. Some assume that time series has only one input variable (is univariate) , that is only a single change-point , that the number of change points is known beforehand , or that the data obeys some specific statistical properties [15, 56].
Unfortunately these assumptions are not (generally) applicable in our scenario. For our purposes, a CPD method should work on multivariate data, and be able to capture non-linear relations between signals without making restrictive presumptions about the underlying data. It also needs to be resilient to time lags between an input signal change and its effect on the output signal (and the system-state).
2.4 Deep Learning and its Application to Time Series Data
Our research goal is to develop a technique that overcomes the current limitations of CPD techniques. It should accurately detect an arbitrary number of changes in multivariate time-series, without imposing limiting assumptions about statistical properties of the underlying signals. There have already been several successful examples of the application of Deep Learning techniques to time-series data, albeit not CPD – this will be the subject of the technique that we present in Section 3.
2.4.1 Hybrid Deep Neural Networks
One fundamental Machine Learning task is to identify salient features in data (e.g. to identify the properties within a multivariate time-series that indicate a change point. For non-trivial data-sets, these feature-encodings may not be readily available. They can also be difficult to determine from a human standpoint because they may arise from complex non-linear interactions within the data. The task of identifying these features automatically is referred to asfeature extraction .
Convolutional neural networks (CNNs) present a particularly effective solution to this problem for sequential signals. These are neural networks where the hidden units are associated with ‘receptive fields’ – matrices that capture the local contextual data for a given area of the input. These are then processed by filters which are convolved with particular filters (also referred to as kernels) that amplify particular aspects of the data [40, 64, 62]. The weights for hidden nodes are tied across the whole time-series. This means that useful features that are discovered in one zone can be re-used elsewhere without having to be independently learned [41, 29]
. CNNs also exhibit ‘translation invariance’, which means that they are able to classify patterns no matter where they occur in the input data (in our case, the time-series).
Recurrent neural networks (RNN) have shown great performance in analysing sequential data such as machine translation, time-series prediction, and time-series classification [16, 61, 44] RNNs can capture long-term temporal dependencies, a property that is especially useful in our case 
. For example, they might learn that “climb” state in a UAV autopilot usually follows “take off”. Therefore, while it is outputting “take off” it anticipates what the next state will probably be and as soon as its input features start shifting, it detects the onset of a state change. This has the potential to improve the predictive capabilities of the model and make it more efficient at detecting in a way that could be difficult to match with classic methods. The combination of RNN with CNN is generally referred to as a ‘hybrid deep neural network’.
2.4.2 Application of Deep Learning to Time Series Data
One major driver in Time-series analysis has been the growth in devices such as fitness trackers and smart watches, which use sensors to collect data about an individual’s movements or health-indicators such as heart rate. One challenge is to detect high-level episodes of distinctive human behaviour - e.g. to determine whether someone is running or walking from the accelerometer data in their phone. This particular challenge is referred to as Human Activity Recognition (HAR). Several attempts to apply deep learning to HAR tasks have shown that hybrid neural networks (the combination of CNN and RNN) are particularly effective [63, 40, 44, 65]
In the context of computer vision, there have been several approaches to automatically segment images into distinctive areas (e.g. to detect cells in a petri-dish, or to detect distinctive phases of memory usage in a bitmap representation of sequential memory-bus accesses ). In this context, U-net has emerged as one of the promising auto-encoder architectures .
Although U-net is specific to bitmap data, its principles have also been adapted for time-series analysis by Perslev et al., called U-time . Their model is fully convolutional; so although it is good at recognising repeated localised patterns, the lack of recurrent cells means that it lacks the benefit of RNNs, particularly the ability to process long-term dependencies that may arise in a time-series.
2.4.3 Transfer Learning
It is often the case that, as part of a software analysis process, the analyst has access to a separate piece of software with a similar functionality. It might be an older (more robust) version of the software, a similar product in the company’s product line, an open source equivalent, or a similar product by a competitor. These alternative versions can be used for example in testing (for regression testing, or to address the test-oracle problem ).
In these settings, Machine Learners can also benefit. ‘Transfer learning’ 
refers to a set of techniques to use the information that a machine learning model learned in a task, to improve its performance with respect to a different (but similar) task. Using transfer learning can result in a faster training time (reaching the asymptotic performance in fewer training epochs), a better result overall (a higher asymptote), requiring fewer training examples to reach an acceptable performance, a combination of these results and more.
To understand transfer learning, in more details, let’s look at a typical example from machine learning context. In the next section, we then show how this will apply in our software engineering problem in hand. Assume we have trained a DNN model, with very high accuracy, for an image classification task, on a large dataset of cats and dogs images. Let’s call this the “source” problem. Now assume we have a “target” problem that is image classification for boats and planes. Here we have a small dataset of images that includes boats and planes and the task is to classify them. Obviously, with the small target dataset a DNN won’t be effective. Also, we can not reuse the trained model from the source problem directly on the target problem, since the objective is different. To address this problem, we can use transfer learning to reuse the high accuracy model for classifying cats and dogs, as much as possible, on the new but similar problem of classifying boats and planes.
The process that is also called fine tuning, is to first train the source model on the source dataset (image classification model on the cats and dog dataset, in the above example). Then freeze all layers of the trained model, except for the last few layers. By “freezing” a layer, we mean that we keep the trained coefficients constant throughout the subsequent training (fine tuning) steps. The last couple of layers have been chosen for fine tuning (being trained on the new dataset) since these may need to be adapted to the new task (e.g., the task of classifying new objects in the above example would require a different output layer from the original classification model). In this setting the final training model is reused as much as possible, hopefully retaining the properties that lead to its high accuracy in the source data set, and leading to an improved performance for the new classification task.
3 Hybrid Neural Network for State Inference
Although the analysis of time-series is well-established, Section 2 also shows that current approaches to the sequential analysis of time-series have significant drawbacks. In this section we present a technique that seeks to overcome these limitations by applying a Hybrid Neural Network. In this section we show how our technique can be used to infer the states of a software system from run-time data, recorded in the form of a time-series that is comprised of the values of inputs and outputs of the system. In the rest of this section we present the architecture of our approach, show how to encode data for it, and discuss its implementation.
3.1 The Model Architecture
As can be seen in Figure 1, we capture inputs and outputs as a multivariate time series. To achieve its goal of accurately mapping phases in the time-series to high-level states, our solution needs to (1) automatically identify features within the data that indicate a state-change, and (2) accurately use these features to classify change-points and to identify which intervals between change points correspond to equivalent states in the data.
The architecture of our proposed model is illustrated in Figure 1
. The Convolutional Layers are intended to discover local features, such as sudden changes in phasal behaviour. Recurrent layers are used to process the sequential aspect of the data - to learn which specific states tend to occur in sequence. Finally, we use dense (fully-connected) layers to reduce the dimensions of the preceding layers to match the output dimensions. If there are only two states, the last layer can have a sigmoid activation function and be of shape
(the length of the input), otherwise, to match the one-hot encoding of labels, an output of shapewith softmax activation along the second axis () is required (
being the number of possible states). In terms of loss function, we apply the dice overlap loss function, which is typically used in image semantic segmentation tasks[39, 55]. An important property of this loss function for our case is that it accommodates class imbalances.
The number of different types of layers, filters and kernel sizes are hyper-parameters that should be selected based on the size of data and the complexity of the system under analysis. Using a sequence of convolutions with (a) increasing numbers of filters and the same kernel size, (b) the same number of filters and increasing kernel size, and (c) decreasing numbers of filters increasing kernel sizes are all strategies that have been used by well-known architectures such as VGG and U-net [54, 50]. Our model is configured as follows.
The first few layers of the model are convolutional layers. We used 5 convolutional layers with 64 filters each and a growing kernel size. The intuition behind this design is that starting with a small kernel guides the training in a way that the first layers learn simpler more local features that fits in their window (kernel size). Kernel sizes started with 3 (a typical kernel size in the literature), and are increased in multiples of 5 (equal to the sampling frequency) up to a size of 20.
Choosing a kernel size involves making a compromise between the propensity of a model to over-fitting or to over-generalising. We limited the kernel size at 20 because we judged that any larger size would invariably lead to models that over-fit.
A similar judgement was made in the second section of the model (Recurrent layers). We ascertained that the ‘sweet spot’ for hyper-parameters here was to use two GRU layers with 128 cells each. Their output was fed into a fully connected layer with 128 neurons with a leaky ReLU () activation function  and finally to a dense layer with units with softmax activation. We used Adam optimizer  that could converge in 60-80 epochs, i.e. validation accuracy plateaued. The full architecture is shown in figure 2. Note that all the initial configurations discussed here are identified based on reported best practices in the literature and after some experimentation in our first case study. However, we also take a proper and standard approach for hyper-parameter tuning (Grid Search), that will be discussed in section 4.3 (methodology) and reported in our research questions.
3.2 Data Encoding
The input/output values of the black-box system create a multivariate time-series (), which can be defined as a set of univariate time series () of the same length . Each corresponds to the recorded values for one of the inputs or outputs of the system:
We take both inputs and outputs as part of the time-series data to be fed as input into our deep learning models (as shown in Figure 1). This is important because the externally observable outputs can be important indicators of the current state of the system.
As an example from our UAV case study, if the ’Elevator’ outputs were not taken into account, a mid-flight “descend” state and the “approach” state right before landing would be indistinguishable if using the input sensor readings alone.
The next task is to prepare the training data by labelling phases in the time-series with the corresponding state-label. Theoretically, this requires a domain expert to manually label each individual timestamp with a state ID. In practice, however, the experts do not need to manually label every single timestamp. They only need to identify the timestamps they believe the state has changed at, during a flight, which usually are not more than a handful, (on average 7 state changes happened during a typical test scenario). This means that all timestamps in between each two state changes will be assigned labels, automatically. This reduces the manual work a lot and makes this process feasible. Note that we will also address the labeling cost in our transfer learning experiment.
We encode the state information as a set of tuples in the form of where denotes the timestamp where the system entered state . We denote the set of all possible states as .
So in summary, the dataset consists of pairs of the I/O values as features and their state information as labels
. Note that different flight scenarios are not necessarily in the same length. Therefore, to feed the data into TensorFlow, we first pre-process it to ensure that the signals are all of the same length, by zero-padding signals that are shorter than the longest signal. More details about this preprocessing can be find in the github repository of this paper.
4 Empirical Evaluation
The objective of this section is to evaluate our approach with the help of our industrial UAV system, provided by MicroPilot, as well as a well-known open source AutoPilot (Paparazzi). Our research questions are as follows:
4.1 Research Questions
How effective is our proposed DNN-based technique on inferring states and state changes?
In order to answer this, we pose three sub-questions:
How does the proposed technique perform in detecting the state change points?
Does our approach outperform existing baseline approaches in terms of Precision, Recall and F1 scores?
How well does the proposed technique predict the internal state of the system?
How correct are the state-labels predicted by our approach?
How much does the proposed model owe its performance to being a hybrid model?
Is the hybrid approach more accurate than using a model without the convolutional part, or without the GRU layers?
All of the above questions will be addressed in the context of our case study with our industrial partners at MicroPilot. However, this leads on to the second research question:
Do the results generalise to other systems?
To answer this question we apply the technique to another case study from FOSS domain: Paparazzi autopilot. To assess the generalisability of our approach, we pose the following sub-questions:
How does hyper-parameter tuning affect the generalizability of the results?
Since DL performance is so sensitive to hyperparameter tuning, can the use of standard tuning approaches lead to a comparable performance on Paparazzi?
How well do the results on change-point detection generalise? How good is our approach at detecting state-changes in Paparazzi?
How well do the results on state-labelling generalise? How good is our approach at labelling the states in Paparazzi?
Given that we use two similar systems (MicroPilot and Paparazzi) to answer RQ2, this raises the prospect of using Transfer Learning. We therefore also explore this as part of RQ2:
Is it possible to use Transfer Learning to use the model learned on one system to improve the performance on similar systems? Can we use the model from the MicroPilot study to improve the efficiency and accuracy of the learning process on Paparazzi?
4.2 Subject Systems
In this section more details on how data was collected from each system will be presented.
4.2.1 MicroPilot Autopilot
MicroPilot’s autopilot is a commercial autopilot with a codebase of 500k lines of C code. Micropilot is the world-leader in professional UAV autopilot which develops both hardware and software for 1000+ clients (including NASA, Raytheon, and Northrop Grumman) in 85+ countries during the past 20+ years.
The primary control mechanism in the autopilot is a hierarchy of PID loops, as explained both in chapter 2.1 and 3. High level commands are loaded on the autopilot as a flight plan. The flight plan looks like “takeoff, climb to 300 feet, go to waypoint A, go to waypoint B, land”. These commands determine which PID loops must be activated and what should their ‘desired values’ be. For example, a “go to waypoint A” command activates a PID loop that tries to minimize the distance between current location of the aircraft and point A. This is a high level loop that activates other lower-level PID loops to achieve its goal. Those lower level loops can be one to maintain the altitude and another loop that keeps the aircraft heading on the straight line from current location to point A. This hierarchical chain of ‘higher-level goals controlling lower level ones’ goes on, down to the level that directly controls aerodynamic surfaces of the aircraft.
|Pitch||The angle that aircraft’s nose makes with the horizon around lateral axis|
|Roll||The angle of aircraft’s wings make with the horizon around longitudinal axis|
|Yaw||The rotation angle of aircraft around the vertical axis|
|Altitude||AGL222Above Ground LevelAltitude|
|Air speed||Speed of the aircraft relative to the air|
|Elevator||Control surfaces that control the Pitch|
|Aileron||Control surfaces that control the Roll|
|Rudder||Control surface that controls the Yaw|
|Throttle||Controller of engine’s power, ranges from 0 to 1|
|Flaps||Surfaces of back of the wings that provide extra lift at low speeds, usually used during the landing|
Control decisions in this software are made in a 5Hz loop, it means that every 200ms all the sensor inputs are read and based on the current state of the aircraft and the system’s goal at the moment (e.g. maintaining a constant speed) decisions will be made and output is generated. Considering this, the best way to capture those data is in the end of each iteration of this loop. We inserted instrumentation code there, to log input and output values (listed in TableI) at the exact spot where they are updated. Please note that although it is more convenient to capture the values in this way, it does not give us any special advantage or insight that breaks the black-box condition. In other words, the exact same data could be collected from the compiled binaries without any access to the internals, just with extra steps. Inputs and outputs, after all, are the very least thing available in both black-box and white-box settings.
MicroPilot has a repository of 948 system tests, we ran them in a software simulator333It is developed by MicroPilot and provides an accurate simulation of the aerodynamic forces on the aircraft, the physical environment irregularities (e.g. unexpected wind gusts), and noises in sensor readings and collected the logged flight data, over time. The test cases are system-level tests. Each test case includes a flight scenario for various supported aircraft. A flight scenario goes through different phases in a flight such as “take off”, “climb”, “cruise”, “hitting way points”, and “landing”. Out of the 948 flight logs, we omitted 60 that were either too short or too long (shorter than 200 samples or longer than 20k samples). Figure 3 shows the distribution of the remaining log lengths. After omitting those scenarios, the maximum length observed () was 18,000 samples.
The data items (test scenarios) were randomly split into three chunks of 90%, 5%, and 5% for training, validation, and testing.
Note that separate test and validation sets are needed to facilitate proper hyper-parameters tuning, without leaking information. The autopilot can be used in SWIL and HWIL modes , which stand for software in the loop and hardware in the loop respectively. We used SWIL mode as it provides what was needed without any of the costs and hassles that come from HWIL mode.
Paparazzi provides a rich and flexible API that can be configured to record several different parameters in flight. The aircraft periodically sends data back to the ground station over a wireless link using a protocol called Paparazzi link. Paparazzi link is built over Ivy, a message bus protocol that uses UDP. In Paparazzi’s architecture a process called ‘link’ interfaces the wireless link to the aircraft to the computer’s network; on one side are the Paparazzi link messages that come and go as UDP datagrams and on its other side is the (often wireless444A wired connection is used in HWIL test mode as well as some scenarios where the autopilot equipment is used in a autonomous submarine, not an aircraft.) connection to the aircraft. In simulations, the modem and wireless communications are no longer needed, instead the autopilot runs as a separate process and mimics a wireless channel over the local network. (See Figure 4)
Paparazzi comes with a multitude of small tools that could do most of what we needed in terms of instrumentation. There is a remote logger and a log player which are quite close to the instrumentation tool we need, however upon trying them in action, we figured that they cannot record some of the information that we need. Therefore, we developed a custom flight data recorder tool.
While MicroPilot had a large a number of system tests (in addition to other types of tests such as unit tests which we did not use), Paparazzi comes only with unit tests. This is possibly because it is not subject to the stringent certifications and approvals that commercial systems require.
To create tests for our study we created a fuzz testing tool that can automatically generate valid, diverse, and meaningful automated system tests for Paparazzi. Our tool takes an example flight-plan, automatically generate system tests and runs them in a simulator (or on hardware555Although we have not tested running tests on a hardware (HWIL) to confirm, but having implemented the protocol it potentially is capable of doing so.), and also collect required telemetry data from the aircraft. The targeted randomizations in test inputs are augmented with the stochastic wind model in the simulation to further diversify the observed behaviours. The tool (which we refer to as pprz_tester) is available online 666https://github.com/MJafarMashhadi/pprz_tester. To make the logging and testing more similar to MicroPilot we added some patches to Paparazzi, for example increasing telemetry reporting rate from 2Hz to 5Hz. A list of these patches including the reason why that change was necessary or beneficial and the exact lines of code that need to be changed is available in the pprz-tester wiki777https://github.com/MJafarMashhadi/pprz_tester/wiki/Paparazzi-Patches.
The result of generating and running tests was 378 runs (Paparazzi dataset size) worth of different flight scenarios. After collecting the data we performed some pre-processing steps on them to make them more similar to what the previous model was trained on. These pre-processing steps include normalizing some values as well as metric to imperial unit conversions.
Figure 5 shows the distribution of the test lengths. The test lengths range from 140 time steps (70 seconds) to 2,580 time steps, with a median of 2,170 and a mean of 1896.
The dataset was split into 3 chunks after shuffling: 70% of the data was used for training, 20% was used as the validation set for tuning the hyper-parameters, and the remaining 10% was set aside as the test data to measure the trained model’s performance. Note that the splits are proportionally different to the chunks used for MicroPilot, since the data set here is much smaller so 5% of data, which is less than 20 flights, would not be sufficient for experimental analysis.
For the sake of brevity, we skip more details on the process of data collection from the Paparazzi simulator. To read more on this, please see the Github repository of this paper.
4.3.1 RQ1: Measuring CPD Accuracy
We start from a labelled data-set, where for a given multivariate time-series there is a corresponding series of state-labels. For this we use the source code to collect the exact time a state-change happens and the actual state labels (ground truth).
Given that in this task there is an inherent class imbalance (there are far more points where a change has not
happened compared to points with a state-change positive label), we avoid using the proportional measure of accuracy and report both Precision and Recall. The original Precision / Recall metrics require some modification, in order to accommodate a degree of approximation around the exact time stamp at which a state-change has happened. To handle this, similar to related work[57, 30], we use a tolerance margin . If a detected state-change () is within of a true change (), we call the prediction a True Positive, otherwise it is a False Positive. Similar adjustment to definition is applied for True Negative and False Negative. We denote change points for -th trace element as:
Here, refers to
-th element of output vector
. Based on this the confusion matrix elements are calculated as follows:
From these we calculate Precision, Recall and F1 in the usual way: . , and .
We calculate these measurements with three values for : 1, 3, and 5 seconds. The smaller the tolerance, the stricter the definitions become (leading to reduced accuracy scores).
For baselines for RQ1.1 we used ‘ruptures’ library developed by authors of a recent CPD survey study . It is a library to facilitate experimentation with CPD algorithms, and provides a variety of search and cost functions, which can be combined to form different state-of-the-art algorithms.
There are three search methods implemented which were suitable to use as baselines in our experiments, i.e. that do not require assumptions about the number of change points, distribution, etc. Pelt  is the most efficient exact search method. Two other ones are approximate search methods: bottom-up segmentation and window-based search method. After trying to run Pelt on MicroPilot’s data, we realized that it is prohibitively time-consuming compared to the approximate methods without providing much better results. As a result we restrict ourselves to the bottom-up and the window-based segmentation methods for baselines.
For the cost function, we tried “Least Absolute Deviation”, “Least Squared Deviation”, “Gaussian Process Change”, “Kernelized Mean Change”, “Linear Model Change”, “Rank-based Cost Function”, and “Auto-regressive model change” as defined in the library. Their parameters were left as default. To optimize the number of change points a penalty value (linearly proportionate to the number of detected change points) is added to the cost function, which limits the number of detected change points, the higher the penalty the fewer reported change points. We tried three different ratios (100, 500, and 1000) for the penalty.
In RQ1.2 (checking that the label of the predicted state matches the actual state) we have a multi-class classification problem. For this we calculate one set of Precision / Recall values per class (state label). We then report the mean value across all classes.
As a baseline for RQ1.2 we used Scikit-learn’s implementation of the classification algorithms: A ridge classifier (Logistic regression with L2 regularization) and three decision trees. The ridge classifier was configured to use the built-in cross validation to automatically chose the best regularization hyper-parameterin the range of to . Each decision tree was regularized by setting “maximum number of features” and “maximum depth”.
To prepare the data to be fed to the classification algorithms, we used a sliding window of width over the 10 time-series values and then flattened it to make a vector of size as the features. For the labels, which are categorical values for different states, we used their one-hot encoded representation (i.e., rather than having one integer feature with state IDs between 1 to 25, we use 25 binary features, with each encoding corresponding to a particular state).
The window sizes were chosen as same as the sizes of convolutional layers’ kernel sizes (3, 5, 10, 15, 20), to make the baselines better comparable with our method.
The ridge classifier was configured to use the built-in cross validation to automatically chose the best regularization hyper-parameter . To regularize the decision trees we tried: no limits, , and for “maximum number of features” regularization parameter. To find best “maximum depth” we first tried having no upper bound and observed how deep the tree grows; then we tried multiple numbers less than the maximum, until a drop in performance was observed.
For RQ1.3 (establishing the extent to which performance is affected by the hybrid model), we compared different versions of the model with each other. For this we used two base models - one is fully convolutional and one fully recurrent, and compared these to our full hybrid model.
4.3.2 RQ2: Establishing the generalisation to other systems
For RQ2 we use the same basic performance measurements of Precision, Recall and F1 (and their multi-class variants) as covered in Section 4.3.1.
For RQ2.1, we designed a model creation and evaluation pipeline that takes hyper-parameters as the input and outputs the model performance scores on test data (tuning data) as its output. The hyper-parameters that we searched over are:
Number of GRU layers: 1 or 2
Number of GRU cells in the recurrent section: 5 values between 64 and 512
Number of convolutional filters in each layer: 5 values between 16 to 72
The size of convolution kernels and the number of convolutional layers: 3-6 layers with increasing kernel size, starting from kernel sizes 3 or 5
The learning rate of Adam optimizer: 3 values from to
The parameters are shown in Figure 2
. We performed a grid search over these parameters, using Tensor Board888https://www.tensorflow.org/tensorboard to track the metrics and determine the right balance. Tensor Board is a monitoring tool made for TensorFlow that provides great insight for better training TensorFlow models. In total, there were 520 configurations that were used to train models on the training data and evaluated on test (tuning) data.
For RQ2.2 we re-applied the same methodology as described in RQ1.1 (evaluating the CPD accuracy on MicroPilot), but applied it to the Paparazzi system. Since the Paparazzi dataset was smaller, it became feasible (though still really time-consuming) to try “Pelt” as well. When using the window-based search method, we left the window size parameter at the default size of 100 .
In RQ2.3 we again used the same configurations and procedures as the MP’s case (RQ1.2). The only difference was the removal of the depth limit from the decision trees (since this activity had turned out to have a negligible effect on accuracy, as shown in the results for RQ1.2 later on).
For RQ2.4, the hybrid model developed for MicroPilot is used as a source model. We compare the predictions that the source model (tailored for and trained on the Paparazzi dataset) provides against the predictions made by a transfer learning technique based on the source model.
To see the effect of transfer learning, we apply it on the Paparazzi case study. Recall that initially Paparazzi had a very small set of tests. This, therefore, presents a good example of a scenario that suits transfer learning. The objective of this sub-RQ is to study the extent to which transfer learning is effective in this setting.
To apply the transfer learning to our source model from MicroPilot, we use the last two fully connected layers for fine tuning and freeze the remaining layers. We allocate a very small subset of the Paparazzi’s whole data set for training (only 5 test cases – less than 2% of the whole data set). The rest of the data is used as test sets. We use the same data points to train and test the source model, where all layers are trained from scratch (initialized with random weights) on the small training set, without freezing any layers; i.e., no transfer learning. We then calculate the mean for each metric over all the generated results in a K-fold cross-validation (with only 5 items as training set in each fold).
4.3.3 Experiment Execution Environment
Training and evaluation of the deep learning model was done on a single node running Ubuntu 18.04 LTS (Linux 5.3.0) equipped with Intel Core i7-9700 CPU, 32 gigabytes of main memory, and 8 gigabytes of GPU memory on a NVIDIA GeForce RTX 2080 graphics card. The code was implemented using Keras on TensorFlow 2.0999https://www.tensorflow.org/.
The baseline models could not fit on that machine, so two nodes on Compute Canada’s Beluga cluster, one with 6 CPUs and 75GiB of memory and one with 16 CPUs and 64GiB of memory, were used to train and evaluate them.
In this environment we have collected the execution costs of each technique in terms of actual time, and informally discuss them in the results for each RQ as well.
In this section, we present the results of the experiments and answer the two research questions.
4.4.1 RQ1.1 - CPD Performance
|Cost Function||Search Method||Penalty||Prec.||Recall||F1||Prec.||Recall||F1||Prec.||Recall||F1|
|Autoregressive Model||Bottom Up||1000||10.43%||75.44%||18.33%||21.21%||80.32%||33.55%||28.94%||81.22%||42.68%|
|Least Absolute Deviation||Bottom Up||500||7.32%||52.54%||12.85%||17.52%||87.73%||29.20%||25.02%||88.95%||39.05%|
|Least Squared Deviation||Bottom Up||1000||7.44%||85.09%||13.68%||16.40%||89.81%||27.74%||24.16%||90.47%||38.14%|
|Linear Model Change||Bottom Up||100||37.59%||28.98%||32.73%||45.20%||38.39%||41.52%||48.07%||41.36%||44.46%|
|Gaussian Process Change||Bottom Up||100||3.77%||92.23%||7.25%||8.99%||92.23%||16.39%||13.53%||92.23%||23.60%|
|Rank-based Cost Function||Bottom Up||100||13.45%||60.19%||21.98%||19.49%||80.10%||31.35%||22.98%||87.23%||36.38%|
|Kernelized Mean Change||Bottom Up||100||4.13%||3.24%||3.63%||12.22%||8.14%||9.77%||15.38%||10.58%||12.54%|
Table II shows the results of running CPD algorithms for various configurations . For each search method and cost function pair only one of the penalty values which resulted in the highest F1 scores for all values is reported.
As is to be expected, larger values of lead to improved scores. Another observation is that the bottom-up segmentation consistently outperforms the window-based segmentation method. We can also see that the linear cost function beats all the other ones in terms of precision. The Gaussian cost function achieves much higher recall values, but at the expense of precision. This cost function results in the detection of numerous change points spread across the time axis, so there is a good chance of having at least one change point predicted close to each true change point (hence the high recall), but also there are a lot of false positives, which leads to a low precision. Our approach (see the last line in the same table) shows improved scores throughout. Its results almost double the F1 score of the best performing baseline.
In terms of execution cost, running all 42 different settings of CPD algorithms on the whole dataset took a bit over 12 hours in the cloud using 16 CPUs and 64GB of main memory. The deep learning model on the other hand takes about an hour to train (which only needs to be done once), on a smaller machine (see section 4.3.3). It made predictions on the whole dataset in less than a minute.
So to answer RQ1, our method has shown improvement in F1 score with , with , and with ; almost doubling the score compared to the baselines.
The proposed model, which requires less memory compared to traditional CPD algorithms, improved their best performance by up to 102%, measured by F1 score, in less execution time.
4.4.2 RQ1.2 - Multi-class Classification Performance
To answer RQ1.2, we first compare different configurations of the baseline methods using the F1 score (harmonic mean of precision and recall) on the test data. The results are presented in TableIII.
Comparing the baseline methods with our approach (the last row) in Table III shows that our model outperforms all baselines. Comparing it with the model with the best F1-score shows a improvement in precision as well as a improvement in recall that means overall improvement in F1-score.
To get a visual impression of how good our predictions are in practice, Figure 6 shows the output of our model alongside the ground truth. The horizontal axis shows sample ID (time) and the states are color coded; each bar is split horizontally between ground-truth and predicted state values. This indicates that the algorithm performs better when the state changes are farther apart. There are some state changes that happen in quick succession which are not detected. This might happen because very quick frequent changes may be considered as noise, and not a pattern to infer.
Whereas the baseline models only see one window of the data at a time, convolutional layers are more generalized and flexible since each filter in each layer is comparable to a sliding window. As shown in in Table III, a larger window size leads to an improved performance. However, the downside of this is that it becomes significantly more difficult to train a model with large window sizes. In addition, convolutions can automatically learn preprocessing steps that could be beneficial such as a moving average. Each convolutional filter can learn a linear combination of its inputs. So when the convolutional layers are stacked on each other with non-linear activation functions in between, the hypothesis space they can learn grows, probably much larger than most of the baseline algorithms here. The fact that the performance improves as the window size increases is probably due to the ability of the recurrent cells (such as GRU) to capture long-term dependencies that do not necessarily fall into one window.
In terms of the training complexity (time and memory), our approach is much less resource-consuming. This can largely be attributed to the use of deep learning. In baseline models, as the window size grows the training and evaluation complexity also grows, up to a point where they ran out of memory. This forced us to train them in the cloud. Meanwhile, as mentioned earlier, the deep learning model could be trained on a 8GB GPU in roughly an hour.
The proposed model requires less than half as many CPUs and 70% as much memory compared to the best performing classical ML model. However, it improved upon the best baseline performance by up to 17% (in terms of F1 score) and required significantly less execution time.
4.4.3 RQ1.3 - Hybrid vs. Homogeneous model
As the results in Table IV suggest101010The results in the last column differ (around 1% in absolute value) from their corresponding results in tables III and II due to randomizations in splitting the data into training, testing, and validation sets., using a hybrid architecture delivers more than sum of its parts, outperforming the fully convolutional and fully recurrent baselines. We can also see how the RNN baseline results were closer to those of the full model, suggesting the important role it plays in capturing long-term relations in the data and inferring the system’s internal state.
|RNN only||CNN only||Full Model|
The hybrid architecture significantly outperforms a comparable RNN model or fully convolutional model. The recurrent section plays a more important role in the model’s performance.
4.4.4 RQ2.1 - Hyper-parameter Tuning
We used Tensor Board to compare the effects of different hyper-parameters on the model’s performance (across 520 configurations in our grid search). We looked at 6 metrics to tune (precision and recall for CPD with and and the precision and recall for state classification).
The tuning results showed that a) the number of GRU cells has a high correlation with both CPD precision and classification precision, b) the learning rate has the largest correlation (absolute) with the performance metrics, and c) the next most important factor was the convolutional layer counts and sizes.
In addition, we observed that the best configurations all share the lowest learning rate and the next most important factors in the best configurations were the convolutional layer counts and sizes.
Based on the previous studies, especially in the field of computer vision, we can say that each filter learns one feature so it is more effective to have small filters that are easier to train and can learn multiple features of the data.
Figure 7 summarizes the tuning results, after filtered out some of the poor configurations (those with CPD recall (t=1s)). The figure shows 3 + 8 parallel axes (3 hyper-parameters and 8 metrics). The green highlighted line is the best configuration with the highest area under the curve. The best configuration uses 384 GRU cells in 1 layer, 16 filter per convolutional layer sizes from 5 to 15, and a slow learning rate of .
To conclude the tuning study, Table V reports the results of the model evaluation with and without (i.e., using Micropilot case study’s parameters) tuning. The tuning has increased all evaluation scores (in the range of 11.5% to 42.5%). For instance, looking at the four F1 scores, we see an improvement between 18.3% (from 79.11% to 93.59%, for the CPD’s F1 with ) to 34.8% (from 64.87% to 87.45%, for the CPD’s F1 with ). This has made the Paparazzi results more on par with the Micropilot case study, as well.
|Evaluation Metric||Default Hyper-params||Tuned Hyper-params|
|State Detection Prec.||57.35%||77.20%|
|State Detection Recall||78.90%||87.97%|
|State Detection F1||66.41%||82.23%|
Hyper-parameter tuning is crucial for the generalisability of our approach. The tuned model improves the evaluation scores between 9% to 25%.
4.4.5 RQ2.2 - CPD Performance on Paparazzi
The results of applying baseline algorithms on the Paparazzi data set are visualised in Table VI. The range of CPD performance varies across the techniques. The setting that achieved the best result is the Pelt algorithm using an L1 cost function and a high penalty coefficient of 1,000. However, Pelt is a slow algorithm that would rapidly become infeasible to run on a larger data set, as it was the case for the MicroPilot case study. In this Paparazzi case, running Pelt took more than 14 hours on the same machine that performed all other CPD algorithms in less than 1 hour. What you see in table VI is the summary of the results of more than 23,800 experiments.
The baseline results can be contrasted with the neural network results in the last three rows of the table. This shows improvements of 47.88%, 34.81%, and 18.30% in F1 scores compared to the best results in the baselines (for seconds respectively). Although this model still performed better than the baselines, the improvement margin was lower than in RQ1.1. This is not due to any decrease in performance by our approach, but by a relative increase in performance by the baselines on Paparazzi’s data. We posit that this is because the Paparazzi data set is very small compared to MP; Paparazzi consists of 300 tests, each containing up to 2,500 samples vs. 900 tests in MP with series comprising up to 18,000 samples.
The proposed approach showed a near 48% improvement over the baselines with a 94% F1 score, confirming that it is a feasible approach even with smaller data.
|Cost Function||Search Method||Penalty||Prec.||Recall||F1||Prec.||Recall||F1||Prec.||Recall||F1|
|Autoregressive Model||Bottom Up||500||13.65%||14.13%||18.89%||39.73%||36.40%||36.14%||57.11%||53.20%||50.04%|
|Least Absolute Deviation||Window Based||100||24.50%||15.92%||19.85%||56.39%||37.24%||44.53%||68.94%||48.20%||56.23%|
|Least Squared Deviation||Window Based||1000||25.01%||16.14%||20.26%||54.58%||35.77%||42.96%||68.73%||48.11%||56.18%|
|Linear Model Change||Bottom Up||100||7.39%||0.39%||9.98%||27.44%||1.49%||10.27%||52.51%||2.75%||9.90%|
|Gaussian Process Change||Window Based||100||7.39%||0.39%||9.98%||27.44%||1.49%||10.27%||52.51%||2.75%||9.90%|
|Rank-based Cost Function||Pelt||100||21.11%||31.85%||25.51%||52.29%||74.63%||61.13%||65.88%||89.30%||75.46%|
|Kernelized Mean Change||Bottom Up||100||15.66%||1.90%||9.45%||44.23%||5.36%||12.95%||62.37%||7.48%||15.66%|
Change point detection methods performance on Paparazzi data set. Since the 100% recalls are actually outliers, the next largest recall values are in bold face as well.
4.4.6 RQ2.3 Results: Multi-class Classification Performance on Paparazzi (replication)
Table VII presents the comparative results of the baseline algorithms and those of our approach. In comparison with the MP case, we can see that in all but two settings limiting the maximum number of features did not help improve the model. A further observation is that the scores do not vary much from changing the window size, and decision trees almost universally outperform linear classifiers. The scores themselves are lower: The Ridge classifier F1 score here is in 21-29% range, which was 32-39% in RQ1.2. For decision trees it is in 67-68% range here while it is in 73-77% range in RQ1.2.
The deep learning model’s performance measures (on test data) can be found in the last row of the table. There is an improvement in F1 score over the baselines. The improvement in RQ1.2 is 16.83%, not very different from this. This, again provides some evidence on the generalizability of the results in the first two RQs.
Figure 8 summarizes the prediction results of the deep learning model compared to the ground truth.
Compared to the baselines, the proposed approach could detect the internal state of system with a 77% precision and 88% recall rate, showing a 19% improvement on baselines. Similar results were seen in RQ 1.2 (which is closely related to this question) confirming generalizability of the findings.
4.4.7 RQ2.4 - Transfer Learning
|Evaluation Metric||Transfer Learning||Training a full model|
|State Detection Prec.||73.12%||72.65%|
|State Detection Recall||75.82%||70.20%|
|State Detection F1||74.40%||71.23%|
As discussed, to evaluate the generalizability of our approach on cases where there are not many labeled data available, we study transfer learning (TL). Table VIII summarizes TL results by comparing the 12 evaluation metrics we have been using in previous RQs, when TL is employed, compared to training a model from scratch (called “Training a full model”). The results shows that the transfer learning method improves the baseline (“Training a full model”) for all reported metrics. The improvements over the baseline are in the range of 1% to 30%. For instance looking at the F1 scores, using TL (with only 5 labeled test cases) we can achieve 74.35% for CPD with and 74.40% for state classification tasks.
To put these numbers into perspective, the maximum F1 scores of our model with a fully labeled training set is 93.59% and 82.23%, for the CPD (with ) and the classification tasks, respectively (See Tables VI and VII). That means TL can achieve 79%, and 90% of the maximum potentials, for the same tasks, by much less manual effort (98% labeling cost reduction).
Using transfer learning up to 90% of the maximum potentials of our model (in terms of F1 scores) can be be achieved with only 2% of manual labeling costs.
4.5 Limitations and Threats to Validity
4.5.1 Limitations of the Approach
One of the limitations of this approach is that it might miss an input-output invariant correlation. It can happen when the input remains constant or it changes too little to reveal its relation with certain outputs.
We assume that during the data collection, sampling happens in regular intervals; this approach probably will have a hard time achieving high performances, working on unevenly spaced time-series data.
Another limitation is that requires the entire data set at once to be able to infer the models. In other words, the model can not handle a stream of data as they arrive, thus limiting its application for use cases such as anomaly detection at run-time. We intend to address this in future work by adapting the model architecture to an online learning model.
4.5.2 Threats to Validity
In terms of construct validity, we are using standard metrics to evaluate the results. However, the use of tolerance margin should be treated with caution since it is a domain-dependent variable and can change the final results. To alleviate this threat, we have used multiple margins and reported all results.
Another potential construct validity threat is the fact that we used the source code to provide an authoritative labelling of our data, whereas in practice this would be carried out by a domain expert in a black-box manner. This is a realistic expectation, since monitoring the logs and identifying the current system state is part of the developers/testers regular practice during inspection and debugging. All that is provided here is a tool that, given a partial labeling (only on the training set), automatically predicts the state labels and the state-change times for future flights. Even though we use the source code to label the training set, we still treat the data as though it were obtained from a black-box and do not take advantage of the additional information that one could obtain from source code. Also note that the transfer learning approach reduces the need for manual labeling significantly when the approach is being reused on similar projects (e.g., different products in a product line).
In terms of threats to internal validity, we reduced the potential for bias against baseline approaches by not implementing the CPD baselines ourselves and reusing existing libraries.
In terms of conclusion validity, one threat is that we draw incorrect conclusions from limited observations. To address this risk we based our results on a large test suite of 888 real test cases from MicroPilot’s test repository, and provided a proper train-validation-test split for training, tuning, and evaluation.
Finally, in terms of external validity threats, this study is limited to only two case studies. However, (a) the studies are large-scale real-world systems with many test cases, and (b) they are both from industry and open source to be more representative. Our future work will look to extend this work with more case studies from other domains, to increase its generalizability.
5 Conclusion and Future Work
In this paper, we developed a novel method for inferring black box models for autopilot software systems. Our method at its core is a deep neural network that combines convolutions and recurrent cells: hybrid CNN-RNN model. This design is inspired by deep neural network architectures that showed good performance in other fields (such as speech recognition, sleep phase detection, and human activity recognition). It can be used for both CPD and state classification problems in multivariate time series. This method can be used as a black-box state model inference for variety of use cases such as testing, debugging, and anomaly detection in control software systems, where there are several input signals that control output states. We have trained and evaluated this neural network on two case studies of UAV autopilot software, one open-Source and one from our industry partner. It showed promising results in inferring a behavioural model from the autopilot execution data; showing significant improvement in both change point detection and state classification compared to several baselines on 10 comparison metrics. Our proposed transfer learning approach also help reducing the manual labeling cost, significantly. Some potential extensions to this work include: (a) adapting this method to other domains such as self-driving cars, (b) using the inferred models to perform a downstream task such as anomaly detection, and (c) improving the hyper-parameter optimization part with more advanced tuning techniques.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number CRDPJ/515254-2017].
-  (2009) An extensible debugging architecture based on a hybrid debugging framework. Ph.D. Thesis, University of Idaho. Cited by: §1.
-  (2017) A survey of methods for time series change point detection. Knowledge and information systems 51 (2), pp. 339–367. Cited by: §2.3.1.
-  (2002) Mining specifications. ACM Sigplan Notices 37 (1), pp. 4–16. Cited by: §1.
-  (2012) Group lassoing change-points in piecewise-constant ar processes. EURASIP Journal on Advances in Signal Processing 2012 (1), pp. 70. Cited by: §2.3.1.
-  (1987) Learning regular sets from queries and counterexamples. Information and computation 75 (2), pp. 87–106. Cited by: 1st item.
-  (2008) Feedback systems : an introduction for scientists and engineers. Princeton University Press, Princeton. External Links: Cited by: §2.1.
-  (1998) Testing for and dating common breaks in multivariate time series. The Review of Economic Studies 65 (3), pp. 395–432. Cited by: §2.3.1.
-  (2014) The oracle problem in software testing: a survey. IEEE transactions on software engineering 41 (5), pp. 507–525. Cited by: §2.4.3.
-  (1993) Detection of abrupt changes: theory and application. Vol. 104, prentice Hall Englewood Cliffs. Cited by: §2.3.1, §2.3.1.
-  (1972) On the synthesis of finite-state machines from samples of their behavior. IEEE transactions on Computers 100 (6), pp. 592–597. Cited by: 1st item.
Nonlinear system identification using coevolution of models and tests.
IEEE Transactions on Evolutionary Computation9 (4), pp. 361–384. Cited by: 1st item.
-  (2008) A region-based algorithm for discovering petri nets from event logs. In International Conference on Business Process Management, pp. 358–373. Cited by: §2.2.
-  (2016) Active learning for extended finite state machines. Formal Aspects of Computing 28 (2), pp. 233–263. Cited by: 2nd item.
-  (2018-12) Recurrent Neural Networks for Multivariate Time Series with Missing Values. Scientific Reports 8 (1), pp. 1–12. External Links: Cited by: §2.4.1.
-  (2011) Parametric statistical change point analysis: with applications to genetics, medicine, and finance. Springer Science & Business Media. Cited by: §2.3.1.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.4.1.
-  (1998) Discovering models of software processes from event-based data. ACM Transactions on Software Engineering and Methodology (TOSEM) 7 (3), pp. 215–249. Cited by: §2.2.
-  (2011) Automatically generating test cases for specification mining. IEEE Transactions on Software Engineering 38 (2), pp. 243–257. Cited by: §1.
-  (2005) Generating annotated behavior models from end-user scenarios. IEEE Transactions on Software Engineering 31 (12), pp. 1056–1073. Cited by: §1, 1st item.
-  (2018) Comparison of search-based algorithms for stress-testing integrated circuits. In International Symposium on Search Based Software Engineering, pp. 198–212. Cited by: §2.1.
-  (2014) Wild binary segmentation for multiple change-point detection. The Annals of Statistics 42 (6), pp. 2243–2281. Cited by: §2.3.1.
-  (2014) Using the paparazzi uav system for scientific research. Cited by: §1, Fig. 4.
-  (2006) Wavelet-based phase classification. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pp. 95–104. Cited by: §2.4.2.
-  (2001) An online algorithm for segmenting time series. In Proceedings 2001 IEEE international conference on data mining, pp. 289–296. Cited by: §2.3.1, §4.3.2.
-  (2012) Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107 (500), pp. 1590–1598. Cited by: §2.3.1, §4.3.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.1.
-  (2006) Exploiting the analogy between traces and signal processing. In 2006 22nd IEEE International Conference on Software Maintenance, pp. 320–329. Cited by: §2.3.
-  (2005) Using penalized contrasts for the change-point problem. Signal processing 85 (8), pp. 1501–1510. Cited by: §2.3.1.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §2.4.1.
-  (2018) Time Series Segmentation through Automatic Feature Learning. Technical report Vol. 13. External Links: Cited by: §4.3.1.
-  (2015) General ltl specification mining (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 81–92. Cited by: §2.2.
-  (2011) Mining software specifications: methodologies and applications. CRC Press. Cited by: §1.
-  (2007) Mining modal scenario-based specifications from execution traces of reactive systems. In Proceedings of the Twenty-Second IEEE/ACM International Conference on Automated Software Engineering, ASE ’07, New York, NY, USA, pp. 465–468. External Links: Cited by: §2.2.
-  (2008) Automatic generation of software behavioral models. In Proceedings of the 30th international conference on Software engineering, pp. 501–510. Cited by: 2nd item.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §3.1.1.
-  (2019) An empirical study on practicality of specification mining algorithms on a real-world application. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 65–69. Cited by: §1, §2.1.
-  (2019) Interactive semi-automated specification mining for debugging: an experience report. arXiv preprint arXiv:1905.02245. Cited by: §1.
-  (2019-September 17) True hardware in the loop spi emulation. Google Patents. Note: US Patent 10,417,360 Cited by: §4.2.1.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §3.1.
-  (2016) Deep convolutional feature transfer across mobile activity recognition domains, sensor modalities and locations. In Proceedings of the 2016 ACM International Symposium on Wearable Computers, pp. 92–99. Cited by: §2.4.1, §2.4.2.
-  (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §2.4.1, §2.4.1.
-  (2008) Finding and reproducing heisenbugs in concurrent programs.. In OSDI, Vol. 8, pp. 267–280. Cited by: §1, §2.1.
-  (2002) Analyzing stock market tick data using piecewise nonlinear model. Expert Systems with Applications 22 (3), pp. 249–255. Cited by: §2.3.1.
-  (2016-01) Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors 16 (1), pp. 115. External Links: Cited by: §2.4.1, §2.4.2.
-  (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §2.4.3.
Black-box test generation from inferred models.
Proceedings - 4th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, RAISE 2015, pp. 19–24. External Links: Cited by: §1.
-  (2019) U-time: a fully convolutional network for time series segmentation applied to sleep staging. In Advances in Neural Information Processing Systems, pp. 4417–4428. Cited by: §2.4.2.
-  (2009) LearnLib: a framework for extrapolating behavioral models. International journal on software tools for technology transfer 11 (5), pp. 393. Cited by: 1st item.
-  (2007) A review and comparison of changepoint detection techniques for climate data. Journal of applied meteorology and climatology 46 (6), pp. 900–915. Cited by: §2.3.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.4.2, §3.1.1.
-  (2012-01) Model-based testing. IEEE Software 29 (1), pp. 14–18. External Links: Cited by: §1.
-  (1974) . Biometrics, pp. 507–512. Cited by: §2.3.1.
-  (2013) Assisting developers of big data analytics applications when deploying on hadoop clouds. In Proceedings of the 2013 International Conference on Software Engineering, pp. 402–411. Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.1.
-  (2017) Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 240–248. Cited by: §3.1.
-  (2006) A unifying framework for detecting outliers and change points from time series. IEEE transactions on Knowledge and Data Engineering 18 (4), pp. 482–492. Cited by: §2.3.1.
-  (2018-01) Selective review of offline change point detection methods. External Links: Cited by: §2.3.1, §2.3.1, §4.3.1, §4.3.1.
-  (2000) Adaptive, model-based monitoring for cyber attack detection. In International Workshop on Recent Advances in Intrusion Detection, pp. 80–93. Cited by: §1.
-  (2016) Inferring computational state machine models from program executions. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 122–132. Cited by: 2nd item.
-  (2018) Testing functional black-box programs without a specification. In Machine Learning for Dynamic Software Analysis: Potentials and Limits: International Dagstuhl Seminar 16172, Dagstuhl Castle, Germany, April 24-27, 2016, Revised Papers, A. Bennaceur, R. Hähnle, and K. Meinke (Eds.), pp. 101–120. External Links: Cited by: §1.
-  (2017) Time series classification from scratch with deep neural networks: a strong baseline. In 2017 International joint conference on neural networks (IJCNN), pp. 1578–1585. Cited by: §2.4.1.
-  (2015) Deep convolutional neural networks on multichannel time series for human activity recognition. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §2.4.1.
-  (2017) DeepSense: a unified deep learning framework for time-series mobile sensing data processing. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, Republic and Canton of Geneva, CHE, pp. 351–360. External Links: Cited by: §2.4.2.
-  (2014) Convolutional neural networks for human activity recognition using mobile sensors. In 6th International Conference on Mobile Computing, Applications and Services, pp. 197–205. Cited by: §2.4.1.
-  (2016) Exploiting multi-channels deep convolutional neural networks for multivariate time series classification. Frontiers of Computer Science 10 (1), pp. 96–112. Cited by: §2.4.2.