1 Introduction
Learning temporal logic requirements from data is an emergent research field gaining momentum in the rigorous engineering of cyberphysical systems. Classical machine learning methods typically generate very powerful blackbox (statistical) models. However, these models often do not help in the comprehension of the phenomenon they capture. Temporal logic provides a precise formal specification language that can easily be interpreted by humans. The possibility to describe datasets in a concise way using temporal logic formulas can thus help to better clarify and comprehend which are the emergent patterns for the system at hand. A clearcut example is the problem of anomaly detection, where the input is a set of trajectories describing regular or anomalous behaviors, and the goal is to learn a classifier that not only can be used to detect anomalous behaviors at runtime, but also gives insights on what characterizes an anomalous behavior. Learning temporal properties is also relevant in combination with state of the art techniques for searchbased falsification of complex closedloop systems
[0001SDKJ15, Sankaranarayanan17, BartocciBNS15, SilvettiPB17], as it can provide an automatic way to describe desired (or unwanted behaviors) that the system needs to satisfy.In this paper, we consider the problem of learning a temporal logic specification from a set of trajectories which are labeled by human experts (or by any other method which is not usable for realtime monitoring) as “good” for the normal expected behaviors and “bad” for the anomalous ones. The goal is to automatically synthesize both the structure of the formula and its parameters providing a temporal logic classifier that can discriminate as much as possible the bad and the good behaviors. This specification can be turned into a monitor that output a positive verdict for good behaviors and a negative verdict for bad ones.
Related Work
Mining temporal logic requirements is an emerging field of research in the analysis of cyberphysical systems (CPS) [Xu2018, AckermannCHRSL10, AsarinDMN11, HoxhaDF18, BBS14, JinDDS15, nguyen_abnormal_2017, ZhouRWT17, kong_temporal_2017, bombara_decision_2016, BufoBSBLB14]. This approach is orthogonal to active automata learning (AAL) such as Angluin’s algorithm [Angluin87] and its recent variants [IsbernerHS14, SteffenHI12]. AAL is suitable to capture the behaviours of blackbox reactive systems and it has been successfully employed in the field of CPS to learn how to interact with the surrounding environments [Chen2013, FuTHC14]. Mining temporal logic requirements has the following advantages with respect to AAL. The first is that it does not require to interact with a reactive system. AAL needs to query the system in order to learn a Mealy machine representing the relation between the input provided and the output observed. Mining temporal logic requirements can be applied directly to a set of observed signals without the necessity to provide an input. The second is the possibility to use temporal logic requirements within popular tools (such as Breach [Donze10] and STaLiRo [AnnpureddyLFS11]) for monitoring and falsification analysis of CPS models.
Most of the literature related to temporal logic inference from data focuses in particular on the problem of learning the optimal parameters given a specific template formula [Xu2018, AsarinDMN11, HoxhaDF18, BBS14, JinDDS15, nguyen_abnormal_2017, ZhouRWT17]. In [AsarinDMN11], Asarin et al. extend the Signal Temporal Logic (STL) [Maler2004] with the possibility to express the time bounds of the temporal operators and the constants of the inequalities as parameters. They also provide a geometric approach to identify the subset of the parameter space that makes a particular signal to satisfy an STL specification. Xu et al. have recently proposed in [Xu2018] a temporal logic framework called CensusSTL for multiagent systems that consists of an outer logic STL formula with a variable in the predicate representing the number of agents satisfying an inner logic STL formula. In the same paper the authors propose also a new inference algorithm similar to [AsarinDMN11] that given the templates for both the inner and outer formulas, searches for the optimal parameter values that make the two formulas capturing the trajectory data of a group of agents. In [HoxhaDF18] the authors use the same parametric STL extension in combination with the quantitative semantics of STL to perform a counterexample guided inductive parameter synthesis. This approach consists in iteratively generating a counterexample by executing a falsification tool for a template formula. The counterexample found at each step is then used to further refine the parameter set and the procedure terminates when no other counterexamples are found. In general, all these methods, when working directly with raw data, are potentially vulnerable to the noise of the measurements and they are limited by the amount of data available.
Learning both the structure and the parameters of a formula from a dataset poses even more challenges [kong_temporal_2017, bombara_decision_2016, BBS14, BufoBSBLB14]. This problem is usually addressed in two steps, learning the structure of the formula and synthesizing its parameters. In particular, in [kong_temporal_2017] the structure of the formula is learned by exploring a directed acyclic graph and the method exploits Support Vector Machine (SVM) for the parameter optimization. In [bombara_decision_2016] the authors use instead a decision tree
based approach for learning the formula, while the optimality is evaluated using heuristic impurity measures.
In our previous works [BBS14, BufoBSBLB14] we have also addressed the problem of learning both the structure and the parameters of a temporal logic specification from data. In [BBS14] the structure of the formula is learned using a heuristics algorithm, while in [BufoBSBLB14] using a genetic algorithm. The synthesis of the parameters is instead performed in both cases exploiting the Gaussian Process Upper Confidence Bound (GPUCB) [gpucb]
algorithm, statistically emulating the satisfaction probability of a formula for a given set of parameters. In both these methodologies, it is required to learn first a statistical model from the training set of trajectories. However, the statistical learning of this model can be a very difficult task and this is one of the main reason for proposing our new approach.
Our contribution
In this work, we consider the problem of mining the formula directly from a data set without requiring to learn a generative model from data. To achieve this goal, we introduce a number of techniques to improve the potentials of the genetic algorithm and to deal with the noise in the data in the absence of an underlying model.
First, instead of using the probability satisfaction as evaluator for the best formula, we design a discrimination function based on the quantitative semantics (or robustness) of STL and in particular the average robustness introduced in [BartocciBNS15]. The average robustness enables us to differentiate among STL classifiers that have similar discriminating performance with respect to the data by choosing the most robust one. This gives us more information than just having the probability of satisfaction: for each trajectory, we can evaluate how much is it closed to the “boundary” of the classifier, instead of only checking whether it satisfies or not a formula. We then modify the discrimination function and the GPUCB algorithm used in [BBS14] and [BufoBSBLB14] to better deal with noisy data and to use the quantitative semantics to emulate the average robustness distribution with respect to the parameter space of the formula.
Second, we reduce the computational cost of the Evolutionary Algorithm
(EA) by using a lightweight configuration (i.e., a low threshold of max number of iterations) of the GPUCB optimization algorithm to estimate the parameters of the formulas at each generation. The EA algorithm generates, as a final result, an STL formula tailored for classification purpose.
We compare our framework with our previous methodology [BufoBSBLB14] and with the decisiontree based approach presented in [bombara_decision_2016] on an anomalous trajectory detection problem of a naval surveillance and on an Assisted Ventilation in Intensive Care Patients. Our experiments indicate that the proposed approach outperforms our previous work with respect to accuracy and show that it produces in general more compact, and easy to read, temporal logic specifications with respect to the decisiontree based approach with a comparable speed and accuracy.
Paper structure
The rest of the paper is organized as follows: in the next section we present the Signal Temporal Logic and its robust semantics. We then introduce the problem considered in Section 3. In section 4, we describe our approach. The results are presented in Section 5. Finally, we conclude the paper in Section 6, by discussing the implications of our contribution, both from the practical and the methodological aspect and some directions of improvement.
2 Signal Temporal Logic
Stl
Signal Temporal Logic (STL) [Maler2004] is a formal specification language to express temporal properties over realvalues trajectories with densetime interval. For the rest of the paper, let be a trace/trajectory, where is the time domain, is the value at time of the projection on the coordinate, and , as an abuse on the notation, is used also to indicate the set of variables of the trace considered in the formulae.
Definition 1 (STL syntax)
The syntax of STL is given by
where is the Boolean true constant, is an atomic proposition, inequality of the form (), negation and conjunction are the standard Boolean connectives, and is the Until temporal modality, where is a real positive interval. As customary, we can derive the disjunction operator and the future eventually and always operators from the until temporal modality
STL can be interpreted over a trajectory using a qualitative (yes/no) or a quantitative (realvalue) semantics [Maler2004, Donze2013]. We report here only the quantitative semantics and we refer the reader to [MalerN13, Maler2004, Donze2013] for more details.
Definition 2 (STL Quantitative Semantics)
The quantitative satisfaction function returns a value ,^{1}^{1}1 quantifying the robustness degree (or satisfaction degree) of the property by the trajectory at time . It is defined recursively as follows:
Moreover, we let .
The sign of provides the link with the standard Boolean semantics of [Maler2004]: only if , while only if ^{2}^{2}2The case , instead, is a borderline case, and the truth of cannot be assessed from the robustness degree alone.. The absolute value of , instead, can be interpreted as a measure of the robustness of the satisfaction with respect to noise in signal , measured in terms of the maximal perturbation in the secondary signal , preserving truth value. This means that if then for every signal for which every secondary signal satisfies , we have that if and only if (correctness property).
Pstl
Parametric Signal Temporal Logic [AsarinDMN11] is an extension of STL that parametrizes the formulas. There are two types of formula parameters: temporal parameters, corresponding to the time bounds in the time intervals associated with temporal operators (e.g. , with , s.t. ), and the threshold parameters, corresponding to the constants in the inequality predicates (e.g., , where is a variable of the trajectory). In this paper, we allow only atomic propositions of the form with . Given an STL formula , let be the parameter space, where is the temporal parameter space ( being the number of temporal parameters), and is the threshold parameter space ( being the number of threshold parameters). A is a parameter configuration that induces a corresponding STL formula; e.g., then .
Stochastic Robustness
Let us consider an unknown stochastic process , where each vector corresponds to the state of the system at time . For simplicity, we indicate the stochastic model with .
is a measurable also as a random variable
on the space valued cadlag functions , here denoted by , assuming the domain to be fixed. It means that the set of trajectories of the stochastic process is the set . Let us consider now an STL formula , with predicates interpreted over state variables of . Given a trajectory of a stochastic system, its robustness is a measurable functional [BartocciBNS15] from the trajectories in to which defines the realvalued random variablewith probability distribution:
Such distribution of robustness degrees can be summarized by the average robustness degree. Fixing the stochastic process , , it gives a measure of how strongly a formula is satisfied. The satisfaction is more robust when this value is higher. In this paper, we will approximate this expectation by Monte Carlo sampling.
3 Problem formulation
In this paper, we focus our attention on learning the best property (or set of properties) that discriminates trajectories belonging to two different classes, say good and bad, starting from a labeled dataset of observed trajectories. Essentially, we want to tackle a supervised twoclass classification problem over trajectories, by learning a temporal logic discriminant, describing the temporal patterns that better separate two sets of observed trajectories.
The idea behind this approach is that there exists an unknown procedure that, given a temporal trajectory, is able to decide if the signal is good or bad. This procedure can correspond to many different things, e.g., to the reason of the failure of a sensor that breaks when it receives certain inputs. In general, as there may not be an STL specification that perfectly explains/mimics the unknown procedure, our task is to approximate it with the most effective one.
Our approach works directly with observed data, and avoids the reconstruction of an intermediate generative model of trajectories conditioned on their class , as in [BufoBSBLB14, BBS14]. The reason is that such models can be hard to construct, even if they provide a powerful regularization, as they enable the generation of an arbitrary number of samples to train the logic classifier.
In a purely datadriven setting, to build an effective classifier, we need to consider that training data is available in limited amounts and it is typically noisy. This reflects in the necessity of finding methods that guarantee good generalization performance and avoid overfitting. In our context, overfitting can result in overly complex formulae, de facto encoding the training set itself rather than guessing the underlying patterns that separate the trajectories. This can be faced by using a score function based on robustness of temporal properties, combined with suitably designed regularizing terms.
We want to stress that the approach we present here, due to the use of the average robustness of STL properties, can be easily tailored to different problems, like finding the property that best characterise a single set of observations.
4 Methodology
Learning an STL formula can be separated in two optimization problems: the learning of the formula structure and the synthesis of the formula parameters. The structural learning is treated as a discrete optimization problem using an Genetic Algorithm
(GA); the parameter synthesis, instead, considers a continuous parameter space and exploits an active learning algorithm, called
Gaussian Process Upper Confidence Bound (GPUCB). The two techniques are not used separately but combined together in a bilevel optimization. The GA acts externally by defining a set of template formulae. Then, the GPUCB algorithm, which acts at the inner level, finds the best parameter configuration such that each template formula better discriminates between the two datasets. For both optimizations, we need to define a score function to optimize, encoding the criterion to discriminate between the two datasets.Our implementation, called RObustness GEnetic (ROGE) algorithm is described in Algorithm 1. First, we give an overview of it and then we described each specific function in the next subsections. The algorithm requires as input the dataset (good) and (bad), the parameter space , with the bound of each considered variable, the size of the initial set of formulae , the number of maximum iterations , the mutation probability and the maximum initial formula size . The algorithm starts generating (line 1) an initial set of PSTL formulae, called . For each of these formulae (line 2), the algorithm learns the best parameters accordingly to a discrimination function . We call the generation for which we know the best parameters of each formula. During each iteration, the algorithm (line 4), guided by a fitness function F , extracts a subset composed by the best formulae, from the initial set . From this subset (line 5), a new set composed of formulae is created by employing the Evolve routine, which implements two genetic operators. Then (line 6), as in line 2, the algorithm identifies the best parameters of each formula belonging to . The new generation and the old generation are then merged together (line 7). From this set of formulae the algorithm extracts, with respect to the fitness function , the new generation of formulae. At the end of the iterations or when the stop criterion is true (lines 812), the algorithm returns the last generated formulae. The best formula is the one with the highest value of the discrimination function, i.e., . We describe below in order: the discrimination function algorithm, the learning of the parameters of a formula template and the learning of the formula structure.
4.1 Discrimination Function
We have two datasets and and we search for the formula that better separates these two classes. We define a function able to discriminate between this two datasets, such that maximising this discrimination function corresponds to finding the best formula classifier. In this paper, we decide to use, as evaluation of satisfaction of each formula, the quantitative semantics of STL. As described in Section 2, this semantics computes a realvalue (robustness degree) instead of only a yes/no answer.
Given a dataset , we assume that the data comes from an unknown stochastic process . The process in this case is like a blackbox for which we observe only a subset of trajectories, the dataset . We can then estimate the averages robustness
and its variance
, averaging over .To discriminate between the two dataset and , we search the formula that maximizes the difference between the average robustness of , , and the average robustness of ,
divided by the sum of the respective standard deviation:
(1) 
This formula is proportional to the probability that a new point sampled from the distribution generating or , will belong to one set and not to the other. In fact, an higher value of implies that the two average robustness will be sufficiently distant relative to their lengthscale, given by the standard deviation .
As said above, we can evaluate only a statistical approximation of because we have a limited number of samples belonging to and . We will see in the next paragraph how we tackle this problem.
4.2 GBUCP: learning the parameters of the formula
Given a formula and a parameter space , we want to find the parameter configuration that maximises the score function , considering that the evaluations of this function are noisy. So, the question is how to best estimate (and optimize) an unknown function from observations of its value at a finite set of input points. This is a classic nonlinear nonconvex optimization problem that we tackle by means of the GPUCB algorithm [gpucb]
. This algorithm interpolates the noisy observations using a stochastic process (a procedure called emulation in statistics) and optimize the emulated function using the uncertainty of the fit to determine regions where the true maximum can lie. More specifically, the GPUCB bases its emulation phase on Gaussian Processes, a Bayesian nonparametric regression approach
[Rasmussen2007], adding candidate maximum points to the training set of the GP in an iterative fashion, and terminating when no improvement is possible (see [gpucb] for more details).After this optimization, we have found a formula that separates the two datasets, not from the point of view of the satisfaction (yes/no) of the formula but from the point of view of the robustness value. In other words, there is a threshold value such that and . Now, we consider the new STL formula obtained translating the atomic predicates of by (e.g., becomes ). Taking into account that the quantitative robustness is achieved by combination of , , and , which are linear algebraic operators with respect to translations (e.g, ), we easily obtain that and . Therefore, will divide the two datasets also from the point of view of the satisfaction.
4.3 Genetic Algorithm: learning the structure of the formula
To learn the formula structure, we exploit a modified version of the Genetic Algorithm (GA) presented in [BufoBSBLB14]. GAs belongs to the larger class of evolutionary algorithms (EA). They are used for search and optimization problems. The strategy takes inspiration from the genetic area, in particular in the selection strategy of species. Let us see now in detail the genetic routines of the ROGE algorithm.
.
This routine generates the initial set of formulae. A fraction of this initial set is constructed by a subset of the temporal properties: , , , where the atomic predicates are of the form or or simple boolean combinations of them (e.g. ). The size of this initial set is exponential accordingly to the number of considered variables . If we have few variables we can keep all the generated formulae, whereas if the number of variables is large we consider only a random subset. The remain formulae are chosen randomly from the set of formulae with a maximum size defined by the input parameter . This size can be adapted to have a wider range of initial formulae.
.
This procedure extracts from a subset of formulae, according to a fitness function . The first factor is the dicrimination function previously defined and is a size penalty, i.e. a function penalizes formulae with large sizes. In this paper, we consider , where is heuristically set such that , i.e. formulae of size 5 get a 50% penalty, and is adaptively computed as the average discrimination in the current generation. An alternative choice of can be done by crossvalidation.
.
This routine defines a new set of formulae () starting from , exploiting two genetic operators: the recombination and the mutation operator. The recombination operator takes as input two formulae , (the parents), it randomly chooses a subtree from each formula and it swaps them, i.e. it assigns the subtree of to and viceversa. As a result, the two generated formulae (the children) share different subtrees of the parents’ formulae. The mutation operator takes as input one formula and induce a random change (such as inequality flip, temporal operator substitution, etc.) on a randomly selected node of its treestructure. The purpose of the genetic operators is to introduce innovation in the offspring population which leads to the optimization of a target function ( in this case). In particular, recombination is related to exploitation, whereas mutation to exploration. The evolve routine implements an iterative loop that at each iteration selects which genetic operators to apply. A number is randomly sampled. If its value is lower than the mutation probability, i.e., , then the mutation is selected, otherwise a recombination is performed. At this point the input formulae of the selected genetic operator are chosen randomly between the formulas composing and the genetic operators are applied. In our implementation, we give more importance to the exploitation; therefore the mutation acts less frequently than the recombination (i.e., ). The iteration loops will stop when the number of generated formula is equal to , i.e. twice the cardinality of .
5 Case studies and Experimental Results
5.1 Maritime Surveillance
As first case study, We consider the maritime surveillance problem presented in [bombara_decision_2016] to compare our framework with their Decision Tree (DTL4STL) approach. The experiments with the DTL4STL approach were implemented in Matlab, the code is available at [DTL4STL]. We also test our previous implementation presented in [BufoBSBLB14] with the same benchmark. Both the new an the previous learning procedure were implemented in Java (JDK 1.8_0) and run on a Dell XPS, Windows 10 Pro, Intel Core i77700HQ 2.8 GHz, 8GB 1600 MHz memory.
The synthetic dataset of naval surveillance reported in [bombara_decision_2016] consists of 2dimensional coordinates traces of vessels behaviours. It considers two kind of anomalous trajectories and regular trajectories, as illustrated in Fig. 1. The dataset contains 2000 total trajectories (1000 normal and 1000 anomalous) with 61 sample points per trace.
We run the experiments using a 10fold crossvalidation in order to collect the mean and variance of the misclassified trajectories of the validation set. Results are summarized in Table 1, where we report also the average execution time. We test also our previous implementation [BufoBSBLB14] without a generative model from data. It performs so poorly on the chosen benchmark that is not meaningful to report it: the misclassification rate on the validation set is around 50%. In Table 1, we also report DTL4STL the DTL4STL performance declared in [bombara_decision_2016], but we were not be able to reproduce them in our setting.
ROGE  DTL4STL  DTL4STL  
Mis. Rate  
Comp. Time    

In terms of accuracy our approach is comparable with respect to the performance of the DTL4STL. In terms of computational cost, the decision tree approach is slightly slower but it is implemented in Matlab rather than Java.
An example of formula that we learn with ROGE is
(2) 
The formula (2) identifies all the regular trajectories, remarkably resulting in a misclassification error equal to zero, as reported in Table 1. The red anomalous trajectories falsify the predicate before verifying , on the contrary the blue anomalous trajectories globally satisfy but never verify (consider that all the vessels start from the top right part of the graphic in Figure 1). Both these conditions result in the falsification of Formula( 2), which on the contrary is satisfied by all the regular trajectories. In Figure 1, we have reported the threshold and .
An example instead of formula found by the DTL4STL tool using the same dataset is the following:
The specific formula generated using ROGE is simpler than the formula generated using DTL4STL and indeed it is easier to understand it. Furthermore DTL4STL does not consider the until operator in the set of primitives (see [bombara_decision_2016] for more details).
5.2 Ineffective Inspiratory Effort (IIE)
The Ineffective Inspiratory Effort (IIE) is one of the major problems concerning the mechanical ventilation of patients in intensive care suffering from respiratory failure. The mechanical ventilation uses a pump to help the patient in the process of ispiration and expiration, controlling the of air into the lugs and the airway pressure (). Inspiration is characterised by growth of pressure up to the selected value and positive , while expiration has a negative and a drop . The exact dynamics of these respiratory curves stricly depends on patient and on ventilatory modes. We can see an example in Fig. 2 of two regular (blue regions) and one ineffective (red region) breaths. An IIE occurs when the patients tries to inspire, but its effort is not sufficient to trigger the appropriate reaction of the ventilator. This results in a longer expiration phase. Persistence of IIE may cause permanent damages of the respiratory system.
An early detection of anomalies using lowcost methods is a key challenge to prevent IIE conditions and still an open problem.
In [BufoBSBLB14], starting with a dataset of discrete time series and sampled values (labeled good and bad) from a single patient, the authors statically designed generative models of and of its derivative , for regular and ineffective signals. Then they used the simulations of such models to identify the best formula/formulae discriminating between them. In that contribution trajectories with different lengths are considered, treating as false trajectories that are too short to verify a given formula, and use this to detect the length of trajectories and separate anomalous breaths which last longer than regular ones. However, using the information about the duration of a breath is not advisable if the goal is to monitor the signals at runtime and detect the IEE at their onset. For this reason, in our approach we consider a new dataset, cutting the original trajectories used to generate the stochastic model in Bufo et al. [BufoBSBLB14] to a maximum common length of the order of 2 seconds. We also apply a moving average filter to reduce the noise in the computation of .
Specifically, the new dataset consists of 2dimensional traces of and its derivative, , containing a total of 288 trajectories (144 normal and 144 anomalous) with 134 sample points per trace. The time scale is in hundredths of a second. We run the experiments using a 6fold crossvalidation and report our results and comparison on the new dataset with DTL4STL [bombara_decision_2016] in Table 2.
ROGE  DTL4STL  

Mis. Rate  
False. Pos  
False. Neg  
Comp. Time 
An example of formula that we learn with ROGE is:
(3) 
Formula identifies the anomalous trajectories, stating that at a time between sec and seconds either is higher than or is below . This is in accordance with the informal description of an IIE, principally caused by an unexpected increment of the during the expiration phase followed by a rapid decrease. Indeed, one of the main characteristic of an IIE is the presence of a small hump in the curve during this phase. In Fig. 3 we report some of the trajectories of the dataset, showing the areas where the property is satisfied.
On average, formulae found by the DTL4STL tool on the new dataset are disjunctions or conjunctions of eventually or always subformulae, which are not readily interpretable.
Our approach on the new dataset is better in term of accuracy and computation time with an improvement of 22% and 67%, respectively.
Similarly to the previous test case, we compare our approach with [BufoBSBLB14], performed directly on the dataset, and not on the generative model. That approach performs so poorly also in this benchmark, obtaining a misclassification rate of , which is comparable with a random classifier. The problem here is that the methods proposed in [BufoBSBLB14], differently from the one presented here, relies on a large number of model simulations, and it is not suited to work directly with observed data.
If we give up to online monitoring and consider only full breaths, we can improve classification by rescaling their duration into [0,1], so that each breath lasts the same amount of time. In this case, we learn a formula with misclassification rate of , while DTL4STL reaches a misclassification rate of . This suggests that the high variability of durations of ineffective breaths has to be treated more carefully.
6 Conclusion
We present a framework to learn from a labeled dataset of normal and anomalous trajectories a Signal Temporal Logic (STL) specification that better discriminates among them. In particular, we design a Robust Genetic algorithm (ROGE) that combines an Evolutionary Algorithm (EA) for learning the structure of the formula and the Gaussian Process Upper Confidence Bound algorithm (GPUCB) for synthesizing the formula parameters. We compare ROGE with our previous work [BufoBSBLB14] and with the Decision Tree approach presented in [bombara_decision_2016] on an anomalous trajectory detection problem of a maritime surveillance system and on an Ineffective Inspiratory Effort example.
A significant difference concerning our previous approach [BBS14, BufoBSBLB14] is that we avoid reconstructing a generative statistical model from the dataset. Furthermore, we modified both the structure and parameter optimization procedure of the genetic algorithm, which now relays on the robustness semantics of STL. This structural improvement was necessary considering that a naive application of our previous approach [BufoBSBLB14] directly on datasets performs very poorly.
We compare our method also with the Decision Tree (DTL4STL) approach of [bombara_decision_2016] obtaining a low misclassification rate and producing smaller and more interpretable STL specifications. Furthermore, we do not restrict the class of the temporal formula to only eventually and globally and we allow arbitrary temporal nesting.
Our genetic algorithm can get wrong results if the formulae of the initial generation are chosen entirely randomly. We avoid this behavior by considering simpler formulae from the beginning as a result of the generateInitialFormulae routine. This approach is a form of regularization and resembles the choice of the set of primitive of DTL4STL.
As future work, we plan to develop an iterative method which uses the proposed genetic algorithm and the robustness of STL to reduce the misclassification rate of the generated formula. The idea is to use the robustness value of a learned formula to identify the region of the trajectory space where the generated formula has a high accuracy, i.e. trajectories whose robustness is greater than a positive threshold or smaller than a negative threshold . These thresholds can be identified by reducing the number of false positives or false negatives. We can then train a new STL classifier on the remaining trajectories, having small robustness for . This will create a hierarchical classifier, that first tests on , if robustness is too low it tests on , and so on. The depth of such classification is limited only by the remaining data at each step. The method can be further strengthen by relying on bagging to generate an ensemble of classifiers at each level, averaging their predictions or choosing the best answer, i.e. the one with larger robustness.
Acknowledgment
E.B. and L.N. acknowledge the partial support of the Austrian National Research Network S 11405N23 (RiSE/SHiNE) of the Austrian Science Fund (FWF). E.B., L.N. and S.S. acknowledge the partial support of the ICT COST Action IC1402 (ARVI).
References
2 Signal Temporal Logic
Stl
Signal Temporal Logic (STL) [Maler2004] is a formal specification language to express temporal properties over realvalues trajectories with densetime interval. For the rest of the paper, let be a trace/trajectory, where is the time domain, is the value at time of the projection on the coordinate, and , as an abuse on the notation, is used also to indicate the set of variables of the trace considered in the formulae.
Definition 1 (STL syntax)
The syntax of STL is given by
where is the Boolean true constant, is an atomic proposition, inequality of the form (), negation and conjunction are the standard Boolean connectives, and is the Until temporal modality, where is a real positive interval. As customary, we can derive the disjunction operator and the future eventually and always operators from the until temporal modality
STL can be interpreted over a trajectory using a qualitative (yes/no) or a quantitative (realvalue) semantics [Maler2004, Donze2013]. We report here only the quantitative semantics and we refer the reader to [MalerN13, Maler2004, Donze2013] for more details.
Definition 2 (STL Quantitative Semantics)
The quantitative satisfaction function returns a value ,^{1}^{1}1 quantifying the robustness degree (or satisfaction degree) of the property by the trajectory at time . It is defined recursively as follows:
Moreover, we let .
The sign of provides the link with the standard Boolean semantics of [Maler2004]: only if , while only if ^{2}^{2}2The case , instead, is a borderline case, and the truth of cannot be assessed from the robustness degree alone.. The absolute value of , instead, can be interpreted as a measure of the robustness of the satisfaction with respect to noise in signal , measured in terms of the maximal perturbation in the secondary signal , preserving truth value. This means that if then for every signal for which every secondary signal satisfies , we have that if and only if (correctness property).
Pstl
Parametric Signal Temporal Logic [AsarinDMN11] is an extension of STL that parametrizes the formulas. There are two types of formula parameters: temporal parameters, corresponding to the time bounds in the time intervals associated with temporal operators (e.g. , with , s.t. ), and the threshold parameters, corresponding to the constants in the inequality predicates (e.g., , where is a variable of the trajectory). In this paper, we allow only atomic propositions of the form with . Given an STL formula , let be the parameter space, where is the temporal parameter space ( being the number of temporal parameters), and is the threshold parameter space ( being the number of threshold parameters). A is a parameter configuration that induces a corresponding STL formula; e.g., then .
Stochastic Robustness
Let us consider an unknown stochastic process , where each vector corresponds to the state of the system at time . For simplicity, we indicate the stochastic model with .
is a measurable also as a random variable
on the space valued cadlag functions , here denoted by , assuming the domain to be fixed. It means that the set of trajectories of the stochastic process is the set . Let us consider now an STL formula , with predicates interpreted over state variables of . Given a trajectory of a stochastic system, its robustness is a measurable functional [BartocciBNS15] from the trajectories in to which defines the realvalued random variablewith probability distribution:
Such distribution of robustness degrees can be summarized by the average robustness degree. Fixing the stochastic process , , it gives a measure of how strongly a formula is satisfied. The satisfaction is more robust when this value is higher. In this paper, we will approximate this expectation by Monte Carlo sampling.
3 Problem formulation
In this paper, we focus our attention on learning the best property (or set of properties) that discriminates trajectories belonging to two different classes, say good and bad, starting from a labeled dataset of observed trajectories. Essentially, we want to tackle a supervised twoclass classification problem over trajectories, by learning a temporal logic discriminant, describing the temporal patterns that better separate two sets of observed trajectories.
The idea behind this approach is that there exists an unknown procedure that, given a temporal trajectory, is able to decide if the signal is good or bad. This procedure can correspond to many different things, e.g., to the reason of the failure of a sensor that breaks when it receives certain inputs. In general, as there may not be an STL specification that perfectly explains/mimics the unknown procedure, our task is to approximate it with the most effective one.
Our approach works directly with observed data, and avoids the reconstruction of an intermediate generative model of trajectories conditioned on their class , as in [BufoBSBLB14, BBS14]. The reason is that such models can be hard to construct, even if they provide a powerful regularization, as they enable the generation of an arbitrary number of samples to train the logic classifier.
In a purely datadriven setting, to build an effective classifier, we need to consider that training data is available in limited amounts and it is typically noisy. This reflects in the necessity of finding methods that guarantee good generalization performance and avoid overfitting. In our context, overfitting can result in overly complex formulae, de facto encoding the training set itself rather than guessing the underlying patterns that separate the trajectories. This can be faced by using a score function based on robustness of temporal properties, combined with suitably designed regularizing terms.
We want to stress that the approach we present here, due to the use of the average robustness of STL properties, can be easily tailored to different problems, like finding the property that best characterise a single set of observations.
4 Methodology
Learning an STL formula can be separated in two optimization problems: the learning of the formula structure and the synthesis of the formula parameters. The structural learning is treated as a discrete optimization problem using an Genetic Algorithm
(GA); the parameter synthesis, instead, considers a continuous parameter space and exploits an active learning algorithm, called
Gaussian Process Upper Confidence Bound (GPUCB). The two techniques are not used separately but combined together in a bilevel optimization. The GA acts externally by defining a set of template formulae. Then, the GPUCB algorithm, which acts at the inner level, finds the best parameter configuration such that each template formula better discriminates between the two datasets. For both optimizations, we need to define a score function to optimize, encoding the criterion to discriminate between the two datasets.Our implementation, called RObustness GEnetic (ROGE) algorithm is described in Algorithm 1. First, we give an overview of it and then we described each specific function in the next subsections. The algorithm requires as input the dataset (good) and (bad), the parameter space , with the bound of each considered variable, the size of the initial set of formulae , the number of maximum iterations , the mutation probability and the maximum initial formula size . The algorithm starts generating (line 1) an initial set of PSTL formulae, called . For each of these formulae (line 2), the algorithm learns the best parameters accordingly to a discrimination function . We call the generation for which we know the best parameters of each formula. During each iteration, the algorithm (line 4), guided by a fitness function F , extracts a subset composed by the best formulae, from the initial set . From this subset (line 5), a new set composed of formulae is created by employing the Evolve routine, which implements two genetic operators. Then (line 6), as in line 2, the algorithm identifies the best parameters of each formula belonging to . The new generation and the old generation are then merged together (line 7). From this set of formulae the algorithm extracts, with respect to the fitness function , the new generation of formulae. At the end of the iterations or when the stop criterion is true (lines 812), the algorithm returns the last generated formulae. The best formula is the one with the highest value of the discrimination function, i.e., . We describe below in order: the discrimination function algorithm, the learning of the parameters of a formula template and the learning of the formula structure.
4.1 Discrimination Function
We have two datasets and and we search for the formula that better separates these two classes. We define a function able to discriminate between this two datasets, such that maximising this discrimination function corresponds to finding the best formula classifier. In this paper, we decide to use, as evaluation of satisfaction of each formula, the quantitative semantics of STL. As described in Section 2, this semantics computes a realvalue (robustness degree) instead of only a yes/no answer.
Given a dataset , we assume that the data comes from an unknown stochastic process . The process in this case is like a blackbox for which we observe only a subset of trajectories, the dataset . We can then estimate the averages robustness
and its variance
, averaging over .To discriminate between the two dataset and , we search the formula that maximizes the difference between the average robustness of , , and the average robustness of ,
divided by the sum of the respective standard deviation:
(1) 
This formula is proportional to the probability that a new point sampled from the distribution generating or , will belong to one set and not to the other. In fact, an higher value of implies that the two average robustness will be sufficiently distant relative to their lengthscale, given by the standard deviation .
As said above, we can evaluate only a statistical approximation of because we have a limited number of samples belonging to and . We will see in the next paragraph how we tackle this problem.
4.2 GBUCP: learning the parameters of the formula
Given a formula and a parameter space , we want to find the parameter configuration that maximises the score function , considering that the evaluations of this function are noisy. So, the question is how to best estimate (and optimize) an unknown function from observations of its value at a finite set of input points. This is a classic nonlinear nonconvex optimization problem that we tackle by means of the GPUCB algorithm [gpucb]
. This algorithm interpolates the noisy observations using a stochastic process (a procedure called emulation in statistics) and optimize the emulated function using the uncertainty of the fit to determine regions where the true maximum can lie. More specifically, the GPUCB bases its emulation phase on Gaussian Processes, a Bayesian nonparametric regression approach
[Rasmussen2007], adding candidate maximum points to the training set of the GP in an iterative fashion, and terminating when no improvement is possible (see [gpucb] for more details).After this optimization, we have found a formula that separates the two datasets, not from the point of view of the satisfaction (yes/no) of the formula but from the point of view of the robustness value. In other words, there is a threshold value such that and . Now, we consider the new STL formula obtained translating the atomic predicates of by (e.g., becomes ). Taking into account that the quantitative robustness is achieved by combination of , , and , which are linear algebraic operators with respect to translations (e.g, ), we easily obtain that and . Therefore, will divide the two datasets also from the point of view of the satisfaction.
4.3 Genetic Algorithm: learning the structure of the formula
To learn the formula structure, we exploit a modified version of the Genetic Algorithm (GA) presented in [BufoBSBLB14]. GAs belongs to the larger class of evolutionary algorithms (EA). They are used for search and optimization problems. The strategy takes inspiration from the genetic area, in particular in the selection strategy of species. Let us see now in detail the genetic routines of the ROGE algorithm.
.
This routine generates the initial set of formulae. A fraction of this initial set is constructed by a subset of the temporal properties: , , , where the atomic predicates are of the form or or simple boolean combinations of them (e.g. ). The size of this initial set is exponential accordingly to the number of considered variables . If we have few variables we can keep all the generated formulae, whereas if the number of variables is large we consider only a random subset. The remain formulae are chosen randomly from the set of formulae with a maximum size defined by the input parameter . This size can be adapted to have a wider range of initial formulae.
.
This procedure extracts from a subset of formulae, according to a fitness function . The first factor is the dicrimination function previously defined and is a size penalty, i.e. a function penalizes formulae with large sizes. In this paper, we consider , where is heuristically set such that , i.e. formulae of size 5 get a 50% penalty, and is adaptively computed as the average discrimination in the current generation. An alternative choice of can be done by crossvalidation.
.
This routine defines a new set of formulae () starting from , exploiting two genetic operators: the recombination and the mutation operator. The recombination operator takes as input two formulae , (the parents), it randomly chooses a subtree from each formula and it swaps them, i.e. it assigns the subtree of to and viceversa. As a result, the two generated formulae (the children) share different subtrees of the parents’ formulae. The mutation operator takes as input one formula and induce a random change (such as inequality flip, temporal operator substitution, etc.) on a randomly selected node of its treestructure. The purpose of the genetic operators is to introduce innovation in the offspring population which leads to the optimization of a target function ( in this case). In particular, recombination is related to exploitation, whereas mutation to exploration. The evolve routine implements an iterative loop that at each iteration selects which genetic operators to apply. A number is randomly sampled. If its value is lower than the mutation probability, i.e., , then the mutation is selected, otherwise a recombination is performed. At this point the input formulae of the selected genetic operator are chosen randomly between the formulas composing and the genetic operators are applied. In our implementation, we give more importance to the exploitation; therefore the mutation acts less frequently than the recombination (i.e., ). The iteration loops will stop when the number of generated formula is equal to , i.e. twice the cardinality of .
5 Case studies and Experimental Results
5.1 Maritime Surveillance
As first case study, We consider the maritime surveillance problem presented in [bombara_decision_2016] to compare our framework with their Decision Tree (DTL4STL) approach. The experiments with the DTL4STL approach were implemented in Matlab, the code is available at [DTL4STL]. We also test our previous implementation presented in [BufoBSBLB14] with the same benchmark. Both the new an the previous learning procedure were implemented in Java (JDK 1.8_0) and run on a Dell XPS, Windows 10 Pro, Intel Core i77700HQ 2.8 GHz, 8GB 1600 MHz memory.
The synthetic dataset of naval surveillance reported in [bombara_decision_2016] consists of 2dimensional coordinates traces of vessels behaviours. It considers two kind of anomalous trajectories and regular trajectories, as illustrated in Fig. 1. The dataset contains 2000 total trajectories (1000 normal and 1000 anomalous) with 61 sample points per trace.
We run the experiments using a 10fold crossvalidation in order to collect the mean and variance of the misclassified trajectories of the validation set. Results are summarized in Table 1, where we report also the average execution time. We test also our previous implementation [BufoBSBLB14] without a generative model from data. It performs so poorly on the chosen benchmark that is not meaningful to report it: the misclassification rate on the validation set is around 50%. In Table 1, we also report DTL4STL the DTL4STL performance declared in [bombara_decision_2016], but we were not be able to reproduce them in our setting.
ROGE  DTL4STL  DTL4STL  
Mis. Rate  
Comp. Time    

In terms of accuracy our approach is comparable with respect to the performance of the DTL4STL. In terms of computational cost, the decision tree approach is slightly slower but it is implemented in Matlab rather than Java.
An example of formula that we learn with ROGE is
(2) 
The formula (2) identifies all the regular trajectories, remarkably resulting in a misclassification error equal to zero, as reported in Table 1. The red anomalous trajectories falsify the predicate before verifying , on the contrary the blue anomalous trajectories globally satisfy but never verify (consider that all the vessels start from the top right part of the graphic in Figure 1). Both these conditions result in the falsification of Formula( 2), which on the contrary is satisfied by all the regular trajectories. In Figure 1, we have reported the threshold and .
An example instead of formula found by the DTL4STL tool using the same dataset is the following:
The specific formula generated using ROGE is simpler than the formula generated using DTL4STL and indeed it is easier to understand it. Furthermore DTL4STL does not consider the until operator in the set of primitives (see [bombara_decision_2016] for more details).
5.2 Ineffective Inspiratory Effort (IIE)
The Ineffective Inspiratory Effort (IIE) is one of the major problems concerning the mechanical ventilation of patients in intensive care suffering from respiratory failure. The mechanical ventilation uses a pump to help the patient in the process of ispiration and expiration, controlling the of air into the lugs and the airway pressure (). Inspiration is characterised by growth of pressure up to the selected value and positive , while expiration has a negative and a drop . The exact dynamics of these respiratory curves stricly depends on patient and on ventilatory modes. We can see an example in Fig. 2 of two regular (blue regions) and one ineffective (red region) breaths. An IIE occurs when the patients tries to inspire, but its effort is not sufficient to trigger the appropriate reaction of the ventilator. This results in a longer expiration phase. Persistence of IIE may cause permanent damages of the respiratory system.
An early detection of anomalies using lowcost methods is a key challenge to prevent IIE conditions and still an open problem.
In [BufoBSBLB14], starting with a dataset of discrete time series and sampled values (labeled good and bad) from a single patient, the authors statically designed generative models of and of its derivative , for regular and ineffective signals. Then they used the simulations of such models to identify the best formula/formulae discriminating between them. In that contribution trajectories with different lengths are considered, treating as false trajectories that are too short to verify a given formula, and use this to detect the length of trajectories and separate anomalous breaths which last longer than regular ones. However, using the information about the duration of a breath is not advisable if the goal is to monitor the signals at runtime and detect the IEE at their onset. For this reason, in our approach we consider a new dataset, cutting the original trajectories used to generate the stochastic model in Bufo et al. [BufoBSBLB14] to a maximum common length of the order of 2 seconds. We also apply a moving average filter to reduce the noise in the computation of .
Specifically, the new dataset consists of 2dimensional traces of and its derivative, , containing a total of 288 trajectories (144 normal and 144 anomalous) with 134 sample points per trace. The time scale is in hundredths of a second. We run the experiments using a 6fold crossvalidation and report our results and comparison on the new dataset with DTL4STL [bombara_decision_2016] in Table 2.
ROGE  DTL4STL  

Mis. Rate  
False. Pos  
False. Neg  
Comp. Time 
An example of formula that we learn with ROGE is:
(3) 
Formula identifies the anomalous trajectories, stating that at a time between sec and seconds either is higher than or is below . This is in accordance with the informal description of an IIE, principally caused by an unexpected increment of the during the expiration phase followed by a rapid decrease. Indeed, one of the main characteristic of an IIE is the presence of a small hump in the curve during this phase. In Fig. 3 we report some of the trajectories of the dataset, showing the areas where the property is satisfied.
On average, formulae found by the DTL4STL tool on the new dataset are disjunctions or conjunctions of eventually or always subformulae, which are not readily interpretable.
Our approach on the new dataset is better in term of accuracy and computation time with an improvement of 22% and 67%, respectively.
Similarly to the previous test case, we compare our approach with [BufoBSBLB14], performed directly on the dataset, and not on the generative model. That approach performs so poorly also in this benchmark, obtaining a misclassification rate of , which is comparable with a random classifier. The problem here is that the methods proposed in [BufoBSBLB14], differently from the one presented here, relies on a large number of model simulations, and it is not suited to work directly with observed data.
If we give up to online monitoring and consider only full breaths, we can improve classification by rescaling their duration into [0,1], so that each breath lasts the same amount of time. In this case, we learn a formula with misclassification rate of , while DTL4STL reaches a misclassification rate of . This suggests that the high variability of durations of ineffective breaths has to be treated more carefully.
6 Conclusion
We present a framework to learn from a labeled dataset of normal and anomalous trajectories a Signal Temporal Logic (STL) specification that better discriminates among them. In particular, we design a Robust Genetic algorithm (ROGE) that combines an Evolutionary Algorithm (EA) for learning the structure of the formula and the Gaussian Process Upper Confidence Bound algorithm (GPUCB) for synthesizing the formula parameters. We compare ROGE with our previous work [BufoBSBLB14] and with the Decision Tree approach presented in [bombara_decision_2016] on an anomalous trajectory detection problem of a maritime surveillance system and on an Ineffective Inspiratory Effort example.
A significant difference concerning our previous approach [BBS14, BufoBSBLB14] is that we avoid reconstructing a generative statistical model from the dataset. Furthermore, we modified both the structure and parameter optimization procedure of the genetic algorithm, which now relays on the robustness semantics of STL. This structural improvement was necessary considering that a naive application of our previous approach [BufoBSBLB14] directly on datasets performs very poorly.
We compare our method also with the Decision Tree (DTL4STL) approach of [bombara_decision_2016] obtaining a low misclassification rate and producing smaller and more interpretable STL specifications. Furthermore, we do not restrict the class of the temporal formula to only eventually and globally and we allow arbitrary temporal nesting.
Our genetic algorithm can get wrong results if the formulae of the initial generation are chosen entirely randomly. We avoid this behavior by considering simpler formulae from the beginning as a result of the generateInitialFormulae routine. This approach is a form of regularization and resembles the choice of the set of primitive of DTL4STL.
As future work, we plan to develop an iterative method which uses the proposed genetic algorithm and the robustness of STL to reduce the misclassification rate of the generated formula. The idea is to use the robustness value of a learned formula to identify the region of the trajectory space where the generated formula has a high accuracy, i.e. trajectories whose robustness is greater than a positive threshold or smaller than a negative threshold . These thresholds can be identified by reducing the number of false positives or false negatives. We can then train a new STL classifier on the remaining trajectories, having small robustness for . This will create a hierarchical classifier, that first tests on , if robustness is too low it tests on , and so on. The depth of such classification is limited only by the remaining data at each step. The method can be further strengthen by relying on bagging to generate an ensemble of classifiers at each level, averaging their predictions or choosing the best answer, i.e. the one with larger robustness.
Acknowledgment
E.B. and L.N. acknowledge the partial support of the Austrian National Research Network S 11405N23 (RiSE/SHiNE) of the Austrian Science Fund (FWF). E.B., L.N. and S.S. acknowledge the partial support of the ICT COST Action IC1402 (ARVI).
References
3 Problem formulation
In this paper, we focus our attention on learning the best property (or set of properties) that discriminates trajectories belonging to two different classes, say good and bad, starting from a labeled dataset of observed trajectories. Essentially, we want to tackle a supervised twoclass classification problem over trajectories, by learning a temporal logic discriminant, describing the temporal patterns that better separate two sets of observed trajectories.
The idea behind this approach is that there exists an unknown procedure that, given a temporal trajectory, is able to decide if the signal is good or bad. This procedure can correspond to many different things, e.g., to the reason of the failure of a sensor that breaks when it receives certain inputs. In general, as there may not be an STL specification that perfectly explains/mimics the unknown procedure, our task is to approximate it with the most effective one.
Our approach works directly with observed data, and avoids the reconstruction of an intermediate generative model of trajectories conditioned on their class , as in [BufoBSBLB14, BBS14]. The reason is that such models can be hard to construct, even if they provide a powerful regularization, as they enable the generation of an arbitrary number of samples to train the logic classifier.
In a purely datadriven setting, to build an effective classifier, we need to consider that training data is available in limited amounts and it is typically noisy. This reflects in the necessity of finding methods that guarantee good generalization performance and avoid overfitting. In our context, overfitting can result in overly complex formulae, de facto encoding the training set itself rather than guessing the underlying patterns that separate the trajectories. This can be faced by using a score function based on robustness of temporal properties, combined with suitably designed regularizing terms.
We want to stress that the approach we present here, due to the use of the average robustness of STL properties, can be easily tailored to different problems, like finding the property that best characterise a single set of observations.
4 Methodology
Learning an STL formula can be separated in two optimization problems: the learning of the formula structure and the synthesis of the formula parameters. The structural learning is treated as a discrete optimization problem using an Genetic Algorithm
(GA); the parameter synthesis, instead, considers a continuous parameter space and exploits an active learning algorithm, called
Gaussian Process Upper Confidence Bound (GPUCB). The two techniques are not used separately but combined together in a bilevel optimization. The GA acts externally by defining a set of template formulae. Then, the GPUCB algorithm, which acts at the inner level, finds the best parameter configuration such that each template formula better discriminates between the two datasets. For both optimizations, we need to define a score function to optimize, encoding the criterion to discriminate between the two datasets.Our implementation, called RObustness GEnetic (ROGE) algorithm is described in Algorithm 1. First, we give an overview of it and then we described each specific function in the next subsections. The algorithm requires as input the dataset (good) and (bad), the parameter space , with the bound of each considered variable, the size of the initial set of formulae , the number of maximum iterations , the mutation probability and the maximum initial formula size . The algorithm starts generating (line 1) an initial set of PSTL formulae, called . For each of these formulae (line 2), the algorithm learns the best parameters accordingly to a discrimination function . We call the generation for which we know the best parameters of each formula. During each iteration, the algorithm (line 4), guided by a fitness function F , extracts a subset composed by the best formulae, from the initial set . From this subset (line 5), a new set composed of formulae is created by employing the Evolve routine, which implements two genetic operators. Then (line 6), as in line 2, the algorithm identifies the best parameters of each formula belonging to . The new generation and the old generation are then merged together (line 7). From this set of formulae the algorithm extracts, with respect to the fitness function , the new generation of formulae. At the end of the iterations or when the stop criterion is true (lines 812), the algorithm returns the last generated formulae. The best formula is the one with the highest value of the discrimination function, i.e., . We describe below in order: the discrimination function algorithm, the learning of the parameters of a formula template and the learning of the formula structure.
4.1 Discrimination Function
We have two datasets and and we search for the formula that better separates these two classes. We define a function able to discriminate between this two datasets, such that maximising this discrimination function corresponds to finding the best formula classifier. In this paper, we decide to use, as evaluation of satisfaction of each formula, the quantitative semantics of STL. As described in Section 2, this semantics computes a realvalue (robustness degree) instead of only a yes/no answer.
Given a dataset , we assume that the data comes from an unknown stochastic process . The process in this case is like a blackbox for which we observe only a subset of trajectories, the dataset . We can then estimate the averages robustness
and its variance
, averaging over .To discriminate between the two dataset and , we search the formula that maximizes the difference between the average robustness of , , and the average robustness of ,
divided by the sum of the respective standard deviation:
(1) 
This formula is proportional to the probability that a new point sampled from the distribution generating or , will belong to one set and not to the other. In fact, an higher value of implies that the two average robustness will be sufficiently distant relative to their lengthscale, given by the standard deviation .
As said above, we can evaluate only a statistical approximation of because we have a limited number of samples belonging to and . We will see in the next paragraph how we tackle this problem.
4.2 GBUCP: learning the parameters of the formula
Given a formula and a parameter space , we want to find the parameter configuration that maximises the score function , considering that the evaluations of this function are noisy. So, the question is how to best estimate (and optimize) an unknown function from observations of its value at a finite set of input points. This is a classic nonlinear nonconvex optimization problem that we tackle by means of the GPUCB algorithm [gpucb]
. This algorithm interpolates the noisy observations using a stochastic process (a procedure called emulation in statistics) and optimize the emulated function using the uncertainty of the fit to determine regions where the true maximum can lie. More specifically, the GPUCB bases its emulation phase on Gaussian Processes, a Bayesian nonparametric regression approach
[Rasmussen2007], adding candidate maximum points to the training set of the GP in an iterative fashion, and terminating when no improvement is possible (see [gpucb] for more details).After this optimization, we have found a formula that separates the two datasets, not from the point of view of the satisfaction (yes/no) of the formula but from the point of view of the robustness value. In other words, there is a threshold value such that and . Now, we consider the new STL formula obtained translating the atomic predic