I Introduction
To cope with the complexity of robotic tasks, machine learning (ML) techniques have been employed to capture their temporal and logical structure from timeseries data. One of the main problems in ML is the twoclass classification problem, where the goal is to build a classifier that distinguishes desired system behaviors from the undesired ones. Traditional ML algorithms focus on building such classifiers; however, they are often not easy to understand or they don’t offer any insights about the system. Motivated by the readability and interpretability of temporal logic formulas
[6], there has been great interest in applying formal methods to ML in recent years [1, 11, 25, 21, 26, 14, 27].Signal Temporal Logic (STL) [18] is a specification language used to express temporal properties of realvalued signals. In this paper, we use STL to generate specifications of timeseries system behaviors. Early methods for mining temporal properties from data mostly focus on parameter synthesis, given template formulas [1, 12, 10, 2]. These works require the designer to have a good understanding of the system properties. In addition, learning algorithms based on formula templates may not derive new knowledge from the data. In [15]
, a general supervised learning framework that can infer both the structure and the parameters of a formula from data is presented. The approach is based on lattice search and parameter synthesis, which makes it general, but inefficient. Using an efficient decision treebased framework to learn STL formulas is explored in
[4, 3], where the nodes of the tree contain simple formulae that are tuned optimally from a predefined set of primitives. In [20], the authors propose a systematic enumeration based method to learn short, interpretable STL formulas. Other works in the area of learning temporal logic formulae consider learning from positive examples only [11], clustering [25](i.e., unsupervised setting), active learning
[17], and using automatabased methods for untimed specifications [21, 26].Most existing algorithms for learning STL formulas either do not achieve good classification performance for realworld applications, such as autonomous driving, or do not provide any interpretability of the output formulas: they generate long and complicated specifications. In this paper, to address these concerns, we introduce Boosted Concise Decision Trees (BCDTs) to learn STL formulas from labeled timeseries data. To improve the classification accuracy of existing works, we use a boosting method to combine multiple models with weak classification power. The weak learning models are boundeddepth decision trees, called Concise Decision Trees (CDTs). Each CDT is a Decision Tree (DT) [5]
, empowered by a set of techniques called as conciseness techniques, to generate simpler formulae and improve the interpretability of the final output. We also use a heuristic method in the BCDT algorithm to prune the ensemble of trees, which helps with the interpretability of the formulae. To relate STL and BCDTs, we establish a connection between boosted trees and weighted STL (wSTL) formulas
[19], which have weights associated with Boolean and temporal operators. We show performance gains and improved interpretability of our method compared to existing works, in naval surveillance and urban driving scenarios.The main contributions of the paper are: (a) a novel inference algorithm based on boosted decision trees, which has better classification performance than related approaches, (b) a set of heuristic techniques to generate simple STL formulae from decision trees that improve interpretability, (c) two case studies in naval surveillance and urbandriving that highlight the classification performance and interpretability of our proposed learning algorithm.
Ii Preliminaries
Let , , denote the sets of real, integer, and nonnegative integer numbers, respectively. With a slight abuse of notation, given we use . The cardinality of a set is denoted by . A (discretetime) signal is a function that maps each (discrete) time point to an
dimensional vector of real values, where
. Each component of is denoted as .Signal Temporal Logic (STL) was introduced in [18]. Informally, the STL formulas used in this paper are made of predicates defined over components of realvalued signals in the form of , where is a threshold and , which are connected using Boolean operators, such as , , , and temporal operators, such as (always) and (eventually). The semantics are defined over signals. For example, formula means that, for all times 3,4,5,6, component of a signal is less than or equal 1. STL has both qualitative and quantitative semantics. We use to denote Boolean satisfaction. The quantitative semantics is given by a robustness degree (function) [7] , which captures the degree of satisfaction of a formula by a signal . Positive robustness () implies Boolean satisfaction , while negative robustness () implies violation .
Weighted STL (wSTL) [19] is an extension of STL that has the same qualitative semantics as STL, but has weights associated with the Boolean and temporal operators, which modulate its robustness degree. In this paper, we restrict our attention to a fragment of wSTL with weights on conjunctions only. For example, the wSTL formula , , denotes that and must hold with priorities and . The priorities capture the satisfaction importance of their corresponding formulas.
Parametric STL (PSTL) [1] is an extension of STL, where the endpoints of the time intervals in the temporal operators and the thresholds in the predicates are parameters. The set of all possible valuations of all parameters in a PSTL formula is called the parameter space and is denoted by . A particular valuation is denoted by and the corresponding formula by .
Iii Problem Formulation
Iiia Motivating Example
Consider the maritime surveillance scenario from [15, 3] (see Fig. 1). The goal is to detect anomalous vessel behaviors by looking at their trajectories. A vessel behaving normally approaches from the open sea and heads directly towards the harbor, while a vessel with anomalous behaviors either veers to the island and then heads to the harbor, or it approaches other vessels in the passage between the peninsula and the island and then returns to the open sea.
In the scenario’s dataset [3], the signals are represented as 2dimensional trajectories with planar coordinates . The labels indicate the type of a vessel’s behavior (normal or anomalous). In Fig. 1 and 1, we show the and components of some signals, respectively, over time. For better visualization, we show the signals over a part of their time horizon. In Fig. 1, one of the areas that distinguishes between positive and negative signals is the area between lines and , over the time interval . By using the restricted STL from [3], formula can be used to describe this area and distinguish between positive and negative signals. This can be obviously simplified to .
Similarly, in Fig. 1, we can describe the separation area between lines and by the STL formula using the restricted STL from [3], which can be simplified to . Considering the common time interval between the separation areas in Fig. 1 and Fig. 1, we can combine and into a shorter, easier to read formula , where and
As it will be shown next, in this paper we will use such formulas, which are simpler than the ones in [3], to classify signals without losing classification accuracy.
IiiB Problem Statement
Let be the set of possible (positive and negative) classes. We consider a labeled data set with data samples as , where is the signal and is its label.
Problem 1
Given a labeled data set , find an STL formula that minimizes the Misclassification Rate defined below:
(1) 
Iv Solution
We propose a solution to Pb. 1 based on BCDTs (Alg. 1). Our algorithm grows multiple binary CDTs based on AdaBoost [9] that combines weak classifiers with simple formulae, trained on weighted data samples. Weights of the data represent the difficulty of correct classification. After training a weak classifier, the weights of correctly classified samples are decreased and weights of misclassified samples are increased. In Sec. IVA and IVB, we introduce the construction methods for BCDTs and a single CDT, respectively. We describe the methods’ meta parameters in Sec. IVC, while in Sec. IVD we explain the conciseness techniques and the connection with interpretability. In Sec.IVE, we describe the translation of BCDTs to STL formulas.
Iva Boosted Concise Decision Trees Algorithm
The BCDT algorithm in Alg. 1 is based on the AdaBoost method [24]. The algorithm takes as input the labeled data set , the number of learners (trees) , and the weak learning model , which is the algorithm to construct CDTs (explained in Alg. 3). The CDTs are binary decision trees, where formulas of the nodes are primitives (see Sec. IVC) with general rectangular predicates of the form , with , , as the identity matrix, and .
In Alg. 1, initially all data samples are weighted equally (line 3). The algorithm iterates over the number of trees (line 4). At each iteration, the weak learning algorithm constructs a single CDT based on data set and current samples’ weights (line 5). Next, the misclassification error of the constructed tree is computed (line 6). If the current tree has weak classification performance but better than random guessing (), its weight is computed based on the original AdaBoost method, and if it has perfect classification performance such that it classifies all signals correctly (), a big value is assigned to its weight (line 7). At the end of each iteration, the samples’ weights are updated and normalized (denoted by ) based on the performance of the current tree (line 8). To compute the final output of the algorithm, we use a heuristic method to prune the ensemble of trees, to generate simpler formulae and improve interpretability. Inspired by heuristic methods for pruning ensemble of decision trees in [5, 16], we compute the final output as (line 9): if the weights of all trees are less than , the final output is computed as the weighted majority vote over all the CDTs (as in the AdaBoost method); otherwise, if there are one or more trees with weight , the final output is computed by the weight tree that has the simplest STL formula, denoted by . As a metric to compare the simplicity of formulas, the number of Boolean and temporal operators is considered. This pruning method helps with reducing the generalization error in the test phase and generating simpler formulas. We show its advantages with empirical results in Sec. V.
The final output assigns a label to each data sample. For simplicity, we abuse notation and consider and , such that for all . Note that one of the main assumptions in boosting methods is that each weak learner performs slightly better than random guessing (i.e., coin tossing). Therefore in Alg. 1, if any newly generated tree performs worse than random guessing (), we just discard it and generate another tree. An illustration of Alg. 1 is shown in Fig. 3.
IvB Construction of Concise Decision Tree
Decision Trees (DTs) [5, 22] are sequential decision models with hierarchical structures. In our algorithm, DTs operate on signals with the goal of predicting their labels. Inspired by [3], we present the Concise Decision Tree (CDT) method in Alg. 3, which extends the DT construction algorithm to CDTs, by applying conciseness techniques (see Sec. IVD). To limit the complexity of CDTs, we consider three metaparameters in Alg. 3: (1) PSTL primitives capturing the possible ways to split the data at each node, (2) impurity measures to select the best primitive at each node, and (3) stop conditions to limit the CDTs’ growth. The metaparameters are explained in Sec. IVC.
To explain Alg. 3, first we introduce the parameterized primitive optimization method presented in Alg. 2. This method has similar meta parameters as Alg. 3 and takes as input (1) the set of labeled signals at the current node, (2) the path formula from the root to the current node, (3) a set of input primitives prim, and (4) the depth from the root to the node. In line 4, if the stop conditions are satisfied, a label is computed according the best classification quality using the defined in Sec. IVC2 (we identify the labels and as and , respectively). Otherwise, the best primitive from the input primitive set is computed based on the impurity measure in Sec. IVC2. We use Alg. 2 followed by Alg. 3 to find the best primitive with optimal evaluation at each node, from the input primitive set .
Alg. 3 is recursive, and takes as input (1) the set of labeled signals at the current node, referred to as parent node, (2) the path formula from the root to the parent node, (3) the depth from the root to the node, and (4) the candidate formula for the node. The construction of each CDT starts with .
At the start of Alg. 3, the stop conditions are checked (line 4). If they are satisfied, a single leaf is returned that is marked with label according to Alg. 2 (lines 56). Otherwise, a nonterminal node is created that is associated with the candidate formula (line 7). The formula is the updated path formula from the root, considering the candidate primitive of the parent node (line 8). Next, the data set is partitioned according to the new formula (line 9), where and are the set of signals that satisfy and violate , respectively.
Following the structure of the tree, first for the left child of the node () and then for the right child (), we follow these steps (line 10): first, the candidate primitive for the child is computed from the set (line 11), based on Alg. 2. Then, by applying the conciseness method (explained in Sec. IVD) on the combination of parent’s candidate formula and the child’s candidate primitive , we find a new formula (line 12) as a new candidate for the parent node. If the impurity measure of the new candidate formula is better than the previous candidate (line 13), the algorithm is repeated for the current node, with replaced by (line 14). The decision tree method in [3] is based on incremental impurity reduction at each node of the tree. Following the same idea, we argue that by applying the conciseness techniques at each node of the tree, if the impurity reduction of the new candidate formula is better than the previous one, the new candidate leads to a stronger classifier with a simpler specification at the end. Finally, we continue the construction of the tree for the left and right children (lines 1516) and the subtree for the parent is returned (line 17).
IvC Meta Parameters
IvC1 PSTL primitives
The splitting rules at each node are simple PSTL formulas, called primitives [3]. Here we use firstorder primitives : , , where the decision parameters are .
IvC2 Impurity measure
We use the Misclassification Gain (MG) impurity measure [5] as a criterion to select the best primitive at each node. Given a finite set of signals , an STL formula , and the subsets of that are partitioned based on as , , we have , where , and the parameters are partition weights computed based on signals’ labels and satisfaction of . Here, we extend the robustnessbased impurity measures in [3] to account for the sample weights from the BCDT in Alg. 1. The boosted impurity measures are defined by partition weights
(2)  
This formulation also works for other types of impurity measures, such as information and Gini gains [23].
IvC3 Stop Conditions
There are multiple stopping conditions that can be considered for terminating Alg. 3. We stop the growth of trees either when they reach a given depth, or when the majority of the signals belong to the same class.
IvD Conciseness
We propose the conciseness method , presented in Alg. 4, to improve the simplicity and interpretability of STL formulas. This algorithm takes as inputs the candidate primitive for the parent node, the candidate primitive for its child (either left or right child) , the set of signals , path formula , and depth at the parent node. The output of the algorithm is a new candidate primitive for the parent node, denoted by .
First, the method constructs a new PSTL primitive for the parent node, denoted by , by combining the candidate primitives of the parent and the child nodes (line 3), where the combination operator is denoted by . This is done by considering the possible ways to combine two candidate primitives, which are explained in the following. Then, the optimal valuation of the new PSTL primitive is computed by applying the optimization method in Alg. 2 (line 4).
Next, we present heuristic techniques to combine two primitives and generate shorter PSTL formulae:
IvD1 Combination of Always operators
If the candidate primitive of the parent node is and the candidate primitive of its child is , we construct a new PSTL primitive for their combination. For example, given and , the combined PSTL primitive is .
IvD2 Combination of Eventually operators
Similar to the combination of always operators, if the candidate primitive of the parent node is and the candidate primitive of its child is , we construct a new PSTL primitive .
Remark: In this paper we consider the techniques mentioned above to generate shorter formulas. However, there are other ways to combine the primitives and improve interpretability and expressivity of formulas. For example, given the candidate primitive of the parent node and the candidate primitive of its child , we can construct a new PSTL primitive . We will investigate other ways of combining primitives in future work.
IvE Decision trees to formulas
We use the method from [3] to convert a CDT to an STL formula. The algorithm is invoked from the root, and builds the formula that captures all the branches that end in leaves marked . The BCDT method returns a set of formulas and associated weights . The STL formula is the overall output formula. However, using wSTL [19], we express (see Fig. 3).
V Case Studies
We demonstrate the effectiveness and computational advantages of our method with two case studies. The first is the naval surveillance scenario from Sec. IIIA. The second is an urbandriving scenario, implemented in the simulator CARLA [8]
. We use Particle Swarm Optimization (PSO) method
[13] for solving the optimization problems in Alg. 2. The parameters of the PSO method are tuned empirically. We use in our implementations. We run the case studies on a GHz processor with GB RAM.Va Naval Surveillance
We compare our inference algorithm with the methods from [3] (the DTL4STL tool) and [20]. The dataset is composed of 2000 signals, with 1000 normal and 1000 anomalous trajectories. Each signal has 61 timepoints. See Fig. 4 for some example trajectories. We test our algorithm with 5fold cross validation and maximum depth = 3 for the trees (as in [3]). The results are provided in Table. I for different values of in Alg. 1; TRM and TRS
are the mean and standard deviation of the MCR in the training phase, respectively; TEM
and TES are the mean and standard deviation of the MCR in the test phase; R is the runtime, and m is the number of times that by applying the conciseness method during the construction of CDTs, a simpler formula is found.With , we find a set of concise trees that are able to classify all signals correctly in the test phase. As an example formula in one of the folds, the learned wSTL formula is . By applying the heuristic idea explained in Alg. 1, the final output of the BCDT algorithm is computed as , where and .
K  TRM (%)  TRS (%)  TEM (%)  TES (%)  R  m 
1  0.36  0.35  0.95  0.97  11m 8s  4 
2  0.34  0.21  0.55  0.33  30m 47s  14 
3  0.01  0.02  0.0  0.0  33m 16s  10 
4  0.05  0.1  0.1  0.12  61m 33s  29 
In [3], using firstorder primitives and maximum tree depth of 3, the authors get a MCR with mean 1.3 and standard deviation 0.28 for this data set. To provide a fair comparison, we ran the algorithm from [3] on the same computer that we used for the algorithm from our paper and for the same data set. We obtained a MCR with mean and standard deviation in the test phase, with total runtime of 33 seconds. An example formula learned in one of the folds using the method from [3] is:
Compared to the method from [3], our algorithm obtains a better classification performance, in addition to simpler and more interpretable formulas. In [20], the authors obtain a MCR with mean in test phase and total runtime of 45 minutes and the formula learned in their work is . From the interpretability view, both the formulas learned by our algorithm and by [20] are simple and easy to interpret, but our algorithm has better classification performance.
VB Urban Driving
Consider an autonomous vehicle (referred to as ego) driving in an urban environment shown in Fig. 5. The scenario also contains a pedestrian and another car, which is assumed to be driven by a ”reasonable” human who obeys traffic laws. Ego and the other car are in different, adjacent lanes, moving in the same direction. The cars move uphill in the plane of the coordinate frame, towards positive and directions, with no lateral movement in the direction. The accelerations for both cars are constant, and smaller for ego.
The positions and accelerations of the cars are initialized such the other car is always ahead of ego. The vehicles are headed towards an intersection without any traffic lights. There is an unmarked crosswalk at the end of the road before the intersection. When the pedestrian crosses the street, the other car brakes to stop before the intersection. If the pedestrian does not cross, the other car keeps moving without decreasing its velocity.
Ego does not have a clear lineofsight to the pedestrian crossing at the intersection, because of the other car and the uphill shape of the road. The goal is to develop a method allowing ego to infer whether a pedestrian is crossing the street by observing the behavior (e.g., relative position and velocity over time) of the other car.
The simulation of this scenario ends whenever ego gets closer than 8 to the intersection. We assume that labeled behaviors (relative distances and velocities) are available, where the labels indicate whether a pedestrian is crossing or not. We collected 300 signals with 500 uniform timesamples per trace, where 150 were with and 150 without pedestrians crossing the street (see Fig. 6).
We evaluate our algorithm with 5fold crossvalidation and maximum depth = 2 for the trees. The results are shown in Table II for different values of .
K  TRM (%)  TRS (%)  TEM (%)  TES (%)  R  m 

1  0.0  0.0  1  1.33  7m 10s  2 
2  0.0  0.0  0.67  0.82  9m 57s  2 
3  0.0  0.0  0.33  0.66  14m 52s  1 
4  0.0  0.0  0.0  0.0  24m 40s  3 
In , as an example formula in one of the folds, our algorithm learns the wSTL formula . By applying the heuristic idea from Alg. 1, the final output is computed as . The thresholds of formula are shown in Fig. 6.
To provide a fair comparison, we evaluate the performance of the algorithm from [3] on the same data set and on the same computer that is used for the algorithm developed in this paper. With firstorder primitives, 5fold cross validation and maximum depth of 2 for the trees, we obtained a mean MCR of 1 with standard deviation 1.5 in the test phase, with total runtime of 7.72 seconds. An example formula learned in one of the folds using the method from [3] is . The results show that, with our algorithm, we get simpler formulas and better classification performance than with the algorithm from [3].
Vi Conclusion
In this paper, we propose a method for twoclass classification of timeseries data. The algorithm, called Boosted Concise Decision Trees (BCDTs), grows an ensemble of Concise Decision Trees (CDTs), which are decision trees empowered by conciseness techniques to improve the interpretability of the formulas. We show that boosting helps with improving the miclassification performance. The classification and interpretability advantages of our algorithm are evaluated on naval surveillance and urbandriving case studies. We also compare our method with two recent algorithms from literature.
References
 [1] (2011) Parametric identification of temporal properties. In International Conference on Runtime Verification, pp. 147–160. Cited by: §I, §I, §II.
 [2] (2018) Efficient parametric identification for stl. In Proceedings of the 21st International Conference on Hybrid Systems: Computation and Control (part of CPS Week), pp. 177–186. Cited by: §I.
 [3] (2021) Offline and online learning of signal temporal logic formulae using decision trees. ACM Transactions on CyberPhysical Systems 5 (3), pp. 1–23. Cited by: §I, §IIIA, §IIIA, §IIIA, §IIIA, §IVB, §IVB, §IVC1, §IVC2, §IVE, §VA, §VA, §VA, §VB.
 [4] (2016) A decision tree approach to data classification using signal temporal logic. In Hybrid Systems: Computation and Control, pp. 1–10. Cited by: §I.
 [5] (1984) Classification and regression trees. CRC press. Cited by: §I, §IVA, §IVB, §IVC2.
 [6] (1986) Automatic verification of finitestate concurrent systems using temporal logic specifications. ACM Transactions on Programming Languages and Systems (TOPLAS) 8 (2), pp. 244–263. Cited by: §I.
 [7] (2010) Robust satisfaction of temporal logic over realvalued signals. In International Conference on Formal Modeling and Analysis of Timed Systems, pp. 92–106. Cited by: §II.
 [8] (2017) CARLA: an open urban driving simulator. preprint arXiv:1711.03938. Cited by: Fig. 5, §V.
 [9] (1997) A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §IV.
 [10] (2018) Mining parametric temporal logic properties in modelbased design for cyberphysical systems. International Journal on Software Tools for Technology Transfer 20 (1), pp. 79–93. Cited by: §I.
 [11] (2019) TeLEx: learning signal temporal logic from positive examples using tightness metric. Formal Methods in System Design 54 (3), pp. 364–387. Cited by: §I, §I.
 [12] (2015) Mining requirements from closedloop control models. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 34 (11), pp. 1704–1717. Cited by: §I.

[13]
(1995)
Particle swarm optimization.
In
International Conference on Neural Networks
, Vol. 4, pp. 1942–1948. Cited by: §V.  [14] (2019) Synthesis of monitoring rules via data mining. In American Control Conference, pp. 1684–1689. Cited by: §I.
 [15] (2016) Temporal logics for learning and detection of anomalous behavior. IEEE Transactions on Automatic Control 62 (3), pp. 1210–1222. Cited by: §I, Fig. 1, §IIIA.

[16]
(2012)
Pruning of random forest classifiers: a survey and future directions
. In2012 International Conference on Data Science & Engineering (ICDSE)
, pp. 64–68. Cited by: §IVA.  [17] (2020) Active learning of signal temporal logic specifications. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), pp. 779–785. Cited by: §I.
 [18] (2004) Monitoring temporal properties of continuous signals. In Formal Techniques, Modelling and Analysis of Timed and FaultTolerant Systems, pp. 152–166. Cited by: §I, §II.
 [19] (2020) Specifying user preferences using weighted signal temporal logic. IEEE Control Systems Letters. Cited by: §I, §II, §IVE.
 [20] (2020) Interpretable classification of timeseries data using efficient enumerative techniques. In Proceedings of the 23rd International Conference on Hybrid Systems: Computation and Control, pp. 1–10. Cited by: §I, §VA, §VA.
 [21] (2018) Learning linear temporal properties. In Formal Methods in Computer Aided Design, pp. 1–10. Cited by: §I, §I.
 [22] (2007) Pattern recognition and neural networks. Cambridge university press. Cited by: §IVB.
 [23] (2005) Topdown induction of decision trees classifiers  a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C 35 (4), pp. 476–487. External Links: Document Cited by: §IVC2.
 [24] (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §IVA.
 [25] (2017) Logical clustering and learning for timeseries data. In International Conference on Computer Aided Verification, pp. 305–325. Cited by: §I, §I.
 [26] (2019) Informationguided temporal logic inference with prior knowledge. In American Control Conference, pp. 1891–1897. Cited by: §I, §I.
 [27] (2021) Neural network for weighted signal temporal logic. arXiv preprint arXiv:2104.05435. Cited by: §I.
Comments
There are no comments yet.