1 Introduction
Automatic anomaly detection is a major issue in numerous areas and has generated a vast scientific literature [chandola2009anomaly]. We focus in this paper on a very important application case, aircraft engine health monitoring which aims at detecting early signs of failures, among other applications [asheraryan2013surveyhealth, tumer1999survey]. Aircraft engines are generally made extremely reliable by their conception process and thus have low rate of operational events. For example, in 2013, the CFM567B engine, produced jointly by Snecma and GE aviation, has a rate of in flight shut down (IFSD) of (per 1000 Engine Flight Hour) and a rate of aborted takeoff (ATO) of (per 1000 departures). This dispatch availability of nearly 100 % (99.962 % in 2013) is obtained via regular maintenance operations but also via engine health monitoring (see also e.g. [vasov2007reliability] for an external evaluation).
This monitoring is based, among other sources, on data transmitted by satellites^{2}^{2}2using the commercial standard Aircraft Communications Addressing and Reporting System (ACARS, see http://en.wikipedia.org/wiki/ACARS), for instance. between aircraft and ground stations. Typical transmitted messages include engine status overview as well as useful measurements collected as specific instants (e.g., during engine start). Flight after flight, measurements sent are analyzed in order to detect anomalies that are early signs of degradation. Potential anomalies can be automatically detected by algorithms designed by experts. If an anomaly is confirmed by a human operator, a maintenance recommendation is sent to the company operating the engine.
As a consequence, unscheduled inspections of the engine are sometimes required. These inspections are due to the abnormal measurements. Missing a detection of early signs of degradation can result in an IFSD, an ATO or a delay and cancellation (D&C). Despite the rarity of such events, companies need to avoid them to minimize unexpected expenses and customers’ disturbance. Even in cases where an unscheduled inspection does not prevent the availability of the aircraft, it has an attached cost: it is therefore important to avoid as much as possible useless inspections.
We describe in this paper a general methodology to built complex automated decision support algorithms in a way that is comprehensible by human operators who take final decisions. The main idea of our approach is to leverage expert knowledge in order to build hundreds of simple binary indicators that are all signs of the possible existence of an early sign of anomaly in engine health monitoring data. The most discriminative indicators are selected by a standard forward feature selection algorithm. Then an automatic classifier is built on those features. While the classifier decision is taken using a complex decision rule, the interpretability of the features, their expert based nature and their limited number allows the human operator to at least partially understand how the decision is made. It is a requirement to have a trustworthy decision for the operator. It should be noted that while this paper focuses on aircraft engines, the methodology can be applied to various other contexts. For instance a related but simpler methodology was proposed in [hegedus2011methodology] in the context of malware detection.
2 Application context
2.1 Flight data
Engine health monitoring is based in part on flight data acquisition. Engines are equipped with multiple sensors which measure different physical quantities such as the high pressure core speed (N2), the Fuel Metering Valve (FMV), the Exhausted Gas Temperature (EGT), etc. (See Figure 1
.) Those measures are monitored in real time during the flight. For instance the quantities mentioned before (N2, FMV, etc.) are analyzed, among others, during the engine starting sequence. This allows one to check the good health of the engine. If a potential anomaly is detected, a diagnostic is sent to an operator of the owner of the engine. The airline may then have to postpone the flight or cancel it, depending on the criticality of the fault and the estimated repair time.
The monitoring can also be done flight after flight to detect any change that can be flagged as early signs of degraded behavior. This is done by compressing the in flight measurements into engine status overviews. The methodology introduced in this article is mostly designed for this latter kind of monitoring.
2.2 Detecting faults and abnormal behaviors
Traditional engine health monitoring is strongly based on expert knowledge and field experience (see e.g. [asheraryan2013surveyhealth, tumer1999survey] for surveys and [flandrois2009expertise] for a concrete example). Faults and early signs of faults are identified from suitable measurements associated to adapted computational transformation of the data. For instance, the different measurements (temperatures, vibration, etc.) are influenced by the flight parameters (e.g. throttle position) and conditions (outside temperature, etc.). Variations in the measured values can therefore result from variations in the parameters and conditions rather than being due to abnormal behavior. Thus a typical computational transformation consists in preprocessing the measurements in order to remove dependency to the flight context [lacaille2009standardized].
In practice, the choice of measurements and computational transformations is generally done based on expert knowledge. For instance in [rabenoroinstants]
, a software is designed to record expert decision about a time interval on which to monitor the evolution of such a measurement (or a time instant when such a measurement should be recorded). Based on the recorded examples, the software calibrates a pattern recognition model that can automatically reproduce the time segmentation done by the expert. Once the indicators have been computed, the normal behavior of the indicators can be learned. The residuals between predictions and actual indicators can be statistically modeled as a Gaussian vector, for instance. A score measurement is obtained from the likelihood of this distribution. The normalized vector is a failure score signature that may be described easily by experts to identify the fault origin, in particular because the original indicators have some meaning for them. See
[come2010aircraft], [flandrois2009expertise] and [lacaille2009maturation] for other examples.However experts are generally specialized on particular subsystems, thus each algorithm focuses mainly on a specific subsystem despite the need of a diagnostic of the whole system.
2.3 Data and detection fusion
The global diagnostic is currently done by the operator who collects all available results of diagnostic applications. The task of taking a decision based on all incoming information originating from different subsystems is difficult. A first difficulty comes from dependencies between subsystems which means that for instance in some situations, a global early sign of failure could be detected by discovering discrepancies between seemingly perfectly normal subsystems. In addition, subsystem algorithms can provide conflicting results or decisions with very low confidence levels. Furthermore, the extreme reliabilities of engines lead to an exacerbated trade off between false alarm levels and detection levels, leading in general to a rather high level of false alarms, at least at the operator level. Finally, the role of the operator is not only to identify a possible early sign of failure, but also to issue recommendations on the type of preventive maintenance needed. In other words, the operator needs to identify the possible cause of the potential failure.
2.4 Objectives
The long term goal of engine health monitoring is to reach automated accurate, trustworthy and precise maintenance decisions during optimally scheduled shop visits, but also to drastically reduce operational events such as IFSD and ATO. However, partly because of the current industrial standard, pure black box modeling is unacceptable. Indeed, operators are currently trained to understand expertly designed indicators and to take complex integrated decisions on their own. In order for a new methodology to be accepted by operators, it has at least to be of a gray box nature, that is to be (partially) explainable via logical and/or probabilistic reasoning. Then, our objective is to design a monitoring methodology that helps the human operator by proposing integrated decisions based on expertly designed indicators with a “proof of decision”.
3 Architecture of the Decision Process
3.1 Engine health monitoring data
In order to present the proposed methodology, we first describe the data obtained via engine health monitoring and the associated decision problem.
We focus here on ground based long term engine health monitoring. Each flight produces dozens of timestamped flight events and data. Concatenating those data produces a multivariate temporal description of an engine whose dimensions are heterogeneous. In addition, sampling rates of individual dimensions might be different, depending on the sensors, the number of critical time points recorded in a flight for said sensor, etc.
Based on expert knowledge, this complex set of time series is turned into a very high dimensional indicator vector. The main idea, outlined in the previous section, is that experts generally know what is the expected behavior of a subsystem of the engine during each phase of the flight. Then the dissimilarity between the expected behavior and the observed one can be quantified leading to one (or several) anomaly scores. Such scores are in turn transformed into binary indicators where means an anomaly is detected and means no anomaly detected. This is somewhat related to the way malware are characterized in [hegedus2011methodology], but with a more direct interpretation of each binary feature^{3}^{3}3In [hegedus2011methodology], a value for a feature simply means that the software under study has a particular quality associated to the feature, without knowing whether this quality is an indication of its malignity..
This transformation has two major advantages: it homogenizes the data and it introduces simple but informative features (each indicator is associated to a precise interpretation related to expert knowledge). It leads also to a loss of information as the raw data are in general not recoverable from the indicators. This is considered here a minor inconvenience as long as the indicators capture all possible failure modes. This will be partially guaranteed by including numerous variants of each indicator (as explained below). On a longer term, our approach has to be coupled with field experience feedback and expert validation of its coverage.
After the expert guided transformation, the monitoring problem becomes a rather standard classification problem: based on the binary indicators, the decision algorithm has to decide whether there is an anomaly in the engine and if, this is the case, to identify the type of the anomaly (for instance by identifying the subsystem responsible for the potential problem).
We describe now in more details the construction of the binary indicators.
3.2 Some types of anomalies
Some typical univariate early signs of anomalies are shown on Figures 4, 4 and 4
which display the evolution through time of a numerical value extracted from real world data. One can identify, with some practice, a variance shift on Figure
4, a mean shift on Figure 4 and a trend modification (change of slope) on Figure 4. In the three cases, the change instant is roughly at the center of the time window.The main assumption used by experts in typical situations is that, when external sources of change have been accounted for, the residual signal should be stationary in a statistical sense. Assuming that a signal of length
is made of
values generated independently from a fixed parametric probability model with a parameter
. In this framework, the signal is stationary if the parameters are fixed, that is, if all the are identical. Then, detecting an anomaly amounts to detecting a change in the time series (as illustrated by the three Figures above). This can be done via numerous techniques [chandola2009anomaly] in particular with well known statistical tests [basseville1993detection]. In the multivariate cases, similar shifts in the signal can be associated to anomalies. More complex scenarios, involving for instance time delays, can also be designed by experts.3.3 Exploring the parameter space
While experts can generally describe explicitly what type of change they are expecting for some specific early signs of anomaly, they can seldom provide detailed parameter settings for statistical tests (or even for the aggregation technique that could lead to a statistical test after complex calculations). To maximize coverage it seems natural to include numerous indicators based on variations of the anomaly detectors compatible with expert knowledge.
Let us consider for illustration purpose that the expert recommends to look for shifts in mean of a certain quantity as early signs of a specific anomaly. If the expert believes the quantity to be normally distributed with a fixed variance, then a natural test would be Student’s ttest. If the expert has no strong priors on the distribution, a natural test would be the Mann–Whitney U test. Both can be included to maximize coverage.
Then, in both cases, one has to assess the time scale of the shift. Indeed those tests work by comparing summary statistics of two populations, before and after a possible change point. To define the populations, the expert has to specify the length of the time windows to consider before and after the possible change point: this is the expected time scale at which the shift will appear. In most cases, the experts can only give a rough idea of the scale. Again, maximizing the coverage leads to the inclusion of several scales compatible with the experts’ recommendations.
Given the choice of the test, of its scale and of a change point, one can construct a statistic. A possible choice for the indicator could be the value of the statistic or the associated value. However, we choose to use simpler indicators to ease their interpretation. Indeed, the raw value of a statistic is generally difficult to interpret. A
value is easier to understand because of the uniform scale, but can still lead to misinterpretation by operators with insufficient statistical training. We therefore choose to use binary indicators for which the value 1 corresponds to a rejection of the null hypothesis of the underlying test to a given level (the null hypothesis is here the case with no mean shift).
3.4 Confirmatory indicators
Finally, as pointed out before, aircraft engines are extremely reliable, a fact that increases the difficulty in balancing sensibility and specificity of anomaly detectors. In order to alleviate this difficulty, we build high level indicators from low level tests. For instance, if we monitor the evolution of a quantity on a long period compared to the expected time scale of anomalies, we can compare the number of times the null hypothesis of a test has been rejected on the long period with the number of times it was not rejected, and turn this into a binary indicator with a majority rule.
3.5 Decision
To summarize, we construct parametric anomaly scores from expert knowledge, together with acceptable parameter ranges. By exploring those ranges, we generate numerous (possible hundreds of) binary indicators. Each indicator can be linked to an expertly designed score with a specific set of parameters and thus is supposedly easy to interpret by operators. Notice that while we are focused in this presentation on temporal data, this framework can be applied to any data source.
The final decision step consists in classifying these high dimensional binary vectors into at least two classes, i.e., the presence or absence of an anomaly. A classification into more classes is highly desirable if possible, for instance to further discriminate between seriousness of anomalies and/or sources (in terms of subsystems of the engine).
As explained before, we aim in the long term at gray box modeling, so while numerous classification algorithms are available see e.g. [kotsiantis2007supervised]
, we shall focus on interpretable ones. In this paper, we choose to use Random Forests
[breiman2001random]as they are very adapted to binary indicators and to high dimensional data. They are also known to be robust and to provide stateoftheart classification performances at a very small computational cost. While they are not as interpretable as their ancestors CART
[breiman1984classification], they provide at least variable importance measures that can be used to identify the most important indicators.Another classification algorithms used in this paper is naive Bayes classifier [koller2009probabilistic] which is also appropriate for high dimensional data. They are known to provide good results despite the strong assumption of the independence of features given the class. In addition, decisions taken by a naive Bayes classifier are very easy to understand thanks to the estimation of the conditional probabilities of the feature in each class. Those quantities can be shown to the human operator as references.
Finally, while including hundreds of indicators is important to give a broad coverage of the parameter spaces of the expert scores and thus to maximize the probability of detecting anomalies, it seems obvious that some redundancy will appear. Unlike [hegedus2011methodology] who choose features by random projection, the proposed methodology favors interpretable solutions, even at the expense of the classification accuracy: the goal is to help the human operator, not to replace her/him. Therefore, we have chosen to apply a feature selection technique [guyon2003introduction] to this problem. The reduction of number of features will ease the interpretation by limiting the quantity of information transmitted to the operators in case of a detection by the classifier. Among the possible solutions, we choose to use the Mutual information based technique Minimum Redundancy Maximum Relevance (mRMR, [peng2005feature]) which was reported to give excellent results on high dimensional data (see also [fleuret2004] for another possible choice).
4 A simulation study
4.1 Introduction
It is difficult to find real data with early signs of degradations, because their are scarce and moreover the scheduled maintenance operations tend to remove these early signs. Experts could study in detail recorded data to find early signs of anomalies whose origins were fixed during maintenance but it is close to looking for a needle in a haystack, especially considering the huge amount of data to analyze. We will therefore rely in this paper on simulated data. Our goal is to validate the interest of the proposed methodology in order to justify investing in the production of carefully labelled real world data.
In this section we begin by the description of the simulated data used for the evaluation of the methodology, and then we will present the performance obtained on this data.
4.2 Simulated data
The simulated data are generated according to the univariate shift models described in Section 3.2: each observation is a short time series which is recorded as at specific time points, e.g., . As pointed out in Section 3.1, signals can have different time resolutions. This difficulty is integrated in the simulated data by using different lengths/dimensions for each observation (hence the numbers of time points). Notice that the time points are introduced here only for generative purposes and are not used in the actual decision process. For multivariate data sets, they could become useful (e.g. to correlate potential shift detection), but this is out of the scope of the present paper.
In the rest of the paper, the notation
says that the random variable
follows a Gaussian distribution with mean
and variance and the notation says that the random variablefollows the uniform distribution on the set
(which can be finite such as or infinite such as ).We generate two data sets: a simple one and a slightly more complex one . In both cases, it is assumed that expert based normalisation has been performed. Therefore when no shift in the data distribution occurs, we observe a stationary random noise modeled by the standard Gaussian distribution. This assumption made about the noise may seem very strong but the actual goal of this paper is to evaluate the methodology and we choose to use simple distribution and simple statistical tests. In the future, we plan to use more realistic noise associated with more complex tests. An observation with no shift is then generated as follows:

, the length of the signal, is chosen as ;

the values are sampled independently from the standard Gaussian distribution, that is .
Anomalous signals use the same distribution of the signal length as normal signals. More precisely, an anomalous observation is generated as follows:

, the length of the signal, is chosen as ;

the change point is chosen as
where is the integer part of . For instance, if , then the change point is chosen among the time points to ;

the values are generated according to one anomaly model.
Anomalies are in turn modelled after the three examples given in Figures 4, 4 and 4. The three types of shift are:

a variance shift: in this case, the parameter of the shift is the variance after the change point, , with . Given , the are sampled independently as when (before the change point) and when (after the change point). See Figure 7 for an example;

a mean shift: in this case, the parameter of the shift is the mean after the change point, . In set , while in set , . Given , the are sampled independently as when (before the change point) and when (after the change point). See Figure 7 for an example;

a slope shift: in this final case, the parameter of the shift is a slope with . Given , the are sampled independently as when (before the change point) and when (after the change point). See Figure 7 for an example.
We generate according to this procedure two data sets with 6000 observations corresponding to 3000 observations with no anomaly, and 1000 observations for each of the three types of anomalies. The only difference between data set and data set is the amplitude of the mean shift which is smaller in , making the classification harder.
4.3 Indicators
As explained in Section 3.3, binary indicators are constructed from expert knowledge by varying parameters, including scale and position parameters. In the present context, we use sliding and jumping windows: for each possible position of the window, a classical statistical test is conducted to decide whether a shift in the signal occurs at the center of the window (see Figure 8). Individual tests obtained from different positions are combined to create binary indicators.
More precisely, a window of length is a series of consecutive time points in a signal (in other words, this is a subsignal). For a fixed , there are windows in , from to (when sorted in order of their first time points).
Given a window of length (assumed even) starting at position , a two sample test is conducted on the two subsets of values corresponding to the first half of the window and to the second half. More precisely, we extract from the series of values a first sample and a second sample . Then a test of inequality between and is conducted (inequality is defined with respect to some statistical aspect). The “expert” designed tests are here (notice that those tests do not include a slope shift test):

the MannWhitneyWilcoxon U test (non parametric test for shift in mean);

the two sample KolmogorovSmirnov test (non parametric test for differences in distributions);

the Ftest for equality of variance (parametric test based on a Gaussian hypothesis).
From this general principle, we derive a large set of indicators by varying the length of the window, the level of significance of the test and the way to combine results from all the windows that can be extracted from a signal.
In practice, for an observation , we use three different window lengths, , and . We use also three different levels for the tests, namely 0.005, 0.1 and 0.5. For a fixed window length and a fixed test, an observation is transformed into values produced by applying the test to the windows extracted from the observation. For each significance level, the
values are binarized giving 1 or 0 whether the null hypothesis of identical distribution between
and is rejected or not, leading to binary values.The next step consists in turning the raw binary values into a set of indicators producing the same number of indicators for all observations. The simplest binary indicator equals to 1 if and only if at least one binary value among the is equal to 1 (that is if at least one window defines two subsamples that differ according to the chosen test). Notice that as we use 3 window lengths, 3 tests, and 3 levels, we obtained this way 27 simple binary indicators.
Then, more complex binary indicators are generated, as explained in Section 3.4. In a way, this corresponds to build very simple binary classifiers on the binary values obtained from the tests. All those indicators are based on the notion of consecutive windows. In order to vary the time resolution of this process, we first introduce a jump parameter . Two windows are consecutive according to if the first time point of the second window is the th time point of the first window. For instance, if , and are consecutive windows. In practice, we use three values for , namely 1, 5 and 10.
For each value of and for each series of binary values, we compute the following derived indicator:

the global ratio indicator is equal to 1 if and only if on a fraction of at least of all possible windows, the test detects a change. This indicator does not use the fact that windows are consecutive, but it is nevertheless affected by . Indeed, values strictly larger than 1 for reduce the total number of windows under consideration;

the consecutive ratio indicator is equal to 1 if and only if there is a series of consecutive windows on which the test detects a change ;

the local ratio indicators is equal to 1 if and only if there is a series of consecutive windows among which the tests detects at least times a change.
Those derived indicators are parametric. In the present experiments, we use three different values for (used by the first two derived indicators), namely 0.1, 0.3 and 0.5. For the pair used by the last derived indicator, we used three different pairs , and .
A simple series of binary values allows us to construct 27 derived indicators (3 values for times values for the parameters of each of the 3 types of derived indicators). As we have 27 of such series (because of the 3 window lengths, 3 tests and 3 levels), we end up with 729 additional indicators for a total of 756 binary indicators.
In addition, based on expert recommendation, we apply the entire processing both to the original signal and to a smoothed signal (using a simple moving average over 5 measurements). The total number of indicators is therefore 1512. However many of them are identical over the 6000 observations and the final number of distinct binary indicators is 810. The redundancy is explained by several aspects. For instance, while the smoothing changes the signal, it has a limited effect on the test results. Also, when the level of the test is high, the base binary values tend to be all equal to one. When the ratio is low, there are not many differences between the derived indicators with respect to , etc.
It should be noted that the parameters used both for the simulated data and the indicators have been chosen so as to illustrate the possibilities of the proposed architecture of the decision process. The values have been considered reasonable (see Table 1 for a summary of these values) and representative of what would be useful in practice by experts of our application domain. It is however expected than more statistical tests and more indicators should be considered in practice to cover the range of the possible anomalies.
Parameters  Values 
statistical test  MannWhitneyWilcoxon U test 
Two sample KolmogorovSmirnov test  
Ftest  
length of window ()  30 
50  
100  
levels  0.005 
0.1  
0.5  
jump ()  1 
5  
10  
fraction for global ratio indicator and consecutive ratio indicator ()  0.1 
0.3  
0.5  
k among l for local ratio indicators ()  (3,2) 
(5,3)  
(5,4)  
moving average  1 
5 
4.4 Performance analysis
Each data set is split in a balanced way into a learning set with 1000 signals and a test set with 5000 signals (the class proportions from the full data set are kept in the subsets). We report the global classification accuracy (the classification accuracy is the percentage of correct predictions, regardless of the class) on the learning set to monitor possible over fitting. The performances of the methodology are evaluated on 10 balanced subsets of size 500 from the 5000 signals’ test set. This allows to evaluate both the average performances and their variability. For the Random Forest, we also report the outofbag (oob) estimate of the classification accuracy: this quantity is obtained during the bootstrap procedure used to construct the forest. Indeed each observation appears as a training observation in only roughly two third of the trees that constitute the forest. Then the prediction of the remaining trees for this observation can be aggregated to give a decision. Comparing this decision to the true value gives the outofbag estimate of the classification error for this observation (see [breiman2001random] for details). Finally, for the Naive Bayes classifier, we use confusion matrices and class specific accuracies to gain more insights on the results when needed.
4.5 Performances with all indicators
As indicators are expertly designed and should cover the useful parameter range of the tests, it is assumed that the best classification performances should be obtained when using all of them, up to the effects of the curse of dimensionality.
Data set  Training set acc.  OOB acc.  Test set average acc. 

0.9770  0.9228  0.9352 (0.0100)  
0.9709  0.9118  0.9226 (0.0108) 
Classification accuracy of the Random Forest using the 810 binary indicators. For the test set, we report the average classification accuracy and its standard deviation between parenthesis.
Table 2 reports the global classification accuracy of the Random Forest, using all the indicators. As expected, Random Forests suffer neither from the curse of dimensionality nor from strong over fitting (the test set performances are close to the learning set ones). Table 3
reports the same performance indicator for the Naive Bayes classifier. Those performances are significantly lower than the one obtained by the Random Forest. As shown by the confusion matrix on Table
4, the classification errors are not concentrated on one class (even if the errors are not perfectly balanced). This tends to confirm that the indicators are adequate to the task (this was already obvious from the Random Forest).Data set  Training set accuracy  Test set average accuracy 

0.7856  0.7718 (0.0173)  
0.7545  0.7381 (0.0178) 
Predicted class  

No anomaly  Variance  Mean  Slope  total  
True class  No anomaly  1759  667  45  29  2500 
Variance  64  712  50  3  829  
Mean  7  2  783  37  829  
Slope  32  7  195  595  829 
4.6 Feature selection
While the Random Forest gives very satisfactory results, it would be unacceptable for human operators as it operates in a black box way. While the indicators have simple interpretation, it would be unrealistic to ask to an operator to review 810 binary values to understand why the classifier favors one class over the others. In addition, the performances of the Naive Bayes classifier are significantly lower than those of the Random Forest one. Both drawbacks favor the use of a feature selection procedure.
As explained in Section 3.5, the feature selection relies on the mRMR ranking procedure. A forward approach is used to evaluate how many indicators are needed to achieve acceptable predictive performances. Notice that in the forward approach, indicators are added in the order given by mRMR and then never removed. As mRMR takes into account redundancy between the indicators, this should not be a major issue. Then for each number of indicators, a Random Forest and a Naive Bayes classifier are constructed and evaluated.
Figures 9, 10, 11 and 12 summarize the results for the 100 first indicators. The classification accuracy of the Random Forest increases almost monotonously with the number of indicators, but after roughly 25 to 30 indicators (depending on the data set), performances on the test set tend to stagnate (this is also the case of the outofbag estimate of the performances, which shows, as expected, that the number of indicators could be selected using this measure). In practice, this means that the proposed procedure can be used to select the relevant indicators implementing this way an automatic tuning procedure for the parameters of the expertly designed scores.
Results for the Naive Bayes classifier are slightly more complex in the case of the second data set, but they confirm that indicator selection is possible. Moreover, reducing the number of indicators has here a very positive effect on the classification accuracy of the Naive Bayes classifier which reaches almost as good performances as the Random Forest. Notice that the learning set performances of the Naive Bayes classifier are almost identical to its test set performances (which exhibit almost no variability over the slices of the full test set). This is natural because the classifier is based on the estimation of the probability of observing a 1 value independently for each indicator, conditionally on the class. The learning set contains at least 250 observations for each class, leading to a very accurate estimation of those probabilities and thus to very stable decisions. In practice one can therefore select the optimal number of indicators using the learning set performances, without the need of a crossvalidation procedure (optimality is with respect to the classification accuracy).
It should be noted that significant jumps in performances can be observed in all cases. This might be an indication that the ordering provided by the mRMR procedure is not optimal. A possible solution to reach better indicator subsets would be to use a wrapper approach, leveraging the computational efficiency of both Random Forest and Naive Bayes construction. Meanwhile Figure 13 shows in more detail this phenomenon by displaying the classification error class by class, as a function of the number of indicators, in the case of data set . The figure shows the difficulty of discerning between mean shift and trend shift (for the latter, no specific test have been included, on purpose). But as the strong decrease in classification error when the 23th indicator is added concerns both classes (mean shift and trend shift), the ordering provided by mRMR could be questioned.
4.7 Indicator selection
Based on results shown on Figures 11 and 12, one can select an optimal number of binary indicators, that is the number of indicators that maximizes the classification accuracy on the learning set. However, this leads in general to a too large number of indicators. Thus we restrict the search below a maximal number of indicators in order to avoid flooding the human operator with to many results.
For instance Table 5 gives the classification accuracy of the Naive Bayes classifier using the optimal number of binary indicators between 1 and 30.
Data set  Training set acc.  Test set average acc.  # of indicators 

0.8958  0.8911 (0.0125)  23  
0.8828  0.8809 (0.0130)  11 
Classification accuracy of the Naive Bayesian network using the optimal number binary indicators between 1 and 30. For the test set, we report the average classification accuracy and its standard deviation between parenthesis.
While the performances are not as good as the ones of the Random Forest, they are much improved compared to the ones reported in Table 3. In addition, the selected indicators can be shown to the human operator together with the estimated probabilities of getting a positive result from each indicator, conditionally on each class, shown on Table 6. For instance here the first selected indicator, , is a confirmation indicator for the U test. It is positive when there are 2 windows out of 3 consecutive ones on which a U test was positive. The Naive Bayes classifier uses the estimated probabilities to reach a decision: here the indicator is very unlikely to be positive if there is no change or if the change is a variance shift. On the contrary, it is very likely to be positive when there is a mean or a trend shift. While the table does not “explain” the decisions made by the Naive Bayes classifier, it gives easily interpretable hints to the human operator.
type of indicator  no change  variance  mean  trend 
confu(2,3)  0.010333  0.011  0.971  0.939 
F test  0.020667  0.83  0.742  0.779 
U test  0.027333  0.03  0.977  0.952 
ratef(0.1)  0.0016667  0.69  0.518  0.221 
confu(4,5)  0.034333  0.03  0.986  0.959 
confu(3,5)  0.0013333  0.001  0.923  0.899 
U test  0.02  0.022  0.968  0.941 
F test  0.042  0.853  0.793  0.813 
rateu(0.1)  0.00033333  0.001  0.906  0.896 
confu(4,5)  0.019  0.02  0.946  0.927 
conff(3,5)  0.052333  0.721  0.54  0.121 
U test  0.037667  0.038  0.983  0.951 
KS test  0.016  0.294  0.972  0.936 
confu(3,5)  0.049  0.043  0.988  0.963 
F test  0.030667  0.841  0.77  0.801 
U test  0.043  0.043  0.981  0.963 
lseqf(0.3)  0.0093333  0.749  0.59  0.36 
rateu(0.1)  0.001  0.002  0.896  0.895 
lsequ(0.1)  0.062667  0.06  0.992  0.949 
confu(3,5)  0.025667  0.021  0.963  0.936 
lseqf(0.3)  0.008  0.732  0.656  0.695 
KS test  0.016333  0.088  0.955  0.93 
confu(3,5)  0  0  0.003  0.673 
5 Conclusion and perspectives
In this paper, we have introduced a diagnostic methodology for engine health monitoring that leverages expert knowledge and automatic classification. The main idea is to build from expert knowledge parametric anomaly scores associated to range of plausible parameters. From those scores, hundreds of binary indicators are generated in a way that covers the parameter space as well as introduces simple aggregation based classifiers. This turns the diagnostic problem into a classification problem with a very high number of binary features. Using a feature selection technique, one can reduce the number of useful indicators to a humanly manageable number. This allows a human operator to understand at least partially how a decision is reached by an automatic classifier. This is favored by the choice of the indicators which are based on expert knowledge and on very simple decision rules. A very interesting byproduct of the methodology is that it can work on very different original data as long as expert decision can be modelled by a set of parametric anomaly scores. This was illustrated by working on signals of different lengths.
The methodology has been shown sound using simulated data. Using a reference high performance classifier, Random Forests, the indicator generation technique covers sufficiently the parameter space to obtain a high classification rate. Then, the feature selection mechanism (here a simple forward technique based on mRMR) leads to a reduced number of indicators (23 for one of the data set) with good predictive performances when paired with a simpler classifier, the Naive Bayes classifier. As shown in the experiments, the class conditional probabilities of obtaining a positive value for those indicators provide interesting insights on the way the Naive Bayes classifier takes a decision.
In order to justify the costs of collecting a sufficiently large real world labelled data set in our context (engine health monitoring), additional experiments are needed. In particular, multivariate data must be studied in order to simulate the case of a complex system made of numerous subsystems. This will naturally lead to more complex anomaly models. We also observed possible limitations of the feature selection strategy used here as the performances displayed abrupt changes during the forward procedure. More computationally demanding solutions, namely wrapper ones, will be studied to confirm this point.
It is also important to notice that the classification accuracy is not the best way of evaluating the performances of a classifier in the engine health monitoring context. Firstly, engine health monitoring involves intrinsically a strong class imbalance [japkowicz2002class]. Secondly, engine health monitoring is a cost sensitive area because of the strong impact on airline profit of an unscheduled maintenance. It is therefore important to take into account specific asymmetric misclassification cost to get a proper performance evaluation.
The Gaussian assumption made on the noise of the simulated data is a strong one. In our future work, we plan to use more realistic noise and to test the robustness of the methodology. We will evaluate if we can compensate this new complexity with the use of more complex indicators.