Appliance Event Detection – A Multivariate, Supervised Classification Approach

04/24/2019 ∙ by Matthias Kahl, et al. ∙ Technische Universität München 0

Non-intrusive load monitoring (NILM) is a modern and still expanding technique, helping to understand fundamental energy consumption patterns and appliance characteristics. Appliance event detection is an elementary step in the NILM pipeline. Unfortunately, several types of appliances (e.g., switching mode power supply (SMPS) or multi-state) are known to challenge state-of-the-art event detection systems due to their noisy consumption profiles. Classical rule-based event detection system become infeasible and complex for these appliances. By stepping away from distinct event definitions, we can learn from a consumer-configured event model to differentiate between relevant and irrelevant event transients. We introduce a boosting oriented adaptive training, that uses false positives from the initial training area to reduce the number of false positives on the test area substantially. The results show a false positive decrease by more than a factor of eight on a dataset that has a strong focus on SMPS-driven appliances. To obtain a stable event detection system, we applied several experiments on different parameters to measure its performance. These experiments include the evaluation of six event features from the spectral and time domain, different types of feature space normalization to eliminate undesired feature weighting, the conventional and adaptive training, and two common classifiers with its optimal parameter settings. The evaluations are performed on two publicly available energy datasets with high sampling rates: BLUED and BLOND-50.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Natural energy resources are depleting at an alarming rate and, at the same time, the demand for energy is steadily increasing. Recently, many approaches have been proposed to decrease our reliance on these resources by increasing energy efficiency. Non-intrusive load monitoring (NILM) provides detailed information on the energy consumption for consumers in residential or industrial areas. Surveys indicate that appliance-level consumption feedback can increase consumers awareness to energy wasting and therefore reduce their energy consumption (Kelly and Knottenbelt, 2016). NILM is an intelligent energy monitoring technique which utilizes a single energy monitor to retrieve information of appliances from aggregated loads, such as power consumption and appliance type, in a non-intrusive way. Apart from energy savings, NILM is a helpful tool for predictive maintenance (Abdelgayed et al., 2018) or the determination of motor speed (Orji et al., 2015).

Figure 1. The first plot shows an actual OFF, followed by an ON event of a monitor. The second plot shows sudden laptop transients, that most likely stem from processor load changes. The goal is to differentiate between actual ON / OFF events and transients that are irrelevant to the user.

Appliance events play an essential role in NILM since these events are the time points where the energy consumption significantly changes. Therefore, it is an important task to identify appliance events for all following steps correctly. Regarding NILM, an appliance event is often defined as the transition between two steady states of a time series (Wild et al., 2015). These time series are mostly in the form of current, real or reactive power, and voltage distortion (Meziane et al., 2017). Other metrics, such as admittance or derivation of current, allow identifying appliance events as well, but seem to be not considered for now. Several event-based approaches have been proposed to retrieve detailed consumption information from aggregated load signals (Hart, 1992; Armel et al., 2013; Batra et al., 2014). Event-based NILM approaches differ in performance, based on the number and types of appliances, the sampling frequency of the acquired data, the quality of the event metrics, and the complexity of utilized disaggregation algorithms. These factors determine the detection accuracy of the appliances from the aggregated load.

A considerable amount of inaccuracies in NILM disaggregation stems from the event detection, which depends mostly on the observed appliance type. A high false positive rate is a common problem in event detection. Events from resistive appliances are usually steep, undistorted, and easily detectable due to very low noise in their steady-state consumption. SMPS-driven continuous load appliances (desktop PCs, laptops, LED-TVs, etc.) on the other hand can draw strong event-like transients mid-usage that satisfy typical event rules but do not match any physical state change or user interaction (see Figure 1).

Event detection algorithms are often using a rule-based system with hand-crafted and empirically selected sets of rules

(Girmay and Camarda, 2016; Trung et al., 2014; Jin et al., 2011b). From a certain point of rule complexity and due to the presence of manually labeled data, a supervised learning approach is worth to consider (Ian Goodfellow, 2016)

. The high noise and variances in the current waveform of SMPS-driven appliances is hardly processable with rule-sets.

Therefore, for our approach, we replace hand-crafted rules with a multivariate, binary classification to distinguish between unrelated event-like transients and actual user relevant appliance events. Our classification system learns from customer-labeled events to distinguish between appliance ON / OFF events and customer irrelevant event-like transients in events and non-events. As a result of our critical discussion about hard-coded appliance event definitions, our event detection is designed to retrieve a flexible event definition supervised from representative examples. Our two-step adaptive learning approach that is oriented on the boosting algorithm (Géron, 2017), ensures a relevant selection of training samples for the event and non-event class by learning from false positives. Our experiments show that the algorithm can reduce the number of unrelated event-like transients (false positives) significantly.

The rest of the paper is organized as follows: Section 2 gives an overview of related work and discusses several event definitions. A detailed description of the multivariate event detection approach can be found in Section 3, while our experiments are given in Section 4. The results in Section 5 are consolidated and put into the NILM context in our conclusions in Section 6.

2. Background and Related Work

According to Anderson et al. (2012a) the elementary steps for energy consumption feedback with NILM are: (1) Signal measurement, (2) Appliance Event detection, (3) Appliance Event classification, and (4) Energy Disaggregation (see Figure 2).

Figure 2. The focus in this work is on event detection of the general NILM pipeline.

The 4 steps of the NILM pipeline.

Multi-state and SMPS-driven appliances often show unrelated event-like transients due to appliance state changes. These transients can be caused, amongst others, by computers that switch spontaneously from idle to full processor load (see Figure 1). Organic-LED-driven monitors have an image dependent energy consumption that can switch from minimum load to maximum load in between milliseconds just by changing from black to white in the displaying image. These undesired or unrelated transients affect the appliance classification and power disaggregation performance and make the event detection a challenging part. Rule-based event detection algorithms would need a complex rule set that is hardly feasible and sensitive to environment changes or appliance set changes due to their inflexibility.

Event Detection

NILM is commonly divided into event-based and state-based approaches. Event-based approaches rely on using detection algorithms in order to find electrical events such as switch-ON or switch-OFF of an individual appliance. State-based methods on the other hand, take into account every sample of the signal to perform the inference step. Event-based methods are generally more efficient in the inference step than state-based approaches. This efficiency is caused by pre-processing of the voltage and current signals with labeling and extracting the regions of interest of the signal after the events have occurred. (Liang et al., 2010). Most of the event-based methods rely on the switch continuity principle (Makonin, 2016), which was initially introduced by Hart in 1992 (Hart, 1992). It essentially states that there is only up to one event, i.e., not multiple ones, at a given point in time. Furthermore, it assumes that events are relatively rare when looking at the overall signal, allowing to see the event detection as anomaly detection. Sampling data at higher rates increases the validity of this principle. Employing this principle allows event-detection methods and other algorithms to treat electric events as being isolated from one another (Makonin, 2016).

Three categories of event detection approaches are introduced by Anderson et al. (2012a)

. Expert heuristics describe mostly rule-based approaches that consider prior knowledge to define sets of parameters and thresholds

(Hart, 1992; Baranski and Voss, 2004)

. Probabilistic models consider statistical metrics, including variance and standard deviation, to estimate the probability of a change in a time series

(Berges et al., 2011; Jin et al., 2013). Approaches of the matched-filter category try to find a universal event pattern in the signal by exceeding a likelihood threshold (Leeb et al., 1995; Shaw et al., 2008). The approach of Anderson et al. (2012a)

considers the usage of a modified general likelihood ratio detector to compare four different evaluation metrics.

Baets et al. (2016) apply a cepstrum smoothing high-pass filter to the signal. This way, only very low frequency and step changes remain in the signal. The assumption is that in the case of an event, all remaining low frequencies lie above a certain threshold. The optimal parameter values were empirically evaluated. De Baets et al. compare the results on the BLUED dataset with the chi-squared goodness-of-fit ( GOF) approach by Jin et al. (2011a) and could reach comparable results.

Barsim et al. (2014) introduced an unsupervised event detection algorithm which creates the logarithm of the P, Q plane (Hart, 1992)

to find steady states as clusters, while transients are represented as single scatters or outliers. The extraction of actual events was performed in three stages: a coarse search, followed by a fine search, and a final verification stage. The unsupervised way has the advantage that no learning from existing ground truth is necessary. The results show a very similar performance compared to

Baets et al. (2016).

Wild et al. (2015) introduce a new event definition which gives events a dimension in time, they are not infinite anymore. This definition allows a Fisher discriminant analysis in combination with some constraints a robust unsupervised appliance event detection in the spectral domain.

Houidi et al. (2018) investigate three commonly used techniques for the abrupt event detection that are typically used in other research fields: the Effective Residual algorithm (Berriri et al., 2012), the Cumulative Sum (CUSUM) algorithm (Trung et al., 2014), and the Bayesian Information Criterion algorithm (Ajmera et al., 2004). These algorithms are probabilistic event detection techniques. By comparing the algorithms in a real-world environment, Houidi et al. conclude that the CUSUM algorithm outperforms the other two and achieves good results on their internal dataset.

Azzini et al. (2014) introduce the ”window with margin” method. This threshold-based algorithm uses a sliding window and a subset of the samples within the window, i.e., samples from the beginning and the end of the window, to calculate two averages of the active power consumption. Azzini et al. then use heuristically defined thresholds to check if the difference between the averages exceeds a certain limit in order to detect events in the signal.

The event detection methods above are developed for residential settings, whereas Leeb and Kirtley (1993)

propose a multi-scale transient event detector for industrial settings. To tolerate overlapping events, the author’s algorithm searches for time patterns of segments in the signal that exhibit significant variation instead of searching for complete transient shapes. The algorithm detects such segments by using a change-of-mean detector. The transient changes in the signal are then detected by using sets of the previously computed segments as features for particular events and a pattern matching algorithm.

In contrast to the majority of the event detection approaches, Cox et al. (2006) do not use current signals and analyze only aggregated voltage measurements. By using a spectral decomposition of the voltage signal to compute the harmonic voltage distortion, they are able to detect residential appliance events reliably. They further show that the voltage signal exhibits sufficient information to identify events.

All mentioned approaches have in common that all significant transients are interpreted as events. Every approach considers another event definition making it hard to compare their results. They do not allow to distinguish between different kinds of events or ignore undesired events.

Work of… F-Score
Baets et al. (2016) 80.04
Jin et al. (2013) 81.01
Wild et al. (2015) 89.15
Table 1. Event detection results on BLUED, using different event definitions making the results hardly comparable


We use two common energy datasets to evaluate the introduced event detection algorithm. The Building-Level fUlly-labeled dataset for Electricity Disaggregation (BLUED) is being introduced by Anderson et al. (2012b). The dataset contains continuous voltage and current measurements of around one week from a single-family household. The aggregated consumption signal is measured in a high amplitude (16-bit) and temporal resolution (12 kHz). To enable event detection research, Anderson et al. decided to label significant appliance state transients with timestamps and appliance information. The transient event ground truth stems from additional sensing such as light sensors and visual observation of humans. The resulting 1 577 events of phase B are used for our experiments. An overview of event-detection results using the BLUED dataset can be seen in Table 1.

The Building-Level Office eNvironment dataset (BLOND) (Kriechbaumer and Jacobsen, 2018) contains long-term continuous measurements of a 3-phase energy supply to an office building. In this work, we use the appliance-level BLOND-50 sub-dataset, which contains mains voltage and current measurements of 90 observed sockets. The per-appliance electrical signals that aim for ground truth retrieval were collected with 6.4 kHz and 12-bit resolution. The data amounts to 213 days of recording, from which we selected the period of November 2016 for all experiments on BLOND-50. The ground truth assigns an appliance type and the associated nameplate information with a monitored power socket.

Event Definition

Regarding the event definition itself, multiple different interpretations of events can be found in the literature. Wild et al. (2015) present a classical and an extended event definition. A classical event is a ”transient from one steady state to another steady state which definitely differs from the previous one” (Wild et al., 2015), while an extended event describes a ”so-called active section where the signal is somehow deviating from the previous steady state” (Wild et al., 2015), which provides a higher resilience against peaks and short pulses. Anderson et al. (2012a) define an event with a state change of 30 W for a certain amount of time in a concrete value-based way, while Jin et al. (2011b) see event detection as a way to find ON and OFF transients of appliances. Girmay and Camarda (2016) see an event as an active region from any appliance activation in which the power consumption is ”well above” the background power.

The list of definitions above shows that there is no common agreement on what an appliance event can be. The event detection performance depends strongly on the event definition itself. A simple definition that includes a significant change of power for a certain amount of time, regardless of the cause, can simply be put into a rule-based system that may allow for a perfect detection performance. From the consumer perspective, appliance ON / OFF events that have a causal origin (i.e., from user interaction or physical appliance state changes) are more relevant than transients that simply satisfy the rule set. In practice, the consumer might be interested in the fridge or washing machine spin cycles. The temporarily increased energy consumption from a laptop during an irregular 5 minute lasting operating system update or the suddenly content dependent energy consumption of an organic-LED-driven TV is only of minor interest to the consumer.

Our approach avoids a distinct, hard-coded appliance event definition by learning from individual consumer-configured appliance event segments to build a tailored event model. This way we step back from a distinct event definition in favor of a user-definable event model. Since events from different appliances show individual characteristics, a rule-based approach with thresholds may not be sufficient to find ON / OFF switches. Our system is able to learn from different event features in the time and spectral domain which are fed as features into a supervised binary classification system. To improve the classification performance, we introduce an adaptive training technique that learns from previously wrong detected transients that lie on the border between events and non-events.

3. Multivariate Event Detection

A reasonable appliance classification and disaggregation performance will only be achieved when the NILM system adapts to the deployed environment. The customization may include parameter settings of base load, min/max appliance load or max concurrent running appliances. Besides those parameters, a consumer supervised appliance labeling for system training purposes, over a certain amount of time (e.g., few days/weeks), will result in considerably improved classification and disaggregation performance (Kahl_2017b).

Since the temporal appliance event positions are implicitly known from the consumer labeled time range, these event segments can be used to train a supervised event model for the event classification. The a priori known event segments can be used to identify significant event characteristics, which are a major advantage compared to hand-crafted rules. In a supervised classification task, the classifier needs training samples for each individual class. Event detection is related to anomaly detection that faces the problem of not having sufficient training samples for one of the targeting classes. In practice, we explicitly know from examples how an event looks like, but we don’t explicitly know how a non-event looks like.

Figure 3. The explicitly-known events are retrieved from the event ground truth. Therefore, all other regions are implicitly-known as non-events.

Splitting the dataset segments in events and non-events using the event ground truth.

To overcome that issue, we make use of the fact, that statistically the majority of the time, no event occurs in the signals time domain. We cut short, randomly positioned regions of the temporal signal from the training area, to use them as non-event samples (see Figure 3). The probability to hit an event on a randomly selected position in the training area of the temporal signal is low for common residential and office environments. Around 1 250 events occur per phase in one week for the residential environment while it is around 257 for the office environment, based on the utilized datasets BLUED and BLOND-50. Assuming we are interested in the same number of non-events as it is for events, the chance to hit an event via random selection lies at 0.83 % for the residential environment, while it is around 0.17 % for the office environment. To even overcome that small uncertainty, a minimum temporal distance to explicitly known events of minimal 10 s must be fulfilled. The resulting non-events will be named implicitly-known non-events throughout this paper. All samples together can be used to train a classifier with a training set that consists of explicitly-known events and implicitly-known non-events.

An observed issue with this approach lies in a high number of event false positives. The randomly selected non-event samples stem mostly from areas of a steady consumption. Therefore the non-event class is a good homogeneous representation of steady non-event areas. A more heterogeneous set of non-event training samples with unsteady event-like transients would be necessary to improve the classification performance of transients from SMPS-driven appliances in favor to non-events.

3.1. Adaptive Training

Extracting even more randomly selected samples would be one infeasible way to get a higher variance. The extreme form would be to use every extractable time window in the dataset that is not a ground truth labeled event. Obviously, this would create an infeasible number of training samples for the non-event class. However, the vast amount of training samples would be unnecessary anyway due to a very strong similarity.

Figure 4. The event detection runs on the training area and generates false positives that are being stored for the actual event detection.

Collecting False Positives from the training run.

Our approach is a so called boosting variant that runs the event detection algorithm on the whole training area to find all ground truth labeled events but also a certain amount of non-labeled transients. These transients are obvious false positives, based on the provided ground truth (see Figure 4). They are marginal, uncertain segments of non-events that share similarities with events. These similarities cause the misclassification in favor to the class event. Since these false positives are found inside the training set, we can use them freely to improve our classification model.

Figure 5. The collected false positives from the event detection of the training area form together with the negative samples the class non-events. The positive examples are the representatives of the event class.

Contents of the training set: positive samples and Negative samples together with false positives.

The idea lies in adding these edgy transients to the non-events class of the training set to improve the border between events and non-events (see Figure 5). The actual training set consists now of ground truth labeled event samples, implicitly-known non-event samples, and false positives that were found in the event classification run on the training area itself. This way, it is possible to overcome the issue of finding proper non-event samples for the event detection algorithm. To even reduce the amount of false positives further, the adaptive training can be applied multiple times.

3.2. Event Features

The event ground truth information for BLUED is based on a power consumption change of at least 30 W over a time period of minimal 5 s (Anderson et al., 2012b). Based on this definition, the appliance events can be identified in a moving time window in the continuous electricity signals. We implemented one spectral and six time domain metrics as appliance event features for the classification between events and non-events. Our design defines that the actual event transient is being aligned in the middle of the extracted time window with 5 s of data before and after the actual event transient. The actual temporal position of the event transient is being extracted from the ground truth information or manual annotation in the case of BLOND-50.

The BLUED provided ground truth information and BLOND-50 annotations from this work, comprises the appliance ON and OFF switch events, including circuit number, temporal position (timestamp) and appliance type. The provided switch-OFF and switch-ON events of these appliances will always cause significant changes in these consumption-related metrics:


The current is the first intuitive metric that contains consumption changes (see Figure 6-1). The RMS current for each period is calculated as follows, with as the number of samples per period, calculated as the ratio of the sampling frequency and the mains frequency .


Since multiple appliances can run at the same time, the actual pre-event current can be a sum of multiple appliances and therefore has a high variance (see Figure 6). The actual information of interest is the current step change at the event time (see Figure 6-2). This metric can be retrieved by the numerical difference of the neighboring elements of the current periods . The operation is the derivation equivalent for discrete time series.


The grids voltage can contain high fluctuations (up to 10 %), which influences the current signal as well. The admittance removes the voltage influence from the current signal and is therefore more precise to the appliance consumption itself (see Figure 6

-3). The admittance ADM, can be calculated by the element wise vector division of the period wise current

and voltage .

Spectral Flatness

Our motivation for the only spectral feature we considered is the assumption that all appliances have their individual fingerprints in their harmonic energy distribution. A suitable spectral one-dimensional metric is the spectral flatness. A flat spectral curve would cause a value close to one, while a single strong spike would lead to a value close to zero (see Figure 6-4). The switch-OFF and switch-ON of an appliance influences the spectral flatness in general way. The spectral flatness for each period is calculated by the ratio of the geometric and the arithmetic mean of the current signal energy spectrum (Peeters, 2004).

Cumulative Sum

The cumulative sum is a sequence analysis technique that allows to identify small and continuously slow as well as strong and fast changes in a sequential time series (see Figure 6-5). It is therefore a common technique for change and event detection. The cumulative sum is the sum of the differences to the mean of the signal in between a defined time window.

(Cumulative Sum)

The cumulative sum can have extreme gains in their values and therefore causing undesired weighting of dimensions in the feature space. The derivative of the cumulative sum is a way to prevent this issue and to keep the values in a lower magnitude. The resulting signal is visually comparable with the current itself, but with enlarged transients (see Figure 6-1 and 6-6).

Figure 6. Events (left) and Non-Events (right) with Periods in the X-Axis and amplitude in the Y-Axis of the 6 event feature metrics: 1. Current, 2. (Current), 3. Admittance, 4. Spectral Flatness, 5. CUSUM and 6. (CUSUM). The color saturation correlates with the average distance to the mean event (red line). The closer the event lies to the mean-event the higher is the saturation.

In addition to the mentioned features and training methods, we evaluated the event detection performance through different methods in the feature space normalization and classification step. To avoid undesired weighting across the dimensions of the feature space, a common technique is to apply a feature space normalization. This is often an essential step, of which we evaluate three types. The classification step is being evaluated with two different classifier (KNN and SVM) including their hyper-parameter search.

4. Experiments

To compare our event detection performance with state-of-the-art, we applied our algorithm on the BLUED dataset, which is commonly used for event detection evaluation. The experimental setup is oriented on the setup in the work of Baets et al. (2016). While De Baets is using a fixed test area, we are using cross-validation for our performance evaluation. At least Anderson et al. (2012a), Barsim et al. (2014) and Wild et al. (2015) evaluate their event detection algorithm on the BLUED dataset as well. For BLUED we use the provided ground truth information which stems from hand-crafted annotations.

Unfortunately, neither BLUED nor BLOND-50 provide versatile event information that allows a determination between ON / OFF-switching and user-unrelated transients. In our experiments on BLOND-50, we try to distinguish ON and OFF events from all remaining state transients - identical to the work of Baets et al. (2016). The appliance ON and OFF events for the BLOND-50 dataset are being collected by visual observation of an instructed person with the help of a self-implemented annotation tool. There are no studies regarding event detection on BOND-50 yet.

Since the benchmark of several parameters using cross-validation takes much computational time, we use a cluster of 60 virtual machines, based on dual Intel Xeon E5-2630v3 with each four cores and 10 GiB RAM to execute the appliance event detection algorithm in parallel. The cumulative CPU time for all experiments, preprocessing and testing lies in a range of 128 000 CPU-core-hours.

Figure 7.

The architecture of the experimental setup considers the main step of the common machine learning pipeline and includes the evaluation of six event features, three types of feature space normalization, two training approaches and two different classifiers with its optimal parameters. The whole architecture is wrapped by a cross-validation and structured to run on a distributed computation system.

The architecture of the experimental setup.

4.1. Multivariate Event Detection

Instead of monitoring one or few parameters passing thresholds, our multivariate approach enables supervised learning of multiple event characteristics. The explicitly-known event, and implicitly-known non-event sections were used to train the classifier that decides, based on the given feature vector, between event or non-event.

Architecture for BLUED

In addition to the 1 577 events, we extracted 6 428 segments of implicitly-known non-events (one for each file) of the same length. The segments are aligned with the ground truth event timestamp in the center. These segments are fed to the feature extraction and normalization after that. The normalization parameters (e.g., means or standard deviation) are saved to apply the corresponding transformation to the samples of the test area. The following steps include a parameter search for the classifier (e.g., C and Gamma for SVM), classifier training, and classification of the samples of the test area (see Figure 

7). All experiments are implemented within a stratified k-fold cross-validation to ensure reliable results.

Architecture for BLOND-50

The manually annotated temporal time span comprises one month of measured data. We extract all manually annotated events and the implicitly-known non-events in a very similar way as we do for BLUED. This step yields in 3 310 event and 3 264 non-event samples. The events originate from 41 different monitored appliances in the time range of 2016-11-01 to 2016-11-30.

4.2. Adaptive Training

The adaptive training shares the same experimental architecture as the multivariate event detection, with one additional event detection run on the training area itself and its false positives included to the training set. This training run finds events in the training area that can be divided, considering the ground truth information, into true positives, false positives, and false negatives. All false positive segments that originate from the training run are added to the non-event class of the actual training set.

4.3. Manual BLOND-50 Event Annotation

Every performance benchmark needs reference information to enable comparisons. For the event detection evaluation, an event ground truth including the exact temporal position of an appliance event is necessary. For the BLUED dataset, the appliance events are provided already, for BLOND-50, the appliance events and the corresponding measurement system, circuit and socket number need to be acquired.

Figure 8. Annotation tool for BLOND-50 event ground truth annotation. The annotating person specifies the date and measurement system to view the corresponding consumption of the day for each of the six sockets. Zooming into the time series plot allows for a precise event annotation.

Screenshot of the annotation tool.

To label the data with a ground truth, we were using a self-designed annotation tool that allows a manual annotation in the per-appliance subset of BLOND-50 (see Figure 8). The annotating person observes the data of one measurement system instance and all 6 sockets for one day per screen. The two appliance event constraints (power-rise/fall of 30 W for a minimum time span of 5 s) are communicated to the annotating person to ensure consistent events. In addition, the annotating person is instructed to consider only obvious appliance ON and OFF events. Transients that fulfill event constraints but are not obvious switch ON and OFF events are ignored.

The event ground truth for BLUED and BLOND-50 originate from visual time series observation by humans. Therefore the experimental evaluations in this paper are not performed on the (non-existing) absolute truth but rather subjectively chosen time series segments of the human observation that always contain an individual degree of uncertainty. Since neither an event ground truth nor an appliance event definition has been chosen, the goal is to retrieve an appliance event model from user chosen examples declaratively, a degree of uncertainty from the human observation therefore does not play any role. The manually annotated events, as well as the corresponding annotation tool for MATLAB, can be downloaded at the following link111Due to the double-blind review, the link will be available in the camera-ready version or can be requested via program chair..

5. Results

To ensure a consistent evaluation pipeline we decided to use the best parameter or settings from the previous steps. In practice, the evaluation of the normalization method is done with the best performing feature of the feature evaluation. For all experiments, a search window step-size of 30 periods was used. To the nature of the algorithms, multiple events occurring in between a 5 s window (SCP violation) may be recognized as one event (see Figure 9). That circumstance causes a small number of false positives. The goal of all experiments is to find all ground truth labeled events (true positives) while keeping the misclassifications (false positives) to a minimum.

Figure 9. A BLUED scatter plot of chronologically ordered events with their distance to the next event in the position of the y-axis. The three star-shaped clusters below the 5s horizon are caused by the printer appliance. The detection rate drops below the 5 s horizon, causing more false negatives, due to the fact that 2 events in between 5 s are recognized as one event.

Scatter plot of chronologically ordered events.

Precision, Recall, and F-Score are the most relevant performance metrics for event detection algorithms. These normed metrics allow a general performance conclusion considering the number of correctly detected (true positives), incorrectly detected (false positives) and not detected (false negatives) events. The F-Score is a metric that rises to 1 by an increase of true positives and decrease of false positives. It is combining both relevant performance metrics (true positives and false positives) and is the preferred performance metric in the following evaluations.

5.1. Features

For our first experiment, we implemented the event detection, using adaptive-training and 87 nearest neighbors for the K-NN classifier. These values seemed promising in pre-executed experiments. The highest performance could be achieved with features that are based on the CUSUM (see Table 2). The CUSUM has already been used for event detection with promising results by Trung et al. (2014). Since the current and CUSUM segments are similar (see Figure 6), we expected comparable results. A closer look at the segments reveals that the mean event step of the CUSUM segments is broader and more obvious due to the power neutrality of the CUSUM. We assume that this power neutrality leads to a more distinct event model and an improved detection performance. The performance on BLOND-50 supports these assumptions with a similar trend in the results.

Events that have a previous current of near zero are always ON-events, which are easily detectable in the per-appliance measurements (BLOND-50) but not in the case of concurrent running appliances of aggregated measurements (BLUED). The features ADM, SPF, and Current could therefore not be applied to the BLOND-50 dataset due to their strong dependence on the appliance power in combination with the single appliance measurements which would influence the results in an invalid way.

Feature Prec. Rec. F-Sc. Prec. Rec. F-Sc.
CUSUM 0.81 0.75 0.78 0,22 0,98 0,36
CUSUM 0.80 0.75 0.78 0.23 0.98 0.38
Current 0.88 0.38 0.53 - - -
ADM 0.88 0.38 0.53 - - -
SPF 0.87 0.28 0.43 - - -
Current 0.20 0.33 0.25 0.18 0.83 0.29
Table 2. Feature Results for BLUED and BLOND-50

5.2. Normalization

To prevent undesired feature weighting, a feature normalization needs to be applied, especially in the case of a strong range variance of the feature dimensions. There are two common ways to normalize the feature space. The first is the min-max scaling that ensures that all dimensions lie in a range of [-1 …1] while the second is called standardization that ensures that the standard deviation of all dimensions lies at exactly 1.

Norm Prec. Rec. F-Sc. Prec. Rec. F-Sc.
None 0.82 0.74 0.78 0.22 0.98 0.36
MinMax 0.82 0.75 0.78 0.23 0.97 0.37
Variance 0.82 0.72 0.77 0.24 0.96 0.38
Table 3. Normalization Results

The min-max normalization performs best in our experiments on BLUED but also shows that the normalization itself does not influence the performance significantly (see Table 3). For BLOND-50, the best result could be achieved with a variance normalization. However, also here, the performance results remain quite stable. This means that the different value ranges of the feature space dimensions do not add any significant weighting. This is most likely caused by a similar order of magnitude in the value range across the individual feature space dimensions. The fact that the features are based on time series segments, and therefore share the same value range, affirms the low variations in the performance results.

5.3. Training Method

The two previously introduced training methods (classical and adaptive training) are being evaluated. The best result for the multivariate event detection (without adaptive training) allows detection of 1 170 out of 1 577 appliance events from BLUED with 490 false positives and a corresponding F-Score of 0.72. This result was obtained with 30 periods of step-size and the K-NN classifier with K=301.

Training Prec. Rec. F-Sc. Prec. Rec. F-Sc.
classical 0.13 0.99 0.24 0.12 0.99 0.21
adaptive 0.22 0.98 0.36 0.28 0.94 0.43
adaptive 3x 0.45 0.87 0.59 0.55 0.85 0.67
adaptive 5x 0.53 0.85 0.65 0.56 0.77 0.65
Table 4. Adaptive Training Improvement on BLOND-50

All experiments for the adaptive training show a significant, absolute improvement of the event detection performance of +0.14 in average for the F-Score regarding the BLUED dataset (see Figure 11). The individual improvements vary slightly. The primary performance enhancement of the adaptive training is to reduce the number of false positives due to improvements in the non-event class. The best result for BLUED was obtained with 1 175 true positives and an F-Score of 0.78 by using K=137 for the K-NN classifier, a min-max normalization, and one adaptive training round. The number of false positives was reduced to 260. A significant rise of true positives was not expected and did not occur in most experiments with adaptive training.

Figure 10. The first plot shows the non-event class represented only by the implicitly-known non-event segments. The second plot shows the non-event class including the false positives from the adaptive training. The increased diversity due to the false positives is clearly visible. The images are retrieved from non-events of the first 2 weeks in 2016-11 of the BLOND-50 dataset without (first plot) and with (second plot) one adaptive training run.

The main improvement was observed by applying three rounds of the adaptive training to the event detection on the BLOND-50 dataset. Since the event detection on this dataset produces many false positives, due to a high number of SMPS-driven appliances, the adaptive training reduced the number of false positives from 19 463 to 2 297 which is an improvement of more than eight times. An expected side effect of this enormous improvement is a considerable, but still low, decrease in true positives and recall (see Table 4).

Figure 11. The individual performance improvements by using the adaptive training. The bars show the achieved event detection F-Score for different K of the K-NN classifier on the BLUED dataset.

Bar chart of adaptive training improvements.

Feature Norm Train Class Param F-Sc.
BLUED CUSUM MinMax adap 1x KNN K=137 0.78
BLOND-50 CUSUM Variance adap 3x SVM C/G 128/512 0.67
Table 5. Overall best results on BLUED and BLOND-50

Using the adaptive training to augment the training set with false positive samples, we were able to reduce the final number of false positives during testing. We conclude that the classifier learns the not explicitly definable heterogeneous model of a non-event by adding the false positives of the training run (see Figure 10).

Figure 12. The three plots show the most prominent reasons for false positive events in BLOND-50: a laptop that produces event-like patterns (first plot), a faulty monitor that immediately goes OFF after switching ON (second plot), and a desktop computer that produces event-like patterns due to CPU load changes. The colored event marker show the false positives that stem from the classical (red) and adaptive (5x) training method (green).

Three plots that show the most prominent reasons for false positives.

5.4. Classification


Since the event detection performance varies unexpectedly strong, depending on the number of neighbors for the K-NN classifier, we decided to evaluate the performance of eight different K for the classifier. The best general K in our experiments was 301 with classical training, while it was 137 when applying the adaptive training (see Figure 11). For BLOND-50 the best result with K-NN was achieved by using five rounds of adaptive training (see Figure 12).


The best result we could achieve by using the SVM classifier on the BLUED dataset was with an F-Score of 0.72 considerably lower than with 0.78 for the K-NN classifier. The reason is an almost twice the number of false positives - even after adaptive training. The number of true positives with 1 112 lies only slightly below the best result for K-NN. For BLOND-50 the best result by using the SVM lies in a range of 0.67 by using three adaptive training rounds. The optimal SVM hyper-parameter have been retrieved with a grid search algorithm that is provided in the LIBSVM package of Chang and Lin (2011) and could be found at C=128 and Gamma=512 for BLUED and C=1 and Gamma=0.0078 for BLOND-50.

6. Conclusions

We propose a multivariate event detection that learns from a user designed event model. The event model stems from event and non-event segments of the training area and allows a user relevant event detection. The challenge to distinguish between relevant and irrelevant events is tackled by multiple runs of an introduced adaptive training process. For events of the BLUED dataset, an F-Score of 0.78 could be achieved, which lies in a range of the state-of-the-art. It allows a reduction of more than eight times of false positives for BLOND-50. We could achieve an F-Score of 0.67, which means that a found event is more likely relevant than irrelevant for the user.

The multivariate event detection in combination with the introduced way of adaptive training is an appropriate algorithm for the increasing number of SMPS-driven appliances in residential and office environments.

We would like to thank Hardani Maulana for the thorough event annotation on the BLOND-50 dataset. This research was partially funded by the Alexander von Humboldt Foundation established by the government of the Federal Republic of Germany and was supported by the Federal Ministry for Economic Affairs and Energy on the basis of a decision by the German Bundestag.


  • (1)
  • Abdelgayed et al. (2018) T. S. Abdelgayed, W. G. Morsi, and T. S. Sidhu. 2018. Fault Detection and Classification Based on Co-training of Semisupervised Machine Learning. IEEE Transactions on Industrial Electronics 65, 2 (Feb. 2018), 1595–1605.
  • Ajmera et al. (2004) J. Ajmera, I. McCowan, and H. Bourlard. 2004. Robust speaker change detection. IEEE Signal Processing Letters 11, 8 (Aug. 2004), 649–651.
  • Anderson et al. (2012b) Kyle Anderson, Adrian Ocneanu, Diego Benitez, Derrick Carlson, Anthony Rowe, and Mario Berges. 2012b. BLUED: A Fully Labeled Public Dataset for Event-Based Non-Intrusive Load Monitoring Research. In Proceedings of the 2nd KDD Workshop on Data Mining Applications in Sustainability (SustKDD) (2012-08). ACM.
  • Anderson et al. (2012a) K. D. Anderson, M. E. Bergés, A. Ocneanu, D. Benitez, and J. M. F. Moura. 2012a. Event detection for Non Intrusive load monitoring. In IECON 2012 - 38th Annual Conference on IEEE Industrial Electronics Society. 3312–3317.
  • Armel et al. (2013) Kathleen Carrie Armel, Abhay Gupta, Gireesh Shrimali, and Adrian Albert. 2013. Is disaggregation the holy grail of energy efficiency? The case of electricity. Energy Policy 52 (1 2013), 213–234.
  • Azzini et al. (2014) H. A. D. Azzini, R. Torquato, and L. C. P. da Silva. 2014. Event detection methods for nonintrusive load monitoring. In 2014 IEEE PES General Meeting. 1–5.
  • Baets et al. (2016) Leen De Baets, Joeri Ruyssinck, Dirk Deschrijver, and Tom Dhaene. 2016. Event detection in NILM using cepstrum smoothing. In 3rd International Workshop on Non-Intrusive Load Monitoring. 1–4.
  • Baranski and Voss (2004) M. Baranski and J. Voss. 2004. Detecting patterns of appliances from total load data using a dynamic programming approach. In Data Mining, 2004. ICDM ’04. Fourth IEEE International Conference on. 327–330.
  • Barsim et al. (2014) Karim Said Barsim, Roman Streubel, and Bin Yang. 2014. Unsupervised adaptive event detection for building-level energy disaggregation. Proceedings of power and energy student summt (PESS), Stuttgart, Germany (2014).
  • Batra et al. (2014) Nipun Batra, Jack Kelly, Oliver Parson, Haimonti Dutta, William Knottenbelt, Alex Rogers, Amarjeet Singh, and Mani Srivastava. 2014. NILMTK: An Open Source Toolkit for Non-intrusive Load Monitoring. In Proceedings of the 5th International Conference on Future Energy Systems. ACM, New York, NY, USA, 265–276.
  • Berges et al. (2011) Mario Berges, Ethan Goldman, H. Scott Matthews, Lucio Soibelman, and Kyle Anderson. 2011. User-Centered Nonintrusive Electricity Load Monitoring for Residential Buildings. Journal of Computing in Civil Engineering 25, 6 (2011), 471–480.
  • Berriri et al. (2012) H. Berriri, M. W. Naouar, and I. Slama-Belkhodja. 2012. Easy and Fast Sensor Fault Detection and Isolation Algorithm for Electrical Drives. IEEE Transactions on Power Electronics 27, 2 (Feb. 2012), 490–499.
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. 2011.

    LIBSVM: A Library for Support Vector Machines.

    ACM Transactions on Intelligent Systems and Technology 2, 3 (April 2011), 271–2727. Issue 3. Software available at
  • Cox et al. (2006) R. Cox, S. B. Leeb, S. R. Shaw, and L. K. Norford. 2006. Transient event detection for nonintrusive load monitoring and demand side management using voltage distortion. In Twenty-First Annual IEEE Applied Power Electronics Conference and Exposition, 2006. APEC ’06. 7.
  • Girmay and Camarda (2016) Awet Abraha Girmay and Christian Camarda. 2016. Simple event detection and disaggregation approach for residential energy estimation. In Proceedings of the 3rd International Workshop on Non-Intrusive Load Monitoring (NILM).
  • Géron (2017) Aurélien Géron. 2017.

    Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

    (1 edition ed.).
    O’Reilly Media.
  • Hart (1992) George William Hart. 1992. Nonintrusive Appliance Load Monitoring. Proc. IEEE 80, 12 (1992), 1870–1891.
  • Houidi et al. (2018) Sarra Houidi, F. Auger, Houda Ben Attia Sethom, Laurence Miègeville, Dominique Fourer, and Xiao Jiang. 2018. Statistical assessment of abrupt change detectors for non-intrusive load monitoring. 2018 IEEE International Conference on Industrial Technology (ICIT) (2018), 1314–1319.
  • Ian Goodfellow (2016) Aaron Courville Ian Goodfellow, Yoshua Bengio. 2016. Deep Learning. (2016). Book in preparation for MIT Press.
  • Jin et al. (2011a) Y. Jin, E. Tebekaemi, M. Berges, and L. Soibelman. 2011a. A Time-Frequency Approach for Event Detection in Non-Intrusive Load Monitoring.
  • Jin et al. (2011b) Y. Jin, E. Tebekaemi, M. Berges, and L. Soibelman. 2011b. Robust adaptive event detection in non-intrusive load monitoring for energy aware smart facilities. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4340–4343.
  • Jin et al. (2013) Yuanwei Jin, Eniye Telebakemi, and Mario Berges. 2013. A Time-Frequency Approach for Event Detection in Non-Intrusive Load Monitoring. In Proc. of SPIE Vol, Vol. 8050. 80501U–1.
  • Kahl et al. (2017) Matthias Kahl, Thomas Kriechbaumer, Anwar Ul Haq, and Hans-Arno Jacobsen. 2017. Appliance Classification Across Multiple High Frequency Energy Datasets. In 2017 IEEE International Conference on Smart Grid Communications (SmartGridComm).
  • Kelly and Knottenbelt (2016) Jack Kelly and William Knottenbelt. 2016. Does disaggregated electricity feedback reduce domestic electricity consumption? A systematic review of the literature. CoRR (2016).
  • Kriechbaumer and Jacobsen (2018) Thomas Kriechbaumer and Hans-Arno Jacobsen. 2018. BLOND, a building-level office environment dataset of typical electrical appliances. Scientific Data, an open-access NatureResearch journal 5, 180048 (2018).
  • Leeb and Kirtley (1993) Steven B. Leeb and James L. Kirtley. 1993. A multiscale transient event detector for nonintrusive load monitoring. In Industrial Electronics, Control, and Instrumentation, 1993. Proceedings of the IECON ’93., International Conference on. 354–359 vol.1.
  • Leeb et al. (1995) Steven B. Leeb, Steven R. Shaw, and James L. Kirtley. 1995. Transient Event Detection in Spectral Envelope Estimates for Nonintrusive Load Monitoring. IEEE Transactions on Power Delivery 10, 3 (7 1995), 1200–1210.
  • Liang et al. (2010) Jian Liang, Simon KK Ng, Gail Kendall, and John WM Cheng. 2010. Load Signature Study–Part I: Basic Concept, Structure, and Methodology. Power Delivery, IEEE Transactions on 25, 2 (2010), 551–560.
  • Makonin (2016) Stephen Makonin. 2016. Investigating the Switch Continuity Principle Assumed in Non-Intrusive Load Monitoring (NILM). In 2016 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (2016). Institute of Electrical and Electronics Engineers (IEEE).
  • Meziane et al. (2017) M. Nait Meziane, P. Ravier, G. Lamarque, J. C. Le Bunetel, and Y. Raingeaud. 2017. High accuracy event detection for Non-Intrusive Load Monitoring. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2452–2456.
  • Orji et al. (2015) U. A. Orji, Z. Remscrim, C. Schantz, J. Donnal, J. Paris, M. Gillman, K. Surakitbovorn, S. B. Leeb, and J. L. Kirtley. 2015. Non-intrusive induction motor speed detection. IET Electric Power Applications 9, 5 (2015), 388–396.
  • Peeters (2004) G. Peeters. 2004. A large set of audio features for sound description (similarity and classification) in the CUIDADO project. Tech. Rep. IRCAM.
  • Shaw et al. (2008) S. R. Shaw, S. B. Leeb, L. K. Norford, and R. W. Cox. 2008. Nonintrusive Load Monitoring and Diagnostics in Power Systems. IEEE Transactions on Instrumentation and Measurement 57, 7 (July 2008), 1445–1454.
  • Trung et al. (2014) Kien Nguyen Trung, Eric Dekneuvel, Benjamin Nicolle, Olivier Zammit, Cuong Nguyen Van, and Gilles Jacquemod. 2014. Event detection and disaggregation algorithms for nialm system. In the 2nd International Non-Intrusive Load Monitoring (NILM) Workshop.
  • Wild et al. (2015) B. Wild, K. S. Barsim, and B. Yang. 2015. A new unsupervised event detector for non-intrusive load monitoring. In 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 73–77.