Electricity Theft Detection with self-attention

02/14/2020 ∙ by Paulo Finardi, et al. ∙ 0

In this work we propose a novel self-attention mechanism model to address electricity theft detection on an imbalanced realistic dataset that presents a daily electricity consumption provided by State Grid Corporation of China. Our key contribution is the introduction of a multi-head self-attention mechanism concatenated with dilated convolutions and unified by a convolution of kernel size 1. Moreover, we introduce a binary input channel (Binary Mask) to identify the position of the missing values, allowing the network to learn how to deal with these values. Our model achieves an AUC of 0.926 which is an improvement in more than 17% with respect to previous baseline work. The code is available on GitHub at https://github.com/neuralmind-ai/electricity-theft-detection-with-self-attention.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to the World Bank, in more than of the world population had access to electrical energy, which is made available to people via a complex transmission and distribution system that interconnects power plants to consumers. In the operation of this system two types of losses are expected: technical and non-technical losses. Technical Losses (TL) occur due to power dissipation in the materials that compose the electrical power system itself, such as cables, connectors, and power transformers. Non-Technical Losses (NTL) represent energy losses due to energy theft and errors of billing or measurement [1].

According to the Electricity Distribution Loss Report published by ANEEL (Brazilian National Electricity Agency) [2], NTLs comprised about of all energy injected into the Brazilian electrical power system in . These losses impact consumers with more expensive energy bills, electricity distribution companies with reduced revenues, and the reliability of the electrical power system [3]. Part of the of the problem of tackling NTLs is dealing with the metering infrastructure itself, which is pointed out as being the most faulty subsystem [3]

. Recent advances in the Internet of Things (IoT) made possible addressing these problems by the adoption of Advanced Metering Infrastructures (AMIs), that can provide consumption data with high temporal resolution, thus reducing losses related to billing and metering issues. Together with AMIs, artificial intelligence algorithms can play an important role in detecting NTLs due to electricity theft in power distribution system

[4, 5].

In this work, we developed a predictive method using a supervised learning technique with deep learning methodologies applied to to identify fraudulent consumer units. We train and evaluate our models on a dataset of

months of daily electricity consumption. The work brings several improvements compared with the previous state-of-the-art method [5]

, such as the usage of Quantile normalization on the original data, the usage of an additional binary input channel to deal with missing values and the usage of attention mechanism.

Our results show that the usage of a model with attention mechanism layers delivered an increment of on the Area Under the Curve (AUC) score when compared to the baseline. The combination of this model with a the binary input channel (Binary Mask) and Quantile normalization improved the AUC and the F.

The article is organized as follows: in section 2 we present an overview of related works; in section 3 we present the problem and the methodology adopted; in section 4 we detail the proposed solution and the metrics used to evaluate the performance of the algorithms; in section 5 we describe the data processing steps; section 6 presents the results obtained; and finally, in section 7 we describe our conclusions and future perspectives.

2 Related Work

The application of deep learning in NTLs detection has increased in recent years. Several approaches to the problem have been proposed and the results obtained are significantly superior when compared to those from rule-based traditional methods [1, 6, 5]. However, one of the main difficulties in developing data-driven models for NTLs detection in the electricity industry is the lack of data publicly available. Energy consumption is a sensitive data and due to privacy and security issues the vast majority of electricity distribution companies do not share their data. One of the ways to circumvent this problem is to generate synthetic data. For instance, Liu et al. [4] inject artificial electricity thefts into a database of regular consumers. Although useful, the generation of synthetic data may lead to unintentional introduction of bias or misrepresentation of real situations.

Zheng et at. [5]

present a study using a dataset with real electricity theft data provided by State Grid Corporation of China (SGCC). This study, which has become a baseline for following recent works, introduces a neural network architecture based on a wide (dense) and a deep (convolutional) component trained together. Moreover, their proposed reshaping of the 1D electricity consumption data sequences into a 2D format has provided a straightforward way to explore neighborhood correlations with 2D convolutional neural network (CNN). Hasan et al. 


uses real electricity theft data, they propose a combination of CNN and LSTM (Long Short-Term Memory) architectures in order to explore the time-series nature of the electricity consumption data. Nonetheless, satisfactory results were achieved only after applying the synthetic minority over-sampling technique (SMOTE)

[8] to account for the imbalanced dataset.

In Li et al. [9]

, a combination of CNN with Random Forest (RF) algorithm is applied on a dataset of over 5000 residential and businesses consumers provided by the Electric Ireland and Sustainable Energy Authority of Ireland (SEAI), with thieves being synthetically injected. Also motivated by the data reshaping introduced by Zheng et al. (2018), the authors reshaped the electricity consumption data into a 2D format, allowing a more generalized feature extraction by the CNN.

3 Problem Analysis

Our task is to detect fraud in electricity consumption. The dataset is a collection of real electricity consumption samples and was released by the State Grid Corporation of China (SGCC). The data is a sequence of daily electricity consumption, which we characterize as a time series. The basic assumption that guides the analysis of time series is that there is a causal system more or less constant, related to time, which influenced the data in the past and can continue to do so in the future. The purpose of time series analysis is to identify nonrandom patterns in the daily electricity consumption behavior that allows more accurate predictions. See section 5 for a time series analysis and autocorrelations for the problem at hand.

3.1 Data Methodology

A important contribution from Zheng et al. [5] is the transformation of one dimensional data into bidimensional (Figure 1

). A 2D format allows the exploration of periodicity and neighborhood characteristics with the usage of a computer vision models, such as 2D convolutional neural networks.

Figure 1: Data processing methodology.

3.2 Missing data

Missing data is an ubiquitous problem. In the literature we find two common practices to deal with them. One approach is to delete the incomplete reading from the dataset. However, this approach may dismiss valuable information. An alternative is to estimate the missing value using interpolation or with the median or mean of the data feature


. Although those techniques have been proven effective, they impose strong assumptions about the nature of the missing data and hence might bias the predictive models negatively. In addition to these methods, attempts to find approximations to fill the missing data using genetic algorithm, simulated annealing and particle swarm optimization have also been proposed

[11]. However, when dealing with large datasets such techniques can be prohibitively slow.

To deal with the missing values, we create a binary mask as an additional channel of the input as follows: First, we identify the indices of all missing data and create a binary mask, where the missing data receives value and all remainder values receives . We call this mask Binary Mask. The missing data at the values channel receives a value of . These channels are the input to a 2D CNN. See Figure  2 for an illustration of our method.

Figure 2: Top left: raw data in 2D format, Top right: missing entries are filled with 0’s, Left bottom: binary mask, Right bottom: final data with 2 channels.

4 Architecture overview

Image recognition is a classic classification problem where CNNs have a history of high efficacy [12, 13]

. As our data input resembles an image, we developed two models to address the problem, both using 2D convolutions: a CNN and a multi-head attention model. Attention models are used in many Natural Language Processing (NLP) tasks and have been recently adapted to vision problems


4.1 CNN Architecture

Our CNN model has layers of 2D convolutional operators with kernel size of 3: First layer has channels as input and as outputs; The second layer outputs channels with a non-linear activation PReLU [15]; The third and final convolutional layer outputs

channels over a dilated kernel with a stride factor of

which is followed by PReLU activation function. All convolutional layers have kernel size

. The convolutional output is flattened and connected to a fully connected layer, Figure 3 summarizes the model.

Dilation is a practice to increase the receptive view using sparse filters [16]. The convolution itself is modified to use the filter parameters in a sparse way as it skips a fixed number of features along both dimensions at regular intervals, albeit the sparsity, dilated convolutions do not lose resolution. The stride or sub-sampling factor as mentioned in [17] is the step of the convolution used to reduce the overlap of receptive fields and spatial dimensions which can be seen as an alternative to pooling layers.

Figure 3: CNN model.

4.2 Multi-heads Attention Architecture

Attention mechanisms have shown great ability to solve many kind of problems, ranging from NLP tasks [18] to computer vision [19] and tabular data [20]. Inspired by the recent advances we propose a novel Neural Network that takes advantage of both attention mechanisms and convolutional layers that are concatenated and unified through a convolution of kernel size . We start by describing the inner works of the convolutional part.

Convolutional Layer: Our convolutional layer is composed of two parts, one will perform standard convolutions over the inputs, while the other part applies a convolution with dilation factor of , both layers utilizes a kernel size of , the results are concatenated to form a single output.

Attention Mechanism: Our attention mechanism differs from standard approaches by looking at the channels of the input as the heads and mapping them to another set of attention heads, that is, given an input of shape we first transpose the first two dimensions and flatten it into a matrix of shape , let

be learnable linear transformations, where

is the number of channels or heads coming in, is the size of the sequence, is the dimension of every element in the sequence and is the number of output heads or channels, we start by computing , . Second we map back to a tri-dimensional shape by unflatenning and transposing so that , finally we compute the output of the attention layer as follows:


Summarizing, given an input we perform the following mapping:


This allows for consistency of the output shape between the attention and convolutional layers.

Unification: After the input is processed both by the attention and convolutional layers we concatenate the results into a single matrix and unify it through a convolution of kernel size followed by Layer Norm and PReLU activation function. We called this a Hybrid Multi-Head Attention/Dilated Convolution Layer.

Classifier: Finally the output of a sequence of these hybrid layers is flattened and fed to a linear feedforward neural network that will predict the input class.

Our final architecture is composed of two hybrid layers, where the first has heads and outputs heads while the convolutional part receives a channel 2D input and outputs a channel matrix of the same size, the unification is fed to a second hybrid layer with the same dimensions, lastly a one layer dense neural network with PReLU as activation function and neurons on its hidden layer classifies the input. Figure 4 shows the model.

Figure 4: Hybrid Multi-Head Attention/Dilated Convolution.

4.3 Metrics

In this work we evaluate our models with AUC that represents the data separability degree and the ROC curve which depicts the probability curve created by plotting the rate of true positives versus the rate of false positives. The AUC is the area under this curve that summarizes the ROC curve in a single value.

We also evaluate the performance on the F

score that combines precision and recall in order to bring a unique number that indicates the general quality of the model. Besides these metrics we use the Mean Average Precision (MAP)

[21] to measure the effectiveness of information retrieval. To evaluate the MAP we first ordered the true labels by the predicted probabilities and consider a subset of top probabilities given by the following equation:


where the is the true label of the consumer, if is a thief and

otherwise. For the loss function we decided to use the cross entropy which is a classic practice for classification problems.

5 Data

The SGCC data presents the daily consumption of consumer units with a total time window ranging from January to October , corresponding to approximately weeks. The data is divided into thieves and normal electrical consumers, where the first compose of the total. This data does not show the date when the fraud occurs. We tested data reshape 2D on a monthly and weekly basis, we decided to use a weekly period, as we noticed a more correlation between thieves and normal electricity customers.

Due to the granularity of the data, it is common to have a significant number cases of missing values and there are approximately 25% of them. Our propose to handle the missing data was presented in section 3.2. The dataset description is showed in the Table 1. The same dataset was analyzed in [5], where the authors used an Wide and Deep architecture [22], more details about this study is described in section 6.1.

Description Value
Time window 2014/01/01 – 2016/10/31
Normal electricity customers 38 757 approx. 91.5%
Electricity thieves 3 615 approx. 8.55%
Total customers 42 372
Missing data cases approx. 25%
Table 1: Dataset Description

5.1 Data Preprocessing

Data processing is a key element that determine the success or failure in many deep learning models. In our analysis the realistic SGCC data has some particular features, including a significant number of missing data, a long tail distribution which produces strong skewness and kurtosis. The missing data is discussed in section

3.2. For the atypical data, or outliers, we noticed that most of the cases occur in the normal electricity costumers and we did not remove these cases to avoid losing useful information. Prior to the normalization of the data, we studied the dataset as a time series due to the fact that there is only one variable performed at uniform intervals. To evaluate possible correlations and periodicity, two experiments were conducted: (I) we accumulated the electricity consumption over the days of the week (from Monday to Sunday) and constructed a correlation matrix between the days of the week for thieves and normal electricity customers, as illustrated in Figure 5.

Figure 5: Correlation Matrix. Top: Normal Electricity Customers, Bottom: Thieves.


In order to find periodicity and pattern recognition between classes we use the autocorrelation function, which provides the correlation of a time series with its own lagged values, Figure

6. The axis indicates the interval time being considered, where meaning a lag of intervals; axis is the autocorrelation score and is the highest possible score. Top: normal electrical customers, Bottom: thieves.

Figure 6: Autocorrelation of Electricity Consumption. Top: Normal Electricity Customers, Bottom: Thieves.

The analysis from Figures 5 and 6 shows some difference between thieves and normal electricity customers. In particular, the greater correlation observed between days of the week for the thieves suggests that this feature could be exploited to improve model performance, in another words, the thieves have similar behaviour.

The SGCC data has a phenomenon called heteroscedasticity (non-constant variability)

[23], which causes the resulting distribution to be asymmetric positive or Leptokurtic [24], i.e., there is great variability on the right side of the distribution which creates a long tail, as shown in Figure 7-Top. This asymmetry can lead to spurious interactions in the deep learning model due to non-constant variations. To deal with this asymmetry distribution we perform a Quantile uniform normalization provide by [25]. The Quantile uniform transformation is a non-linear function which is applied on each feature data independently. This normalization spreads out the most frequent values between . First, the Quantile map the original values to estimate the cumulative distribution, then these values are spread out into numbers of quantiles. In our approach we use quantiles. A distribution of the data processed is shown in Figure 7 on the Bottom. One problem that Quantile transform has is the the number of data required to performed the transformation. As a rule of thumb, to create quantiles, a minimum of samples is required.

In addition to processing Quantile, we also tested a Yeo-Johnson power transform [26], but the transformed values were between and with Quantile between

. We also verified the Kullback-Leibler Divergence (

) [27]

to a uniform distribution is minimized.

is a practice of measuring the matching between two distributions, given by the formula:


where is the distribution of the data transformed by Quantile and is the ground truth, in our case a uniform distribution and we are interested in matching to . A lower value means a better and matched. The Table 2 shows the values before and after Quantile transformation.

The processed dataset has less Kurtosis and Skewness, which brings stationarity to the data by Kwiatkowski, Phillips, Schmidt and Shin (KPSS) [28] test with level equals

. Namely the data variance, mean and covariance has more stationary behavior and its statistical properties do not change over time in the columns where the KPSS test is

True, Table 2.

Figure 7: Electrical Consumption Data from 100 samples (in blue). Top: Raw data; Bottom: Data processed by Quantile transformation.
Property Raw data Processed data
Min 0.00 0.00
Max 800003.31 1.00
Mean 6.87 0.40
Std 236.14 0.35
Skewness 2551.62 -0.01
Kurtosis 7170709.11 -1.67
15121.81 57.15
KPSS test False: 1016 / True: 19 False: 581 / True: 454
Table 2: Processing data

6 Experiments

In this section we describe the experiments performed in this work. In addition to the two models developed, we also compared our attention model with the Attention Augmented Convolutional Network [19]. To evaluate the proposed modification for the missing data described in section 3.2, we also performed an experiment with and without a Binary Mask. All training sessions were performed with different train percentages splits and with stratified k-fold.

6.0.1 Binary Mask Experiment

Using stratified k-fold with the Hybrid Multi-head Attention Dilation Convolutional model and training split we evaluated the percentage difference of the data with Binary Mask and without. When there’s Non-Binary Mask, all missing data was filled with value, Table 3 presents results of this experiment where the column name Only Quantile refers to Non-Bynary Mask.

6.0.2 Attention Augmented Convolution Network

We implemented the Attention Augmentation Convolutional Network algorithm proposed in [19]. Which is a self-attention algorithm developed for two-dimensional tasks as an alternative to CNN networks. The authors combine features extracted from the convolutional layers with self-attention through concatenation. The experiment was performed with stratified k-fold in different train splits size. Table 4 shows the results.

6.1 Baselines

Detection of electrical fraud with granular data using Deep Learning techniques are still rare to be found in the literature. The dataset on which this work was developed is a real data, which makes it even rarer. To compare our model with other approaches, we will use [5] that made the dataset available. These authors developed a study with Wide and Deep technique [29]. The Wide component try to memorize the global knowledge and the CNN layers capture features of electricity consumption data. These two components associated resulted in a good performance with an AUC metric up to and above .

6.1.1 Dataset preprocessed with Missing Values Interpolated

Our aim in this experiment is to conduct:

  • The Quantile transformation contributed positively to our preprocessing data proposal

  • The Hybrid Multi-Head Attention/Dilated Convolution outperformed the Wide and Deep model [22] in the same data.

For this, we preprocessed the SGCC dataset with the equations and as in Zheng et al. [5] and trained our model in the split with stratified k-fold. Results are presented in Table 3, column name Interpolated Missing Values

. With the same dataset configuration as our baseline, we improve all the metric scores and the results presented are the average values for all folds at the same epoch. To show the Quantile transformation is efficient, we need to compare the results obtained in Table

3 between the columns name Only Quantile and Interpolated Missing Values.

Interpolated Only Quantile +
Metric Missing Values Quantile Binary Mask

0.840 0.889 0.925
F score 0.365 0.504 0.606
MAP@100 0.960 0.972 0.992
MAP@200 0.941 0.961 0.972
Table 3: Binary Mask Experiment: all columns was trained with Hybrid Multi-Head Attention/Dilated Convolution with train split = 80%.

6.2 Results and Discussion

Table 4 presents the main results of the models developed in this work. The three train splits of and were tested with stratified k-fold. The Hybrid Multi-Head Attention Dilation/Convolution significantly outperformed the baseline. Moreover, the score obtained with our two models and with the Attention Augmented Convolutional Newtwork shows that the Quantile transformation brings a significant improvement to the data preprocessing. The attention mechanism produces a notable increase in the F score. Another distinguished behaviour is the much faster convergence of the attention model compared with the CNN model. In our tests the CNN needed approximately epochs to converge while the Hybrid Attention converged with approximately epochs. Figure 8 presents the evolution of the scores as a function of the epoch for the Hybrid Multi-Head Attention Dilation/Convolution.

Model Metric train = 50% train = 75% train = 80%

AUC 0.898 0.920 0.922
Neural F score 0.477 0.508 0.530
Network MAP@100 0.977 0.978 0.979
MAP@200 0.969 0.970 0.976
Hybrid AUC 0.903 0.926 0.925
Multi-Head F score 0.553 0.583 0.606
Attention MAP@100 0.996 0.988 0.992
Dil. Conv. MAP@200 0.981 0.971 0.972

AUC 0.881 0.902 0.911
Augmented F score 0.503 0.543 0.551
Conv. MAP@100 0.956 0.969 0.969
Network MAP@200 0.948 0.956 0.952
Table 4: Main Results

With respect the time spent during the training and inference the Table 5 shows the average time spent for epoch in folds in the training and total time needed to infer the valid data which is % of the dataset. The results achieved enable the establishment of protocols for suspected cases inspection with high assertiveness. However, it is necessary to note that the choice of the threshold is an important point for decision making. Our model has an optimal threshold of , as shown in Figure 9, which produces a F score of . Note that when a threshold is used there is a trade-off between Precision and Recall. In other words, if Precision is prioritized, we must choose a threshold greater than . The Table 4

and the confusion matrix in Figure

10 correspond to threshold .

Model Training time Inf. time # Params.
CNN 2min 27seg 32seg 3Mi
Hybrid Attn. 3min 16seg 37seg 51Mi
Attn. Augmented 2min 40seg 20seg 17Mi
Table 5: Training time, on Tesla V100 GPU Hardware - with train split = 80%
Figure 8: Metrics by Epochs, train = .
Figure 9: Threshold Analysis
Figure 10: Confusion Matrix from one fold in train = 80%

7 Conclusion

In this paper, we introduced a Hybrid multi-head self-attention dilated convolution method for electricity theft detection with realistic imbalanced data. We apply three innovations to improve upon the previous baseline work:

  1. A Quantile normalization of the dataset;

  2. The introduction of a second channel to the input called Binary Mask;

  3. A novel model of multi-head self-attention.

Another key element is the time series data reshape in 2D format introduced by [5, 9] allowing to treat the consumer sample as an image and to use CNNs. Our attention model overperformed the CNN model developed up to points of F and converged in epochs, approximately hour and min compared with epochs in CNN, approximately hours and min.

The model presented in [19] was the inspiration for our attention model. The unification step that combines the outputs from the attention, normal and dilated convolution, allowing that information from different spatial sizes and sources be merged, is the core of our model’s architecture. The characteristics of our model do not emerge from the used data, that said, problems on computer vision, for instance, could also be solved by it.

Due to the high number of missing values in the data (approx. 25%). Classic attempts to reconstruct these values can bring a significant bias resulting in poor solutions. With the addition of the Binary Mask we improved the F score em approximately points to the best of our knowledge this is the first time that the a Binary Mask was introduced as input channel into a CNN for dealing with missing data. Deep learning solutions in electricity theft detection are rare in the literature. To incentive the research in this field we are providing the code in a repository of GitHub https://github.com/neuralmind-ai/electricity-theft-detection-with-self-attention and the dataset can be found at another repository https://github.com/henryRDlab/ElectricityTheftDetection/ . The results obtained in this study demonstrate that still exist space for advances into the results obtained by Deep Learning techniques applied to electricity theft detection in smart real metered data.

7.1 Future Work

The insights produced and experience gained from this work will be used in future experiments involving energy such as energy consumption forecasting and fraud detection in the context of another AMI framework, where data will be available at almost real time with higher sampling rate.

7.2 Acknowledgments

This work is funded by ENEL in ANEEL R&D Program PD_06072_06 61/2018. Roberto Lotufo thanks CNPQ’s support through the research project PQ2018, process number .