I Introduction
Nonintrusive load monitoring (NILM) is the task of estimating the power demand of a specific appliance from the aggregate consumption of a household measured by a single meter
[hart1992nonintrusive]. As the task requires breaking down the total energy consumed by multiple appliances into appliancelevel energy consumption records, NILM is synonymous with the phrase “energy disaggregation” [shin2018subtask]. A direct benefit of NILM is that energy endusers can acquire appliancelevel consumption feedbacks and optimize their energy consumption behaviours accordingly. It is estimated that up to 12% residential energy saving can be achieved by providing appliancelevel feedback [ehrhardt2010advanced]. NILM benefits consumers, the research community and utilities in domains including residential and commercial energy use, appliance innovation, energy efficient marketing and program evaluation [armel2013disaggregation].The approaches for NILM can be generally divided into supervised methods and unsupervised methods [bonfigli2015unsupervised]
. In the supervised setting, the power consumptions of individual appliances are collected and can be used to train the models. For unsupervised methods, however, only the aggregate power comsumption data can be used. Approaches for unsupervised NILM include hidden Markov models (HMM)
[parson2012non, parson2014unsupervised], factorial hidden Markov models (FHMM) [kolter2012approximate, NIPS2016_6534] and methods based event detection and clustering [gonccalves2011unsupervised, zhao2016training]. Comprehensive reviews of unsupervised NILM approaches can be found in [bonfigli2015unsupervised, zhuang2018overview].With the development of deep neural networks, various neural networkbased supervised NILM approaches have been proposed [mauch2015new, kelly2015neural]. A substantial progress has been made recently thanks to convolutional neural networks (CNN) [zhang2018sequence, shin2018subtask]. For the task of NILM, the power consumption patterns of different appliances generally have varied scales. The aggregate consumption of multiple appliances is prone to have more complicated shapes, hence requiring the ability to deal with scale variation. In addition to local information within a small time range, it is also important to consider the context dependencies of consumption patterns as energy consumption behaviours contain higherlevel semantics (e.g., the dryer works after the washer, and one may turn on the microwave multiple times until cooking is finished). However, existing CNNbased models fail to exploit those aspects, which yield high rates of false positive/negative errors in the disaggregation results. In light of this, we propose a scale and contextaware network (SCANet) structure to incorporate the abovementioned ideas. In this paper, we compare the performance of SCANet with stateoftheart models and conduct empirical analyses on the advantages of the proposed structure. The contributions of this work are twofold:

A scale and contextaware CNN structure is designed for the task of NILM, which greatly improves the disaggregation results for multiple appliances.

We show that adding adversarial loss or onstate augmentation can help the model produce more accurate results and increase generalizability.
The organization of the rest of the paper is as follows: we introduce the related work of this study in Section II. The modules for scale and context awareness, the adversarial loss and the onstate augmentation are described in detail in Section III. The effectiveness of the proposed SCANet model is validated in Section IV with extensive comparisons and visualizations. An additional experiment setting that uses partial ground truth is also introduced and implemented. Finally, Section V concludes the paper and points out some future works.
Ii Related Work
Ii1 Neural Nonintrusive Load Monitoring
The application of neural networks in NILM started with recurrent neural networks (RNN), CNN, and denoising autoencoders (DAE) with relatively simple structures
[mauch2015new, kelly2015neural]. Various CNN models have been proposed thanks to the flexibility of CNN structures, such as sequencetopoint, sequencetosequence, and fully convolutional models [zhang2018sequence, chen2018convolutional, brewitt2018non].The integration of domain knowledge further enriches the design of CNN architectures. An on/off state classification subnetwork can be added in parallel to the regression subnetwork so that the model can learn from on/off state information directly [shin2018subtask, murray2018transferability]. The work in this paper adopts the structure of subtask gated network (SGN) [shin2018subtask] as a starting point.
Ii2 Multiscale Features in CNNs
CNNs are widely used in computer vision tasks including object detection and semantic segmentation, for which capturing multiscale information is of crucial significance. When features of various scales exist in a CNN structure, these features can be combined by upsampling higher layers
[hariharan2015hypercolumns]or adopting different sampling strategies (e.g., use either max pooling or deconvolution) for different layers
[kong2016hypernet]. The association of multiscale features can also be achieved by building pyramidlike network structures such as UNET [ronneberger2015u], FPN [lin2017feature], and PANet [liu2018path]. While UNET concatenates lowlevel features and upsampled highlevel features using skipconnections in a sequential manner and uses the last layer for prediction, features of multiple layers are used by FPN to produce predictions of various scales.Another way to create features of multiple scales is to use dilated convolutions [yu2015multi], which is adopted by TridentNet [li2019scale] to generate multiscale features in several parallel branches with different dilation rates. Scale awareness is obtained by training the branches separately with objects within certain scale ranges. In this work, we use a multibranch structure similar to that of TridentNet. Unlike TridentNet, however, we use gating signals generated by branches in the on/off state classification network to selectively keep feature maps in the regression subnetwork, which facilitates scale awareness.
Ii3 Selfattention Mechanism
The attention mechanism is useful when additional information can be provided by global context [bahdanau2014neural]. For selfattention, the output value at a position in a sequence is calculated by attending to all positions in the sequence [zhang2018self]. Applications including machine translation and video classification greatly benefit from the usage of selfattention [vaswani2017attention, wang2018non]. In this work, we adopt the selfattention module proposed by Zhang et al. [zhang2018self].
Ii4 Generative Adversarial Networks
Generative adversarial networks (GAN) are a family of generative models that are capable of generating realistic data [goodfellow2014generative]
, and different extensions of GANs have been applied to tasks including imagetoimage translation
[zhu2017unpaired][yu2018generative], texttoimage synthesis [reed2016generative], music generation [yang2017midinet], etc. Other works focus on stabilizing the training of GANs and improve the quality of generated samples [arjovsky2017wasserstein, gulrajani2017improved, zhao2017energy, takeru2018spectral]. Applying GANs to NILM is a relatively new idea [bao2018enhancing], where a disaggregator is used to produce latent representations for a specific appliance, followed by a generator that produces the load sequence of the appliance. Different from [bao2018enhancing], we directly formulate the generator as a mapping from the aggregate consumption to the appliancelevel consumption without producing the latent representations.Iii Proposed Model and Training Techniques
In this section, we first formally define the NILM task considered in the paper. We then briefly introduce existing CNNbased models and elaborate on the building blocks of SCANet. Techniques that can facilitate the training of the model are also introduced.
Iiia Problem Formulation
Consider a household with a given aggregate power consumption signal . Let and denote the power consumption of the th appliance being considered and the total consumption of all remaining appliances, respectively. Then, we have , where is the number of appliances and is the additive noise. Given the aggregate signal , the task of NILM is to recover the power consumption sequences of the appliances under consideration [shin2018subtask]. An illustration of the task is provided in Fig. 1.
IiiB Model Design
Estimating the power consumption sequence of an appliance with length using the aggregate consumption signal corresponding to the same time window is difficult as contextual information outside the window is not considered. Thus, it is suggested to add windows of length to both sides for the input aggregate sequence [shin2018subtask, chen2018convolutional]. Specifically, with input sequence , is the output sequence for the th appliance.
It is straightforward to formulate sequencetosequence neural network models with stacked convolutional layers and fullyconnected (FC) layers for the task , where is the predicted sequence [zhang2018sequence]. In order to exploit the on/off state information, two subnetworks, namely, and , are formulated [shin2018subtask]. An auxiliary sequence representing the on/off state of the
th appliance is added and the predicted sequence of onstate probability is given as
. Hence, the final output of the model is(1) 
where is the elementwise multiplication. For simplicity, we omit the superscript and subscript hereafter.
The structure of the two subnetworks proposed in [zhang2018sequence, shin2018subtask] is illustrated in Fig. 2. We build our model featuring scale and context awareness based on this structure (see Fig. 3). The additional components are added based on two observations of existing works: first, the convolutional layers are unable to explicitly extract features with different time scales, and second, the features of the convolutional layers for a given time step are produced based on neighbouring input values without referring to the context. The details of the scare and context awareness modules are elaborated as follows:
IiiB1 Scaleaware Feature Extraction
The scale awareness of SCANet is obtained by adding parallel branches with different dilation rates to the original network and connecting the branches in the two subnetworks by a simple gating mechanism, which allows the regression network keep only the most important feature maps at different scales. An illustration of dilated convolution with different dilation rates () is shown in Fig. 4. With the same number of layers and parameters, a larger allows the output nodes respond to wider time ranges at the input. Thus, the outputs at the branches with different will reflect elements (e.g., shapes or edges) of different time scales at the input. At the same time, an element at the input will affect more output nodes when a larger is used.
Let be the outputs of the branches in and let be the outputs of the branches in
with the sigmoid activation function. Then, the gating mechanism associating the two subnetworks is given by
(2) 
As the gating operation is separately performed for each dilation rate, a rich combination of features at different time scales can be achieved. We then concatenate and and obtain and (note that
contains features with the rectified linear unit (ReLU) activation function instead of the sigmoid function). Both
and are processed by a convolutional layer with a kernel size of 1, yielding and , the inputs to the selfattention modules.IiiB2 Contextaware Feature Integration
The integration of contextual information is achieved by the selfattention module, which takes an input with time steps and channels. The module learns an additional feature map whose values at each time step are obtained by attending to all the time steps in . We first map the input with and , and an entry in the attention matrix is calculated as
(3) 
The additional feature map is then calculated by
(4) 
Note that is the attention assigned to the th time step when the response of the th time step is being calculated. The output of the selfattention module is defined as , where is initialized as 0 and updated when the model is trained, so that the model can rely on the local context at first and gradually learn to pick up the dependencies in the global context [zhang2018self]. Specifically, weight matrices , , and are implemented as convolutional layers with a kernel size of 1. For the two subnetworks, the selfattention modules can be represented as and , where and are the additional feature maps and and are the corresponding coefficients.
The loss function of the model with two subnetworks is given by
, where is the mean squared error (MSE) measuring the overall disaggregation error of the model, and is the binary crossentropy (BCE) measuring the classification error of the on/off state classification subnetwork.IiiC Training Techniques that Improve Accuracy
IiiC1 Training with adversarial loss
The performance of SCANet can be further improved by adding an adversarial loss to the model. As is illustrated in Fig. 5, a critic network is added to the model so that we can train the model partially as a Wasserstein GAN with gradient penalty (WGANGP) [gulrajani2017improved]. The original GAN is formulated by the minimax game between the generator and the discriminator [goodfellow2014generative]:
(5) 
where is the distribution of the real data and is the distribution of data generated by , indicating that the input of is sampled from some noise distribution. Briefly speaking, the goal of the discriminator is to gain the ability to distinguish between real and generated samples, while the generator tries to confuse the discriminator by learning to generate realistic data samples. Training the generator and the discriminator in turn allows the generator gradually obtain the ability to generate realistic samples.
The WGANGP model used in this work is modification to the WGAN model proposed in [arjovsky2017wasserstein], which adopts the Wasserstein distance to stabilize the training of GANs. The gradient penalty in WGANGP further stabilizes the training process by penalizing the norm of the gradient of instead of clipping the weights in (Here, is named critic instead of discriminator as the task of is not classification of real or generated data). The loss of WGANGP (also referred to as the adversarial loss in this paper) is formulated as
(6) 
where the first two terms measure the Wasserstein distance between and . The last term is the gradient penalty and refers to uniformly sampling from the line segment connecting point pairs sampled from and (see [gulrajani2017improved]). In this paper, instead of generating samples from a noise distribution, we directly use the network producing as the generator. Specifically, we add the adversarial loss so that the overall loss function becomes
(7) 
where is the weight for the adversarial loss. It is expected that the adversarial loss can help the model produce more realistic output sequences, especially when the size of the training dataset is relatively small.
Metric  Model  Fridge  Microwave  Dishwasher  Average 

Seq2point [zhang2018sequence]  26.01  27.13  24.44  25.86  
MAE  SGN [shin2018subtask]  26.11  16.95  15.77  19.61 
Proposed SCANet  21.77  13.75  10.14  15.22  
Seq2point  16.24  18.89  22.87  19.33  
SAE  SGN  17.28  12.49  15.22  15.00 
Proposed SCANet  14.05  9.97  8.12  10.71 
Experiment Results on the Evaluation Metrics for the REDD Dataset
Metric  Model 

Kettle  Fridge  Microwave  Dishwasher  Average  

Seq2point  10.87  10.81  17.48  12.47  15.96  13.52  
MAE  SGN  9.74  8.09  16.27  5.62  10.91  10.13  
Proposed SCANet  8.48  6.14  15.16  4.82  8.71  8.67  
Seq2point  8.69  5.30  8.01  10.33  10.65  8.60  
SAE  SGN  7.14  5.03  6.61  4.32  7.86  6.20  
Proposed SCANet  5.77  4.03  6.54  3.81  4.86  5.00 
IiiC2 Onstate augmentation
We propose onstate augmentation to deal with the variance of onstate power consumption of appliances (e.g., the peak power of two fridge models may be different even if they have the same operation pattern). Given an appliance, the maximum offset values
and are decided, and each output sequence is replaced by , where . The same amount of onstate offset is also added to the corresponding input sequence . The augmentation is applied during the training of the model.In this work, we apply onstate augmentation to fridge, for which biased estimation of onstate power is a major source of disaggregation error. As expected, the model is able to estimate the onstate power of fridge more accurately after onstate augmentation is implemented.
Iv Results and Analysis
In this section, we introduce the datasets used in this paper and the implementation details of the models. Experiment results are presented together with empirical analyses of the advantages of SCANet.
Iva Experiment Settings
IvA1 Datasets
Two realworld datasets, REDD [kolter2011redd] and UKDALE^{1}^{1}1http://jackkelly.com/data/ [kelly2015uk] are used to evaluate the performance of SCANet in this paper. The REDD dataset contains measurement data from six households in the US and the time span of the dataset ranges from 23 to 48 days for different houses. The mains consumption was recorded every 1 second, while the appliancewise consumption was recorded every 3 seconds. The UKDALE dataset includes data from five UK households, and measurements for the aggregate consumption as well as consumptions of individual appliances were recorded every 6 seconds. The monitoring of house 1 lasted for over 600 days, while the time spans for the other houses range from 39 to 234 days. Detailed descriptions of the households and the monitored individual appliances in the datasets can be found in [kolter2011redd, kelly2015uk].
Following previous studies [zhang2018sequence, shin2018subtask], we use data of houses 26 to create the training set and leave the data for house 1 as the test set for the REDD dataset. Disaggregation is implemented for fridge, dishwasher, and microwave. The preprocessed data for the REDD dataset used in this work is provided by the authors of [shin2018subtask]. For the UKDALE dataset, we use houses 1 and 5 for training, and house 2 for testing. Disaggregation results for fridge, dishwasher, microwave, washing machine, and kettle are reported. In order to normalize the data, we follow the practice in [shin2018subtask]
and divide the power consumption values of both datasets by 612, which is the standard deviation of the aggregate consumption for houses 25 in the REDD dataset.
IvA2 Implementation Details
The SGN backbone [shin2018subtask]
has 6 convolutional layers followed by 2 FC layers in each subnetwork. Specifically, the numbers of filters in each layer are 30, 30, 40, 50, 50, and 50, and the kernel sizes are 10, 8, 6, 5, 5, and 5, respectively. All convolutional layers are implemented with a stride of 1 and “same” padding, and the weights are initialized with “He normal” initializer
[he2015delving]. The first FC layer has 1024 hidden nodes, and the second FC layer has the same number of nodes as the output of the model. All but the last layer uses ReLU activation function. For the REDD dataset, the output sequence size is 64 and the additional window size is 400, while andare 32 and 200 for the UKDALE dataset. As the sizes of the input and the output are reduced by half for the UKDALE dataset, we also change the kernel sizes to 5, 4, 3, 3, 3, and 3 while other hyperparameters remain the same. Adam optimizer with initial learning rate 0.0001 is adopted and we train the models for 5 epochs with a batch size of 16. For SCANet, we add two parallel branches with
and to each subnetwork starting from the 4th convolutional layer. The layer producing and has 64 filters, thus for the selfattention modules, and we set to 32. We adopt most of the hyperparameters from [shin2018subtask]such that we can focus on ensuring the effectiveness of the proposed model components rather than tuning the hyperparameters. All models are implemented in Python 3.6 with Keras 2.1.6.
The input samples are produced by a sliding window running over the input sequences with specific step sizes, which is set to 2 for the REDD dataset. The step sizes for microwave, dishwasher, fridge, washing machine, and kettle are 4, 8, 32, 32, and 32 for the UKDALE dataset. We choose the step sizes by ensuring that the SGN model performs no worse than its reported results [shin2018subtask]. For testing, a sliding window with a step size of 2 generates the input samples. Multiple overlapping output sequences are averaged to produce the final output. Further, as onstate events are relatively rare for some appliances, the imbalance of on and off states may bring difficulties to the training of the models. Thus, we randomly remove offstate samples from the training dataset for some appliances. For the REDD dataset, the probablity of keeping an offstate sample (i.e., the entire output of the sample is offstate) is 0.2 for dishwasher. The probabilities for dishwasher, microwave, and kettle are 0.02, 0.05, and 0.1, respectively, for the UKDALE dataset. The same settings are shared by the Seq2point model [zhang2018sequence] and SGN when applicable.
For the implementation of WGANGP, a simple critic with 4 convolutional layers and 32 filters at each layer is formulated. The kernel sizes are set to 3, an FC layer with 256 hidden nodes bridges the convolutional layers and the output node, and the weight for is 0.5. We train the model with a batch size of 32. Specifically, we implement the model with for appliances other than fridges. Onstate augmentation is implemented in the training of the model for fridges with and for the REDD dataset and and for the UKDALE dataset.
Mean absolute error (MAE) and signal aggregate error (SAE) are used as the evaluation metrics for each appliance [shin2018subtask]. Specifically, given a predicted output sequence with time steps, , where is the number of disjoint time periods with length , , is the total predicted power consumption in the th time period, and is the corresponding ground truth. In this work, we set , thus each time period corresponds to approximately one hour for the REDD dataset, and two hours for the UKDALE dataset.
IvB Experiment Results
We report the results of the evaluation metrics for the REDD and the UKDALE datasets in Table I and Table II, respectively. Each value is obtained by averaging results from 3 trials. It is seen in the two tables that SCANet achieves lower MAEs and SAEs than SGN for all of the appliances, especially for the REDD dataset, for which the improvements are 22.39% and 28.60% for averaged MAEs and SAEs. The average improvements for the UKDALE dataset are also greater than 10%.
The comparison of SCANet with SGN is further presented in Fig. 6 and Fig. 7, and we can see that SCANet produces more accurate disaggregation results than SGN. In Fig. 6 (a), it is observed that onstate augmentation helps the model capture the onstate power consumption of fridges. We further showcase the advantage of SCANet in Fig. 8, in which two cases are illustrated for microwave in the REDD dataset. A false positive case of SGN is shown in Fig. 8 (a). Fig. 8 (b) shows that SCANet is able to identify that the microwave consumes power for multiple short durations successively while SGN fails to tell that the microwave is on. In Fig. 9, the disaggregation results of the SGN and the SCANet models for microwave are illustrated. The two cases in Fig. 8 are also marked in Fig. 9. In general, the SCANet model is more accurate in terms of power consumption level and has fewer false false positive cases.
The activations at the end of the branches in the two subnetworks for the case in Fig. 6 (b) are visualized in Fig. 10. For the sample used for the visualization, only the 256 time steps in the middle are plotted. Specifically, each feature map contains values of 256 time steps (the horizontal axis) and 50 channels (the longitudinal axis). It is clear that the branches are all responding to the rising and falling edges in the microwave consumption signal (the signal is added to the feature maps of and for comparison), and that a large proportion of the gating signals are actually suppressing the high activations in the regression subnetwork ( to are used as gating signals for to ). As a result, the feature map is much less activated in general.
We also use the cases in Fig. 6 (b) and Fig. 8 (b) to illustrate the mechanism of the selfattention module. We first visualize the attention matrix in the classification subnetwork for the case of Fig. 6 (b) in Fig. 11 (a). Clearly, the model mainly focuses on the edges in this time range (note that the assignment of attention is rowwise), and the highest attention values are observed for the three rising edges, i.e., the model refers to all the rising edges when looking at one of the rising edges. For instance, we highlight the row corresponding to the first rising edge in , and the high values in the row mainly belong to the three rising edges in the signal including the first rising edge itself. Further, the attention matrix for Fig. 8 (b) is shown in Fig. 11 (b). Similarly, the main feature of the matrix is that high attention values are found for the time steps corresponding to rising edges and the rising edges are attending to all rising edges within the time range.
In Fig. 12, we plot the input and output of the selfattention module for the case of Fig. 6 (b) for the classification subnetwork as well as the additional feature map , which is highly activated at all three rising edges. As and for the sample, the activations at the edges are reflected in . Thus, it is of interest to investigate the effect of in . To this end, we bypass the selfattention network and directly feed to the FC layers to obtain the output of the classification subnetwork and compare it with the original output in Fig. 13, which shows that helps suppress the onstate probability where the microwave is not consuming electricity. Note that this can be a hard task as the microwave is turned on shortly before and after. Thus, in this case, the regression subnetwork only needs to produce an output sequence with approximately the same value and the classification subnetwork alone is able to produce desirable results.
We use the case in Fig. 8 (b) to show that the regression subnetwork is not obsolete. Specifically, we plot the feature maps and , the outputs of the subnetworks as well as the output of the model in Fig. 14, which shows that is actively responding to both rising and falling edges of the input, forming a repetitive pattern. As a result, the output of the regression subnetwork predicts the right trends of the power consumption in the time period.
IvC Ablation Study and Discussion
MS  SA  AL  OA  Dishwasher  Microwave  Fridge 
        15.77  16.95  26.11 
✓  13.22  15.93  25.52  
✓  13.53  15.98  25.02  
✓  ✓  12.73  15.00  24.68  
✓  13.54  15.97    
✓      22.80  
✓  ✓  ✓  10.14  13.75    
✓  ✓  ✓      21.77  
We carry out an ablation study for the appliances in the REDD dataset and report the MAEs in Table III (each MAE value is averaged over 3 trials). The first row corresponds to SGN. It is observed in the table that each component we add can reduce the MAE and that the components are mutually compatible, as lower MAEs can be achieved when they are combined. Specifically, the combination of the additional modules and the adversarial loss greatly improves the performance of the model. The improvement for fridge is mainly contributed by onstate augmentation, which helps the model adapt to a different power consumption level in the test data.
It is then of great interest to analyze the mechanism behind the accuracy boost when the additional modules are combined with adversarial loss. After some investigation, we find out that the adversarial loss helps the model capture the power consumption modes as expected, and that the branchwise gates allow the model to avoid the mode collapse phenomenon (see [srivastava2017veegan] for an introduction to mode collapse of GAN). The effect of the adversarial loss is demonstrated in Fig. 15 with dishwashers in the REDD dataset as an example. Specifically, we plot the first two principal components of the 64timestep sequences of (real samples) and
(generated samples) after principal component analysis (PCA). Only complete onstate sequences are considered for simplicity and clarity. Four modes of
in the training set are identified, and it is expected that in the training set would have the same distribution. Although in the test set may not have exactly the same distribution as in the training set (different dishwashers may have varied consumption levels), it would be problematic if the distributions differ too much. In Fig. 15 (a), the SGN model produces sequences close to modes 1 and 2, but fails to cover modes 3 and 4 (only a small fraction gets close to mode 3). By contrast, the complete SCANet model covers all the modes (Fig. 15 (c)). However, it is shown in Fig. 15 (b) that the SCANet fails to cover modes 3 and 4 when the branchwise gates are removed. Note that the branchwise gates are not specifically designed for avoiding the mode collapse phenomenon. Nevertheless, the empirical observation for dishwasher in the REDD dataset shows that the branchwise gates facilitates the incorporation of the adversarial loss.IvD Practicability Verification With Limited Data and Partial Ground Truth
The aforementioned experiments are carried out with at least several weeks of data from multiple houses and measurements of power consumptions of individual appliances are available. The individualappliancelevel measurements, however, may be impractical to obtain. In order to verify the performance of the proposed SCANet model when there is no access to fully labelled datasets (i.e., datasets containing consumptions of individual appliances) with large time spans, a different setting that uses less data with partial ground truth is adopted. More specifically, the new setting has the following features:

We use the training data of the REDD dataset and test with the test data of the UKDALE dataset, which puts higher demands on the generalization ability of the models.

It is assumed that appliancelevel power consumption signals are inaccessible, but the on/off states of the appliances being considered are labelled. Thus, only the ground truth of on/off states are available.

Only a small proportion of training data is used to train the models.
As the ground truths for appliancelevel consumptions are unavailable, we modify the structures of SGN and SCANet and keep only the on/off state classification subnetworks in the models. As a result, the outputs of the models only contain on/off state predictions. The data in the REDD dataset is downsampled by a factor of 2 to match the sampling frequency of the UKDALE dataset (i.e., is 32 and is 200). The step sizes for the REDD and the UKDALE datasets are 4 and 2, respectively. Other hyperparameters of the models remain unchanged. The adversarial losses are not added as the consumption values are unknown. Specifically, the experiments for the three appliances in the REDD dataset are designed as follows:

Fridge: Two proportions, namely, 5% and 10% of the training data are used (the first 5% or 10% of each section in the training data as there are multiple sections). The total time span for the training data is roughly 3 days for the proportion of 10%. Onstate augmentation with and is used.

Dishwasher: 20% of the training data is used, which contains only 2 events of usage. Onstate augmentation with and is used.

Microwave: for microwave, 20% of the training data is used and 12 microwave usage events are included. Onstate augmentation with and is used. In addition to this setting, we also experiment with adding part of the test data into the training data to mimic the process of gradually improving the model with the help of additional partially labelled data from the household being tested (e.g., from user feedbacks). Specifically, the additional data is taken from the beginning of the test data and is not used for evaluation. For the additional data, onstage augmentation is also added with and .
For performance evaluation, we use the score which is defined as
(8) 
where is the precision and is the recall of the predictions for all the time steps. TP, FP, and FN stand for true positive, false positive, and false negative, respectively. For the predicted on/off state probabilities, values lower than 0.5 are considered as off and values greater or equal to 0.5 are considered as on.
Model  Proportion  Precision  Recall  score 

SGN  5%  0.731  0.412  0.525 
SGN  10%  0.724  0.357  0.477 
SCANet  5%  0.812  0.831  0.821 
SCANet  10%  0.823  0.840  0.831 
Model  Precision  Recall  score 

SGN  0.588  0.550  0.567 
SCANet  0.692  0.731  0.710 
Model  Added Days  Precision  Recall  score 

SGN  0  0.151  0.295  0.199 
SGN  7  0.171  0.452  0.249 
SCANet  0  0.173  0.441  0.248 
SCANet  2  0.637  0.531  0.579 
SCANet  4  0.761  0.559  0.644 
SCANet  7  0.852  0.622  0.689 
The performance of the models for fridge is shown in Table IV, and the on/off states are visualized in Fig. 16. It is clear in the results that the SCANet has higher recall than SGN, and the score of SCANet is much higher. Specifically, the SGN model predicts a lot of off states when the fridge is working. The results for dishwasher are shown in Table V and Fig. 17
. With only two usage events in the training data, it is quite impressive that the SCANet model is able to have high precision and recall at the same time. For the time range of Fig.
17, there are 29 usage events for the ground truth, and the SCANet model responses to 27 of them. Meanwhile, only 2 false positive cases are produced.The results for microwave are shown in Table VI and Fig. 18. When no data from the test dataset is added, the performances of both models are not satisfactory. This indicates that transferring a model learned on the REDD dataset to the UKDALE dataset is problematic for microwave. The reason transferring the model of fridge is easier is that fridges generally have a unique cyclic power consumption pattern with a relatively low consumption level. The consumption pattern of dishwashers is also quite unique and the time span for a single usage is relatively long. The usage pattern of microwaves, however, may be confused with other appliances as it mainly consists of sparse, short windows with high power consumptions. As a result, adding partially labelled data from the test data greatly improves the performance of the SCANet model. When the data of a week containing 19 microwave usage events is added, the score of the model increases to 0.689. Further, if we consider the number of events recorded in Fig. 18, the precision and recall are and for the model trained with data of 7 additional days in the test data.
In short, we have shown that the proposed SCANet model has a better performance than SGN in the new experiment setting. In addition, a model trained in this manner may also be used to facilitate unsupervised NILM approaches (e.g., help assign the disaggregation results to specific appliances).
V Conclusion
We develop a scale and contextaware CNN model, namely SCANet, for the task of NILM in this paper. Experiment results show that the proposed SCANet significantly reduces the estimation error of the disaggregated appliancelevel power consumption. Adding adversarial loss and onstate augmentation are proven to be useful for certain appliances. In addition to the comparisons with the stateoftheart, we also provide some observations on the working mechanisms of the modules by diving into the intermediate network layers. We show that the scale and contextaware modules are functioning as expected, which contribute to the improvement in disaggregation accuracy.
In order for NILM techniques to function properly for realworld applications, an important path for future work is to combine the merits of supervised and unsupervised learning. One possibility is to combine the results from supervised and unsupervised models to produce better results. Another direction is to design a practical setting for semisupervised learning and try to incorporate unlabelled or partially labelled data into the training process of a model.
Acknowledgments
The authors thank Changho Shin for providing the preprocessed REDD dataset used in [shin2018subtask]. We are also grateful for the support of NVIDIA Corporation with the donation of a Titan Xp GPU. This work was partially supported by the 2019 Seed Fund Award from CITRIS and the Banatao Institute at the University of California, and the Hellman Fellowship.
Comments
There are no comments yet.