The development of smart grid technologies has facilitated a deeper understanding of customer behaviors in low-voltage networks for distribution system operators. According to statistical data, the number of smart meters (SMs) has exceeded 70 million by the end of 2016 in the U.S. . The widespread deployment of SMs has provided the system operators with an immense amount of time-series consumption data which has already been widely used in distribution system monitoring and load forecasting . Meanwhile, SM data provides a good opportunity to implement active energy management and control mechanisms such as demand response (DR). DR is defined as the changes in electricity usage by end-use customers in response to control/incentive signals, which can enhance resource-efficiency of electricity production by shaving the system peak demand . In the past, due to the limited number of real-time measurement devices, the customers with high demand levels, such as industrial and large commercial customers, were the primary targets of DR programs . More recently, given the development of advanced metering infrastructure (AMI), utilities have become interested in residential customers’ potentials for DR participation . A critical task for large-scale residential DR implementation is identifying candidate customers, which can then be targeted for effective participation in DR. This will help utilities under strict financial constraints in optimizing their investment portfolios.
In recent years, several papers have focused on obtaining strategies that effectively segment and classify customers to support DR program targeting and design using SM data. One of the most common approaches is using clustering techniques for extracting customer consumption features[6, 7, 8, 9]. In 
, an effective support vector clustering method is presented for load profile analysis. In
, a variety of commonly-used clustering methods, including k-means, weighted fuzzy average, self-organizing maps, modified leader-following, and hierarchical algorithms are evaluated and compared using real distribution network data. In
, principal component analysis (PCA) is performed to extract the dominant features within customer consumption data and then k-means algorithm is employed to classify consumers. In, four key time periods of the day are defined to analyze customer behaviors. For each time period, a finite mixture model-based clustering is presented to obtain distinct behavioral groups. It is shown that using clustering outcomes, different types of consumption behavioral groups can be distinctly inferred. However, the typical load profile extraction alone is insufficient to understand customers’ impacts on system operation, which limits utilities’ ability to target suitable customers for DR investment.
Apart from clustering techniques, other papers have proposed using customer load profile features to determine their potential for DR participation [10, 11, 12, 13]. In , residential customers are ranked using their appliance energy efficiency to reduce building energy consumption. In , the entropy of household power demand is used to evaluate the variability of consumption behavior, which is considered to be a key component in DR program targeting and customer engagement. In , a multi-stage method is proposed to target customers based on three new metrics: load profile efficiency, intensity, and randomness. In , a customer’s marginal contribution to system cost is obtained using daily demand profiles. Based on the proposed method, the high-impact customers can be targeted to improve the economic performance of DR in distribution systems. Compared to the clustering approaches, metric-based methods directly extract customer-level features from SM data and use them to find appropriate candidates for DR investment. Nevertheless, the clear relationship between the previously-proposed metrics and customer impact on system peak demand is unknown. For example, it cannot be claimed that a low entropy household has a higher contribution to the system peak demand compared to a customer with a more variable consumption profile. Meanwhile, previous works have only considered DR candidate classification in distribution systems that are fully observable, i.e. all customers have SMs. It remains a big challenge to identify suitable DR candidates in partially observable systems. However, more than half of all U.S. electricity customer accounts do not have SMs to record their detailed consumption behavior due to financial limitations and cyber-security issues .
In order to address these shortcomings, this paper proposes a new metric to quantify the impacts of individual customers on the system peak demand. This index is denoted as coincident monthly peak contribution (CMPC) and is defined as the ratio of individual customer’s demand during system daily peak load time over the total system peak demand in a course of a month. Meanwhile, we propose a multi-stage machine learning model for estimating CMPCs of unobserved customers, using only their monthly consumption data from utility bills. This enables the proposed metric to be applicable to distribution systems with limited observability. The developed machine learning model consists of three modules: 1) Using a graph theoretic clustering, a seasonal typical load pattern bank is constructed to classify various customer consumption behaviors. 2) To connect unobserved customers to the seasonal databank, a multinomial classification model is presented to identify typical load profiles of customers without SMs. 3) According to the outcome of the classification module, a weighted clusterwise regression (WCR) model is trained to map the unobserved customers’ monthly energy consumption data to CMPC values. This machine learning framework takes advantage of the considerable correlation between CMPCs and monthly consumption levels for different customer profiles. The proposed method will be tested and verified using real utility data. Based on CMPC, within a certain range of consumption, customers with heavy demand but small contribution to the system peak could be excluded from DR investment plans, whereas those with a similar demand level but a larger peak contribution are deemed to be more suitable for DR programs. Thus, the proposed metric can help utilities identify high-impact customers for peak-shaving investment.
The rest of the paper is organized as follows: Section II describes the real data and the proposed CMPC metric. In Section III, a clustering algorithm is presented to build the seasonal consumption pattern bank. In Section IV, CMPC inference for unobserved customers is proposed using a classification model and WCR. The numerical results are analyzed in Section V. Section VI presents research conclusions.
Ii Data Description and CMPC Definition
Ii-a Data Description
The available data used in this paper is provided by several mid-west U.S. utilities. The data includes the daily load profiles of over 4000 residential customers ranging from January 2015 to May 2018, and the corresponding supervisory control and data acquisition (SCADA) measurements. The SM data was initially processed to eliminate grossly erroneous and missing samples. Accordingly, the data points with a z-score magnitude of larger than 5 are marked as “erroneous” and replaced using local interpolation
. After data pre-processing, the empirical distribution and cumulative distribution function (CDF) of customer monthly energy consumption are obtained and presented in Fig.1. As shown in the figure, the majority of residential customer monthly consumption samples are concentrated around 1000 kWh, and almost 80 of customers have monthly consumption levels below 2000 kWh. Compared to the industrial and large commercial customers, the demand level of residential households is distributed within a smaller range.
Ii-B CMPC Defintion
One objective of DR programs is to reduce the critical system peak demand and flatten system load curves by changing customer behaviors. Hence, it is necessary to measure or estimate the proportion of each customer’s contribution to the system peak demand. In previous works, it was demonstrated that a linear relationship exists between customer monthly energy consumption and customer peak demand . However, individual customer’s peak demand cannot be employed as a measure of DR potential. The reason for this is that individual customer peak demand does not necessarily coincide with the system peak demand. In order to illustrate this, a basic statistical analysis is performed on the available SM dataset. Fig. 2 shows the percentage of customers whose peak demand coincides with the system peak load. On average only 6
of customers have the same peak time as the system, with a standard deviation of 12. This means that a customer’s peak demand cannot be relied upon to estimate its contribution to the overall system peak load. Thus, in this paper, we propose a new metric, denoted as CMPC, to find the suitable residential candidates for peak shaving programs by accurately quantifying the contribution of an individual customer to the system peak demand:
where CMPC is estimated for the ’th customer at the ’th month, and is denoted by . Here, is the customer’s demand at time on the ’th day of the month, with denoting the total number of days in the month. Note that and are the value and the time of system peak demand on the -th day of the -th month. Hence, CMPC is basically the average customer contribution to the daily system peak demand during a month. A few related but different indices can be found in the literature, such as contribution factor, which is defined as the gap between the aggregate peak demand of a group of customers and their actual consumption at the system peak time . However, the contribution factor cannot be used as a customer-level metric for peak-shaving program targeting due to its inability to quantify individual customers’ contributions to the system peak load.
CMPC can be directly measured for customers with SMs. Considering that not all customers have SMs, we propose a multi-stage data-driven method for estimating CMPC of customers without SMs. The flowchart of the proposed approach is presented in Fig. 3. Based on the AMI data, the demand profiles of customers are utilized to build a seasonal consumption pattern bank, , where each is the set of the typical daily load profiles for a specific season (detailed in Section III). Seasonal data clustering shows a better load behavior identification performance due to its ability to capture the critical seasonal behaviours of residential customers . According to the consumption pattern bank and customer context information, a classification model is developed to infer the likelihood of identified seasonal daily consumption profiles for customers without SM data. Then, a series of WCR models are trained using customers’ monthly billing data to estimate the CMPC of unobserved customers. Basically, the proposed data-driven framework is able to infer CMPC of customers without SMs using their monthly billing information and a limited prior knowledge of customer behaviors.
Iii Graph Theoretical Clustering Algorithm
In this paper, a graph theory-based clustering technique, known as spectral clustering (SC), is adopted. The SC uses seasonal average customer load profiles to identify typical daily load patterns corresponding to different seasons. According to the statistical analysis, both customer behaviors and system peak timing are affected by seasonal changes, as shown in Fig. 4 and Fig. 7. In Fig. 4, different seasonal average load profiles for one household are presented. It can be seen that the peak demand periods and energy levels of this customer vary in different seasons. In Fig. 7
, the peak time distribution in summer is concentrated around evening interval (17:00-18:00 pm). Meanwhile, the peak time probability rises during daytime and falls sharply at night. One possible reason is the increase of air conditioning usage during summer daytime. In contrast, the peak time distribution of winter is presented in Fig.7.
Compared to the summer, the distribution of peak demand time in winter has two concentration points: one in morning hours (8:00-12:00 am), and the other in the evening (18:00-20:00 pm). Also, the peak time probability shows relatively low values during the afternoon interval (13:00-17:00 pm). Hence, in this work, instead of assigning a single pattern to each customer, various patterns are obtained for different seasons to capture the seasonality of customer behaviors . Compared to conventional clustering techniques, the main difference of SC is the transformation of the data clustering problem to a graph partitioning problem. In this paper, to avoid errors caused by manual parameter selection, we employ an automatic neighbor detection method in the SC algorithm . The details of the SC algorithm is shown in Algorithm 1.
The SC algorithm has two main advantages: it mainly relies on the similarity matrix of the dataset rather than using the high-dimensional demand profile data directly. Also, computing the eigenvalues of similarity matrix for data reconstruction is equivalent to achieving dimension reduction by employing a linear PCA in a high dimensional kernel space ; as a basic idea of SC, graph partitioning problem can be solved without making any assumptions on the data distribution. This improves the robustness of SC, and leads to better clustering performance for complex and unknown data structures . The main challenge of SC is that the value still needs to be determined as a priori. To obtain the optimal , we employ the Davies-Bouldin validation index (DBI), which aims to maximize the internal consistency of each cluster and minimize the overlap of different clusters . The optimal value of can be obtained when the DBI is minimized. This is shown in Fig. 8 for summer data subset.
Iv Estimation of CMPC for Unobserved Customers
In order to assess the potential of unobserved residential customers, a WCR approach is proposed to estimate the CMPC of unobserved customers using only their monthly consumption information, as shown in Fig. 9. This framework includes two stages: the first stage is unobserved customer classification based on the seasonal typical consumption pattern bank, and the second stage is cluster-based CMPC inference.
Iv-a Unobserved customer classification
Since the detailed time-series SM data of unobserved customers is not available, their daily consumption patterns cannot be directly determined beforehand. To link the existing typical load patterns, obtained from the SC technique, to unobserved customers, a pattern classification model is developed. The goal of this model is to design a classifier that is able to distinguish different behavioral classes based on an input vector 
. The proposed model in this paper maps our prior information on individual customer peak demand times to the typical daily pattern databank. The basic idea is that the typical daily load profiles of unobserved customers can be discovered using prior knowledge of their peak consumption timing. In this paper, our assumption is that this prior knowledge can be obtained using simple load surveys. This basic survey provides us with knowledge of customer behavior over a few distinctive intervals in the day, namely the morning interval (from 7:00 am to 9:00 am), the afternoon interval (from 12:00 pm to 14:00 pm), and the evening interval (from 18:00 pm to 21:00 pm). This prior information is then used to obtain an approximate probability distribution function of customer peak timing defined as, where is the probability of ’th customer peak demand occurring at time instant , with denoting the maximum number of time points. Thus, the peak timing likelihood distribution,
, is utilized as the input of the classification model. This classification model for unobserved customers is developed using the multinomial logistic regression (MLR) algorithm
. Compared to other binary classification methods such as random forests, MLR is able to obtain the likelihood of different typical profiles for customers rather than picking a single consumption pattern from the databank. The probability that the ’th customer follows the ’th typical load profile can be written as 
where, represents the class of the ’th unobserved customer, is the total number of consumption patterns, is the transposition operator, and is the weight vector corresponding to pattern . The learning parameters are obtained by solving over the training set, where is the classification risk function, defined as follows :
where, is the ’th element of , which is a binary string representing customer class membership. To maximize the log-likelihood function, , with respect to , an iterative reweighted least squares (IRLS) training mechanism was implemented .
Iv-B Estimation of CMPC for Unobserved Customers
In this paper, a linear CWR model is developed for each cluster in the seasonal pattern bank to estimate the CMPC of customers belonging to the clusters. The proposed metric, CMPC, is a load feature that highly depends on two variables: daily load profile and demand level. We have observed that a positive correlation exists between CMPC and monthly energy consumption. Meanwhile, the nature of this linear relationship changes with the typical daily load pattern of customers. Hence, for accurate CMPC estimation we need to assign a separate regression model to each cluster in the pattern databank. This is demonstrated in Fig. 12, where the CMPC and monthly energy consumption of customers belong to different clusters are shown. As depicted in Fig. 12, the correlation between monthly energy consumption and the CMPC is largely different for customers with two distinct behavioral patterns in the same season.
Using the cluster probability values obtained from the classification model, , we have utilized a weighted averaging process to combine the outcomes of the CWR units corresponding to different clusters. Employing this weighted averaging process, the estimated CMPC for the ’th customer at the ’th month, , is determined as follows:
where, and are the regression coefficients for the ’th cluster, and is the customer’s monthly consumption level. Hence, the proposed WCR is able to estimate the CMPC of unobserved customers using only their measured monthly consumption within a probabilistic classification setting.
V Numerical Results
The real distribution system provided by our utility collaborator is equipped with SMs, thus fully observable. This enables us to calculate the exact CMPC of each customer. To test the proposed data-driven framework for partially observable systems, we assume that 40 of customers are unobserved and then compare the estimation results with the actual CMPCs. Thus, the data of observed customers (the remaining 60 of the total data) is divided into 4 subsets corresponding to different seasons of the year for model training.
V-a SC Algorithm Performance
For every subset, the optimal cluster number is determined using DBI and typical load patterns are obtained employing the SC algorithm (detailed in Section III). Fig. 13 and Fig. 14 present the typical load shapes, namely , , …, , and the distribution of population of customers belonging to each cluster during all the seasons. As shown in the figures, the number of typical load profiles in different seasons is not the same and the SC approach is able to capture the critical seasonal consumption patterns. In spring, around of customers show typically higher consumption levels during morning (around 7:00 am). In contrast, more than of customers have higher energy consumption during evening (around 20:00 pm). Meanwhile, more than half of customers present low energy consumption value during the afternoon period. The typical load profiles in summer are different from spring. Except for , the typical load patterns of of all customers show similar behavioral tendencies. This could be due to air-conditioning load consumption during time intervals with higher temperature. Based on the typical load patterns, the majority of peak demand occurs during the evening interval. For around of customers in summer, the peak time ranges from 17:00 pm to 19:00 pm. In fall, the number of typical load patterns are relatively larger rather than other seasons due to variability of customer behavior. Compared to summer, when peak demand barely happens in the morning, more than of customers have high consumption at around 7:00 am in fall, such as and . Also, around of customers provide almost zero consumption from 10:00 am to 15:00 pm, and nearly one-third of customers show two peaks in the morning and evening periods. The winter typical daily patterns are similar to the results of spring since these two seasons have similar weather in mid-west U.S.
V-B WCR Performance
When the seasonal consumption pattern bank is developed using the SM data of observed customers, the WCR models are utilized to infer the CMPC of unobserved customers.
V-B1 Classification Performance Analysis
For the classification part, the Area under the Curve (AUC) index is employed to assess the performance of MLR model . AUC is determined as follows:
where, TP is the True Positive, TN is the True Negative, FP is the False Positive, FN is the False Negative, and N is the number of total Negatives.
Compared to the commonly-used metric, accuracy, the AUC does not depend on the cut-off value that is applied to the posterior probabilities to evaluate the performance of a classification model.
The meaningful range of AUC is between 0.5 to 1. In order to avoid the overfitting problem, the -fold cross-validation method is applied to the MLR to ensure the randomness of the training set . Based on the prior information on customer peak timing distribution, the MLR achieves an AUC value of 0.7 when assigning daily load patterns to unobserved customers.
V-B2 Regression Performance Analysis
Based on the WCR approach, the CMPC of unobserved customers can be estimated using the monthly billing data. Fig. 15 shows the performance of WCR. As can be seen, the estimated values are able to accurately track the unobserved customer’s real contribution to system peak demand. To assess the performance of the model, the goodness-of-fit measure, , and the mean absolute percentage error (MAPE) are utilized in this paper. As demonstrated in Fig. 18, for spring subset, the mean value of and MAPE are and , respectively. Theses two indices are presented in Table I for all seasons. Hence, the regression model has a good performance for estimation of CMPC of unobserved customers in this case.
V-C Relationship between CMPC and Other DR Potential Assessment Metrics
In this section, we demonstrate that CMPC contains unique information of customers’ potentials for DR participation. The information captured by CMPC cannot be directly captured by existing metrics, namely, customer peak demand and load profile entropy, as discussed below:
1) Customer peak demand and CMPC: The diversity of load behaviors causes individual customer peaks and the system peak to be non-coincident, which leads to a considerable difference between CMPC and customer peak demand values. The ratio of customer peak demand over CMPC is shown in Fig. 19 for our SM dataset. It can be seen that a customer’s peak demand can reach five times of the customer’s actual contribution to the system peak, which shows that these two metrics can be largely different. This difference needs to be taken into account by utilities when quantifying residential customer potential for peak-shaving programs.
2) Load profile entropy and CMPC: Entropy is a measure of the variability and uncertainty of customer consumption, which is widely used for identifying customers with stable behaviors for DR program targeting. Compared to entropy, CMPC can provide information on contributions of customers to the system peak. Fig. 20 presents the relationship between CMPC and entropy obtained from our SM dataset. It is observed that CMPC and entropy are almost uncorrelated, which means that these two concepts do not contain mutual information and describe different aspects of load behavior. This implies that customers with a high CMPC (i.e., suitable targets for peak-shaving programs) do not necessarily have higher entropy values.
In this paper, we have presented a new metric, CMPC, to target customers for DR programs. The CMPC can guide utilities in quantifying contributions of individual customers to the system peak demand. Moreover, using the proposed data-driven framework, the CMPC can be accurately estimated for customers without SMs by using their monthly billing data. It is demonstrated that the CMPC provides utilities with additional actionable information for active energy management and demand-side control programs compared to the previously-proposed metrics. The proposed method is successfully validated on real SM data.
-  Energy Information Administration. (2017) Advanced metering count by technology type. [Online]. Available: https://www.eia.gov/electricity/annual/html/epa_10_10.html
-  K. Dehghanpour, Z. Wang, J. Wang, Y. Yuan, and F. Bu, “A survey on state estimation techniques and challenges in smart distribution systems,” IEEE Trans. Smart Grid, pp. 1–1, 2018.
-  U.S. Department of Energy. (2006, Feb,) Benefits of demand response in electricity markets and recommendations for achieving them. [Online]. Available: https://emp.lbl.gov/sites/default/files/report-lbnl-1252d.pdf
-  M. Albadi and E. El-Saadany, “A summary of demand response in electricity markets,” Electric Power Systems Research, vol. 78, no. 11, pp. 1989–1996, 2008.
-  Q. Cui, X. Wang, X. Wang, and Y. Zhang, “Residential appliances direct load control in real-time using cooperative game,” IEEE Trans. Power Syst., vol. 31, no. 1, pp. 226–233, Jan 2016.
-  G. Chicco and I. Ilie, “Support vector clustering of electrical load pattern data,” IEEE Trans. Power Syst., vol. 24, no. 3, pp. 1619–1628, Aug. 2009.
-  S. M. Bidoki, N. Mahmoudi-Kohan, M. H. Sadreddini, M. Z. Jahromi, and M. P. Moghaddam, “Evaluating different clustering techniques for electricity customer classification,” 2010 IEEE PES Transmission and Distribution Conference and Exposition, pp. 1–5, 2010.
-  M. Koivisto, P. Heine, I. Mellin, and M. Lehtonen, “Clustering of connection points and load modeling in distribution systems,” IEEE Trans. Power Syst., vol. 28, no. 2, pp. 1255–1265, May 2013.
-  S. Haben, C. Singleton, and P. Grindrod, “Analysis and clustering of residential customers energy behavioral demand using smart meter data,” IEEE Trans. Smart Grid, vol. 7, no. 1, pp. 136–144, Jan. 2016.
-  A. Kavousian, R. Rajagopal, and M. Fischer, “Ranking appliance energy efficiency in households: Utilizing smart meter data and energy efficiency frontiers to estimate and identify the determinants of appliance energy efficiency in residential buildings,” Energy and Buildings, vol. 99, pp. 220–230, Apr. 2015.
-  J. Kwac, J. Flora, and R. Rajagopal, “Household energy consumption segmentation using hourly data,” IEEE Trans. Smart Grid, vol. 5, no. 1, pp. 420–430, Jan. 2014.
-  R. Gulbinas, A. Khosrowpour, and J. Taylor, “Segmentation and classification of commercial building occupants by energy-use efficiency and predictability,” IEEE Trans. Smart Grid, vol. 6, no. 3, pp. 1414–1424, May 2015.
-  Y. Yu, G. Liu, W. Zhu, F. Wang, B. Shu, K. Zhang, N. Astier, and R. Rajagopal, “Good consumer or bad consumer: Economic information revealed from demand profiles,” IEEE Trans. Smart Grid, vol. 9, no. 3, pp. 2347–2358, May 2018.
D. Cousineau and S. Chartier, “Outlier detection and treatment: a review,”International Journal of Psychological Research, vol. 3, no. 1, pp. 58–67, Jan. 2010.
-  L. Kersting, W.and Grigsby, Distribution System Modeling and Analysis. Boca Raton: CRC Press, 2016.
-  R. Li, C. Gu, F. Li, G. Shaddick, and M. Dale, “Development of low voltage network templates—part ii: Peak load estimation by clusterwise regression,” IEEE Trans. Power Syst., vol. 30, no. 6, pp. 3045–3052, Nov. 2015.
-  K. Chen, J. Hu, and Z. He, “Data-driven residential customer aggregation based on seasonal behavioral patterns,” 2017 IEEE Power Energy Society General Meeting, pp. 1–5, Jul. 2017.
A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: analysis and an algorithm,”Advances in Neural Information Processing Systems, pp. 849–856, 2002.
-  L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” Proceedings of the 17th International Conference on Neural InformationProcessing System, pp. 1601–1608, 2004.
-  D. Vercamer, B. Steurtewagen, D. V. den Poel, and F. Vermeulen, “Predicting consumer load profiles using commercial and open data,” IEEE Trans. Power Syst., vol. 31, no. 5, pp. 3693–3701, Sep. 2016.
-  U.Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, Mar. 2007.
-  B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink, “Sparse multinomial logistic regression: Fast algorithms and generalization bounds,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 6, pp. 957–968, Jun. 2005.
-  F. McLoughlin, A. Duffy, and M. Conlon, “A clustering approach to domestic electricity load profile characterisation using smart metering data,” Appl. Energy, vol. 141, pp. 190–199, Mar. 2015.
-  Z. Xu, Z. Hong, Y. Zhang, J. Wu, A. C. Tsoi, and D. Tao, “Multinomial latent logistic regression for image understanding,” IEEE Trans. on Image Process., vol. 25, no. 2, pp. 973–987, Feb. 2016.
-  J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve,” Radiology, vol. 143, no. 1, pp. 29–36, Apr. 1982.
-  D. Thorleuchter and D. V. den Poel, “Predicting e-commerce company success by mining the text of its publicly-accessible website,” Expert Syst. Applicat., vol. 39, no. 17, pp. 13 026–13 034, Dec. 2012.
-  T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Comput., vol. 10, no. 7, pp. 1895–1923, Oct. 1998.