1 Introduction
Electric consumption is experiencing a rapid growth. The National Bureau of Statistics of the People’s Republic of China reported that the electric consumption was 5503.21 billion KWH in 2015 [1]. Increasing energy efficiency is even more pressing. Understanding electricity consumption behaviors is the key to energy efficiency improvement. For example, in the Demand Response programs, the price strategy must base on the electricity consumption behaviors of users. Thus, the researches on electricity consumption behaviors have been attracting a lot of attention [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].
All of the researches on electricity consumption behaviors depend on the real electricity load profiles. However, obtaining the real electricity load profiles is nontrivial for most of researches. User privacy and commercial value of data are the main obstacles to acquiring electricity load profiles. There are few publicly available electricity load profiles. Moreover, they are small and only collected from the resident sector [16, 17, 18].
Data synthesis is one of the best approaches to tackling the lack of data, and the key is the model that preserves the real electricity consumption behaviors. The electricity consumption behavior is represented by the electricity load profile. In other words, the key of the data synthesis is preserving the real features of the electricity load profile. There are two difficulties in modeling electricity load profiles. First, the electricity consumption model should preserve the features of real electricity load profiles, especially, the features on different time scales since the trend of change in the electricity load profiles varies in different periods. Second, due to the huge differences among user activities, users have various electricity consumption habits. Thus, the electricity consumption model should contain as many types of user as possible, e.g., residential, industrial, and commercial users. To the best of our knowledge, there is not an electricity consumption model that meets the requirements above.
In this paper, we propose a new approach to modeling electricity load profiles on the basis of the electricity consumption patterns shown in Figure 1. Our approach is as follows: first, user electricity consumption patterns, each of which represents a class of similar electricity load profiles, are extracted from the electricity load profiles on three time scales: per day, per week, and per year. Second, a new hierarchical multimatrices Markov Chain (HMMC, in short) model is proposed to characterize the electricity load profiles in terms of electricity consumption patterns. The model is separated into three levels: per year, per week, and per day accordingly. For every pattern of different time scales, a multimatrices Markov Chain (MMC) model is used to represent it. Meanwhile, the user is modeled by a statistical method. Third, the user model and the HMMC model are used to synthesize the scalable electricity load profiles for researches. Finally, the proposed method is evaluated on a smart grid community data set provided by Pecan Street Inc [16]. The experiments shows that our method can effectively preserve the real features of electricity load profiles on different time scales in comparison with the classic Markov Chain model.
To promote the research on the electricity consumption behaviors, we use the HMMC model to characterize two real electricity load profiles. One data set is provided by Pecan Street Inc., containing more than 800 users’ data, which is collected from the resident sector. The other one is a confidential data set^{1}^{1}1The data is collected from the nonresident sectors. we have obtained permission to use. The name cannot be disclosed per the nondisclosure agreements we have signed., containing more than 80000 users’ data, which is collected from different industries such as education, finance and manufacturing. We publish the two trained models online ^{2}^{2}2The models and generator will be published on http://prof.ict.ac.cn/ soon.. Researchers can directly use these trained models to synthesize scalable electricity load profiles.
The paper is structured as follows. Section 2 summarizes the publicly available dataset and the previous work on modeling data. Section 3 proposes our model. Section 4 presents and discusses the results. Section 5 describes the trained models of the electricity consumption data. Section 6 draws a concluding remark.
2 Related Work
It is very detrimental to performing research on the electricity consumption behaviors without publicly available data sets. Like other domains, such as natural language processing and image processing, the public availability of data sets is fundamental in improving related techniques. There are few public available data sets in the energy field. Pecan Street Inc Dataport2017 is a publicly available electricity consumption data set. It involves more than 800 users since 2011
[16]. Collected from 50 families [17], SustData is a free data set, containing power usage and related user information. The ECO data set is collected from 6 households in Switzerland over a period of 8 months (From June 2012 to January2013) [18].Most of the publicly available electric data sets are collected for Nonintrusive load monitoring (NILM) research. They provide different internal load compositions and the user’s related information. On the overall, these data sets contain very few users. Moreover, the public electricity data sets are collected only from the resident sector, and hence insufficient for the research on electricity consumption behaviors. For example, power grid planning need to take as many users of different types as possible into consideration.
Data synthesis is an approach to solving the lack of data. The generative model attractes a lot of attention and a number of models have been proposed in different areas. In the field of marketing, the mixture of normal distributions provides a useful extension of the normal distribution for modeling daily changes in market variables
[19]. In the field of solar radiation, the Markov model is used to model and generate the global solar radiation data
[20]. In the field of wind power, the chronological or sequential Monte Carlo simulation is applied to model and synthesize wind power time series [21]. However, the model on the electricity consumption behaviours is rare in the field of energy. Duffy [22] develops a generative model—the first order Markov Chain model—to model the electricity load profiles for five individual dwelling types in Ireland. However, the firstorder Markov Chain model can not effectively preserve the real features of the user electricity load profiles as shown in Section 4.This paper fills the research gap mentioned above. We propose an innovative model to synthesize scalable electricity load profiles that contains different types of electric users. Moreover, the scalable data set, synthesized using our approach, preserves the real user electricity consumption behaviors.
3 Methodology
Our methodology is as follows: First, we separate the original data into electricity load profiles, and the user data, which represents the user information. Second, given the electricity load profiles, we extract the electricity consumption patterns on different time scales using the clustering algorithm. Third, we propose the HMMC model to model the electricity load profiles. The modeling of the electricity load profiles is separated into three levels: per year, per week, and per day. For each pattern of different time scales, an MMC model is used to represent it. For the user data, it is modeled using a statistical method.
3.1 The electricity consumption model
3.1.1 Extracting the consumption patterns
In order to extract the electricity consumption patterns of different time scales, the electricity load profiles are divided into segments. As shown in Figure 2, each electricity load profile is divided on three time scales: per year, per week, and per day. Segments are obtained after the user electricity load profile is divided per year. Likewise, Segments are obtained after the user electricity load profile is divided per week, and segments are obtained after the user electricity load profile is divided per day, respectively. After the electricity load profiles of all the users are handled on the above, we obtain the yearly, weekly, and daily electricity load profile sets of all users: , , . And the yearly, weekly, and daily patterns are extracted from the corresponding electricity load profile sets using the following clustering algorithm, respectively.
Due to the simplicity and low time complexity of the KMeans, a modified KMeans algorithm, called adaptive KMeans, is used to extract the electricity consumption patterns. In order to ensure the similarity of the profiles within the same cluster, two metrics: the standard deviation of the total consumption
and the mean of the total consumption are used in the adaptive KMeans. As shown in Algorithm 1, for each cluster of the KMeans’ result, if is bigger than a threshold , the KMeans algorithms is iterated on until in every cluster is smaller than . Our approach can automatically determine the number of .3.1.2 Model the consumption patterns
Markov Chain is a stochastic process with Markov properties. In Markov Chain, each state only depends on the immediately preceding states. In this paper, when we call it classic or firstorder Markov Chain. When we call it order Markov Chain. And the whole of the immediately preceding states is called a preceding subsequence.
For the classical Markov Chain model, there are two defects in modeling electricity consumption pattern. First, each subsequent state depends only on the immediately preceding one. It does not care about the position of the state in the sequence. Thus, the same preceding state changes into the same current state in different positions of the sequence only with one probability. However, the position in the sequence—timing in the daily life—has huge impact on electricity consumption behavior. For example, in the morning and afternoon, though there are some states having the same preceding subsequences, due to user habits the preceding subsequences usually change into the same subsequent state with different probabilities. For instance, Figure
3 shows an electricity load profile, which belongs to the classic Dual Peak Morning & Afternoon electricity consumption pattern. Using the firstorder Markov Chain, , and are considered as the same preceding subsequences. However, they not only tend to change into different subsequent states, but also change into the same subsequent state with different probabilities. When ^{3}^{3}3The represents preceding states of . and are the same preceding subsequences, using the higherorder Markov Chain can model both the cases that the same preceding subsequences either change into different subsequent states or the same subsequent state with different probabilities.The other defect is that the differences between the electricity load profiles within the electricity consumption pattern can be accumulated when the new electricity load profile is being synthesized by the classic Markov Chain. In a particular pattern, electricity load profiles have differences, which can be accumulated. Figure 4 shows an electricity consumption pattern with 3 electricity load profiles. When the pattern is modeled by the firstorder Markov Chain, we may synthesize a new electricity load profile consisting of 3 parts: the first part is the part of before point ; the second part is the part of between points and ; and the last part is the part of after point ). Obviously, the new electricity load profile has significant difference from the raw data as shown in 4. As a result, when we synthesize the consumption data, we may obtain unreasonable consumption behavior.
To solve these problems mentioned above, we propose two methods. The first one is to use the higherorder Markov Chain to model the electricity consumption pattern. When is large enough, the same preceding subsequence changes into the same subsequent state in different positions of the sequence only with a probability. Thus, we can ignore the position of the state in the sequence. However, the transition probability matrix size grows exponentially with . The second one is to use the multimatrices Markov Chain (MMC) to model the electricity consumption pattern. MMC consists of many individual transition probability matrices, which are created for each adjunct point of the sequence. Thus, for every position of the sequence, a particular preceding subsequence changes into a particular subsequent state with a specific probability. Meanwhile it was found that MMC with a small can control the accumulation of differences between the electricity load profiles within the pattern when we synthesize the new electricity load profile. According to Equation 1, multiple transition probability matrices for order Markov Chain with states are expressed as:
(1) 
Here, is the of the electricity consumption pattern and is the position in the time series.
3.1.3 Model the electricity load profiles
In the data set we investigate, the electricity consumption data is collected every 15 minutes. It makes the yearly electricity load profile a high dimensional sequence. To model the corresponding pattern of the highdimension yearly electricity load profiles with the MMC, we face two challenges.
First, when we use MMC to model the electricity consumption pattern, the number of transition probability matrix increases with the dimension of electricity load profile. For a yearly electricity load profile, the number of dimension is usually very large. For instance, the dimension number of a yearly electricity load profile is 35040 when the smart meter collects the load profile every 15 minutes, and hence MMC with 35039 transition probability matrices will be used to model a yearly consumption pattern. When the data sampling frequency of smart meters becomes more higher, we will get much higherdimension yearly electricity load profiles, and the number of the transition probability matrix in MMC will be much larger accordingly.
The second challenge is that the unreasonable part of the electricity load profiles may be synthesized when two significantly different profiles coexist within a yearly pattern. For example, Figure 5 shows that the red curve and blue one are two significantly different daily electricity load profiles at the same date. When we synthesize the electricity load profile using the MMC model, it may generate a unreasonable part of electricity load profile. Figure 6 shows a negative example: there are two peaks at the synthesized electricity load profile while for the original data, there is only one peak for each load profile as shown in 5.










To solve the problems mentioned above, we propose an HMMC model, which is separated into three time scales: yearly, weekly, and daily. First, as Figure 7 shows, a bottomup approach is used to model yearly electricity consumption patterns. Second, the HMMC model in a hierarchical structure is used to model the yearly pattern. Due to the fact that a yearly pattern consists of only 52 elements—each of which represents a weekly pattern, a yearly pattern can be model by an MMC with 51 transition probability matrices on the scale of year. Likewise, every week pattern consists of only 7 elements, each of which represents a daily pattern, so it can be model by an MMC with 6 transition probability matrices on the scale of week. On the scale of day, every daily pattern consist of only 96 elements, each of which represents the electricity consumption data, and it can be model by an MMC with 95 transition probability matrices. Modeling the yearly pattern using the HMMC model can effectively reduce the size of transition probability matrix.
Instead of directly synthesizing a yearly electricity local profile, the HMMC model firstly generate the pattern sequences on different time scales. Therefore, it ensures that each part of the synthesized electricity load profile is reasonable on the scales of day, week, and year, respectively. As Figure 8 shows, a yearly electricity load profile is synthesized using the HMMC model. First, we select a yearly pattern, and synthesize the weekly pattern sequences accordingly. Second, for every weekly pattern in the sequence, we generate the daily pattern sequences accordingly. And then, for every daily pattern, we synthesize the electricity load profile. Finally, a yearly electricity load profile is generated by concatenating the daily electricity load profiles in a chronological order.
3.2 The user model
The user model is a stochastic one, derived from the statistics of the user information. The anonymous user information is one of the most important parts for the public dataset. The multinominal logistic regression is used to calculate the likelihood of the user for a particular yearly pattern
[8]. According to Equation 2, the multinominal logistic regression is expressed as(2)  
Here, is an dimensions variable, which represents a user. Variables are the attributes of the user. The is a constant, and are the regression coefficients. is the likelihood of matching a particular yearly pattern.
4 Results and discussion
In the following section, the electricity consumption pattern is extracted by the clustering algorithms described in Section . For comparison, the electricity load profiles are modeled by our HMMC model and the firstorder Markov Chain, respectively.
4.1 Extracting electricity consumption patterns
For electricity users, the electricity consumption behavior is very variable. In general, clustering electricity load profiles would result in either massive groups or huge variances within a group. We set
after making a tradeoff between the cluster size and the variances.The adaptive KMeans algorithm is applied to the data set on different scales, respectively. Due to the variability of the electricity consumption behavior, we get 22550 clusters in terms of daily patterns. Among them, 12155 clusters only have 1 daily electricity load profile. Likewise, we get 4013 clusters in terms of weekly patterns and 142 clusters in terms of yearly pattern. Figure 9 shows typical patterns examples of different time scales.
4.2 Modeling the electricity consumption data
In this paper, for every yearly pattern, we generate the electricity consumption data using the thirdorder HMMC model and the classic Markov model, respectively. As shown in Table 1, there are 9 data sets belonging to three yearly patterns, respectively. , and are the raw data of the different yearly patterns, while , and are the synthesized yearly patterns using the HMMC model, and , and are the synthesized yearly patterns using the classic Markov Chain model.
As Table 1 shows, the classic Markov model can preserve partial features of the original electricity load profiles. For example, , , , and are very similar to the raw data. In terms of the metrics—, , , and , the HMMC model performs much better than the classic Markov Chain model in the experiments. Compared with the raw data, , , , and are more similar using the HMMC model than using the classic Markov Chain model. And Figure 10 shows, comparing the data synthesized using the classic Markov Chain model with the raw data, there are significant differences between the raw data and the synthesized one. We notice that there are unreasonable parts in the electricity load profiles synthesized with the classic Markov Chain, while the raw data is very similar to the data synthesized using the HMMC model. The reason is that the differences between the electricity load profiles within the pattern is accumulated when electricity load profiles are synthesized using the classic Markov Chain.
1.7081  283.4584  18.1914  0  17.3386  59687.8133  5622.8101  0.0942  72038.6322  50071.5674  

1.7066  289.7538  24.2124  0  18.1202  59636.9444  5523.9014  0.0926  74252.8144  49215.2737  
1.7084  283.7152  18.1909  0  20.7554  59697.8092  934.0054  0.0156  62602.77346  56469.2666  
1.8641  283.5548  25.2317  0  21.2238  65140.3193  5628.9001  0.0864  74914.6572  56859.6346  
1.8432  280.1117  27.0431  0  20.0419  64407.6467  5561.7444  0.0864  75686.8546  54852.7722  
1.8629  284.0796  25.2308  0  19.8110  65098.3931  730.0663  0.0112  67690.3262  62424.4531  
1.8343  273.9289  23.1533  0  23.1950  64098.1363  5755.0841  0.0898  73673.1285  58454.8488  
1.8203  273.7401  23.1533  0  21.7358  63608.9406  5639.3368  0.0887  74023.4893  57622.6536  
1.8339  274.4035  23.1526  0  20.1366  64083.5409  827.8379  0.0129  66674.0676  61872.8854 

Here, the metrics include: , , , , , , , , and . is the dimension of the electricity load profile, is the mean of the total consumption. is an metric that measures the variance of electricity consumption. is the highest consumption of the electricity load profile and is the smallest consumption of the electricity load profile. is a metric that measures the variance of the electricity load profiles within a pattern. , , is the variance of dimension within a pattern. is the standard deviation of the total consumption. . is the biggest total consumption. is the smallest total consumption.
1.3376  22.7966  12.8959  0  4.3966  898.8339  68.0509  0.0757  1130.3786  754.0151  
1.3289  24.3616  12.8956  0  4.6772  892.9220  75.5433  0.0846  1232.0509  715.8526  
1.6015  30.1395  10.7404  0  4.9419  1076.2008  83.2836  0.0774  1261.0420  875.3925  
1.6088  30.7867  17.8931  0  5.2270  1081.0966  93.0269  0.0860  1343.5733  818.8636  
0.8965  19.2843  17.9762  0  4.0192  602.4527  52.6302  0.0874  717.0628  485.4517  
0.9017  20.1068  17.9762  0  4.0539  605.9462  60.4593  0.0997  804.7128  459.7600  
0.6742  2.0997  2.5089  0.0070  0.6003  64.7274  3.4618  0.0535  71.4993  57.5900  
0.6744  2.0978  2.5088  0.0071  0.5986  64.7439  3.4515  0.0533  72.6561  56.2700  
0.5034  2.4070  2.3880  0.0690  0.6355  48.3255  3.3699  0.0697  56.2327  41.1594  
0.5041  2.4114  2.3879  0.0690  0.6363  48.3937  3.4515  0.0713  56.3205  41.1317  
2.1439  8.4492  7.4637  0.1314  2.1745  205.8188  12.1681  0.0591  239.0681  181.9523  
2.1449  8.4412  7.4634  0.1315  2.1709  205.9077  12.0770  0.0587  239.4820  181.6977 
Besides, the HMMC model preserves the real features of electricity load profiles of the scales of week and day. Table 2 shows 12 data sets belonged to 6 different patterns, respectively. , and are the raw data of different weekly patterns. Accordingly, , , and are the synthesized weekly patterns using the HMMC model. Similarly, , and are the raw data of different daily patterns, and , , and are synthesized daily patterns with the HMMC model.
As shown in Table 2, the synthesized data is similar to the raw data of the scales of week and day. Figure 11 shows, the synthesized weekly patterns using the HMMC model are similar to the raw electricity load profiles. Likewise, Figure 12 also shows, the synthesized daily patterns are similar to the raw electricity load profiles. On the overall, the HMMC model preserves the real electricity consumption behavior of different time scales.
4.3 User information
There are several sensitive information in the user information. We filter most of the sensitive information, and only keep nonsensitive information. With the probabilistic model derived from the statistics of the user information, we synthesize the base user information. And then we assign a yearly consumption pattern to a user according to the likelihood of the user matching a particular yearly pattern.
5 Public models
The real data set is the key of the data synthesis. We provide two models on the basis of two real data sets, respectively. The first one is a public data set—Pecan Street Inc Dataport2017 [16], and it contains 711 users. Each user data contains two types of data: user information, such as the type of building, the construction year of house, and total square footage; and electricity consumption data, collected every 15 minutes in 2015. The other one is a confidential data set, which we have obtained permission to use, and it contains data from 80000 users from the nonresident sectors. Each user data consists of two types of data: user information, including the installation year of electricity meter, address code, and industry code; and electricity consumption data, which is collected every 15 minutes in 2015.
6 Conclusion
The shortage of the electricity load profiles is a huge obstacle to the research on electricity consumption behaviors. Data synthesis is one of the best approach to tackling this obstacle. We propose a hierarchical multimatrices Markov Chain model (HMMC) to synthesizing scalable electricity load profile that preserves the real consumption behaviors on three time scales: per day, per week, and per year. To promote the research of the electricity consumption behaviors, we use the HMMC model to characterize two distinctive raw electricity load profiles. One is collected from the resident sector, and the other is collected from the nonresident sectors, including different industries such as education, finance, and manufacturing. We publish two trained models online, and researchers can directly use these trained models to synthesize scalable electricity load profiles for further researches.
Acknowledgements
We are very grateful to anonymous reviewers. This work is supported by the Major Program of National Natural Science Foundation of China (Grant No. 61432006), National Key Research and Development Program of China (2016YFB1000600, 2016YFB1000601).
References
 [1] N. B. of Statistics of China, “Annual data.” http://data.stats.gov.cn/easyquery.htm?cn=C01. Accessed February 4, 2017.
 [2] J. L. Viegas, S. M. Vieira, R. Melício, V. Mendes, and J. M. Sousa, “Classification of new electricity customers based on surveys and smart metering data,” Energy, vol. 107, pp. 804–817, 2016.
 [3] J. Kwac and R. Rajagopal, “Targeting customers for demand response based on big data,” Eprint Arxiv, 2014.
 [4] J. Kwac, C. W. Tan, N. Sintov, J. Flora, and R. Rajagopal, “Utility customer segmentation based on smart meter data: Empirical study,” in IEEE International Conference on Smart Grid Communications, pp. 999–1004, 2013.
 [5] Y. Bai, H. Zhong, and Q. Xia, “Realtime demand response potential evaluation: A smart meter driven method,” in IEEE Power and Energy Society General Meeting, pp. 1–5, 2016.
 [6] N. Costa and I. Matos, “Inferring daily routines from electricity meter data,” Energy and Buildings, vol. 110, pp. 294–301, 2016.
 [7] G. W. Hart, “Nonintrusive appliance load monitoring,” Proceedings of the IEEE, vol. 80, no. 12, pp. 1870–1891, 1992.
 [8] F. Mcloughlin, A. Duffy, and M. Conlon, “A clustering approach to domestic electricity load profile characterisation using smart metering data,” Applied Energy, vol. 141, pp. 190–199, 2015.
 [9] I. Benítez, J.L. Díez, A. Quijano, and I. Delgado, “Dynamic clustering of residential electricity consumption time series data based on hausdorff distance,” Electric Power Systems Research, vol. 140, pp. 517–526, 2016.
 [10] J. Kwac, C. W. Tan, N. Sintov, J. Flora, and R. Rajagopal, “Utility customer segmentation based on smart meter data: Empirical study,” in IEEE International Conference on Smart Grid Communications, pp. 999–1004, 2013.

[11]
J. Buitrago, A. Abdulaal, and S. Asfour, “Electric load pattern classification using parameter estimation, clustering and artificial neural networks,”
International Journal of Power and Energy Systems, vol. 35, no. 4, pp. 167–174, 2016.  [12] J. L. Viegas, S. M. Vieira, and J. M. C. Sousa, “Mining consumer characteristics from smart metering data through fuzzy modelling,” in International Conference on Information Processing and Management of Uncertainty in KnowledgeBased Systems, pp. 562–573, 2016.
 [13] S. Shenoy and D. Gorinevsky, “Stochastic optimization of power market forecast using nonparametric regression models,” in IEEE Power and Energy Society General Meeting, pp. 1–5, 2015.
 [14] K. Li, N. Tai, and S. Zhang, “Research and application of climatic sensitive short  term load forecasting,” in IEEE Power and Energy Society General Meeting, pp. 1–5, 2015.
 [15] W. Yang and R. Rajagopal, “Probabilistic baseline estimation via gaussian process,” in IEEE Power and Energy Society General Meeting, pp. 1–5, 2015.
 [16] H. C, “Pecan street inc.: A testbed for nilm,” in International Workshop on NonIntrusive Load Monitoring, 2012.
 [17] L. Pereira, F. Quintal, R. Gonçalves, and N. J. Nunes, “Sustdata: A public dataset for ict4s electric energy research.,” in ICT4S, 2014.
 [18] C. Beckel, W. Kleiminger, R. Cicchetti, T. Staake, and S. Santini, “The eco data set and the performance of nonintrusive load monitoring algorithms,” in ACM Conference on Embedded Systems for EnergyEfficient Buildings, pp. 80–89, 2014.
 [19] J. Wang, “Generating daily changes in market variables using a multivariate mixture of normal distributions,” in Proceedings of the 33nd conference on Winter simulation, pp. 283–289, IEEE Computer Society, 2001.
 [20] B. Ngoko, H. Sugihara, and T. Funaki, “Synthetic generation of high temporal resolution solar radiation data using markov models,” Solar Energy, vol. 103, pp. 160–170, 2014.
 [21] M. Mosayebian, S. Soleymani, S. Mozafari, and H. Shayanfar, “Synthetic generation of wind power time series for wind/storage systems integration studies,” Journal of Renewable and Sustainable Energy, vol. 8, no. 1, p. 013105, 2016.
 [22] A. Duffy, F. Mcloughlin, and M. Conlon, “The generation of domestic electricity load profiles through markov chain modelling,” EuroAsian Journal of Sustainable Energy Development Policy, vol. 3, 2010.
Comments
There are no comments yet.