Privacy-Preserving Distributed Clustering for Electrical Load Profiling

02/26/2020 ∙ by Mengshuo Jia, et al. ∙ 0

Electrical load profiling supports retailers and distribution network operators in having a better understanding of the consumption behavior of consumers. However, traditional clustering methods for load profiling are centralized and require access to all the smart meter data, thus causing privacy issues for consumers and retailers. To tackle this issue, we propose a privacy-preserving distributed clustering framework for load profiling by developing a privacy-preserving accelerated average consensus (PP-AAC) algorithm with proven convergence. Using the proposed framework, we modify several commonly used clustering methods, including k-means, fuzzy C-means, and Gaussian mixture model, to provide privacy-preserving distributed clustering methods. In this way, load profiling can be performed only by local calculations and information sharing between neighboring data owners without sacrificing privacy. Meanwhile, compared to traditional centralized clustering methods, the computational time consumed by each data owner is significantly reduced. The privacy and complexity of the proposed privacy-preserving distributed clustering framework are analyzed. The correctness, efficiency, effectiveness, and privacy-preserving feature of the proposed framework and the proposed PP-AAC algorithm are verified using a real-world Irish residential dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

AMassive number of fine-grained electricity consumption data are being collected by smart meters. Identifying the load patterns from these smart meter data, i.e., residential load profiling, supports retailers and distribution network operators (DSO) in having a better understanding of the consumption behavior of consumers. For example, the retailers can provide personalized tariffs for different types of consumers; the DSO can perform detailed voltage simulation [4] or micro-grid operation [9] of the distribution network based on the identified load patterns.

Ideally, residential load profiling is carried out on a very large and diverse dataset to capture all different types of customers and behaviors. Particularly for retailers and third party providers such a diverse data set is important as they wish to design diversified electricity products to attract new consumers. However, residential load data are only monitored or collected by the corresponding retailers, i.e., each retailer only has the data of the consumers it serves. No center has access to all the smart meter data. Besides, since the smart meter data contains highly private information about the consumers [17], data sharing between retailers is not allowed. Thus, a privacy-preserving distributed clustering scheme is required, where retailers can possibly cooperate with others to jointly achieve the clustering results on their union consumption dataset via local calculation and communication. During the cooperation, the information of each retailer, e.g., the raw data or the number of consumers, will not be deduced by others.

So far, various clustering algorithms have been applied for load profiling, such as hierarchical clustering using different linkages

[28], CFSFDP [30], k-means [3], fuzzy C-means algorithm (FCA) [16], Gaussian mixture model (GMM) [25]

, self organizing map

[22], etc. However, to the best of our knowledge, there is no relevant research on privacy-preserving distributed clustering for load profiling.

To bridge this gap, this paper proposes a privacy-preserving distributed clustering framework for load profiling. This framework can be used to transform three commonly used clustering methods, i.e., k-means, FCA, and GMM, into distributed clustering algorithms for the purpose of privacy-preserving load profiling. Among these three methods, k-means is a ‘hard’ clustering method that delivers deterministic clustering results [12]

; while FCA and GMM are ‘soft’ methods that provide an extent or a probability measure of observations to each classification respectively, which can be leveraged to observe overlapping clusters or uncertain cluster memberships

[14].

In fact, many works about privacy-preserving clustering have been conducted in different fields such as marketing and medicine [23]. Among them, the cryptography-based methods are most commonly used. These methods use secure multiparty computation [15, 5], homomorphic encryption technique [33, 11], or the combination of both [26] to turn the clustering methods into the privacy-preserving k-means [33, 26], the privacy-preserving FCA [15], or the privacy-preserving GMM [5, 11]. However, the methods using secure multiparty computation are extremely computationally expensive [21]. Besides, the overheads of encryption in the homomorphic encryption technique also limit the scope of the corresponding clustering methods [6] and result in time-consuming computations [18]. To reduce overheads, secret sharing can be adopted to design the privacy-preserving k-means clustering [29, 21]. However, these secret-sharing-based methods, including the aforementioned cryptography-based methods, are not fully distributed algorithms, because each party (the data owners, like the retailers in this paper) either has to interact with a data center [33, 29, 15], or has to communicate with all the other parties [21, 26], or has to share its information along a pre-selected information transmission path [5, 11, 6]. These algorithms have the following drawbacks: (1) the existance of a data center or a preset information sharing path greatly increases the risk of a single point or single line failure; (2) the full communication between any two parties results in low scalability.

The proposed privacy-preserving distributed clustering framework aims to solve the above issues. We first perform commonality analysis of the traditional k-means, FCA, and GMM, and point out that the key to the clustering framework lies in how to calculate the summation of retailers’ private information in a fully distributed and privacy-preserving way. The average consensus (AC) algorithm, as an important fully distributed computing method in the automatic control area, provides the means to achieve the summation. However, the slow rate of its convergence towards the average is the major deficiency of this algorithm [2]. Besides, the AC algorithm will reveal the private information available to the retailers during the interaction between neighbors. Therefore, we first introduce an accelerated AC (AAC) algorithm to significantly improve the rate of convergence without sacrificing the simplicity of the original AC algorithm [2]. Then, we adapt the AAC algorithm to provide a privacy-preserving version by leveraging the exponentially decaying disturbance with zero-sum property proposed in [7]. The convergence of the proposed privacy-preserving AAC (PP-AAC) algorithm is also proved. After that, we develop the privacy-preserving distributed clustering framework based on the proposed algorithm. This framework can convert the traditional k-means, FCA, and GMM into fully distributed privacy-preserving clustering methods, where each retailer only needs to communicate with its surrounding neighbors to obtain the exact load pattern identification results of all the consumers. Finally, we provide the privacy and complexity analyses of the proposed framework.

This paper makes the following contributions:

  • Propose a privacy-preserving distributed clustering framework for load profiling. This framework is based on an original PP-AAC algorithm, which is theoretically proven to be convergent.

  • Provide the privacy and complexity analyses of the proposed framework theoretically and practically. Results show that this framework not only protects the data privacy of retailers but also greatly reduces the computational overhead.

  • Develop the privacy-preserving distributed k-means, FCA, and GMM clustering methods using the proposed framework. These methods are applied to identify electrical load patterns, whose results are the same as that of the centralized clustering methods.

To the best of our knowledge, this is the first time that the electrical load data has been analyzed using privacy-preserving distributed clustering methods.

The rest of this paper is organized as follows. Section II analyzes the commonality of k-means, FCA, and GMM. The PP-AAC algorithm is proposed in Section III. Section IV develops the privacy-preserving distributed clustering framework for the three clustering methods. Case studies are provided in Section V, and Section VI concludes this paper.

Ii Problem Formulation

This section first briefly reviews the standard clustering methods: k-means [14], FCA [31], and GMM [24], and then gives the commonality analyses of them. Before that, we assume that the union data set consists of observations. These observations are distributed among retailers, where retailer has consumers, i.e., observations. Besides, the centroid of cluster , described by , is considered as the -th load pattern of the union data set.

Ii-a K-means

K-means partitions observations into

clusters by minimizing the within-cluster variances as follows:

where is the -th observation of retailer . represents the index set of the observations belonging to cluster .

Although finding the solution is NP-hard, Lloyd’s algorithm guarantees to find a local minimum in a few iterations [14]. First, initial cluster centroids are arbitrarily and randomly assigned. Then, in each iteration, the cluster index of is computed by

(1)

and the centroid of cluster is updated by

(2)
(3)
(4)

sequentially. These two steps are repeated until convergence is achieved. Note that equals if and otherwise.

Ii-B Fca

FCA is the best-known method for fuzzy clustering with the objective function given as follows:

where is the fuzziness index and is the degree to which belongs to . The following iterative procedure solves this problem: the degree to which the observation belongs to cluster is first calculated by

(5)

Then, the centroid of cluster is updated by

(6)
(7)
(8)

Different from k-means, where each observation either belongs to a cluster or not, FCA assigns degrees for each observation to be in every cluster, i.e., FCA is a type of soft clustering.

Ii-C Gmm

As a convex combination of Gaussian components with weight and covariance , GMM is given by

(9)

where each Gaussian component represents a cluster.

To divide the union data set into

clusters by GMM, one should train GMM by leveraging the maximum likelihood estimation, which is given as follows:

s.t.

The most commonly used maximum likelihood estimation method is the expectation-maximization (EM) algorithm

[24], which can be summarized as two iterative steps: the E-step and the M-step. The E-step, as given in

(10)

computes the probability that an observation belongs to cluster . The M-step updates the parameters in (9) according to

(11)
(12)
(13)
(14)
(15)
(16)

After convergence, the final parameters of (9) are reached. The final probability that an observation belongs to cluster can be obtained by substituting the final parameters into (10). Same as FCA, GMM is also a soft clustering method.

Ii-D Commonality Analysis

The clustering of k-means, FCA, and GMM have two points in common, which are listed in Remark 2.1 and 2.2.

Remark 2.1: The clustering processes of k-means, FCA, and GMM can all be summarized in two parts: the local calculation part and the global calculation part, where the local one can be performed by each retailer, and the global one is essentially the summation of each retailer’s local calculation results.

In fact, each retailer can directly perform the first steps of the three algorithms via its own data, i.e., the calculation in (1), (5) or (10). Then, retailer is able to compute the following local results :

(17)

depending on the algorithm used. Once each retailer obtains the local results, the global summation of those local results from all retailers is required to continue the clustering method. For example, k-means algorithm needs to sum the local results and of all retailers respectively to update the centroid of cluster in (2). Let be the global summation result, then we have:

(18)

Therefore, the relationship between the local and the global calculation parts can be generalized to:

(19)

where can be calculated by each retailer locally using (17), while the computation of needs cooperations among all retailers. Once in (18) is obtained, the second steps of the three algorithms can be carried out and the iterative procedure continues.

Remark 2.2: Each retailer’s local calculation results from k-means, FCA, and GMM contain private information, so that retailer will refuse to share its with others.

In fact, if retailer shares its () with retailer , the latter can derive the following private information of retailer :

Ii-D1 The number of retailer ’s consumers

Once retailer has received () in (4), (8) or (15), it can compute the number via:

Ii-D2 The proportion or number of retailer ’s consumers belonging to cluster

Once retailer receives , it will also obtain the proportion of retailer ’s consumers belonging to cluster by:

Particularly, retailer can directly know the specific number of retailer ’s consumers belonging to cluster by receiving in (4).

Ii-D3 Retailer ’s local load pattern of cluster

Once retailer has received in (3), (7) or (14), along with in hand, retailer can compute the local centroid of retailer in cluster by:

which will reveal the approximate load pattern of retailer . For example, we choose in (3) and in (4), then is essentially the mean of retailer ’s observations belonging to cluster , which can be considered as its approximate load pattern in cluster . The approximation lies in the fact that and are calculated using the global centroid in (2) in the last iteration, not the local centroid of retailer in the last iteration; otherwise it will be the exact load pattern based on retailer ’s data set.

Definition 2.3: We define the “privacy” of retailer () as the information set .

Clearly, retailer will not let retailer obtain . As a result, directly sharing will be refused by retailer , impeding the implementation of the key summation in (18) for the three algorithms. Therefore, a privacy-preserving distributed summation algorithm to compute (18) is required.

Iii PP-AAC Algorithm

To achieve a distributed summation algorithm, this section first introduces an AAC algorithm with a fast convergence rate [2]. After that, we further improve the AAC algorithm by leveraging an exponentially decaying disturbance with zero-sum property to propose a PP-AAC algorithm. Finally, the convergence of the proposed algorithm is proved.

Iii-a AAC Algorithm

The AAC algorithm is graph-theory-based. Therefore, we consider a graph consisting of the nodes and edges. Each node represents a retailer, and the edge between each pair of nodes means that there is bidirectional noise-free communication between two retailers. This graph is publicly known by all retailers. Denote the node set by and the edge set by . The neighborhood of retailer is represented by , and the degree of retailer is denoted by . Let be the Metropolis weight matrix with elements as follows [32]:

(20)

In the AAC algorithm, each retailer has a state value that will be updated through iterations. Let be the state of retailer in the AAC algorithm, then the state update equation of the AAC algorithm in the -th iteration is given by

(21)

which is a convex combination of the value from the original AC algorithm and the predictor given respectively by

(22)
(23)

The matrix form of the update is given as follows:

(24)
(25)

where , and

is the identity matrix. We call

the accelerated Metropolis weight matrix.

In this way, will converge to the mean of all retailers’ initial standardized state values

(26)

with the fastest asymptotic worst-case convergence rate if the weighted coefficient equals the optimal value [2]:

(27)

where

is the smallest eigenvalue of

, and is the second largest eigenvalue of . Since the graph is publicly known by all retailers, each retailer can easily compute using (20). Then, can be obtained by all retailers using (24).

Note that the AAC algorithm is fully distributed, i.e., each retailer only needs to communicate with its neighbors. Besides, after convergence, retailers can obtain the summation of their initial standardized state values by multiplying the mean in (26) by . Thus, let be equal to , then each retailer can obtain in (18) in a fully distributed manner using the AAC algorithm. However, in the first iteration, retailer will send to its neighbors, which directly reveals the private information of retailer .

Iii-B PP-AAC Algorithm

To facilitate the AAC algorithm with privacy-persevering characteristics, we utilize the exponentially decaying disturbance with zero-sum property from [7] to mask the interactive state values among neighbors during the AAC iterations, so that each retailer cannot derive private information of the others.

The proposed PP-AAC algorithm is defined by

(28)

where is the state value masked by the disturbance as follows:

(29)

The noise is randomly selected from by retailer , where , , and . This design leads to the two features of , which will be used for the following proof of Theorem 3.1:

  • The noise is exponentially decaying as and grows with the number of iterations. So is also exponentially decaying.

  • The disturbance has zero-sum property, which means that if we sum up from to infinity (or to a relatively large number), the result will be close to 0, i.e.,

(30)

Theorem 3.1: The proposed PP-AAC algorithm in (28) will make each retailer’s state value converge to the average of all retailers’ initial state values, i.e., (26) still holds.

Proof: See Appendix.

Iv Privacy-preserving Distributed Clustering Framework

This section describes the privacy-preserving distributed clustering framework for k-means, FCA, and GMM incorporating the proposed PP-AAC algorithm. In addition, we provide the privacy and complexity analyses of the proposed framework.

Iv-a Clustering Framework

The idea of the clustering framework is that independent of the employed clustering method, in every iteration, each retailer first performs its local calculation according to (17); then each retailer sets its local result as the initial state of the proposed PP-AAC algorithm; after convergence, each retailer obtains the global summation of all the local results in (18); finally, using the global summations, each retailer can perform the rest of the clustering method to update the global information, e.g., the centroids of all clusters. The detailed clustering framework is demonstrated in Algorithm 1.

Input: Standardized () of retailer ().
Input: Arbitrarily and publicly assign K centroids .
Output: The load pattern () of the union data set
1 while convergence criterion of clustering is not met do
2       Retailer () calculates () in (17);
3       Retailer () sets ;
4       ;
5       while average consensus is not achieved do
6             Retailer () randomly selects by rule;
7             Retailer () masks its by (29);
8             Retailer () computes its by (28);
9             ;
10            
11       end while
12      Retailer () obtains by ;
13       Retailer () updates global cluster information;
14       - K-means: updates () by (2);
15       - FCA: updates () by (6);
16       - GMM: updates () by (11)-(13);
17      
18 end while
19Retailer () gets the load patterns of the union data set;
20 - K-means: gets the final () in (2);
21 - FCA: gets the final () in (6);
22 - GMM: gets the final () in (12);
Algorithm 1 The clustering framework

Iv-B Privacy Analysis

As aforementioned, the AAC algorithm will directly reveal the initial value in the first iteration. On the contrary, in the first iteration of the proposed PP-AAC algorithm, retailer () receives () instead of . Since is masked using independent disturbance by retailer , retailer cannot derive the original value of from , thus retailer will not know the private of its neighbors, protecting the private information of retailer . In the remaining iterations, the process of adding disturbance continues; meanwhile, begins to converge to the mean value in (26) and moves away from its initial value, which further masks the true initial value. Quantitative illustrations will be shown in the next section.

In addition, we should note that if and , i.e., retailer can receive all the information that retailer has received, including retailer ’s information, then retailer can deduce retailer ’s initial value even if the disturbance is introduced [7]. Therefore, the authors in [7] and [19] both consider it necessary to assume that retailer cannot receive all the information that retailer has. The assumption is also adopted in this paper. Since is publicly known by all retailers, retailer can tell that whether is a subset of its neighbor’s . If such a situation occurs, retailer can refuse to communicate with retailer . Therefore, the assumption will hold in practice.

Iv-C Complexity Analysis

For the distributed framework, we investigate each retailer’s computation and communication overhead.

The proposed clustering framework not only keeps all the multiplication calculations in the original clustering methods, but also introduces new multiplication calculations by integrating the proposed PP-AAC algorithm. The multiplication calculations in the original clustering methods are divided by retailers according to their number of observations, i.e., if the computation overhead of the original clustering method is , then the overhead of retailer is . Moreover, in each iteration of the PP-AAC algorithm, although the disturbance can be queried from the preset lookup table, retailer () still needs to compute and (), which requires multiplications. Let denote the iteration number of the selected clustering method, and represent the iteration number of the proposed AAC algorithm, then the computation overhead of retailer is . Take k-means for example, where is , then retailer ’s overhead is . Please note that , because the number of retailers in a DN is small, and the proposed PP-AAC algorithm’s convergence is accelerated, thus is generally also small. However, is thousands and . Moreover, we know that . Therefore, the computation overhead of retailer is significantly smaller than that of the centralized k-means. Detailed illustrations is shown in the next section.

Besides, in each iteration of the proposed AAC algorithm, the communication number of retailer is [20]. Therefore, the communication overhead of retailer is .

V Case Study

V-a Data Description and Experiment Setup

We utilize the smart meter data from Ireland for verification, which contains 509660 half-hourly daily electrical consumption observations of 1000 consumers [8]. The representative load profile (RLP) of each consumer is obtained via the method presented in [27]. Thus we get the union data set consisting of 1000 48-dimensional RLPs. For the verification of the proposed PP-AAC algorithm and the clustering framework, e.g., the correctness, the efficiency, the privacy-preserving feature, and the effectiveness, we assume that there are 10 retailers in a DN, and each of them has access to 100 consumers. Their initial communication topology is shown in Fig. 1, where each retailer only communicates with its one-hop neighbors, and retailer () cannot receive all the information that any of its neighbors has. We also use different topologies to investigate the trend of the computational cost of the proposed clustering framework with respect to different topologies. Besides, we set and for randomly selecting the disturbance. Meanwhile, the initial centroids for all clustering methods are randomly chosen.

Fig. 1: Communication topology of the retailers.

V-B Verification of the PP-AAC algorithm

To verify the correctness and efficiency of the proposed PP-AAC algorithm, we compare it with three algorithms: the original AC algorithm in [32], the AAC algorithm proposed in [2] and the PP-AC algorithm proposed in [7]. We use the four algorithms to compute the summation of the observations from each retailer’s first consumer. We then illustrate the average error of all retailers relative to the accurate summation result. The errors of the four algorithms for each iteration are shown in Fig. 2. It can be observed that the average error of the proposed PP-AAC algorithm converges to 0, indicating the correctness of this method. In addition, the proposed algorithm has the same convergence rate as the AAC algorithm. The PP-AC algorithm also has the same convergence rate as the AC algorithm. Please note that the proposed algorithm converges faster than both the AC and the PP-AC algorithm, indicating the efficiency of the proposed algorithm. Therefore, the correctness and efficiency of the proposed PP-AAC algorithm are verified.

Fig. 2: Comparison of convergence of the average consensus approaches and their respective privacy-preserving distributed versions.

Compared to the AC algorithm and the AAC algorithm, the proposed algorithm also has the privacy-preserving feature. To illustrate this feature, we provide the value that retailer 1 shares with its neighbors during the above summation calculation at each iteration. The shared values of the four algorithms are shown in Fig. 3.

Fig. 3: Interactive information shared during the iterative process.

These shared values all converge to the real average value, but we should note that retailer 1 shares its real initial value with its neighbors in the first iteration when performing the AC and the AAC algorithm, which directly reveals the private information of retailer 1. However, after introducing the disturbance for masking, the proposed algorithm enables retailer 1 to share its masked initial value to its neighbors, which is far away from the real one as indicated by the black arrow. Thus, the proposed algorithm protects the privacy of retailer 1. Moreover, the proposed algorithm still converges faster than the PP-AC algorithm, even if they both start from the same masked initial point.

V-C Verification of the Proposed clustering framework

We can employ the proposed clustering framework to obtain privacy-preserving distributed k-means, FCA, and GMM clustering methods. Then, we use them for load pattern identification on the distributed data sets. As benchmarks, we also use the centralized k-means, FCA, and GMM for load pattern identification on the corresponding union data set.

To verify the correctness of the clustering framework, in Fig. 4, we use the Silhouette coefficient index (SCI) [13] to evaluate the above distributed and centralized algorithms for a different numbers of clusters. Note that the abbreviation ‘PPD’ in Fig. 4 represents ‘privacy-preserving distributed’. This figure clearly shows that the SCI results of the proposed privacy-preserving distributed algorithms are identical to those of the centralized algorithms. This means that the clustering results on the distributed data sets using the proposed clustering framework, are exactly the same as those on the union data set computed via the centralized methods, indicating the correctness of the proposed clustering framework.

Fig. 4: The indicators results of different clustering methods and their respective privacy-preserving distributed versions.

To verify the effectiveness of the clustering framework, we choose k-means for demonstration as it is a hard clustering method, which is very convenient for illustration. We use the most common way, i.e., the sum of squared errors (SSE), to find the optimal number of clusters [1]. From this, we find that the optimal cluster number of the union data set (1000 RLPs, i.e., 1000 consumers) is , while that of the data set of retailer 1 (100 RLPs) is . After that, we perform the centralized k-means on retailer 1’s data set, and the results are shown in Fig. 5(a). Besides, we also perform the proposed privacy-preserving distributed k-means and the centralized k-means on the union data set. The results are demonstrated in Fig. 5(b). The number of RLPs in each cluster is listed in the sub figure’s title. Meanwhile, the RLPs and the load patterns of retailer 1 are highlighted in Fig. 5(b) as well.

First, from Fig. 5(b), we can observe that the centroids of the proposed algorithm are coincident with the centroids of the centralized k-means. Second, the two load patterns of retailer 1’s data set in Fig. 5(a), approximately match the 2nd and the 3rd load patterns of the union data set in Fig. 5(b). However, retailer 1 missed the remaining four categories of consumers. Certainly, if retailer 1 only uses its own two load patterns for tariff design, its products will be difficult to attract the 608 consumers in the remaining clusters. On the contrary, by the proposed clustering framework, each retailer can use the six load patterns of all consumers for tariff design to attract all of them. Therefore, the effectiveness of the proposed clustering framework is proven.

Fig. 5: The clustering results on (a) the data set of retailer 1 (100 RLPs) and (b) the union data set (1000 RLPs). The RLPs and clustering centers of retailer 1 are also illustrated in (b).

To verify the efficiency of the clustering framework, we provide the computational time and iteration numbers (I-Ns) of the centralized and the privacy-preserving distributed clustering methods. Note that for the distributed methods, the retailers’ computational times are different. Thus the maximum computational time of all retailers is chosen to represent the time of the distributed methods. Details are given in Table I. From this table, it is obvious that the iteration numbers of the corresponding centralized and distributed clustering methods are the same, but the computational times of the corresponding methods differ by an order of magnitude: the time consumed by each retailer in distributed clustering is significantly less than that of the centralized clustering, indicating the high efficiency of the proposed clustering framework.

Methods K-means PPD K-means FCA PPD FCA GMM PPD GMM
Time 0.321 0.046 1.169 0.163 18.698 2.081
I-N 7 7 24 24 6 6
TABLE I: Computational Time and Iteration Number Comparison

Please note that the above computational time does not contain communication time. However, this time is probably negligible. In k-means for example, each retailer shares its masked value to its neighbors, which consists of the masked and for . Thus each retailer actually shares 294 floating-point numbers with its neighbors, i.e., 1.15 kbytes. We know that , , and . Meanwhile, as shown in Fig. 2, and the degree of the retailer that consumes the most time is (retailer 1), which is also the maximum degree among the retailers. According to the communication overhead analysis in Section IV-C, the maximum total amount of upstream data of all the retailers will be kbytes Mbytes. Since the global average broadband internet speed is 11.03Mbps, the actual maximum communication time for retailers will not exceed 0.1 seconds. This cost will be greatly reduced in Europe as it has the world’s highest concentration of countries with the fastest internet, e.g., Sweden’s average speed is 55.18Mbps [10].

V-D Verification of Different Topologies

Although the computational time of retailers’ local calculation is not affected by the change of communication topology, different topologies directly affect the degree of retailers as well as the iteration numbers of the proposed PP-AAC algorithm, resulting in a change in the computational time of the AAC algorithm, which in turn changes the time of the clustering framework. To investigate this trend, we randomly change the communication topology to obtain 9 topologies as shown in Fig. 6.

Fig. 6: The 9 different topologies used in the sensitivity analysis with respect to topology.

Then we measure the total execution time of the AAC algorithm part in the clustering framework for each retailer. Finally, we demonstrate the average time of all retailers when performing the AAC algorithm part in the clustering framework in Fig. 7. The average of the retailers and the average iteration numbers of the AAC algorithm under different topologies are also provided in Table II. Please note that, the assumption that retailer cannot receive all the information which its neighbors have received will cause the number of possible communication lines saturate quickly and result in only minor differences between different topologies. Thus we temporarily ignore this assumption to purely demonstrate the variation of cost for different topologies more clearly.

Topology 1 2 3 4 5 6 7 8 9
Average 1.2 1.3 1.7 1.9 2.7 2.9 3.5 3.7 4.1
Average iteration number 121 81 77 25 17 21 14 12 9
TABLE II: The Factors that Affect the Computation Time
Fig. 7: The variation in the average computational times and the spectral radius for different topologies.

Theoretically, the average computation overhead of the proposed AAC algorithm part is , where denotes the average of all retailers. From Table II, we know that although the average of the retailers increases with the number of the topology map shown in Fig. 6, the increase is much smaller than the decrease of iteration numbers, so the computational time in Fig. 7 is dominated by the iteration numbers. In fact, the worst-case measure of the proposed AAC algorithm’s asymptotic convergence rate is proportional to the spectral radius of matrix , where is the averaging matrix [2]. Since the convergence rate determines the iteration numbers, and the computational time is dominated by the iteration numbers, the trend of the computational time is coincident with the trend of the spectral radius. For verification, we also illustrate the variation in the spectral radius under the different topologies in Fig. 7. As we can see, the decreasing trends of the computational times and the spectral radius are the same.

Vi Conclusions

In this paper, we propose a privacy-preserving distributed clustering framework, which can directly modify the traditional k-means, FCA, and GMM clustering methods and provide privacy-preserving distributed variants. To achieve this, we first performed commonality analysis of the three clustering methods, and pointed out that the key of the clustering framework lies in calculating the summation of the retailers’ private information in a fully distributed and privacy-preserving way. Then we developed a PP-AAC algorithm with proven convergence to achieve the summation. Finally, we presented the privacy-preserving distributed clustering framework based on the proposed algorithm with theoretical privacy and complexity analyses.

The proposed PP-AAC algorithm converges faster than the privacy-preserving AC algorithm and the original AC algorithm. Besides, compared to the original AC algorithm and AAC algorithm, the proposed algorithm is privacy-preserving by introducing the exponentially decaying disturbance with zero-sum property into the shared information. The proposed clustering framework can enable each retailer to obtain the exact residential load pattern identification of all consumers instead of only its own consumers. Thus, this framework can support retailers design better tariff products to attract new users. Meanwhile, the clustering framework not only protects every retailer’s privacy, but also greatly reduces the computation overhead of each retailer compared to the centralized method. Moreover, under different communication topologies, the decreasing trends of the PP-AAC part’s computational times and the spectral radius are the same.

Appendix

First, we need to prove that is doubly stochastic, i.e., that

(31)

holds. Define

as a vector of all ones, then we have:

Since

is a doubly stochastic matrix proved in

[32], the following holds:

Substitute into , we obtain

Similarly, we can obtain with the property that .

Second, define , and . Then we have the matrix form of the proposed PP-AAC algorithm:

(32)
(33)

In the linear dynamic system in (32), as long as is doubly stochastic, with the two aforementioned features of , the authors in [7] proved that (34) and (35) hold:

(34)
(35)

Combining (34) and (35) yields (26).

References

  • [1] (2011) A clustering method combining differential evolution with the k-means algorithm. Pattern Recognit. Lett. 32 (12), pp. 1613 – 1621. External Links: ISSN 0167-8655, Document Cited by: §V-C.
  • [2] T. C. Aysal, B. N. Oreshkin, and M. J. Coates (2009-04) Accelerated distributed average consensus via localized node state prediction. IEEE Trans. Signal Process. 57 (4), pp. 1563–1576. External Links: Document, ISSN Cited by: §I, §III-A, §III, §V-B, §V-D.
  • [3] G. Chicco, R. Napoli, and F. Piglione (2006-05) Comparisons among clustering techniques for electricity customer classification. IEEE Trans. Power Syst. 21 (2), pp. 933–940. External Links: Document, ISSN Cited by: §I.
  • [4] K. Clement-Nyns, E. Haesen, and J. Driesen (2011) The impact of vehicle-to-grid on the distribution grid. Electr Pow Syst Res. 81 (1), pp. 185 – 192. External Links: ISSN 0378-7796, Document Cited by: §I.
  • [5] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu (2002) Tools for privacy preserving distributed data mining. ACM Sigkdd Explorations Newsletter 4 (2), pp. 28–34. Cited by: §I.
  • [6] Z. Gheid and Y. Challal (2016-08) Efficient and privacy-preserving k-means clustering for big data mining. In 2016 IEEE Trustcom/BigDataSE/ISPA, Vol. , pp. 791–798. External Links: Document, ISSN Cited by: §I.
  • [7] J. He, L. Cai, P. Cheng, J. Pan, and L. Shi (2019) Consensus-based data-privacy preserving data aggregation. IEEE Trans. Autom. Control. (), pp. 1–1. External Links: Document, ISSN Cited by: §I, §III-B, §IV-B, §V-B, Appendix.
  • [8] Irish Social Science Data Archive (2012) Commission for energy regulation (cer) smart metering project.. Note: http://www.ucd.ie/issda/data/ commissionforenergyregulationcer/ Cited by: §V-A.
  • [9] H. Kanchev, D. Lu, F. Colas, V. Lazarov, and B. Francois (2011-10) Energy management and operational planning of a microgrid with a pv-based active generator for smart grid applications. IEEE Trans. Ind. Electron. 58 (10), pp. 4583–4592. External Links: Document, ISSN Cited by: §I.
  • [10] S. Lai (2019-02) Countries with the fastest internet in the world 2019. ATLAS and BOOTS.. External Links: Document, ISSN Cited by: §V-C.
  • [11] K. L. Leemaqz, S. X. Lee, and G. J. McLachlan (2017) Corruption-resistant privacy preserving distributed em algorithm for model-based clustering. In 2017 IEEE Trustcom/BigDataSE/ICESS, pp. 1082–1089. Cited by: §I.
  • [12] R. Li, Z. Wang, C. Gu, F. Li, and H. Wu (2016) A novel time-of-use tariff design based on gaussian mixture model. Appl. Energy. 162, pp. 1530 – 1536. External Links: ISSN 0306-2619, Document Cited by: §I.
  • [13] R. Lletı́, M.C. Ortiz, L.A. Sarabia, and M.S. Sánchez (2004)

    Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes

    .
    Anal. Chim. Acta. 515 (1), pp. 87 – 100. External Links: ISSN 0003-2670, Document Cited by: §V-C.
  • [14] S. Lloyd (1982-03) Least squares quantization in pcm. IEEE Trans. Inf. Theory. 28 (2), pp. 129–137. External Links: Document, ISSN Cited by: §I, §II-A, §II.
  • [15] V. Manikandan, V. Porkodi, A. S. Mohammed, and M. Sivaram (2018) PRIVACY preserving data mining using threshold based fuzzy cmeans clustering.. ICTACT Journal on Soft Computing 9 (1). Cited by: §I.
  • [16] D. Z. Marques, K. A. de Almeida, A. M. de Deus, A. R. G. da Silva Paulo, and W. da Silva Lima (2004-11) A comparative analysis of neural and fuzzy cluster techniques applied to the characterization of electric load in substations. In 2004 IEEE/PES Transmision and Distribution Conference and Exposition, Vol. , pp. 908–913. External Links: Document, ISSN Cited by: §I.
  • [17] P. McDaniel and S. McLaughlin (2009-05) Security and privacy challenges in the smart grid. IEEE Secur Priv. 7 (3), pp. 75–77. External Links: Document, ISSN Cited by: §I.
  • [18] F. Meskine and S. N. Bahloul (2012) Privacy preserving k-means clustering: a survey research.. Int. Arab J. Inf. Technol. 9 (2), pp. 194–200. Cited by: §I.
  • [19] Y. Mo and R. M. Murray (2017-02) Privacy preserving average consensus. IEEE Trans. Autom. Control. 62 (2), pp. 753–765. External Links: Document, ISSN Cited by: §IV-B.
  • [20] Y. Mo and B. Sinopoli (2010) Communication complexity and energy efficient consensus algorithm. IFAC Proceedings Volumes. 43 (19), pp. 209 – 214. External Links: ISSN 1474-6670, Document Cited by: §IV-C.
  • [21] S. Patel, S. Garasia, and D. Jinwala (2012) An efficient approach for privacy preserving distributed k-means clustering based on shamir’s secret sharing scheme. In IFIP International Conference on Trust Management, pp. 129–141. Cited by: §I.
  • [22] T. Räsänen, D. Voukantsis, H. Niska, K. Karatzas, and M. Kolehmainen (2010) Data-based method for creating electricity use load profiles using large amount of customer-specific hourly measured electricity use data. Appl. Energy. 87 (11), pp. 3538 – 3545. External Links: ISSN 0306-2619, Document Cited by: §I.
  • [23] S. Samet, A. Miri, and L. Orozco-Barbosa (2007) Privacy preserving k-means clustering in multi-party environment.. In SECRYPT, pp. 381–385. Cited by: §I.
  • [24] R. Singh, B. C. Pal, and R. A. Jabr (2010-02) Statistical representation of distribution system loads using gaussian mixture model. IEEE Trans. Power Syst. 25 (1), pp. 29–37. External Links: Document, ISSN Cited by: §II-C, §II.
  • [25] B. Stephen, A. J. Mutanen, S. Galloway, G. Burt, and P. Järventausta (2014-02) Enhanced load profiling for residential network customers. IEEE Trans. Power Del. 29 (1), pp. 88–96. External Links: Document, ISSN Cited by: §I.
  • [26] C. Su, F. Bao, J. Zhou, T. Takagi, and K. Sakurai (2007-05) Privacy-preserving two-party k-means clustering via secure approximation. In 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW’07), Vol. 1, pp. 385–391. External Links: Document, ISSN Cited by: §I.
  • [27] M. Sun, I. Konstantelos, and G. Strbac (2017-05) C-vine copula mixture model for clustering of residential electrical load pattern data. IEEE Trans. Power Syst. 32 (3), pp. 2382–2393. External Links: Document, ISSN Cited by: §V-A.
  • [28] G. J. Tsekouras, N. D. Hatziargyriou, and E. N. Dialynas (2007-08) Two-stage pattern recognition of load curves for classification of electricity customers. IEEE Trans. Power Syst. 22 (3), pp. 1120–1128. External Links: Document, ISSN Cited by: §I.
  • [29] M. Upmanyu, A. M. Namboodiri, K. Srinathan, and C. Jawahar (2010) Efficient privacy preserving k-means clustering. In Pacific-Asia Workshop on Intelligence and Security Informatics, pp. 154–166. Cited by: §I.
  • [30] Y. Wang, Q. Chen, C. Kang, and Q. Xia (2016-Sep.) Clustering of electricity consumption behavior dynamics toward big data applications. IEEE Trans. Smart Grid. 7 (5), pp. 2437–2447. External Links: Document, ISSN Cited by: §I.
  • [31] K. Wu (2012) Analysis of parameter selections for fuzzy c-means. Pattern Recognit. 45 (1), pp. 407 – 415. External Links: ISSN 0031-3203, Document Cited by: §II.
  • [32] L. Xiao, S. Boyd, and S. Lall (2005-04) A scheme for robust distributed sensor fusion based on average consensus. In IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005., Vol. , pp. 63–70. External Links: Document, ISSN Cited by: §III-A, §V-B, Appendix.
  • [33] K. Xing, C. Hu, J. Yu, X. Cheng, and F. Zhang (2017-08) Mutual privacy preserving -means clustering in social participatory sensing. IEEE Trans. Ind. Informat. 13 (4), pp. 2066–2076. External Links: Document, ISSN Cited by: §I.