Privacy-Preserving Distributed Joint Probability Modeling for Spatial-Correlated Wind Farms

12/17/2018 ∙ by Mengshuo Jia, et al. ∙ Tsinghua University 0

Building the joint probability distribution (JPD) of multiple spatial-correlated wind farms (WFs) is critical for chance-constrained optimal decision-making. The vertical partitioning historical wind power data of WFs is the premise of training the JPD. However, to protect data privacy, WFs with different stakeholders will refuse to share raw data directly or send raw data to a third party as no one knows whether the third party can be trusted. Moreover, the centralized training way is also faced with costly high bandwidth communication, single-point failure and limited scalability. To solve the problems, distributed algorithm is an alternative. But to the best of our knowledge, rarely has literature proposed privacy-preserving distributed (PPD) algorithm to build the JPD of spatial-correlated WFs. Therefore, based on the additive homomorphic encryption and the average consensus algorithm, we first propose a PPD summation algorithm. Meanwhile, based on the binary hash function and the average consensus algorithm, we then present a PPD inner product algorithm. Thereafter, combining the PPD summation and inner product algorithms, a PPD expectation-maximization algorithm for training the Gaussian-mixture-model-based JPD of WFs is eventually developed. The correctness and the robustness to communicate failure of the proposed algorithm is empirically verified using historical data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

WIND power (WP) are gaining momentum globally, driving the outlines of a future sustainable energy system beginning to become clear. By the end of 2017, the total installed capacity of WP was 89.1GW in the U.S., 168.7 GW in the European Union, and 188.4 GW in China, expected to reach 840 GW worldwide by the end of 2022 [1] . The penetration of wind farms (WFs) integrated into distribution network (DN) also faces a rapid growth [2], which raises great challenges to the operations of independent system operator (ISO) in DN.

As a common idea to cope with uncertainty, the chance-constrained optimal schedule with WFs integration has been investigated by many researchers, i.e., chance-constrained unit commitment [3, 4, 5, 6], chance-constrained economic dispatch [7], or chance-constrained ramping reserve schedule [8]. Considering the spatial correlation between the wind power (WP) of adjacent WFs in a DN, the joint probability distribution (JPD) of multiple adjacent WP must be established first. Then, the Latin hypercube sampling technique can be utilized to sample the established probability distribution, generate numerous scenarios and finally transform the chance constraints [3, 4, 5]

; or the chance constraints can be directly converted into deterministic ones by quantiles calculated via the established probability distribution

[6, 7, 8]. Therefore, building the JPD of multiple adjacent WP is critical.

From the perspective of responsibility, the task of building JPD needs to be undertaken by the related WFs. Next, the established JPD will be sent to the ISO for optimal decisions. Inaccurate modeling of the JPD may lead to two situations: (1) insufficient reserve schedule, in which the ISO has to temporarily dispatch power to maintain balance and punish the related WFs afterwards; (2) severe curtailment of WP, in which the related WFs waste their own power resources. Obviously, no matter which situation occurs, the related WFs will bear the loss due to the inaccurate modeling of their JPD. Thus the WFs have a strong willingness to construct an accurate one.

Appropriate probability model is a prerequisite for constructing an accurate probability distribution. Many studies pre-assume the uncertainty of WP follows certain distribution models, e.g., Gaussian distribution

[3, 4, 9], Weibull distribution [5]

, Cauchy distribution

[10]

, Beta distribution

[11] and Versatile distribution [12]

. However, the Gaussian assumption may result in inaccuracy in modeling since WP is a kind of non-Gaussian random variable

[7]. Moreover, although Cauchy distribution fits better than Weibull distribution in some certain timescales [10], this distribution, as well as Beta and Versatile distribution, are ’unimodal’ which can not capture the ’multi-peaks’ characteristic of WP [13]

. Lately, Gaussian mixture model (GMM), originally used in the field of pattern recognition, has been gradually applied to power system

[14, 15] since it can characterize arbitrary dimensions of random variables subject to arbitrary distribution, and is beginning to be utilized to characterize uncertainty of WP with remarkable performance [2, 7, 8]

. To estimate the parameters of GMM, Expectation-maximization (EM) algorithm is most commonly used

[7, 8, 14, 15, 16, 17].

To train the GMM-based JPD via EM algorithm, the training set, consisting of historical WP data of all adjacent WFs in a DN, is required as well. The traditional way is collecting all historical WP data to a third-party data center to train the GMM-based JPD, which is known as the centralized EM algorithm [7, 8, 14, 15, 16, 17]. However, in order to protect data privacy, WFs with different stakeholders will not share raw data or sent raw data to a third-party data center, since no one knows whether the third party can be trusted. In addition, the centralized EM algorithm also requires costly high bandwidth communication between the data center and every WF [18]; its centralized communication structure is also susceptible to single-point failure. Meanwhile, due to the limited scalability [19], variations of the network topology will highlight the inflexibility of centralized algorithms. To overcome the above problems, privacy-preserving distributed (PPD) strategy is a feasible alternative. Therefore, this paper mainly focuses on developing a PPD EM algorithm for training the GMM-based JPD of multiple adjacent WP.

In the data mining field, many efforts have been made on the PPD EM algorithm [20, 21, 22, 23]. In [20, 21], a PPD EM algorithm is proposed based on secure sum technique, which can accurately calculate the sum of data without revealing data privacy of any parties; a cyclic communication topology is adopted to complete the algorithm. Using additive homomorphic encryption technique to encode the raw data into cryptographic message, Kaleb et al. present their PPD EM algorithm [22]. To prevent the leakage of data privacy when an adversary is in control of multiple parties, Kaleb et al. enforce one way communication across a ring topology to guarantee the corruption resistant feature of the proposed algorithm. Similar with [22], Yang et al. also utilize additive homomorphic encryption technique to keep raw data safe [23]. The difference is Yang et al. design a local and global secure summation protocol and cryptographic messages are sent through a spanning tree communication topology. Undoubtedly, the above research can guarantee that the data privacy of all parties will not be leaked when the EM algorithm is implemented. However, the pre-selected cyclic communication topology in [20, 21, 22] and the pre-selected spanning tree communication topology in [23] both need a global perspective to choose, which is contrary to the original intention of distributed manner. Particularly, selecting a cyclic topology is a hamiltonian circuit problem and is proved to be NP-complete [24]. Meanwhile, there is a long way from the first node to the last one of the pre-selected path that not only may cause serious reliability problem when any node along the path fails [24], but also is time-consuming [25].

To design a better distributed communicating manner, diffusion-based strategy or consensus-based strategy will be more appropriate, in which only local communication is needed with no pre-selected path. In the field of parameter estimation in wireless sensor networks, diffusion-based distributed EM algorithms are proposed in [24, 25], where a new step, named D-step, has been added into the traditional EM algorithm by Weng et al. The D-step undertakes data diffusion, designed as that each sensor only communicates with its neighbors once by local statistics. Eventually, every sensor updates its parameters of GMM via the it own local statistics and the collected ones from its neighbors. When an infinite number of data are available for each sensor, the proposed distribution algorithm can be considered as a Robbins-Monro stochastic approximation of the centralized one. In [26]

, the average consensus algorithm is adopted to develop a distributed EM algorithm. The consensus vector is initialized as the local statistics of every sensor, and the knowledge of the global statistics will be obtained after the convergence of the proposed algorithm. Similar with

[26], Gu proposes a distributed distributed EM algorithm based on average consensus filter algorithm [27]. Since the consensus filter only uses the states of neighbors’ consensus filters as inputs, instead of the inputs of neighbors’ consensus filters, the algorithm proposed in [27] need less information exchange. Although the above studies provide better distributed communicating manner, those algorithms presented in [24, 25, 26, 27] are unable to protect data privacy from revealing.

Besides, whether it is multiple sensors or multiple WFs, they can both be regarded as multi-agents. For multi-agents problem, there are two typical data partitioning modes among agents in practice, one is data horizontal partitioning and the other is data vertical partitioning [21, 28, 29]. Let us explain the two data partitioning modes by customer records in banks: first, each customer record is consist of three attributes: name, annual income, and annual expenses; then, a total of 100 customer records are partitioned to bank A, bank B and bank C. If bank A gets 30 customer records, bank B gets 20 customer records and bank C gets 50 customer records, such data partitioning mode is called horizontal partitioning; if bank A only gets the name of 100 customers, bank B only gets the annual income of 100 customers, and bank C only gets the annual expenses of 100 customers, such data partitioning mode is called vertical partitioning. For further explanation, we can consider the customer record with three attributes as a three-dimensional random variable, then the 100 customer records can be seen as 100 samples of the three-dimensional random variable. Thus vertical partitioning means that each bank only has one-dimension of the three-dimensional samples. To build the JPD of M adjacent WP, the samples of the M-dimensional random variable are partitioned vertically among every WF, since each WF only has its own historical WP data, which is only one-dimension of the M-dimensional samples. However, algorithms proposed in [20, 21, 22, 23, 24, 25, 26, 27] are all designed for the horizontal partitioning mode. Therefore, neither the PPD EM algorithms proposed in [20, 21, 22, 23] nor the non-PPD EM algorithms developed in [24, 25, 26, 27] can be applied to build the JPD of multiple adjacent WP. Meanwhile, to the best of our knowledge, rarely has literature proposed PPD EM algorithm to deal with the vertical partitioning mode.

Our work points out that there are two keys to the realization of PPD EM algorithm in view of data vertical partitioning: (1) how to calculate the summation of WP data from every WF in distributed manner under the premise of data privacy preservation; (2) how to calculate the inner product between the WP data of every two WFs in distributed manner under the premise of data privacy preservation. Thus the original contributions of this paper are threefold:

  1. A PPD EM algorithm for data vertical partitioning is proposed to build the JPD of multiple adjacent WP. The proposed algorithm only need neighboring communication with no pre-selected path, and is robust to communication failure. For the initialization of the proposed algorithm, this paper also design a PPD k-means cluster algorithm.

  2. To implement the above proposed algorithms, for the summation of WP data from every WF, this paper propose a PPD summation algorithm based on average consensus algorithm and additive homomorphic encryption technique, where only neighboring communication is needed with no pre-selected path.

  3. To implement the above proposed algorithms, for the inner product between the WP data of every two WFs, this paper propose a PPD inner product algorithm based on average consensus algorithm and binary hash function, where only neighboring communication is needed with no pre-selected path.

The rest of the paper is organized as follows. In Section II, the keys to the realization of PPD EM algorithm in view of data vertical partitioning is demonstrated. The PPD summation algorithm is proposed in Section III, and the PPD inner product algorithm is developed in Section IV. In Section V, the PPD EM algorithm for data vertical partitioning is proposed, as well as the PPD k-means cluster algorithm. Case studies are describes in Section VI. Finally Section VII concludes this paper.

Ii Essence of the EM algorithm for VPM

In this Section, the centralized EM algorithm for GMM is introduced first. Then, the essence of the EM algorithm for data vertical partitioning mode are demonstrated.

Ii-a The centralized EM algorithm for GMM

We consider M spatial-correlated WFs, and their WP constitutes a M-dimensional random variable , which is defined as . Since the m-th element in represents the WP of the m-th WF, the joint probability distribution function (PDF) of the multiple spatial-correlated WP is essentially the GMM-based PDF of . As a convex combination of J multivariate Gaussian distributions with its weighted coefficient , mean vector and covariance , the GMM-based PDF of is given in (1) :

(1)

where is M-dimensional Gaussian distribution, and is the parameter set of the GMM-based joint PDF.

To estimate by the centralized EM algorithm, we also require the training set , defined as , where the m-th element in represents N historical WP observations of the m-th WF, also known as the raw data of m-th WF. The n-th row of , described by , is the n-th sample of . Obviously, is consist of N samples of .

Once is obtained by collecting the historical WP data of all WFs to a data center, a simple closed-form expression of the centralized EM algorithm can be given by two iterative steps, i.e., expectation step (E-step) and maximization step (M-step). Detailed proof and derivation are refer to [17]. For the k iteration, the centralized E-step is given in (2) and centralized M-step in (3):

(2)
(3a)
(3b)
(3c)

where T represents the transpose of the a vector or matrix. After convergence, the estimation of is obtained and the GMM-based joint PDF of multiple spatial-correlated WP in (1) is established. Since the iteration processes are the same for every Gaussian components, we will omit subscripts k and j in later derivations when it doesn’t cause ambiguity.

It is worth noting that, whether it is the centralized E-step or M-step, is needed, each element of which is the n-th WP observation of each WF. In fact, none of the WFs will agree to share its WP data with others when data privacy is considered. In this case, the data center no longer exists, and every WF has no idea about the data of others. Thus there is not even one complete sample in any WF, and none of the WFs can perform any calculation in E-step or M-step even once. This is the difficulty in designing the distributed EM algorithm in the case of data vertical partitioning, especially considering privacy protection.

Ii-B The essence of E-step for data vertical partitioning

E-step aims to calculate statistics in (2) by the parameter updated from the last iteration, i.e., , , and . The main calculation of E-step is to compute the Gauss distribution , as given in (4), and the part actually involves only exists in the exponential term of the Gaussian distribution, described in (5), where the element of is defined as for the m-th row and i-th column, and the m-th or i-th element of is represented by .

(4)
(5)

Equation (5) shows that, for data vertical partitioning, the calculation E-step can be divided into two summations among every WF, one summation is given in (6) and the other in (7). Since is already updated from the last iteration, whether , , or are all public knowledge. For the m-th WF, sharing with others is essentially sharing its n-th raw WP observation , which will be refused for privacy consideration.

(6)
(7)

Noted that the two summations have the same formulation, and a unified form is given in (8). Obviously, the two summations in (6) and (7) are essentially summing M private observations, i.e., , , …, , of all WFs in (8). Therefore, how to calculate (8) in a distributed manner under the premise of data privacy preservation, is the key for the realization of E-step in data vertical partitioning mode.

(8)

Ii-C The essence of M-step for data vertical partitioning

M-step aims to update in (3) by the statistics calculated from E-step. Since is already obtained by all WFs as public knowledge, every WF can directly compute (3a) and update the weighted coefficient . The calculations that actually involve exist in updating in (3b) and updating in (3c).

We first provide the details of (3b) in (9), where the m-th element of is given in (10). It’s can be observed that, although the statistics is already obtained by every WF, due to the data vertical partitioning, the m-th WF can only compute the m-th element of , i.e., in (10), based on its N historical WP observations. None of the M WFs can directly obtain the complete . However, as can be seen in (10), since is public knowledge, is essentially the weighted summation of the N historical WP observations of the m-th WF with weighted coefficient , which is given in (11). No WF can deduce the original N historical WP observations from the result of . Thus although each WF can only compute its corresponding , it can share with others as no data privacy is revealed. Therefore, for data vertical partitioning, the calculation of (3b) can be completed by each WF sharing its with others. No data privacy issues is involved.

(9)
(10)
(11)
(12)
(13)

For further investigation of (3c), without loss of generality, the diagonal and non-diagonal elements of are provided in (12) and (13), respectively, where is the m-th diagonal element, and is the non-diagonal element for the m-th row and i-th column.

One can note that, due to the data vertical partitioning, the m-th WF can only compute the m-th diagonal element of , i.e., in (12), based on its N historical WP observations. None of the M WFs can directly obtain all the diagonal elements of . However, same as in (10), in (12) also doesn’t contain data privacy as no WF can deduce the original N historical WP observations from the result of it. Thus for data vertical partitioning, the calculation of (12) can be completed by each WF sharing its with others.

But the situation is completely different when calculating the non-diagonal element, i.e., in (13), as WP observations of the m-th WF and the i-th WF are both needed. Because , and are all public knowledge after the calculation of (2) and (3b), sharing or with others is essentially sharing the n-th raw WP observation or , which will be refused for privacy consideration. Therefore, computing (13) is essentially calculating a inner product between vectors and , as given in (14), where is consist of N private observations of the m-th WF, and is of the i-th WF.

(14)

To calculate the inner product of two private vectors with the protect of data privacy, secure scalar product (SSP) techniques are proposed by many studies based on additive homomorphic encryption technique [30] or random perturbation disguising technique [20, 31]. However, these methods all require peer-to-peer communication between the two owners of the two private vectors, i.e., the m-th WF and the i-th WF. Thus, to obtain all non-diagonal elements in , every two WFs have to perform peer-to-peer communication, which not only reduces the scalability of the algorithm, but also has low efficiency. For higher scalability and efficiency, a local distributed algorithm is needed. Therefore, how to calculate (14) for every two WFs in a local distributed manner under the premise of data privacy preservation, is the key for the realization of M-step in data vertical partitioning mode.

Based on the above analysis, the two essences of E-step and M-step are also two difficulties that need to be solved to develop a PPD EM algorithm. Thus, in Section III and IV, a PPD summation algorithm and a PPD inner product algorithm are proposed to solve the two difficulties, respectively.

Iii The Privacy-Preserving Distributed Summation Algorithm

In this Section, the additive homomorphic encryption technique is briefly introduced first. Then a PPD summation algorithm is proposed based on additive homomorphic encryption technique and average consensus algorithm.

Iii-a Additive Homomorphic Encryption Technique

The concept of homomorphic encryption was first proposed in 1978 by Rivest and Adleman [32]. A intuitive definition of homomorphic encryption is: a way to delegate processing of your data, without giving away access to it[33, 34]. Thus, for additive homomorphic encryption technique, the aim is to sum up private data without knowing it. Favored by many researchers, Paillier cryptosystem, as an additive homomorphic encryption cryptosystem proposed by Pailliar [35], has been adopted for the analysis of social network [23] or internet [30]. In this paper, Paillier cryptosystem is also adopted.

Let denote a plaintext, denote a ciphertext. Meanwhile, let denote a pre-specified large prime integer and denote a positive random integer. The relationships of and are given as follows, where (15) represents Paillier encryption with a public key , and (16) represents Paillier decryption with a secret key . Noted that, decryption is independent of . We will omit and when it doesn’t cause ambiguity.

(15)
(16)
(17)

Considering M plaintexts and their corresponding ciphertexts: , …, , , …, , the additive homomorphism of Paillier cryptosystem is given in (17). Based on this feature, for calculating the summation of , …, , only the corresponding ciphertexts are needed, which realizes the data privacy protection. More details is refer to [35]. Noted that, although Paillier cryptosystem is only applicable to the case where is a non-negative integer, decimal can be divided into positive/negative integer and positive/negative decimal parts for separate calculations.

Iii-B The PPD Summation Algorithm

Before the demonstration of the proposed PPD summation algorithm, some definitions are given first. The communication topology of multiple spatial-correlated WFs is represented by a graph , where denotes the set of nodes and denotes the set of edges . Once the distance between two nodes is less than a preset distance threshold , the two nodes are connected. The neighbors of node m, also known as the m-th WF, is denoted by . The weighted adjacent matrix of is represented by with adjacent coefficient , which is defined in (18):

(18)

where and denote the degree of node m and i, respectively. Noted that, is a symmetric matrix and , where .

For calculating (8) in a distributed manner, average consensus algorithm is an effective method [36]. The discrete form is given in (19). Each node only needs to communicate with its neighbors and exchange . Once the convergence is achieved, every node in obtains the average value of (8). Multiplying the average value by M will get the final summation result. However, the main problem of the average consensus algorithm lies that, the initialization value of the exchange, , reveals the raw data.

(19)

In fact, the aim of exchanging is to enable each node to compute the the summation of its neighbors, given in (20). Since directly exchanging in the average consensus algorithm will reveal raw data, a privacy-preserving way to calculate (20) is needed. Inspired by the local secure summation protocol in [23], we also utilize Paillier cryptosystem to compute (20) under the protection of data privacy, and then develop a PPD summation algorithm based on the average consensus algorithm. Details are given in Algorithm 1.

(20)
Input: with its .
Output: obtains .
while convergence criterion is not met do
       for  to  do
             01 Nodes are numbered by 1 to ;
             02 ;
             if t=0 then
                   03 generates and ;
                   04 is published to others for encryption;
                   for  to  do
                         05 encrypts ;
                         06 sends to ;
                        
                   end for
                  07 generates a random number ;
                   08 encrypts ;
                   09 computes ;
                   10 sends to ;
                   11 decrypts into ;
                   12 sents its decryption to ;
                   13 subtracts and get ;
                   14 completes the calculation of (19);
                   15 ;
                  
            else
                   for  to  do
                         16 sends to ;
                        
                   end for
                  17 completes the calculation of (19);
                   18 ;
                  
             end if
            
       end for
      
end while
for  to  do
       19 computes
end for
Algorithm 1 The PPD summation algorithm

It can be observed that, based on the additive homomorphism feature of Paillier cryptosystem, each node can obtain the summation in (20) with only local communication between neighbors and the protection of data privacy. Meanwhile, since the initialization value, which reveals raw data, is only needed in the first iteration of the PPD summation algorithm, the later iterations are the same as the traditional average consensus algorithm. The number of interactions per node is only twice more than the traditional average consensus algorithm, one is sending information in line 10 in algorithm 1, and the other is receiving in line 12.

Compared with the existing algorithms [20, 21, 22, 23], the proposed PPD summation algorithm doesn’t need a pre-selected path for communication, in which only local communication with neighbors is enough. Thus no global perspective is required, and the hamiltonian circuit problem is avoided. Meanwhile, the the existing algorithms will fail when any communication line along the path fails, even though the whole graph is still connected. But the proposed PPD summation algorithm, is robust to any communication failure as long as the connectivity of the graph is hold.

Iv The Privacy-Preserving Distributed Inner Product Algorithm

In this Section, we first briefly describe the binary hash function, laying a foundation for developing the PPD inner product algorithm. Then the proposed PPD inner product algorithm is demonstrated.

Iv-a Binary Hash Function

As mentioned above, although the SSP techniques can accurately calculate the inner product of two private vectors, these methods have been based on the assumption that any two nodes are connected. To avoid this problem, we tend to find a way to approximate the inner product with high precision. In the field of pattern recognition or data mining, inner product approximation is an effective method for measuring the similarity between high-dimensional vectors. Cauchy-Schwartz inequality was once used for the approximation, as given in (21) , but the accuracy is not satisfactory[20]. For higher accuracy, magnitude approximation technique for inner product is proposed in [37, 38, 39], and is developed in [40]. However, the symmetric assumption they made for every element in the vector is too strict, which is not suit for the WP data. Actually, once the angle in (21) is obtained, by sharing the norm and , which won’t reveal any raw data, the inner product can be calculated directly. Therefore, the problem becomes how to approximate the angle between two vectors under the premise of data privacy preservation.

(21)

In N

-dimensional space, the probability of finding a random hyperplane separating two vectors

and is proportional to the angle [41]. For calculating the probability, we first define a public known random vector set , where each column is a random vector . Then the probability is given as follows:

(22)
(23)

In fact, the probability calculation in (22) is essentially the hamming distance calculation between two binary hash codes [42]. For further demonstration, the binary hash function , as given in (23), is defined, where function can encode a L-dimensional real vector into a L-dimensional binary vector according to the sign of each element in the real vector, e.g., if , then . Thus, actually represents the sign information of the multiplication results between and . Once and are obtained, (22) can be easily computed by counting the number of different sign pairs. E.g., if , and , then the first and second elements of and are different, which means that the number of the different sign pairs is 2, and the probability in (22) is . Noted that, the counting process is essentially calculating the hamming distance between and [42], represented by . Therefore, based on the binary hash function and hamming distance, the angle can be calculated as in (24). Error analysis is refer to [42]. It is worth noting that both the norm value and the binary hash value protect data privacy as no WF can deduce the raw data from them. Thus, with and , , the inner product in (21) can be obtained under the premise of protecting privacy.

(24)

Iv-B The PPD Inner Product Algorithm

Our goal is not only to calculate the inner product of two vectors under the premise of protecting privacy, but also to obtain all the inner product values between the vectors of any two nodes in the graph via a locally distributed manner. For computing all the inner product values, the set is required. Thus, based on the binary hash function and the average consensus algorithm, the PPD inner product algorithm is proposed in Algorithm 2.

Advantages of the proposed algorithm are threefold: (1) both the norm value and the binary hash value protect data privacy as no WF can deduce the raw data from them; (2) there is no need for strict assumptions about the original data in the vector, like the symmetric assumption in [37, 38, 39, 40], which ensures the universal applicability of the algorithm; (3) based on the average consensus algorithm, each node only need to communicate with its neighbors to obtain all the inner product values, which avoids the assumption made in [30, 20, 31] that any two nodes are connected. Thus the scalability is improved.

Input: with its .
Output: obtains for
for  to  do
       01 computes via (23);
       02 converts into decimal ;
       03 Initialization: , and for others;
       04 Each node computes (19) until convergence ;
       05 Each node multiplies the result by M to get ;
       06 Each node converts into binary ;
       07 computes ;
       08 Initialization: , and for others;
       09 Each node computes (19) until convergence;
       10 Each node multiplies the result by M to get ;
      
end for
11 computes for by (24);
12 obtains for by (21);
Algorithm 2 The PPD inner product algorithm

V The Privacy-Preserving Distributed EM Algorithm

Since the two difficulties mentioned in Section II are solved by the proposed PPD summation and inner product algorithm, the PPD EM algorithm is eventually developed in this Section based on the two algorithms. Details are given first, and then the advantages of the proposed algorithm are discussed.

V-a Details of The Proposed PPD EM Algorithm

Details of the proposed PPD EM algorithm are provided in Algorithm 3. However, although the mainly parts of the PPD EM algorithm, i.e., the PPD E-step and the PPD M-step, are developed, the initialization part still requires a further investigation. Thus, based on the proposed PPD summation and inner product algorithm, we also develop a PPD k-means clustering for initializing the PPD EM algorithm.

(25)
(26)
(27)
(28)

First, with clusters, we define random initial mean vectors for , where . Then let represent the j-th cluster, and represent the number of its members. Thereafter, the t-th assignment step of the centralized k-means clustering is provided in (25) and the t-th update step is given in (26). Actually, the essential calculation of the assignment step lies in the 2-norm calculation, as in (27), while the essential calculation of the update step lies in the mean calculation, as in (28). Obviously, the summation in (27) can be solved by the proposed PPD summation algorithm. Meanwhile, since (28) and (9) are similar, (28) can be obtained by the way that computing (9). Furthermore, after the convergence of the k-means clustering, each cluster can calculate its covariance via (29) and (30) as the initial covariance for the EM algorithm, given in (31). Obviously, the calculation of (29) can be solved by the proposed PPD inner product algorithm. Meanwhile, since (30) and (12) are similar, (30) can also be obtained by the way that computing (12). Therefore, we present a PPD k-means clustering based on the proposed PPD summation and inner product algorithm, which is detailed in Algorithm 4, where the A-step represents the assignment step and U-step represents the update step.

(29)
(30)
(31)
Initialization:
01 Set , and for ;
02 Set k=1;  
The PPD E-step:
for  to and to  do
       03 Define , , ;
       04 Input: with its ;
       05 Performing Algorithm 1;
       06 Output: obtains ;
       07 Input: with its ;
       08 Performing Algorithm 1;
       09 Output: obtains ;
       10 computes (4) via ;
       11 updates in (4) via the result of (4);
      
end for
 
The PPD M-step:
for  to  do
       12 updates in (10) by ;
       13 updates in (12) by ;
       14 updates in (14) by ;
       for  to  do
             15 Set , and for others;
             16 computes (19) until convergence;
             17 multiplies the result by M to get
             18 Set , and for others;
             19 computes (19) until convergence;
             20 multiplies the result by M to get ;
            
       end for
      21 Input: with its ;
       22 Performing Algorithm 2;
       23 Output: obtains for ;
       24 obtains by (3) directly;
       25 updates ;
       26 obtains ;
      
end for
27 Set ;
28 Loop the PPD E-step and M-step until convergence;
Algorithm 3 The PPD EM algorithm
Initialization:
01 Set random for ;
02 Set ;
 
The PPD A-step:
for  to and to  do
       03 Input: with its ;
       04 Performing Algorithm 1;
       05 Output: obtains ;
      
end for
06 updates the assignments by (25);
 
The PPD U-step:
for  to and to  do
       07 Set in (28), and for others;
       08 computes (19) until convergence;
       09 multiplies the result by M to get ;
      
end for
for  to  do
       10 updates ;
      
end for
11 Set ;
12 Loop the PPD A-step and U-step until convergence;
 
Initialization for The PPD EM Algorithm:
for  to  do
       13 computes in (30);
       for  to  do
             14 Set , and for others;
             15 computes (19) until convergence;
             16 multiplies the result by M to get ;
            
       end for
      17 performs Algorithm 2 to get in (29) for ;
       19 obtains by (31);
       20 calculates ;
       21 obtains ;
      
end for
Algorithm 4 The PPD k-means clustering

V-B Discussion

As given in Algorithm 3, the PPD E-step is developed by the proposed PPD summation algorithm, while the PPD M-step is developed by the proposed PPD inner product algorithm. Based on the two algorithms, the proposed PPD EM algorithm can deal with the data vertical partitioning mode, as well as building the GMM-based JPD of multiple spatial-correlated WFs. The advantages of the proposed PPD EM algorithm are threefold:

  1. Strict privacy-preserving. For the summation calculation in the PPD E-step, the Paillier cryptosystem is used to protect the raw data; for the inner product calculation in the PPD M-step, the binary hash function is used to prevent data privacy disclosure. The two techniques we utilized can strictly protect privacy as no WF can deduce any raw data information from the calculation process.

  2. Completely distributed. As we introduce the average consensus algorithm into the PPD E-step and M-step, each WF only need to communicate with its neighbors. Thus we avoid the assumption made in [30, 20, 31] that any two nodes are connected, and improve the scalability of the proposed algorithm. Meanwhile, no pre-selected path for communication is required as in [20, 21, 22, 23]. Thus we avoid the problem that the pre-selected path has weak robustness to communication line failures, which improve the robustness of the proposed algorithm.

  3. Fewer iterations. EM algorithm is sensitive to the initial value. Since the PPD k-means clustering is developed to provide initial value for the PPD EM algorithm, the latter only needs fewer iterations to converge than a random start or a zero-valued start, which significantly reduces the amount of communication.

  4. Robust. As the communication between neighbors may fail, the robustness to communication failure is necessary. Since the proposed PPD EM algorithm is developed based on the average consensus algorithm, as long as the communication topology is still connected, the communication failure basically will not affect the final modeling results due to the consensus feature. More analysis are refer to Section VI.

Vi Case Study

The historical WP data is from the “eastern wind integration data set” published by the National Renewable Energy Laboratory (NREL) [43], where we choose 9 WFs in Maryland, numbered as 4401, 5405, 6211, 6359, 6526, 6812, 6931, 7187, and 7460. Each WF owns 20 days of private WP hour data from 2004/1/1-01:00 to 2004/1/21-00:00. Thus, and . We aim to build the JPD of the WP of the 9 spatial-correlated WFs, which is a typical data vertical partitioning problem. Noted that, for the privacy-preserving feature of the proposed algorithms, discussions have already been made in the above Sections. Thus, this Section mainly aims to verify the correctness and robustness of the proposed algorithm.

Meanwhile, there are also some parameters that need to be set, i.e, communication thresholds , the length of the binary hash code in (23) and the number of Gaussian components . In the following analysis, the idea of setting the above parameters will be demonstrated one by one.

Vi-a Correctness of The PPD Summation Algorithm

To verify the correctness of the proposed PPD Summation algorithm, the communication topology must be first given, which means that the communication thresholds should be set first. Thus, we use the WP data at 2004/1/1-01:00, which contains 9 data points owned by each WF respectively, as input, and use the proposed algorithm to estimate the summation of the 9 data points. The output of the proposed algorithm is that every WF obtains the estimation for the summation result. Since we don’t know which is appropriate, we gradually increase from 1km to 4km. Some communication topologies under certain are provided in Fig. 1 for better understanding, and the estimation result for all WFs under different are shown in Fig. 2.

Fig. 1: Communication topology examples under certain distance thresholds
Fig. 2: Estimated result for the summation under different distance thresholds

With the increasing threshold, the communication topology becomes connected, as can be observed from Fig. 1. Meanwhile, before the distance threshold reaches 2.5km, the estimation result of each WF is not only far from the true value, but also has not reached the consensus, as shown in Fig. 2. In fact, once the threshold reaches 2.5km, the connectivity of the topology is achieved. Even though the higher threshold, the more linkages between WFs, the consensus effect is saturated after the threshold reaching 2.5km. To save the construction cost of communicate lines, we set the as 2.5km.

In fact, Fig. 2 already shows the correctness of the proposed PPD summation algorithm, as each WF reaches consensus and obtains the correct summation result. For further verifications, we use the WP data at 2004/1/1-02:00, 03:00, 04:00 and 05:00 as 4 testing sets, and use the proposed algorithm to estimate the summation of each sets, respectively. The errors relative to the true values are given in Table I. Obviously, all the errors are less than , indicating the correctness of the proposed algorithm. Meanwhile, different WFs have different calculation errors due to different neighbor nodes and different positions in the topology.

WP data time 02:00 03:00 04:00 05:00
Error of WF1 () 1.57 1.34 1.29 1.28
Error of WF2 () 3.16 2.69 2.59 2.58
Error of WF3 () 1.76 1.50 1.45 1.44
Error of WF4 () 0.25 0.21 0.21 0.21
Error of WF5 () 2.44 2.08 2.01 2.01
Error of WF6 () 2.84 2.42 2.44 2.32
Error of WF7 () 6.08 5.18 4.98 4.98
Error of WF8 () 1.01 0.86 0.83 0.83
Error of WF9 () 2.57 2.19 2.11 2.11
TABLE I: The Relative Error of the Proposed Algorithm

Vi-B Correctness of The PPD Inner Product Algorithm

Since the 20 days historical WP data of one WF can be seen as its private vector, to verify the correctness of the proposed PPD inner product algorithm, we calculate the inner products between every two WFs’ private vectors by the algorithm. As our algorithm enables every WF to obtain the inner products, we define the inner products between the m-th vector and the i-th vector calculated by the g-th WF as , while the true inner product as . The indicator we applied for verification is the average relative error defined in (32).

(32)

Meanwhile, to choose an appropriate length of the binary hash code, we gradually increase form to . The average relative error of the proposed algorithm under different value of are illustrated in Fig. 3, while the length of the binary hash code is also shown. It’s can be observed that the error decreases significantly as increases, but nearly stabilizes when reaches . Obviously, the larger the , the more communication traffic will be required. Thus we eventually choose as bit kb. Furthermore, the average relative error is when , which prove the correctness of the proposed algorithm.

Fig. 3: Average relative error and its corresponding binary code length

Vi-C Verification of The PPD EM Algorithm

Before performing the PPD EM algorithm, the number of Gaussian components must be set first. Bayesian information criterion (BIC) is the most used criterion for selecting , where with the lowest BIC is preferred. More details are refer to [44]. Thus, we compute the BIC value for several choices of by the centralized EM algorithm with the 20 days historical WP data. The result is provided in Table II. Since the lowest BIC is achieved when , we eventually choose a GMM with 5 Gaussian components. And the PPD k-means clustering is utilized to generate the initial value of .

Selection
BIC() -8.25 -9.01 -9.41 -9.39 -9.82
Selection
BIC() -9.73 -9.52 -9.04 -9.24 -9.04
TABLE II: BIC For Different Number of Components

Thereafter, we build the JPD of the WP of 9 spatial-correlated WFs by the proposed PPD EM algorithm. The JPD constructed by the centralized EM algorithm is also given as the benchmark. But the 9-dimensional JPD can not be drawn for illustration. To solve the difficulty, we derive several 1-dimensional and 2-dimensional JPDs from the 9-dimensional JPD. The parameter set of the 1-dimensional probability distribution (PD) is defined as and the 2-dimensional JPD as . If the m-th dimension is chosen for the 1-dimensional PD, while the m-th and the i-th dimensions are chosen for the 2-dimensional JPD, then and can be derived directly from the parameter set of the complete 9-dimensional JPD based on the linear invariance property of the GMM [45], which is given as follows:

(33)
(34)
(35)

Noted that, since both and are made up of a part of , it seems that even if and are correct, there is no guarantee for the correctness of . But there is actually a guarantee. In (5), the inverse of the covariance and the complete mean vector are both required to eventually obtain in (2), and the calculation result will be used for updating parameters in (3). Thus, from the perspective of iterative calculation, and in the k-th iteration actually condense the information of all the dimensions in in the (k-1)-th iteration, and again constitute all the dimensions of in the k-th iteration. Therefore, the correctnesses of and are necessary and sufficient for the correctness of .

The 1-dimensional probability distribution function (PDF) and the 1-dimensional cumulative distribution function (CDF) are shown in Fig.

4 and Fig. 5, respectively. Due the limited space, only the first three dimensions in the 9-dimensional JPD are provided one by one, where the empirical PDs are obtained from the corresponding original historical data, the benchmark are built by the centralized EM algorithm, and the PDs of each WF are constructed by the provide PPD EM algorithm. It’s can be observed from Fig. 4 and Fig. 5 that: (1) whether it is the benchmark or the PDs of each WF, it all can match the empirical PDs very well; (2) the PD curves of each WF are coincident, indicating that the consensus of probability modeling is achieved by the proposed algorithm; (3) the PD curves of each WF are basically coincident with the benchmark, indicating the correctness of the proposed PPD EM algorithm.

Further comparisons between the benchmark and the PDs of each WF are made via relative standard error (RSE) as in, where

represents the PDF or CDF built by each WF, represents the benchmark PDF or CDF, and represents its mean value. The RSE results are provided in Fig. 6. First, all the results are less than