Complex systems are ubiquitous; biological, social, transportation networks are examples that we face them on a daily basis. Managing and controlling such systems require learning from data observed and measured at possibly distinct points of the systems. Extracted Data have several intrinsic features: they contain various forms of attributes which are noisy and incomplete; they are massive and usually generated distributively across the networks. Effective and efficient utilization of available resources plays an important role in learning, managing, and controlling of such complex systems. Through a distributed and joint communication-computation design, this paper aims at presenting important machine learning instances where improvement in performance compared to conventional methods can be analytically and experimentally justified.
Many frameworks are developed to handle complex and distributed datasets efficiently. Hadoop Distributed File Systems (HDFS) , Google File System (GFS)  are examples of distributed storage systems. MapReduce  and Spark  are well known frameworks for distributed data processing systems. These platforms can be used to implement machine learning algorithms naturally .
Distributed processing and storage systems are designed to be universal by separating computation and communication tasks. The aim of communication is to exchange information between distributed machines efficiently, whereas the aim of computation is to run a distributed algorithm with the given communication platform. However, communication can be adjusted and modified based on the problem at hand achieving superior results. A new trend in research is started to design distributed and joint communication-computation optimal systems .
In this paper, we present some important machine learning instances in distributed settings and determine analytically and experimentally how one can obtain the optimal trade-off between performance and communication cost. The problem that we focus on is learning of Gaussian Processes
(GPs) which are fundamental models applicable to many machine learning tasks. We propose efficient methods for GP learning and analyze the performance and communication cost of the proposed methods. Although GPs can be used for both classification and regression, we only consider regression in presented applications and experiments. However, the proposed methods are also applicable to other learning models such as distributed manifold learning, KNN-based algorithms, etc.
There exist abundant studies related to distributed learning. Here, we address a few of them which we have found closest to the setting considered in this paper. From information theoretic point of view, distributed statistical inference is addressed by Ahlswede , Amari and Han [8, 9, 10]. In their settings, data are assumed to be distributed between two machines and through minimal communication, it is desired to test a hypothesis or estimate a parameter optimally. They used the method of types  to obtain some useful error and communication bounds on several problems. A survey by Han et. al.  summarizes all early studies in that direction where hypothesis testing, parameter estimation, and classification problems in distributed settings are addressed.
Recently, several researchers worked on distributed statistical learning problems and used tools from information theory to obtain bounds on the performance of the algorithms: [12, 13, 14, 15]. For instance, Duchi et. al.  investigated the minimum amount of communication required for achieving centralized minimax risk in certain estimators. Rahman et. al.  analyzed the problem of distributed hypothesis testing from information theoretic perspective and prove the optimality of binning method for conditionally independent sources.
There are papers addressing the problem of distributed statistical inference from machine learning point of view [16, 17, 18, 14, 19]. For instance, Liu in  showed that how the maximum likelihood estimation of parameters of exponential families can be obtained distributively. Meng et. al.  proposed a distributed algorithm for learning of inverse covariance matrix in Gaussian graphical models. Broderick et. al.  proposed a distributed algorithm for calculating the posterior distribution using variational inference over distributed datasets. Authors of  have addressed the kernel PCA in distributed systems on a large amount of data.
Distributed machine learning problems are also related to distributed optimization problems as optimization of an objective function during their training or test procedure is part of learning tasks. Therefore, several studies on distributed optimization algorithms  can directly or indirectly be applied to many machine learning models. Alternating Direction Method of Multipliers (ADMM)  is one of the most successful distributed algorithms for solving optimization problems over distributed datasets. There are many papers that are based on the ADMM algorithm [21, 22, 23]. For example, Forero et. al.  proposed a re-parameterization for SVM learning in distributed manner based on ADMM.
As mentioned before, in this paper we have considered GP learning for distributed systems. There are many works that propose methods for learning GPs distributively . Deisenroth et. al.  presented a distributed model for GP learning which is based on models called product of experts (PoE). PoE-based models assume that the local dataset in each machine is independent of the other local datasets. Thus, the marginal likelihood factorizes into product of individual local marginal likelihoods. In these models, the overall gram matrix will be block diagonal where the blocks are the local gram matrices of the machines. In PoE models, after training the model hyper-parameters, predictions of GP experts are combined. There are several combining methods such as PoE , Bayesian Committee Machine (BCM)  and the recently proposed method robust BCM (rBCM) .
PoE models need no data transmission during training which makes them favorable when communication is very restricted. However, assuming independence of local experts is unfavorable in many applications. For example, in RBF kernels that are non-increasing function of Euclidean distances of inputs, such assumption may decrease the performance of PoE models. This is due to the fact that data is distributed randomly between machines and nearby inputs may be located at distinct machines. In this case, transmission of inputs is necessary to achieve the optimal performance of the learning task.
There are other notable studies on localized and distributed learning of sparse GPs such as [27, 28, 29, 30, 31]. The main idea behind these works is to select or extract inducing points from the training datasets so that the trained sparse GP is close to the full GP. Authors of  proposed a re-parameterization on  and 
for applying sparse GPs on a network of multiple machines. This method for obtaining inducing points, in each iteration of the optimization procedure transmits all inducing and hyperparameters of the GPs to all machines. Due to the size of the inducing set and the number of iterations till the convergence, the communication cost of this method is usually very high. Moreover, the exact communication cost of this method is not calculable before running the training procedure.
This paper is organized as follows. Section 2 briefly describes background on GPs for the regression problem. In Section 3, we formally define the main problem of the paper. In Section 4, we propose three methods for solving the inner-product estimation problem in a distributed system. In section 5, we show how the proposed methods of Section 4 can be used for distributed learning of GPs. Finally, in Section 6, we evaluate our methods over synthetic and real datasets.
2 Gaussian Processes
Statistical machine learning models can be divided into parametric and non-parametric31]. Basically, a GP is a stochastic process that any finite collection of its points is jointly Gaussian. To learn a GP model, it is sufficient to estimate its mean and covariance functions. The covariance matrix of a GP is sometimes called gram matrix which is denoted by through out this paper.
In regression problems, we wish to estimate a real-valued function where . We can use a GP as a prior distribution over these real-valued functions. Suppose our training dataset is , which we call input samples and target observations. The training examples are noisy realization of the target function where each is obtained by adding a Gaussian noise to at input as follows:
The posterior process is a GP with the following mean and covariance functions:
where is an -dimensional row vector of kernel values between and the training inputs, is the vector of all target observations , and . Then, the predictive distribution for a new input is . Note that is an matrix where is the correlation of and .
In the training procedure, the hyperparameters of the model (e.g. , and parameters of the kernel function) need to be trained. Usually, the hyperparameters are obtained by maximizing the marginal likelihood . Although there exist many kernel functions such as linear, polynomial, squared exponential, Matern, and etc , we select the following linear kernel function:
are the hyperparameters of the kernel. Although our proposed methods for GP learning is based on the linear kernel, they can be readily applied on any Radial Basis Function (RBF) kernels as experimentally will showed in Section6. It is worth mentioning that kernels that are only a function of are called RBFs.
3 Problem Definition
Suppose we have only two machines in our distributed system so that each machine stores some portion of the whole dataset. Then the gram matrix can be partitioned as follows:
where (resp. ) is the gram matrix of data stored in the first (resp. second) machine and is the cross gram matrix between data of the first and second machines. Moreover, . Note that and can be calculated locally without any data transfer between machines. However, and its transpose must be calculated cooperatively by the two machines which requires intensive communication.
Let the first machine be responsible for learning the GP model. This machine calculates locally, whereas and need to be calculated using messages received from the second machine. In Section 5, we will show that if is available, then can be efficiently estimated. Thus, our main goal is to find an efficient method to obtain in distributed manner with minimum distortion.
In fact, we seek for an algorithm that has low communication cost and at the same time has low distortion in reconstruction of the target matrix. Without any limitation on communication cost, we gather the whole dataset at the first machine which leads to the best performance with zero distortion. In Section 4, we propose novel methods respecting imposed communication constraints and achieving optimal/suboptimal performance in the learning procedure.
In our basic setting, the gram matrix of GP is first estimated by communicating information bits between machines. The estimated gram matrix is then used to learn the latent function. Therefore, minimum distortion in estimating of the gram matrix is desired. By considering a linear kernel function for the GP model, our goal is to calculate the inner product of two datasets that stored in different machines with minimum distortion.
To put it formally, consider two datasets and stored in machines and , respectively. Samples and are -dimensional random vectors with zero means. We would like to calculate the inner product for any and in machine . Exact calculation of inner products requires transmission of the whole dataset to machine which is not efficient in many practical situations due to its communication cost. Lowering the communication cost, however, results in reduction in the quality of computed inner products. Therefore, finding the tradeoff between the communication cost and the quality of calculated inner products plays an important role in the design of practical systems.
To characterize the tradeoff, we limit the number of transmission bits per observation to bits; leading to the total of transmission bits. These bits are used to design an encoding map from X to an index set . The encoded index is transmitted to machine where a decoding map is used to map the index to another set which is called the set of reconstructed points. In other words, each is quantized to .
We would like to design and such that the distortion in computing the inner products is minimized. The distortion is measured by
where is the sample covariance matrix of .
A rate-distortion pair is achievable if there exist encoding and decoding maps and such that the transmission rate is limited by and the distortion in reconstructing the inner products is limited by .
In this paper, we are interested in characterizing the set of all achievable rate-distortion pairs. The region of all achievable rate-distortion pairs is a convex region. To show this, one needs to prove that if and are two achievable pairs, then is an achievable pair for any . A coding scheme which uses both schemes for different portions of data can achieve this. Let portion of data be encoded by the first scheme and the rest with the second scheme. It is easy to show that the overall rate-distortion pair is indeed achievable.
Extreme points of the achievable region can be obtained by solving the following optimization problem
4 Transmission Schemes
In this section, we present three different schemes for transmitting the dataset to machine
with low communication cost and distortion. The first scheme is based on an optimal compression method which characterizes rate-distortion region theoretically for normally distributed samples. Although the scheme is optimal, it is not practical due to computation complexity required for block coding. Thus, we propose another scheme that is an approximate version of the optimal one. This method uses per symbol coding instead of block coding. Experiments show its performance is near optimal and it is applicable for many distributed settings. The third scheme is even more simpler and does not require any nonlinear map. It is based on ideas from dimension reduction schemes which are tailored to our distributed setting. This method has more communication cost than the previous two schemes, but it is very simple and fairly efficient in many applications.
Let us assume the datasets and are drawn from two independent distributions as
Taking the expectation of distortion in (7) with respect to , we obtain
where is the covariance matrix of defined by It is worth mentioning that under the preceding conditions, the covariance matrix of is the only parameter influencing the achievable rate-distortion region and therefore the knowledge of the exact distribution of is not required.
We would like to characterize rate-distortion region for the above setting. This problem is an instance of the rate-distortion problem in information theory where the distortion is defined by (7). The optimal coding scheme achieving points on the boundary of the achievable region is obtained by solving the following optimization problem :
Suppose is the solution of (11). The coding scheme with the rate and distortion is obtained as follows:
Generating Codebook: Calculate . Generate a codebook consisting of sets of size that are drawn i.i.d. from . We denote the generated random sets by . The codebook is available at both transmitter () and receiver () machines. Note that the codebook can be generated at both machines by an equal initial seed which leads to identical codebooks. Thus, there is no need to send the codebook from a machine to the other.
Encoding: We are given a set . Index it by if there exists a set in the codebook such that the pair is strongly jointly typical (for definition of strong typicality refer to ). If there is more than one such , send the first in lexicographic order. If there is no such , send 1 to the receiver. It is clear that transmitting the index requires bits. Since by each transmission we send data samples, the average rate will be bits per each sample.
Decoding: At the receiver (machine in our case), select as the reproduced set for .
For the proof that this coding scheme asymptotically (when ) has rate and distortion see . Thus, by solving the optimization problem (11) and using the above coding scheme, we obtain the optimal method for transmitting the set with rate and distortion . Unfortunately, in many cases, there is no closed form solution for (11), but we can obtain a lower bound for it (Theorem 1).
The following theorem presents a lower bound for rate-distortion function (11).
Let be -dimensional zero mean random vectors with covariance matrix , and let the distortion measure be , where is a symmetric positive definite matrix. Then the rate distortion function (11) is lower bounded by
is entropy of variable , is the diagonal matrix of eigenvalues of
is the diagonal matrix of eigenvalues of, and is chosen so that .
The lower bound is obtained as follows. Let , we have
where (a) follows from the fact that conditioning does not increase the entropy; (b) follows from the fact that Gaussian distribution maximizes the entropy conditioned on a covariance matrix.
One can find the best lower bound in (15) by minimizing over all appropriate covariance matrices . This amounts to solving the following optimization problem:
where the second constraint comes from the fact that independence of and results in an equality in (a). More precisely, where and are independent; this yields and thus, .
It can be shown that the above problem is convex and therefore KKT conditions provide necessary and sufficient conditions for the optimal solution. To obtain the KKT conditions, we first obtain the Lagrangian as
where and are the Lagrange multipliers. Using , the KKT conditions amount to
Since is positive semi-definite, (19) yields
Since is positive definite, it can be decomposed as where is the symmetric squared root of . Multiplying (22) by from left, from right and from middle, yields
We can decompose as follows
is a unitary matrix. Since the eigenvalues ofare equal to eigenvalues of , is a diagonal matrix which includes eigenvalues of . By substituting (25) in (24) we can rewrite (24) as follows
The above equation forces to be diagonal. Each element on diagonal of is whether or . Clearly, since , any solution for can be mapped uniquely to and vice versa. It is easy to show that which is defined as follows satisfies all the KKT conditions and therefore, after mapping it to , it is an optimal solution for (16)
where is chosen so that:
Theorem 1 gives a lower bound on the rate distortion function (11). In some cases such as the Gaussian distribution the lower bound is achievable (see Theorem 2), but it is not the case in general. Although this theorem does not propose a coding scheme for data compression, it gives some useful clues for designing low communication rate coding schemes. According to the lower bound in (13), eigenvalues of the product matrix plays an important role in bit rate allocation. In other words, dimensions with larger eigenvalues need more bits compared to dimensions with small eigenvalues. This intuition lead us to propose an approximate coding scheme in the next section for Gaussian datasets.
The following theorem shows that the lower bound (13) is achievable when the dataset is drawn from a normal distribution. In this case, the achievable rate distortion region can be fully characterized as the following theorem presents.
If are drawn from a normal distribution , then the lower bound (13) is achievable if we choose as follow:
where and are independent, and is chosen so that and .
To find the conditional distribution that achieves the lower bound (13), it is usually more convenient to use conditional distribution
. If we choose the joint distribution ofand using (30), then
which is equal to the lower bound (13). ∎
The theorem 2 shows that the optimal compression for Gaussian distribution must be performed using the product of covariance matrices and . We will see that this result is similar to Theorem 3 where the optimal linear dimension reduction is performed by product of the sample covariance matrices and . Theorem 2 is in fact the well known reverse-water filling problem on eigenvalues of matrix . In other words, if we decompose as where is a diagonal matrix of eigenvalues of , then we should consume more bits for dimensions with larger corresponding eigenvalue . According to the theorem, for dimensions with eigenvalues less than a threshold no bits are allocated (i.e. they are not encoded and transmitted), and for dimensions greater than the threshold we consume bits to reduce distortion for them. In fact, in Theorem 2 is the distortion value for -th dimension after compression.
According to Theorem 2, the machine needs for compressing dataset , thus, if is not known for the machine , transmitting it has () bits communication cost. After compression, transmitting the compressed dataset requires () bits.
Although the compression method proposed by Theorem 2 is optimal and simulation results in section 6 (see Fig. 2) shows that it significantly outperforms the dimension reduction method, it is unfortunately not practical in real world due to exponential computation complexity for encoding. To overcome this issue, in the next section we have proposed an applicable approximate method for compression which has near optimal performance.
4.2 Per-Symbol Compression Scheme
Although the proposed method for compression in Theorem 2 is optimal for normal data samples, it is not practically favorable to implement in many cases as it requires vector quantizers which are designed to work on blocks of data samples. Thus, a per-symbol quantizer which is easier to implement is favorable.
In this section, we propose a simple yet effective quantizer for normal data samples which operates near optimal in many practical situations. In the proposed method, we first transform samples of the dataset by , where is the squared root of , and
is obtained by solving the following singular value decomposition problem:
It is easy to see that the above transformation makes dimensions of samples uncorrelated. Hence, for Gaussian samples the dimensions are independent. We denote the transformed dataset by . Thus, we can consider each as
independent normal random variables. Finally, we quantize each dimension separately. Our coding scheme is as follows:
Encoder: Transform dataset to using . Quantize each dimension of samples using a scalar quantizer. Transmit quantized samples denoting by to the receiver.
Decoder: Transform received data samples using
to reconstruct .
In fact, we quantize the transformed version of samples in the encoder and apply the inverse transform on received samples at the decoder. The expected distortion of this method is calculated by taking expectation from (7) as follows:
where is the covariance matrix of . Since the dimensions of are independent for normal distribution and the quantization is done separately on each dimension, is diagonal, hence by denoting the expected distortion is . In fact, is the distortion of the -th dimension. If we want to quantize each sample by bits, we should allot more bits to the dimensions with larger distortions. More precisely, for finding the optimal bit allocation to each dimension, the following integer optimization problem should be solved:
where is the number of bits allocated to -th dimension. For solving the above minimization problem, be calculated first.
To relate and , we use a simple scalar quantizer for each dimension. Assume we have a scalar random variable which we want to quantize it by bits. We create
equally probable bins over the real axis and use bins centroids as the reproduction points at the decoder (receiver). To compress, we send index of the bin that belongs to it by bits.
Assume we denote the -th bin interval by and its centroid by , then
where are determined so that . If , then
If is the set of bins boundaries for standard normal distribution, then the boundaries for is obtained by . Note that the is only function of . Hence, we can calculate using
The above equation shows that the centroids of any normal distribution are simply obtained by multiplying the centroids of standard normal distribution by standard deviation.
The expected reconstruction error for this scalar quantizer is calculated as follows:
is variance of the centroids.
In our problem setting, from (35) and (40), the distortion of -th dimension is , where is the number of bits allocated to the -th dimension and is the variance of -th dimension of the transformed dataset . Now, we can solve the integer program in (36). Defining as the amount of decrease in distortion of -th dimension by adding a new bit, then the greedy algorithm presented in Algorithm 1, obtains the optimal bit allocation. In this algorithm, we allocate one bit in each iteration to the dimension with largest and update the distortion of that dimension. We iterate this procedure until all available bits are allocated.
To show that the above greedy bit allocation algorithm is optimal, we rely on the fact that is a decreasing function of . The plot of against the bit rate shows that it is indeed decreasing exponentially (Fig. 1). Note that the figure is plotted for standard normal distribution and behaves similarly for any normal distribution.
We denote the optimal bit allocation by and the greedy solution by . We define the distance between the greedy and optimal solutions as follows:
If the problem has multiple optimal solutions, we select a solution with the minimum distance value . If , then the greedy solution is also optimal. Hence, we assume that and show that it leads to a contradiction.
If , then there exist two bit rates and so that and . Assume that in iteration of the greedy algorithm, the bit rate of dimension exceeds , and the bit rate of dimension is . Thus,
On the other hand, it is clear that . Hence,
Thus, by removing a bit from and adding it to we obtain a new solution which its error value is less than or equal to error of . Since is optimal, could not have smaller error value. But its distance with the greedy solution is which is a contradiction.
The proposed per-symbol coding scheme has () time complexity for encoding (using binary search algorithm over the codebook) and () for decoding. However, its space complexity is () that is required for storing the bins intervals and centroids (codebook). This method requires to transmit matrices and between machine and machine . Thus, the overall communication cost is ( bits.
Although in the proposed per-symbol method the underlying distribution of samples of is considered to be normal, we can use it for other distributions. However, the normal distribution is used to make the dimensions independent and to prove the optimality of the greedy bit loading. Experimental results in Section 6 reveal that the performance of this method for other distributions is desirable.
4.3 Dimension Reduction
The third approach proposed in this paper for reducing communication cost for transmitting the dataset is to represent them in a lower dimensional space. In this section we propose a linear dimension reduction method that efficiently minimizes the inner product distortion. Interestingly, PCA, the conventional dimension reduction method, falls from optimality as it is shown in Section 6.
Our goal is to construct a lower dimensional vector for so that the error function (7) be minimized. In order to represent by which lies in a lower dimensional space, is defined as a weighted sum of some basis vectors as follows
where are orthogonal -dimensional basis vectors so that , is the weight of vector for , and is dimension of the new space. The equation (45) can be rewritten in vector-matrix form as
where , and is a matrix that forms the -th column, thus, . In fact, is the corresponding representation of in the -dimensional space. We would like to minimize the error function (7), with respect to the matrix and vectors , . More precisely, we want to find the solution of the following optimization problem:
Following theorem gives the optimal values of this problem.
We first optimize with respect to considering as a constant matrix. To this end, by taking the derivative with respect to and setting it to zero, we obtain
The first term in (52) is constant and can be ignored. Next, we obtain by introducing the Lagrange multiplier as follows:
where is an symmetric matrix that comes from the constraint . Taking derivative from (53) with respect to , yields
By setting it to zero and multiplying by from the left, yields . Hence, we have
Multiplying by and from left and right, respectively, yields
It is easy to show that replacing with where is a diagonal matrix which includes eigenvalues of , satisfies (56). In other words, if columns of are the right eigenvectors of , then it satisfies (56). Thus, we would like to solve the following eigenvector problem:
Clearly for minimizing , the trace of should be maximized, thus, the right eigenvectors of that correspond to the largest eigenvalues must be selected as the columns of . ∎
The above theorem shows that the optimal dimension reduction can be obtained from factorization of . Therefore, is first transmitted to machine . Transmitting from machine requires communicating () bits. After dimension reduction, transmitting and to the machine requires () bits, thus, the overall communication cost is () bits. Theorem 3 shows that if the dimensions in dataset are highly correlated, then the error value is significantly lower than PCA and conversely for loosely correlated dimensions the method is close to the PCA so that if then it will be completely equivalent to PCA.
It is worth mentioning that if (e.g. and have identical distributions) there is no need to transmitting , thus the communication cost reduces by () bits. In this case, the machine can perform the proposed dimension reduction by estimating . But in this case, our proposed method is equivalent to the PCA. Because, eigenvectors of are identical to that of and the eigenvalues are squared. Since is positive definite, order of eigenvalues for and are identical. This shows that our proposed method for dimension reduction outperforms the PCA when the local datasets on the machines differ from each other. This will be shown experimentally in Section 6.