## I Introduction

Consider the vector Gaussian Chief Executive Officer (CEO) problem shown in Figure 1. In this model, there is an arbitrary number of agents each having a noisy observation of a vector Gaussian source . The goal of the agents is to describe the source to a central unit, which wants to reconstruct this source to within a prescribed distortion level. The incurred distortion is measured according to some loss measure , where designates the reconstruction alphabet. For quadratic distortion measure, i.e.,

the rate-distortion region of the vector Gaussian CEO problem is still unknown in general, except in few special cases the most important of which is perhaps the case of scalar sources, i.e., scalar Gaussian CEO problem, for which a complete solution, in terms of characterization of the optimal rate-distortion region, was found independently by Oohama in [1] and by Prabhakaran et al. in [2]. Key to establishing this result is a judicious application of the entropy power inequality. The extension of this argument to the case of vector Gaussian sources, however, is not straightforward as the entropy power inequality is known to be non-tight in this setting. The reader may refer also to [3, 4] where non-tight outer bounds on the rate-distortion region of the vector Gaussian CEO problem under quadratic distortion measure are obtained by establishing some extremal inequalities that are similar to Liu-Viswanath [5], and to [6] where a strengthened extremal inequality yields a complete characterization of the region of the vector Gaussian CEO problem in the special case of trace distortion constraint.

In this paper, we study the CEO problem of Figure 1 in the case in which is jointly Gaussian and the distortion is measured using the logarithmic loss criterion, i.e.,

(1) |

with the letter-wise distortion given by

(2) |

where

designates a probability distribution on

and is the value of this distribution evaluated for the outcome .The logarithmic loss distortion measure, often referred to as self-information loss

in the literature about prediction, plays a central role in settings in which reconstructions are allowed to be ‘soft’, rather than ‘hard’ or deterministic. That is, rather than just assigning a deterministic value to each sample of the source, the decoder also gives an assessment of the degree of confidence or reliability on each estimate, in the form of weights or probabilities. This measure, which was introduced in the context of rate-distortion theory by Courtade

et al. [7, 8], has appreciable mathematical properties [9, 10], such as a deep connection to lossless coding for which fundamental limits are well developed (e.g., see [11] for recent results on universal lossy compression under logarithmic loss that are built on this connection). Also, it is widely used as a penalty criterion in various contexts, including clustering and classification [12], pattern recognition, learning and prediction

[13], image processing [14], secrecy [15] and others.### I-a Main Contributions

The main contribution of this paper is a complete characterization of the rate-distortion region of the vector Gaussian CEO problem of Figure 1 under logarithmic loss distortion measure. In the special case in which there is no side information at the decoder, the result can be seen as the counterpart, to the vector Gaussian case, of that by Courtade and Weissman [8, Theorem 10] who established the rate-distortion region of the CEO problem under logarithmic loss in the discrete memoryless (DM) case. For the proof of this result, we find it useful to first extend Courtade-Weissman’s result [8, Theorem 10] on the rate-distortion region of the DM -encoder CEO problem to the case in which the CEO has access to a correlated side information stream which is such that the agents’ observations are independent conditionally given the side information and remote source. On this aspect, we hasten to mention that, for the DM model, the side information is not assumed to be conditionally independent on the agents’ observations given the remote source; and, for this reason, the result cannot be obtained by a direct application of [8, Theorem 10]

viewing the side information as another agent’s observation that is encoded at large (infinite) rate; and, so, a new converse proof is needed. Also, our converse proof involves redefinition of the involved auxiliary random variables. Next, we derive an outer bound on the rate-distortion region of the vector Gaussian CEO problem by using the outer bound from the DM model using the de Bruijn identity, a connection between differential entropy and Fisher information, along with the properties of minimum mean square error (MMSE) and Fisher information. By opposition to the case of quadratic distortion measure, for which the application of this technique was shown in

[16] to result in an outer bound that is generally non-tight, we show that this approach is successful in the case of logarithmic distortion measure and yields a complete characterization of the region. On this aspect, it is noteworthy that although in the specific case of scalar Gaussian sources an alternate converse proof may be obtained by extending that of the scalar Gaussian many-help-one source coding problem by Oahama [1] and Prabhakaran et al. [2] by accounting for side information and replacing the original mean square error distortion constraint with conditional entropy, such approach does not seem to lead to a conclusive result in the vector case as the entropy power inequality is known to be generally non-tight in this setting [17, 18]. The proof of the achievability part simply corresponds to the evaluation of the result for the DM model using Gaussian test channels and no time-sharing. Because this does not necessarily imply that Gaussian test channels also exhaust the Berger-Tung inner bound, we investigate the question and we show that they do if time-sharing is allowed.Furthermore, we show that application of our results allows us to find complete solutions to three related problems. The first is the -encoder hypothesis testing against conditional independence problem that was introduced and studied by Rahman and Wagner in [19]. In this problem, sources are compressed distributively and sent to a detector that observes the pair and seeks to make a decision on whether is independent of conditionally given

or not. The aim is to characterize all achievable encoding rates and exponents of the Type II error probability when the Type I error probability is to be kept below a prescribed (small) value. For both DM and vector Gaussian models, we find a full characterization of the optimal rates-exponent region when

induces conditional independence between the variablesunder the null hypothesis. In both settings, our converse proofs show that the Quantize-Bin-Test scheme of

[19, Theorem 1], which is similar to the Berger-Tung distributed source coding, is optimal. In the special case of one encoder, the assumed Markov chain under the null hypothesis is non-restrictive; and, so, we find a complete solution of the vector Gaussian hypothesis testing against conditional independence problem, a problem that was previously solved in

[19, Theorem 7] in the case of scalar-valued source and testing against independence (note that [19, Theorem 7] also provides the solution of the scalar Gaussian many-help-one hypothesis testing against independence problem). The second is a quadratic vector Gaussian CEO problem with reconstruction constraint on the determinant of the error covariance matrix that we introduce here, and for which we also characterize the optimal rate-distortion region. Key to establishing this result, we show that the rate-distortion region of vector Gaussian CEO problem under logarithmic loss which is found in this paper translates into an outer bound on the rate region of the quadratic vector Gaussian CEO problem with determinant constraint. The reader may refer to, e.g., [20] and [21] for examples of usage of such a determinant constraint in the context of equalization and others. The third is an extension of Tishby’s single-encoder Information Bottleneck (IB) method [12] to the case of multiple encoders. Information theoretically, this problem is known to be essentially a remote source coding problem with logarithmic loss distortion measure [22]; and, so, we use our result for the vector Gaussian CEO problem under logarithmic loss to infer a full characterization of the optimal trade-off between complexity (or rate) and accuracy (or information) for the distributed vector Gaussian IB problem.Finally, for both DM and memoryless Gaussian settings we develop Blahut-Arimoto (BA) [23, 24]

type iterative algorithms that allow to compute (approximations of) the rate regions that are established in this paper; and prove their convergence to stationary points. We do so through a variational formulation that allows to determine the set of self-consistent equations that are satisfied by the stationary solutions. In the Gaussian case, we show that the algorithm reduces to an appropriate updating rule of the parameters of noisy linear projections. We note that the computation of the rate-distortion regions of multiterminal and CEO source coding problems is important per-se as it involves non-trivial optimization problems over distributions of auxiliary random variables. Also, since the logarithmic loss function is instrumental in connecting problems of multiterminal rate-distortion theory with those of distributed learning and estimation, the algorithms that are developed in this paper also find usefulness in emerging applications in those areas. For example, our algorithm for the DM CEO problem under logarithm loss measure can be seen as a generalization of Tishby’s IB method

[12] to the distributed learning setting. Similarly, our algorithm for the vector Gaussian CEO problem under logarithm loss measure can be seen as a generalization of that of [25, 26] to the distributed learning setting. For other extension of the BA algorithm in the context of multiterminal data transmission and compression, the reader may refer to related works on point-to-point [27, 28] and broadcast and multiple access multiterminal settings [29, 30].### I-B Related Works

As we already mentioned, this paper mostly relates to [8] in which the authors establish the rate-distortion region of the DM CEO problem under logarithmic loss in the case of an arbitrary number of encoders and no side information at the decoder, as well as that of the DM multiterminal source coding problem under logarithmic loss in the case of two encoders and no side information at the decoder. Motivated by the increasing interest for problems of learning and prediction, a growing body of works study point-to-point and multiterminal source coding models under logarithmic loss. In [9], Jiao et al. provide a fundamental justification for inference using logarithmic loss, by showing that under some mild conditions (the loss function satisfying some data processing property and alphabet size larger than two) the reduction in optimal risk in the presence of side information is uniquely characterized by mutual information, and the corresponding loss function coincides with the logarithmic loss. Somewhat related, in [31]

Painsky and Wornell show that for binary classification problems the logarithmic loss dominates “universally” any other convenient (i.e., smooth, proper and convex) loss function, in the sense that by minimizing the logarithmic loss one minimizes the regret that is associated with any such measures. More specifically, the divergence associated any smooth, proper and convex loss function is shown to be bounded from above by the Kullback-Leibler divergence, up to a multiplicative normalization constant. In

[11], the authors study the problem of universal lossy compression under logarithmic loss, and derive bounds on the non-asymptotic fundamental limit of fixed-length universal coding with respect to a family of distributions that generalize the well-known minimax bounds for universal lossless source coding. In [32], the minimax approach is studied for a problem of remote prediction and is shown to correspond to a one-shot minimax noisy source coding problem. The setting of remote prediction of [32] provides an approximate one-shot operational interpretation of the Information Bottleneck method of [12], which is also sometimes interpreted as a remote source coding problem under logarithmic loss [22].Logarithmic loss is also instrumental in problems of data compression under a mutual information constraint [33], and problems of relaying with relay nodes that are constrained not to know the users’ codebooks (sometimes termed “oblivious” or nomadic processing) which is studied in the single user case first by Sanderovich et al. in [34] and then by Simeone et al. in [35], and in the multiple user multiple relay case by Aguerri et al. in [36] and [37]. Other applications in which the logarithmic loss function can be used include secrecy and privacy [15, 38], hypothesis testing against independence [39, 40, 19, 41, 42] and others.

### I-C Outline and Notation

The rest of this paper is organized as follows. Section II provides a formal description of the CEO model that is studied in this paper, as well as some definitions that are related to it. In Section III, we provide some results for the DM model that will be shown instrumental for the main goal of this paper which is the study of the vector Gaussian CEO problem with side information under logarithmic loss. In particular, this section contains a single-letter characterization of the rate-distortion region of the DM CEO with side information in the case in which the agents’ observations are conditionally independent given the remote source and the decoder’s side information. Section IV contains the main results of this paper. First, we establish an explicit characterization of the rate-distortion region of the memoryless vector Gaussian CEO problem with side information under logarithmic loss. We then show that Gaussian test channels with time-sharing exhaust the Berger-Tung rate region which is optimal. In this section we also use our results on the CEO problem under logarithmic loss to infer complete solutions of three related problems: the vector Gaussian distributed hypothesis testing against conditional independence problem, a quadratic vector Gaussian CEO problem with a determinant constraint on the covariance matrix error, and the vector Gaussian distributed Information Bottleneck problem. Section V provides BA-type algorithms for the computation of the rate-distortion regions that are established in this paper in both DM and Gaussian cases as well as proofs of their convergence and some numerical examples.

Throughout this paper, we use the following notation. Upper case letters are used to denote random variables, e.g., ; lower case letters are used to denote realizations of random variables, e.g., ; and calligraphic letters denote sets, e.g., . The cardinality of a set is denoted by . The closure of a set is denoted by . The length- sequence is denoted as ; and, for integers and such that , the sub-sequence is denoted as . Probability mass functions (pmfs) are denoted by ; and, sometimes, for short, as . We use to denote the set of discrete probability distributions on . Boldface upper case letters denote vectors or matrices, e.g., , where context should make the distinction clear. For an integer , we denote the set of integers smaller or equal as . For a set of integers , the complementary set of is denoted by , i.e., . Sometimes, for convenience we will need to define as . For a set of integers ; the notation designates the set of random variables with indices in the set , i.e., . We denote the covariance of a zero mean, complex-valued, vector by , where indicates conjugate transpose. Similarly, we denote the cross-correlation of two zero-mean vectors and as , and the conditional correlation matrix of given as i.e., . For matrices and , the notation denotes the block diagonal matrix whose diagonal elements are the matrices and and its off-diagonal elements are the all zero matrices. Also, for a set of integers and a family of matrices of the same size, the notation is used to denote the (super) matrix obtained by concatenating vertically the matrices , where the indices are sorted in the ascending order, e.g, .

## Ii Problem Formulation

Consider a -dimensional memoryless source with finite alphabet and joint probability mass function (pmf) . It is assumed that for all ,

(3) |

forms a Markov chain in that order. Also, let be a sequence of independent copies of , i.e., . Consider now the -encoder CEO problem with side information shown in Figure 1. In this model, Encoder (or agent) , , observes the memoryless source and uses bits per sample to describe it to the decoder. The decoder observes a statistically dependent memoryless side information stream, in the form of the sequence , and wants to reconstruct the remote source to within a prescribed fidelity level. Similar to [8], in this paper we take the reproduction alphabet to be equal to the set of probability distributions over the source alphabet . Thus, for a vector , the notation means the -coordinate of , , which is a probability distribution on , evaluated for the outcome . In other words, the decoder generates ‘soft‘ estimates of the remote source’s sequences. We consider the logarithmic loss distortion measure defined as in (1), where the letter-wise distortion measure is given by (2).

###### Definition 1.

A rate-distortion code (of blocklength ) for the model of Figure 1 consists of encoding functions

and a decoding function

###### Definition 2.

A rate-distortion tuple is achievable for the DM CEO source coding problem with side information if there exist a blocklength , encoding functions and a decoding function such that

The rate-distortion region of the model of Figure 1 is defined as the closure of all non-negative rate-distortion tuples that are achievable.

## Iii Some Results in the DM Case

### Iii-a Rate-Distortion Region

The following theorem gives a single-letter characterization of the rate-distortion region of the DM CEO problem with side information under logarithmic loss measure.

###### Definition 3.

For given tuple of auxiliary random variables with distribution such that factorizes as

(4) |

define as the set of all non-negative rate-distortion tuples that satisfy, for all subsets ,

###### Theorem 1.

The rate-distortion region for the DM CEO problem under logarithmic loss is given by

where the union is taken over all tuples with distributions that satisfy (4).

###### Remark 1.

###### Remark 2.

Theorem 1 extends the result of [8, Theorem 10] to the case in which the decoder has, or observes, its own side information stream and the agents’ observations are conditionally independent given the remote source and , i.e., holds for all subsets . For instance, the side information does not need to be conditionally independent on the agents’ observations given , i.e., the Markov chain is not required to hold; and, for this reason, the result of Theorem 1 cannot be obtained by a direct application of [8, Theorem 10] viewing as another agent’s observation that is encoded at large (infinite) rate. Specifically, our converse proof of Appendix A generalizes that of [8, Theorem 10] to the model with additional correlated side information at decoder such that the agents’ observations are independent conditionally on the remote source and . Also, the proof involves a redefinition of the auxiliary random variables.

### Iii-B An Example: Distributed Pattern Classification

Consider the problem of distributed pattern classification shown in Figure 2. In this example, the decoder is a predictor whose role is to guess the unknown class of a measurable pair on the basis of inputs from two learners as well as its own observation about the target class, in the form of some correlated . It is assumed that . The first learner produces its input based only on ; and the second learner produces its input based only on . For the sake of a smaller generalization gap^{1}^{1}1

The generalization gap, defined as the difference between the empirical risk (average risk over a finite training sample) and the population risk (average risk over the true joint distribution), can be upper bounded using the mutual information between the learner’s inputs and outputs, see, e.g.,

[43, 44] and the recent [45], which provides a fundamental justification of the use of the minimum description length (MDL) constraint on the learners mappings as a regularizer term., the inputs of the learners are restricted to have description lengths that are no more than and bits per sample, respectively. Let and be two (stochastic) such learners. Also, let be a soft-decoder or predictor that maps the pair of representations and to a probability distribution on the label space. The pair of learners and predictor induce a classifier

(5) |

whose probability of classification error is defined as

(6) |

Let be the rate-distortion region of the associated two-encoder DM CEO problem with side information as given by Theorem 1. The following proposition shows that there exists a classifier for which the probability of misclassification can be upper bounded in terms of the minimal average logarithmic loss distortion that is achievable for the rate pair in .

###### Proposition 1.

###### Proof.

Let a triple mappings be given. It is easy to see that the probability of classification error of the classifier as defined by (6) satisfies

(7) |

Applying Jensen’s inequality on the right hand side (RHS) of (7), using the concavity of the logarithm function, and combining with the fact that the exponential function increases monotonically, the probability of classification error can be further bounded as

(8) |

Using (5) and continuing from (8), we get

(9) |

where the last inequality follows by applying Jensen’s inequality and using the concavity of the logarithm function.

Noticing that the term in the exponential function in the RHS of (9),

(10) |

is the average logarithmic loss, or cross-entropy risk, of the triple ; the inequality (9) implies that minimizing the average logarithmic loss distortion leads to classifier with smaller (bound on) its classification error. Using Theorem 1, the minimum average logarithmic loss, minimized over all mappings and that have description lengths no more than and bits per-sample, respectively, as well as all choices of , is

(11) |

Thus, the direct part of Theorem 1 guarantees the existence of a classifier whose probability of error satisfies the bound given in Proposition 1. ∎

To make the above example more concrete, consider the following scenario where plays the role of information about the sub-class of the label class . More specifically, let

be a random variable that is uniformly distributed over

. Also, let and be two random variables that are independent between them and from , distributed uniformly over and respectively. The state acts as a random switch that connects or to , i.e.,(12) |

That is, if then , and if then . Thus, the value of indicates whether

is odd- or even-valued (i.e., the sub-class of

). Also, let(13a) | ||||

(13b) | ||||

(13c) |

where and are Bernoulli- random variables, , that are independent between them, and from , and the addition is modulo . For simplification, we let . We numerically approximate the set of pairs such that is in the rate-distortion region corresponding to the CEO network of this example. The algorithm that we use for the computation will be described in detail in Section V-A. The lower convex envelope of these pairs is plotted in Figure (a)a for . Continuing our example, we also compute the upper bound on the probability of classification error according to Proposition 1. The result is given in Figure (b)b. Observe that if and are high-quality estimates of (e.g., ), then a small increase in the complexity results in a large relative improvement of the (bound on) the probability of classification error. On the other hand, if and are low-quality estimates of (e.g., ) then we require a large increase of in order to obtain an appreciable reduction in the error probability. Recalling that larger implies lesser generalization capability [43, 44, 45], these numerical results are consistent with the fact that classifiers should strike a good balance between accuracy and their ability to generalize well to unseen data. Figure (c)c quantifies the value of side information given to both learners and predictor, none of them, or only the predictor, for .

### Iii-C Estimation of Encoder Observations

In this section, we focus on the two-encoder case, i.e., . Suppose the decoder wants to estimate the encoder observations , i.e., . Note that in this case the side information can be chosen arbitrarily correlated to and is not restricted to satisfy any Markov structure, since the Markov chain is satisfied for all choices of that are arbitrarily correlated with .

If a distortion of bits is tolerated on the joint estimation of the pair , then the achievable rate-distortion region can be obtained easily from Theorem 1, as a slight variation of the Slepian-Wolf region, namely the set of non-negative rate-distortion triples such that

(14a) | ||||

(14b) | ||||

(14c) |

The following theorem gives a characterization of the set of rate-distortion quadruples that are achievable in the more general case in which a distortion is tolerated on the estimation of the source component and a distortion is tolerated on the estimation of the source component , i.e., the rate-distortion region of the two-encoder DM multiterminal source coding problem with arbitrarily correlated side information at the decoder.

###### Theorem 2.

If , the component is to be reconstructed to within average logarithmic loss distortion and the component is to be reconstructed to within average logarithmic loss distortion , the rate-distortion region of the associated two-encoder DM multiterminal source coding problem with correlated side information at the decoder under logarithmic loss is given by the set of all non-negative rate-distortion quadruples that satisfy

for some joint measure of the form .

###### Remark 3.

The auxiliary random variables of Theorem 2 are such that and form Markov chains.

###### Remark 4.

The result of Theorem 2 extends that of [8, Theorem 6] for the two-encoder source coding problem with average logarithmic loss distortion constraints on and and no side information at the decoder to the setting in which the decoder has its own side information that is arbitrarily correlated with . It is noteworthy that while the Berger-Tung inner bound is known to be non-tight for more than two encoders, as it is not optimal for the lossless modulo-sum problem of Korner and Marton [46], Theorem 2 shows that it is tight for the case of three encoders if the observation of the third encoder is encoded at large (infinite) rate.

In the case in which the sources and are conditionally independent given , i.e., forms a Markov chain, it can be shown easily that the result of Theorem 2 reduces to the set of rates and distortions that satisfy

(15) | ||||

(16) | ||||

(17) | ||||

(18) |

for some measure of the form .

This result can also be obtained by applying [47, Theorem 6] with the reproduction functions therein chosen as

(19) |

Then, note that with this choice we have

(20) |

## Iv Vector Gaussian CEO Problem with Side Information

Consider the -encoder CEO problem shown in Figure 1. In this section, the remote vector source is complex-valued, has -dimensions, and is assumed to be Gaussian with zero mean and covariance matrix . denotes a collection of independent copies of . The agents’ observations are Gaussian noisy versions of the remote vector source, with the observation at agent given by

(21) |

where represents the channel matrix connecting the remote vector source to the -th agent; and is the noise vector at this agent, assumed to be i.i.d. Gaussian with zero-mean and covariance matrix and independent from . The decoder has its own noisy observation of the remote vector source, in the form of a correlated jointly Gaussian side information stream , with

(22) |

where, similar to the above, is the channel matrix connecting the remote vector source to the CEO; and is the noise vector at the CEO, assumed to be Gaussian with zero-mean and covariance matrix and independent from . In this section, it is assumed that the agents’ observations are independent conditionally given the remote vector source and the side information , i.e., for all ,

(23) |

Using (21) and (22), it is easy to see that the assumption (23) is equivalent to that the noises at the agents are independent conditionally given . Recalling that for a set , the notation designates the collection of noise vectors with indices in the set , in what follows we denote the covariance matrix of as .

### Iv-a Rate-Distortion Region

We first state the following proposition which essentially extends the result of Theorem 1 to the case of sources with continuous alphabets.

###### Definition 4.

For given tuple of auxiliary random variables with distribution such that factorizes as

(24) |

define as the set of all non-negative rate-distortion tuples that satisfy, for all subsets ,

(25) |

Also, let where the union is taken over all tuples with distributions that satisfy (24).

###### Definition 5.

For given tuple of auxiliary random variables with distribution such that factorizes as

(26) |

define as the set of all non-negative rate-distortion tuples that satisfy, for all subsets ,

Also, let where the union is taken over all tuples with distributions that satisfy (26).

###### Proposition 2.

The rate-distortion region for the vector Gaussian CEO problem under logarithmic loss is given by

For convenience, we now introduce the following notation which will be instrumental in what follows. Let, for every set , the set . Also, for and given matrices

Comments

There are no comments yet.