## I Introduction

The classical problem of supervised learning is to infer an accurate predictor of a target variable from a measured variable on the basis of labeled training samples

independently drawn from an unknown joint distribution

. The standard approach for solving this problem in statistical learning theory is empirical risk minimization (ERM). For a given set of allowable predictors and a loss function that quantifies the risk of each predictor, ERM chooses the predictor with minimal risk under the empirical distribution of samples. To avoid overfitting, the set of allowable predictors is restricted to a class with limited complexity.

Recently, an alternative viewpoint has emerged which seeks distributionally robust predictors. Given the labeled training samples, this approach learns a predictor by minimizing its worst-case risk over an ambiguity distribution set centered at the empirical distribution of samples. In other words, instead of restricting the set of allowable predictors, it aims to avoid overfitting by requiring that the learned predictor performs well under any distribution in a chosen neighborhood of the empirical distribution. This minimax approach has been investigated under different assumptions on how the ambiguity set is constructed, e.g., by restricting the moments

[1], forming the -divergence balls [2] and Wasserstein balls [3] (see also references therein).In these previous works, the learning algorithm finds a predictor that acts directly on a fresh (unlabeled) sample to predict the corresponding target variable . Often, however the fresh sample may be only remotely available, and when designing the predictor it is desirable to also take into account the cost of communicating . This is motivated by the fact that bandwidth and energy limitations on communication in networks and within multiprocessor systems often impose significant bottlenecks on the performance of algorithms. There are also an increasing number of applications in which data is generated in a distributed manner and it (or features of it) are communicated over bandwidth-limited links to a central processor to perform inference. For instance, applications such as Google Goggles and Siri process the locally collected data on clouds. It is thus important to study prediction in distributed and rate-constrained settings.

In this paper, we study an extension of the classical learning problem in which given a finite set of training samples, the learning algorithm needs to infer a descriptor-estimator pair with a desired communication rate in between them. This is especially relevant when both and

come from a large alphabet or are continuous random variables as in regression problems, so neither the sample

nor its predicted value of can be simply communicated in a lossless fashion. We adopt the minimax framework for learning the descriptor-estimator pair. Given a set of labeled training samples, our goal is to find a descriptor-estimator pair by minimizing their resultant worst-case risk over an ambiguity distribution set, where the risk now incorporates both the statistical risk and the communication cost. One of the important conclusions that emerge from the minimax approach to supervised learning in [1] is that the problem of finding the predictor with minimal worst-case risk over an ambiguity set can be broken into two smaller steps: (1) find the worst-case distribution in the ambiguity set that maximizes the (generalized) conditional entropy of given, and (2) find the optimal predictor under this worst-case distribution. In this paper, we show that an analogous principle approximately holds for rate-constrained prediction. The descriptor-estimator pair with minimal worst-case risk can be found in two steps: (1) find the worst-case distribution in the ambiguity set that maximizes the risk-information Lagrangian cost, and (2) find the optimal descriptor-estimator pair under this worst-case distribution. We then apply our results to characterize the optimal descriptor-estimator pairs for two applications: rate-constrained linear regression and rate-constrained classification. While a simple scheme whereby we first find the optimal predictor ignoring the rate constraint, then compress and communicate the predictor output, is optimal for the linear regression application, we show via the classification application that such an estimate-compress approach is not optimal in general. We show that when prediction is rate-constrained, the optimal descriptor aims to send sufficiently (but not necessarily maximally) informative features of the observed variable, which are at the same time easy to communicate. When applied to the case in which the ambiguity distribution set contains only a single distribution (for example, the true or empirical distribution of

) and the loss function for the prediction is logarithmic loss, our results provide a new one-shot operational interpretation of the information bottleneck problem. A key technical ingredient in our results is the strong functional representation lemma (SFRL) developed in [4], which we use to design the optimal descriptor-estimator pair for the worst-case distribution.### Notation

We assume that is base 2 and the entropy is in bits. The length of a variable-length description is denoted as . For random variables , denote the joint distribution by and the conditional distribution of given by . For brevity we denote the distribution of as . We write for when , and is clear from the context.

## Ii Problem Formulation

We begin by reviewing the minimax approach to the classical learning problem [1].

### Ii-a Minimax Approach to Supervised Learning

Let and be jointly distributed random variables. The problem of statistical learning is to design an accurate predictor of a target variable from a measured variable on the basis of a number of independent training samples drawn from an unknown joint distribution. The standard approach for solving this problem is to use empirical risk minimization (ERM) in which one defines an admissible class of predictors that consists of functions (where the reconstruction alphabet can be in general different from ) and a loss function . The risk associated with a predictor when the underlying joint distribution of and is is

ERM simply chooses the predictor with minimal risk under the empirical distribution of the training samples.

Recently, an alternative approach has emerged which seeks distributionally robust predictors. This approach learns a predictor by minimizing its worst-case risk over an ambiguity distribution set , i.e.,

(1) |

where can be any function and can be constructed in various ways, e.g., by restricting the moments, forming the -divergence balls or Wasserstein balls. While in ERM it is important to restrict the set of admissible predictors to a low-complexity class to prevent overfitting, in the minimax approach overfitting is prevented by explicitly requiring that the chosen predictor is distributionally robust. The learned function can be then used for predicting when presented with fresh samples of . The learning and inference phases are illustrated in Figure 1.

### Ii-B Minimax Learning for Remote Prediction

In this paper, we extend the minimax learning approach to the setting in which the prediction needs to be performed based on a rate-constrained description of . In particular, given a set of finite training samples independently drawn from an unknown joint distribution , our goal is to learn a pair of functions , where is a descriptor used to compress into (a prefix-free code), and is an estimator that takes the compression and generates an estimate of . See Figure 2.

Let be the rate of the descriptor and be the risk associated with the descriptor-estimator pair , when the underlying distribution of is , and define the risk-rate Lagrangian cost (parametrized by ) as

(2) |

Note that this cost function takes into account both the resultant statistical prediction risk of , as well as the communication rate they require. The task of a minimax learner is to find an pair that minimizes the worst-case over the ambiguity distribution set , i.e.,

(3) |

for an appropriately chosen centered at the empirical distribution of samples . Note that we allow here all possible pairs. We also assume that the descriptor and the estimator can use unlimited common randomness which is independent of the data, i.e., and can be expressed as functions of and , respectively, and the prefix-free codebook for can depend on . The availability of such common randomness can be justified by the fact that in practice, although the inference scheme is one-shot, it is used many times (by the same user and by different users), hence the descriptor and the estimator can share a common randomness seed before communication commences without impacting the communication rate.

## Iii Main Results

We first consider the case where consists of a single distribution , which may be the empirical distribution as in ERM. Define the minimax risk-rate cost as

(4) |

While it is difficult to minimize the risk-rate cost (2) directly, the minimax risk-rate cost can be bounded in terms of the mutual information between and .

###### Theorem 1.

Let . Then

As in other one-shot compression results (e.g., zero-error compression), there is a gap between the upper and lower bound. While the logarithmic gap in Theorem 1 is not as small as the 1-bit gap in the zero-error compression, it is dominated by the linear term when it is large.

To prove Theorem 1, we use the strong functional representation lemma given in [4] (also see [5, 6]): for any random variables , there exists random variable independent of , such that is a function of , and

(5) |

Here, can be intuitively viewed as the part of which is not contained in . Note that for any such that is a function of and is independent of , . The statement (5) ensures the existence of an , independent of , which comes close to this lower bound, and in this sense it is most informative about . This is critical for the proof of Theorem 1 as we will see next. Identifying the part of which is not contained in allows us to generate and share this part between the descriptor and the estimator ahead of time, eliminating the need to communicate it during the course of inference. To find , we use the Poisson functional representation construction detailed in [4].

###### Proof:

Recall that . The lower bound follows from the fact that . To establish the upper bound, fix any . Let be obtained from (5). Note that is independent of and can be generated from a random seed shared between the descriptor and the estimator ahead of time. For a given , take to be the Huffman codeword of according to the distribution (recall that is a function of ), and take to be the decoding function of the Huffman code. The expected codeword length

Taking an infimum over all completes the proof. ∎

###### Remark 1.

If we consider the logarithmic loss , where is a distribution over , then the lower bound in Theorem 1 reduces to

which is the information bottleneck function [7]. Therefore the setting of remote prediction provides an approximate one-shot operational interpretation of the information bottleneck (up to a logarithmic gap). In [8, 9]

it was shown that the asymptotic noisy source coding problem also provides an operational interpretation of the information bottleneck. Our operational interpretation, however, is more satisfying since the feature extraction problem originally considered in

[7] is by nature one-shot.We now extend Theorem 1 to the minimax setting.

###### Theorem 2.

Suppose is convex. Then

This result is related to minimax noisy source coding [10]. The main difference is that we consider the one-shot expected length instead of the asymptotic rate.

To prove this theorem, we first invoke a minimax result for relative entropy in [11] (which generalizes the redundancy-capacity theorem [12]). Then we apply the following refined version of the strong functional representation lemma that is proved in the proof of Theorem 1 in [4] (also see [5]).

###### Lemma 1.

For any and , there exists random variable , and functions and such that , and

(6) |

We are now ready to prove Theorem 2.

###### Proof:

The lower bound follows from . To prove the upper bound, we fix any , and show that the following risk-rate cost is achievable:

Let

Note that is concave in for fixed since and are linear in . Also is quasiconvex in for fixed since is convex in , and is lower semicontinuous in since is lower semicontinuous with respect to the topology of weak convergence [13], and hence is lower semicontinuous by Fatou’s lemma.

Write for the distribution of when and . Let and be the closure of in the topology of weak convergence. It can be shown using the same arguments as in [11] (on instead of relative entropy, and using Sion’s minimax theorem [14] instead of Lemma 2 in [11]) that if is uniformly tight, then there exists such that

If is not uniformly tight, then by Lemma 4 in [11], , and hence .

Theorem 2 suggest that we can simplify the analysis of the
risk-rate cost (2) by replacing the rate with the mutual information . Define the *risk-information cost* as

(7) |

Theorem 2 implies that the minimax risk-rate cost can be approximated by the *minimax risk-information cost*

(8) |

within a logarithmic gap. Theorem 2 can also be stated in the following slightly weaker form

The risk-information cost has more desirable properties than the risk-rate cost. For example, it is convex in for fixed , and concave in for fixed . This allows us to exchange the infimum and supremum in Theorem 2 by Sion’s minimax theorem [14], which gives the following proposition.

###### Proposition 1.

Suppose , and are finite, is convex and closed, and , then

Moreover, there exists attaining the infimum in the left hand side, which also attains the infimum on the right hand side when is fixed to , the distribution that attains the supremum on the right hand side.

Proposition 1 means that in order to design a robust descriptor-estimator pair that work for any , we only need to design them according to the worst-case distribution as follows.

Principle of maximum risk-information cost: Given a convex and closed , we design the descriptor-estimator pair based on the worst-case distribution

We then find that minimizes and design the descriptor-estimator pair accordingly, e.g. using Lemma 1 on and the induced distribution from and .

## Iv Applications

### Iv-a Rate-constrained Minimax Linear Regression

Suppose , , is the mean-squared loss, and we observe the data . Take to be the set of distributions with the same first and second moments as given by the empirical distribution, i.e.,

(9) |

where are the corresponding statistics of the empirical distribution. The following proposition shows that is Gaussian.

###### Proposition 2 (Linear regression with rate constraint).

Note that this setting does not satisfy the conditions in Proposition 1. We directly analyze (8) to obtain the optimal . Given the optimal , Theorem 2 and Lemma 1 can be used to construct the scheme. Operationally, is a random quantizer of such that the quantization noise follows . With this natural choice of the ambiguity set, our formulation recovers a compressed version of the familiar MMSE estimator.

Figure 3 plots the tradeoff between the rate and the risk when , , , for the scheme constructed using the Poisson functional representation in [4], with the lower bound given by the minimax risk-information cost , and the upper bound given in Theorem 2.

###### Proof:

The optimal scheme in the above example corresponds to compressing and communicating the minimax optimal rate-unconstrained predictor , since the optimal can be obtained from by shifting, scaling and adding noise. This estimate-compress approach can be thought as a *separation* scheme, since we first optimally estimate , then optimally communicate it while satisfying the rate constraint. In the next application, we show that such separation is not optimal in general.

### Iv-B Rate-constrained Minimax Classification

We assume and are finite, , and is closed and convex. The following proposition gives the minimax risk-information cost and the optimal estimator.

###### Proposition 3.

Consider the setting described above. The minimax risk-information cost is given by

the worst-case distribution is the one attaining the supremum, and the optimal estimator is given by , where attains the infimum (when ), and is obtained from .

In particular, if is symmetric for different values of (i.e., for any , there exists permutation of , of such that and ),

We can see that when ,

tends to the maximum a posteriori estimator (under

, the worst-case distribution when ).###### Proof:

Assume is closed and convex. By Proposition 1, the minimax rate-information cost is , where

where (a) is due to that relative entropy is nonnegative, and equality is attained when .

Next we consider the case in which is symmetric. Consider the minimax rate-information cost

For any , let be the permutation over such that and let be the corresponding permutation over in the symmetry assumption. Since the function

is convex and symmetric about and (i.e., ), to find its infimum, we only need to consider ’s satisfying for all (if not, we can instead consider the average of for from 1 up to the product of the periods of and , which gives a value of the function not larger than that of ). For brevity we say is symmetric if it satisfies this condition.

Fix any symmetric . Since the function

is concave and symmetric about and (i.e., ), to find its supremum, we only need to consider symmetric ’s. Hence,