Functional Epsilon Entropy

10/18/2019 ∙ by Sourya Basu, et al. ∙ 0

We consider the problem of coding for computing with maximal distortion, where the sender communicates with a receiver, which has its own private data and wants to compute a function of their combined data with some fidelity constraint known to both agents. We show that the minimum rate for this problem is equal to the conditional entropy of a hypergraph and design practical codes for the problem. Further, the minimum rate of this problem may be a discontinuous function of the fidelity constraint. We also consider the case when the exact function is not known to the sender, but some approximate function or a class to which the function belongs is known and provide efficient achievable schemes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the problem illustrated in Fig. 1 where the encoder observes and decoder observes , and both the encoder and decoder want the decoder to compute with a fidelity criterion for a given function known to both encoder and decoder, where are all finite sets and for some finite natural number . The objective is to find the minimum number of bits the encoder must send such that the decoder can compute with a fidelity criterion , i.e. if

is the estimate of

obtained by the decoder, then the following should hold This work was funded in part by the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR), a research collaboration as part of the IBM AI Horizons Network; and in part by grant number 2018-182794 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation.

(1)

where is the Euclidean distance between . We assume the function is to be computed for independent instances of for large .

Enc 1

Decoder

Figure 1: Coding for computing with side information.

Orlitsky and Roche gave a single-letter characterization of the problem for [1]. Function computation with has been considered in [2, 3] which provide an efficient graph-based encoding scheme using a generalized version of the characteristic graph in [1] called the -characteristic graph. The coding scheme provided there is suboptimal since the construction of the graph is based on only pairwise comparison of function values. Drawing on better geometric methods of comparison, we introduce a novel generalization of the characteristic graph from [1] that yields a hypergraph-based encoding scheme that is indeed optimal. Thus, we provide an alternate (but fully equivalent) description of the rate-distortion function, which further inspires practically implementable codes. We refer to the optimal rate for this problem as functional -entropy since this rate reduces to -entropy of defined in [4] when there is no side information and the hyperedges of the constructed characteristic hypergraph partition the support set . Unlike traditional rate-distortion problems where the rate-distortion function is a continuous function of (the constraint on the expected distortion), in this case we have a rate function that is discontinuous function of and the points of discontinuity can be determined from the characteristic hypergraph for different values of .

We also show that some of the assumptions in [1] that lead to simple coding schemes for might not imply the same when . Although [2, 3] show modular schemes, i.e. graph-based quantization followed by source coding, are optimal for the problem with under some assumptions, the solution provided is NP-hard and an approximate solution is used for coding. Further, the assumptions in [2, 3, 1] that lead to optimal modular schemes only involve the source distributions; one can weaken assumptions that imply optimal modular schemes by considering both the distribution of the source and the function. Although there does not seem to be a simple and elegant dichotomy of functions and sources analogous to Han and Kobayashi’s dichotomy of functions [5] under which modular schemes are optimal, we provide a simple and general class of function-source pairs for which modular schemes are optimal.

After providing the rate-distortion function and an optimal hypergraph-based encoding scheme, we provide algorithms and conditions under which practical coding schemes using randomized quantization and polar codes [6] are optimal.

Sec. 2 describes preliminary results and the problem model. Sec. 3 gives the main result of this paper, the coding theorem and its equivalence to the conditional entropy of a hypergraph. Sec. 4 develops practical coding schemes for the problem. Sec. 5 shows that functional -entropy may be discontinuous in and Sec. 6 concludes the paper.

2 Preliminaries and problem setting

In this section, first we provide some definitions and discuss some preliminary results; then we formally define the problem.

2.1 Preliminaries and notations

For a set of points , for some finite , the smallest circle (sphere) enclosing these points is called the smallest enclosing circle of [7]. Note that the computational complexity of finding the smallest enclosing circle for a set of points is linear in the number of points [8].

Definition 1.

A function is -Lipschitz continuous if for all , for some where is the Euclidean norm.

Definition 2.

A function , is a -approximation to a function if for every , where is the Euclidean norm.

A hypergraph is a pair where is the set of vertices of and is the set of hyperedges of , where is the powerset of [9].

Definition 3.

A hyperedge is called a maximal hyperedge if is not a proper subset of any other hyperedge in the hypergraph .

Let

be i.i.d. random variables with

and suppose for simplicity. Then, source Bhattacharyya parameter [10] for the source is defined as .

Theorem 1 ([11]).

For any , i.i.d. random variables and ,

where is the generator matrix for polar codes.

We will use polar coding to build practical coding techniques for our problem. Thm. 1 can be extended to any finite set using results from [12].

2.2 Problem setting

Let be i.i.d. random variables, where , . The encoder in Fig. 1 observes , the decoder observes , and both the encoder and decoder want the decoder to reconstruct as such that as for some fixed fidelity constraint . For any , we define a code for any fixed function as an encoding function and a decoding function where

is the reconstruction set. The probability of error is

that is, is the average symbol-error probability. A rate is achievable if there exists a sequence of codes such that as . The goal is to find the minimum achievable value of and design practical codes that attain it.

3 Coding theorem and hypergraph-based coding scheme

In this section we first provide the rate-distortion function for the problem described in Sec. 2, and then provide a hypergraph-based coding scheme that achieves the rate-distortion function.

Theorem 2.

Let . The encoder and decoder observe and respectively. The decoder estimates the function as such that for some fixed , as . Then the minimum rate required by the encoder is

(2)

such that there exists a function with , where the distortion function is defined as

The proof to this theorem is direct and follows from [1, Eq. (4)] when the distortion is set to zero under the distortion function.

We define the -characteristic hypergraph, , of a random variable with respect to another possibly correlated random variable , a function , and a fidelity constraint .

Definition 4.

The vertex set of -characteristic hypergraph, , is . For any non-empty subset and , let . Then is a hyperedge in if and only if the radius of the smallest enclosing circle containing the set of points is less than or equal to for all .

Note that for the hypergraph in Def. 4 reduces to the characteristic graph defined in [1, 13] with hyperedges replaced by independent sets. Now we define the hypergraph entropy of a characteristic hypergraph . Let be the set of hyperedges of . When it is clear from context, we will denote by , and is simply written as . We define the functional -entropy, which is a generalization of -entropy proposed by [4].

Definition 5.

The functional -entropy, , is defined as

(3)

where

induces a probability distribution over the vertices of the hypergraph

. The random variable is obtained by defining transition probabilities over all hyperedges that contain , i.e.  for all and .

Note that the minimization over can be restricted to by the data processing inequality, where is the set of maximal hyperedges. Now we show that the optimal rate .

Theorem 3.

Let and be as defined in (2) and (3) respectively for some , then

Proof.

From the definitions of and we need to prove that

First we show the left side is less than or equal to the right side. If , then we can find a (partial) function over such that whenever , and thus implying .

Let and . If for all , then we can leave undefined since it will not affect our expected distortion. Otherwise form the set which consists of all such that and define as the center of the smallest enclosing circle of the set . Then, by Def. 4, for all such that . Hence, whenever , we have . This shows that , hence the left side is less than or equal to the right side.

Next we show that the right hand side is less than or equal to the left hand side, completing the proof. Suppose and there exists a such that . We define such that and show that for this definition. Let be the probability distribution underlying . Set

(4)

and define the Markov chain

by

We first show that . If , this implies there is a such that and . Then, by (4) we have . Thus, we have whenever . Next we show that whenever , then the radius of the smallest enclosing circle of the set is less than or equal to which further implies . From (4), implies there exists a such that . Further, if then , and since forms a Markov chain, it follows that whenever we have . Note that , hence we must have whenever . Thus, it follows that the circle centered at of radius encloses all the points in the set . Hence, the smallest enclosing circle of the set has radius less than or equal to , which implies .

It remains to show that forms a Markov chain and that . The proof to this follows directly from the proof of [1, Thm. 2]. ∎

Note that setting , gives us the rate-distortion function in [1].

4 Towards practical coding scheme

In [1] it was shown that if for all , then every vertex in the characteristic hypergraph belongs to exactly one hyperedge of . This property led to the design of optimal modular schemes in [2, 3] under some assumptions introduced therein. However, those assumptions depended solely on the sources and not on the function considered. In this section, we introduce a function/source condition and show that for , this implies non-overlapping clustering of vertices leading to a modular scheme that can be implemented in time where is the blocklength. Then, by giving a counterexample we show that for , for all does not imply non-overlapping clustering of vertices in in contrast to the case where this condition on the source implies non-overlapping clustering.

Condition 1.

For any and , if , then either or .

Note that Cond. 1 encompasses the cases for all and for all , where and are the support set of and respectively. The main idea is that whenever Cond. 1 holds, each belongs to a unique maximal hyperedge in . Hence quantization followed by entropy coding attains the optimal rate. Consider the following example that illustrates Cond. 1 and each vertex belongs to exactly one maximal hyperedge even though for some .

Example 1.

Consider the random variables distributed as in Fig. 2. a and let in Fig. 2. b be the corresponding function, where . The hypergraph formed in this case is shown in Fig. 2. c.

1 2 3
1 0
2
3 0
1 2 3
1 1 1 1
2 1 0 1
3 1 0 1

(a)                                (b)                                         (c)

Figure 2: (a) Probability distribution for . (b) Function . (c) Corresponding hypergraph .

4.1 Modular schemes

We show that whenever Cond. 1 holds, any belongs to exactly one maximal hyperedge in .

Theorem 4.

If Cond. 1 holds, then for any , if for , then .

Proof.

Without loss of generality, assume that . We know from Def. 4 that for any , if and only if for all either , , or holds. If is a singleton set then we are done since this implies but since , are maximal sets, we have . Now take the case when both are not singleton sets. Assume that , then there is such that , and , and .

For any , under Cond. 1 one of the following cases hold:

  1. .

  2. If (or ), then (or ) by Cond. 1.

Hence, for all , we have or or , which implies that and belong to the same hyperedge in by Def. 4. Thus, belongs to the same maximal hyperedge which implies . But since and are maximal sets, it implies . ∎

Thm. 4 implies that whenever Cond. 1 holds, each belongs to exactly one maximal hyperedge in . Thus, hypergraph-based coding implies the following quantizationentropy coding scheme attains the optimal rate. Given any , encode it using the unique hyperedge it belongs to and then use Slepian-Wolf coding to achieve the rate-distortion function. This scheme can be implemented in time since quantization can be performed in constant time and Slepian-Wolf coding can be implemented in time using polar codes [10] where is the blocklength.

Next consider the case when . Unlike for the case , when , even when for all we might not have non-overlapping clustering, i.e. we can have vertices belonging to more than one maximal hyperedge.

Figure 3: The hypergraph consisting of vertex set and hyperedges and .
Example 2.

Let and be independent uniform random variables defined on the support set and respectively. Let , where , and be defined as , , , and let . Then the characteristic hypergraph is as shown in Fig. 3. The hypergraph consists of three vertices and two maximal hyperedges and . The smallest enclosing circle of the set of points is centred at and has a radius of . Hence, forms a hyperedge of . Similarly, forms a hyperedge, but the smallest enclosing circle of has a radius of which is greater than and hence does not form a hyperedge. Thus, we see that even though for all , belongs to two different maximal hyperedges and .

4.2 Partially known functions

Suppose there is no side information available at the decoder and the function is unknown to the encoder but it is known that is an -Lipschitz continuous function. Then as a corollary of Thm. 3, we have the following result which may be of interest in several applications where the actual function is unknown or requires more computational resources than are available at the encoder. For instance if there is a good linear approximation to a computationally heavy function available to the encoder, the encoder might use the simpler function rather than the actual one.

Corollary 1.

Let be a -Lipschitz continuous function. Then can be upper-bounded as where is constructed with respect to the random variable and the identity function and hence the upper-bound is achievable by the encoder even when is unknown.

Proof.

The proof follows from Thm. 3 and the properties of -Lipschitz continuous functions. The main idea is that if a set of points has a smallest enclosing circle of radius , then the set of points will have a smallest enclosing circle of radius . ∎

Now, consider the case where the encoder cannot compute the exact function but computes , which is an -approximation to as defined in Def. 2.

Corollary 2.

Let be a -approximation to . If the encoder only has access to , then for , can be upper-bounded as where is constructed with respect to the random variable and . Moreover, this upper bound is achievable.

Proof.

Since is a -approximation to , if a set of points have a smallest enclosing circle of radius with respect to , then the same set of points must have a smallest enclosing circle of radius less than or equal to with respect to . Hence, constructing a graph with fidelity constraint and the function ensures that the maximal distortion with respect to is less than or equal to . ∎

Although Ex. 2 shows that for , there can be overlapping clustering even when and are independent random variables, in Sec. 4.4 we will show that in cases where there is overlapping clustering we can still use a randomized form of quantization followed by polar coding to attain the optimal rate.

4.3 Quantization and universal source coding for lossless coding for computing

When is independent of , we have for all , where and are the support set of and respectively. Hence it follows from (3) that , where is the quantized value of corresponding to the unique maximal hyperedge that belongs to. Moreover, note that the function depends only on the function and not on the probability mass function of , which implies that for a fixed function and any such that is independent of , there is a universal source coding scheme that attains the functional -entropy in (3) which is the minimum number of bits the encoder needs to send when . We illustrate this case using an example.

Example 3.

Let where , as illustrated in Fig. 4.a, takes the minimum value of in case of equality, and take . Let and be independent random variables, be the probability mass function of , and be the probability mass function of . Since and are independent random variables, Thm. 4 implies the optimal rate can be obtained by using quantization followed by universal source coding. The quantization scheme for the function can be found to be forming two clusters in the characteristic hypergraph, i.e. any maps to one cluster, while maps to a different cluster. Each row of Fig. 4.b shows the probability mass function on , their corresponding entropy , the optimal rate of functional compression for the function , , where is the quantization function mapping to one value and to another, and the rate observed by using LZW algorithm [14] for blocklength .

1 2
1 2 2
2 2 2
3 1 2
4 1 2
LZW rate
1.64 0.92 1.06
1.65 0.67 0.80
1.95 0.99 1.14
1.88 0.92 1.06

(a)                                                                       (b)

Figure 4: (a) Function . (b) , the functional rate , and the observed rate using LZW algorithm for different .

4.4 Randomized quantization and polar coding for computing

In this subsection, we consider the case when there is no side information and . This can be easily extended to the case when is independent of and the main idea of coding remains the same. Note that even in the absence of side information we might not have non-overlapping clustering in the hypergraph, i.e. a vertex might belong to more than one hyperedge, hence, quantization followed by universal source coding might not be optimal. This can be observed from Ex. 2 with slight modification as well. In this subsection, we show that even when we do not have unique clustering in the hypergraph, we can have practical coding schemes using randomized quantization and polar coding. To that end, we provide the following two-step algorithm for attaining the optimal rate asymptotically for any .

  • [leftmargin=*]

  • Randomized quantization: We refer to the process of finding an auxiliary random variable as randomized quantization since unlike general rate distortion problems, the support set of is finite and can be determined directly from the corresponding characteristic hypergraph. Hence, every vertex of the hypergraph quantizes to the hyperedges associated with it in a randomized manner. Note that this process is different from random binning in the sense that random binning involves assigning bins to -length sequences for a coding scheme with blocklength , whereas, in randomized quantization we assign probabilities to single elements in . Once we have formed the hypergraph, we need to optimize over all conditional probabilities such that lies in the hyperedge . This is a convex optimization problem over finite variables and can be solved easily. Once we find a suitable , we know from the proof of Thm. 3 that we can find a function such that , where is the center of the smallest enclosing circle of the set of points . We will assume that for simplicity, which can be generalized to arbitrary finite-sized using ideas from [12].

  • Polar coding: Once we have found corresponding to the optimal rate, the next step is to use polar codes to achieve a rate of . Define a distortion function . Then for the chosen , we have . From Thm. 1 there exists a set and frozen set such that and

    for and sufficiently large. The coding scheme follows the polar coding scheme for lossy compression [11] and is described next.

    Codebook generation: Let be the family of functions and let be shared between the encoder and the decoder. Later we will show that such set of functions exist that give us the desired rate and distortion.
    Encoder: For , the encoder determines as follows

    and for , the encoder determines as . The encoder sends to the decoder. Hence the rate of coding is .
    Decoder: The decoder upon receiving , determines as for and outputs .
    Analysis: We want as for some function . For a fixed , the average distortion is given by where . From [11, Thm. 4], it directly follows that there exists a set of functions such that for some . Thus, we have as .

5 Properties of

, , , ,
. . . .
Figure 5: and corresponding hypergraphs.

Clearly is a non-increasing function of . In this section, we show may be a discontinuous function of and hence from an operational point of view one must design codes with close to zero or the right of point of discontinuity for efficient compression algorithms. We cannot use time-sharing to remove the discontinuity in because we have considered maximal distortion. Further, the discontinuity of is not obvious from Thm. 2 but only from the equivalent definition of as that the discontinuity and points of discontinuity can be observed. We illustrate this property using the following example.

Example 4.

Consider the function , where , and is defined as , , . Then for different values of we have different as illustrated in Fig. 5. depends on and hence, on increasing the value of , the values of where changes are the points of discontinuity of as illustrated in Fig. 5.

6 Conclusion

This paper considers the problem of coding for computing with a fidelity constraint. The main insight regarding the solution of the problem is obtained by characterizing the rate as the conditional entropy of a hypergraph, which we call functional -entropy. It is shown that the rate-distortion function for the problem is discontinuous with respect to the fidelity constraint. We also develop practical coding schemes for the problem and provide achievable bounds when the exact function is unknown to the encoder but an approximate function or a class to which the function belongs is known. The rate provided in this paper for a maximal distortion can be seen as an upper-bound to the rate for the rate distortion problem with expected distortion . For future work, we want to provide stronger practically achievable bounds for the problem of coding for computing with expected distortion, since practical codes for this problem are still unknown.

Acknowledgement

We appreciate valuable discussions with Souktik Roy, Harshit Yadav, Aditya Deshmukh, Akshayaa Magesh, and Ishita Jain. References

References

  • [1] A. Orlitsky and J. R. Roche, “Coding for computing,” IEEE Trans. Inf. Theory, vol. 47, no. 3, pp. 903–917, Mar. 2001.
  • [2] V. Doshi, D. Shah, M. Médard, and M. Effros, “Functional compression through graph coloring,” IEEE Trans. Inf. Theory, vol. 56, no. 8, pp. 3901–3917, Aug. 2010.
  • [3] S. Feizi and M. Médard, “On network functional compression,” IEEE Trans. Inf. Theory, vol. 60, no. 9, pp. 5387–5401, Sep. 2014.
  • [4] E. C. Posner and E. R. Rodemich, “Epsilon entropy and data compression,” Ann. Math. Stat., vol. 42, no. 6, pp. 2079–2125, Dec. 1971.
  • [5] T. S. Han and K. Kobayashi, “A dichotomy of functions of correlated sources from the viewpoint of the achievable rate region,” IEEE Trans. Inf. Theory, vol. 33, no. 1, pp. 69–76, Jan. 1987.
  • [6] E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEE Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.
  • [7] G. Chrystal, “On the problem to construct the minimum circle enclosing given points in the plane,” Proc. Edinburgh Math. Soc., p. 30, Jan. 1885.
  • [8]

    N. Megiddo, “Linear-time algorithms for linear programming in

    and related problems,” SIAM J. Comput., vol. 12, no. 4, pp. 759–776, Nov. 1983.
  • [9] A. Bretto, Hypergraph Theory: An Introduction.   Springer, 2013.
  • [10] E. Arikan, “Source polarization,” in Proc. 2010 IEEE Int. Symp. Inf. Theory, Jun. 2010, pp. 899–903.
  • [11] J. Honda and H. Yamamoto, “Polar coding without alphabet extension for asymmetric models,” IEEE Trans. Inf. Theory, vol. 59, no. 12, pp. 7829–7838, Dec. 2013.
  • [12] E. Şaşoğlu, E. Telatar, and E. Arikan, “Polarization for arbitrary discrete memoryless channels.” in Proc. IEEE Inf. Theory Workshop (ITW’09), Aug. 2009, pp. 144–148.
  • [13] H. S. Witsenhausen, “On sequences of pairs of dependent random variables,” SIAM J. Appl. Math., vol. 28, no. 1, pp. 100–113, 1975.
  • [14] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Inf. Theory, vol. IT-24, no. 5, pp. 530–536, Sep. 1978.