# A mathematical theory of cooperative communication

Cooperative communication plays a central role in theories of human cognition, language, development, and culture, and is increasingly relevant in human-algorithm and robot interaction. Existing models are algorithmic in nature and do not shed light on the statistical problem solved in cooperation or on constraints imposed by violations of common ground. We present a mathematical theory of cooperative communication that unifies three broad classes of algorithmic models as approximations of Optimal Transport (OT). We derive a statistical interpretation for the problem approximated by existing models in terms of entropy minimization, or likelihood maximizing, plans. We show that some models are provably robust to violations of common ground, even supporting online, approximate recovery from discovered violations, and derive conditions under which other models are provably not robust. We do so using gradient-based methods which introduce novel algorithmic-level perspectives on cooperative communication. Our mathematical approach complements and extends empirical research, providing strong theoretical tools derivation of a priori constraints on models and implications for cooperative communication in theory and practice.

## Authors

• 36 publications
• 4 publications
• 2 publications
• 17 publications
• ### Generalizing the theory of cooperative inference

Cooperation information sharing is important to theories of human learni...
10/04/2018 ∙ by Pei Wang, et al. ∙ 0

• ### The Production of Probabilistic Entropy in Structure/Action Contingency Relations

Luhmann (1984) defined society as a communication system which is struct...
05/05/2010 ∙ by Loet Leydesdorff, et al. ∙ 0

• ### Non-algorithmic theory of randomness

This paper proposes an alternative language for expressing results of th...
10/01/2019 ∙ by Vladimir Vovk, et al. ∙ 0

• ### The level sets of typical games

In a non-cooperative game, players do not communicate with each other. T...
12/11/2020 ∙ by Julie Rowlett, et al. ∙ 0

• ### Cooperative Communication based Connectivity Recovery for UAV Networks

UAV networks often partition into separated clusters due to the high nod...
08/27/2018 ∙ by Wen Tian, et al. ∙ 0

• ### An Optimal Transport View on Generalization

We derive upper bounds on the generalization error of learning algorithm...
11/08/2018 ∙ by Jingwei Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Existing models approximate Optimal Transport

Existing models of cooperative communication are approximations of OT. We demonstrate this point by expressing representatives of three broad classes of models as OT. The first class of models [Shafto2008, Shafto2014, shafto2012learning] are based on the classic Theory of Mind recursion, which compute exact answers for the case of . The second class includes models that compute the first step of the recursion [goodman2013knowledge, Eaves2016b, Eaves2016c], and approximate the OT solution with this probability distribution. The third class includes models that compute the first step by selecting the data that maximize the probability of the hypothesis [hadfield2016cooperative, ho2016showing, ho2018effectively, fisac2017pragmatic], and approximate the complete transport plan with a single data for each hypothesis. These models have characteristic strengths and limitations, which the literature has yet to explore in their fullness. After unifying these approaches as OT, we will derive and contrast these consequences, as well as expose new, yet unexplored algorithms and computational tools through which we may understand communication.

### Full recursive reasoning is optimal transport

Cooperative models that build on the classic Theory of Mind recursion include cooperative inference [YangYGWVS18, wang2018generalizing] and pedagogical reasoning [Shafto2008, Shafto2014, shafto2012learning]

. In this section, we will briefly review the work on cooperative inference and illustrate how Bayesian inference models fit into our unifying OT framework. The core of cooperative inference between two agents is that the teacher’s selection of data depends on what the learner is likely to infer and vice versa. Let

be the learner’s prior of hypothesis , be the teacher’s prior of selecting data , be the teacher’s posterior of selecting to convey and be the learner’s posterior for given . Cooperative inference emphasizes that agents’ optimal communication plans, and should satisfy the following system of interrelated equations for any and , where and are the normalizing constants:

 \LL=\TT\LLpri\LLmar\TT=\LL\TTpri\TTmar ()

Extending [YangYGWVS18]’s results on uniform priors, we show that: 222All proofs are included in the Appendix. Optimal communication plans, and , of a cooperative inference problem with arbitrary priors denoted by and , can be obtained through Sinkhorn scaling. As a direct consequence, cooperative inference is a special case of the unifying OT framework with . Let be the joint distribution, be the teacher’s prior and be learner’s prior. According to the proof of Proposition Document, after cooperative inference, the teacher’s posterior selection matrix is the limit of -SK scaling of . On the other hand, under the unifying OT framework, the optimal teaching plan is the limit of -SK scaling of based on Eq eq:ot_teaching. When the teacher’s expense of selecting is proportional to , . Symmetrically, one may check the same holds for .

### One-step approximate inference

Direct implementation of the recursive Theory of Mind above requires repeated computation of the normalizing constant for Bayesian inference. This is computationally challenging for large scale problems and has been argued to be algorithmically implausible as a model of human cognition. For these reasons, models including Rational Speech Act (RSA) theory [goodman2013knowledge] and Bayesian Teaching [Eaves2016b, Eaves2016c] model cooperation as a single step of recursion. To simplify exposition, we focus on the RSA model. RSA models the communication between a speaker and a listener, formalizing cooperation that underpins pragmatic language [goodman2013knowledge, grice1975logic, levinson2000presumptive, clark1996using]. A pragmatic speaker selects an utterance optimally to inform a naive listener about a world state. Whereas a pragmatic listener interprets an utterance rationally and infers the state using one step Bayesian inference. This represents a communicative process where a speaker-listener pair can be viewed as a teacher-learner pair with world states-utterances being hypotheses-data points, respectively. RSA distinguishes among three levels of inference: a naive listener, a pragmatic speaker and a pragmatic listener. A naive listener interprets an utterance according to its literal meaning. That is, given a joint distribution the naive listener’s selection of given is the -th element of which is obtained by row normalization of . A pragmatic speaker selects an utterance to convey the state such that maximizes utility. In particular, he picks to convey by soft-max optimizing expected utility,

 PT(di|hj)∝eαU(di;hj), ()

where utility is given by , which minimizes the surprisal of a naive listener when inferring given with an utterance cost . This formulation is the same as one step of SK iteration in OT framework (see Eq eq:sk_distance and Eq eq:teacher_cost) where , as in [goodman2013knowledge]. Next, a pragmatic listener reasons about the pragmatic speaker and infers the hypothesis using Bayes rule, P_L(h_j—d_i) ∝P_T(d_i—h_j) P_L(h_j), Here represents the listener’s reasoning on the speaker’s data selection and is the learner’s prior. This is again one step recursion of OT framework of . As described above, teaching and learning plans in RSA are one-step approximations of the OT plans. Although limited recursion and optimization are realistic assumptions in psychology [goodman2013knowledge], in many cases, such approximations are far from optimal. For example, world states can often be organized from most abstract to least, which yield a upper triangular joint distribution matrix. Fully recursive model as cooperative inference would output a diagonal matrix as optimal plan [YangYGWVS18], which achieves the highest efficiency, whereas cooperative index of one step approximation is much lower. Furthermore, one-step approximation plans are much more sensitive to agents’ estimation of the other agent. For instance, a pragmatic speaker’s teaching plan is tailored for a naive listener, in contrast the optimal plan obtained through fully recursion is stable for any listener derived from the same common ground as in Remark Document.

### Single-step argmax approximation

This idea arises in cooperative inverse reinforcement learning, where instead of selecting acts probabilistically, the maximum probability action is always selected

[hadfield2016cooperative, ho2016showing, ho2018effectively, fisac2017pragmatic]. In particular, [fisac2017pragmatic] introduces Pragmatic-Pedagogic Value alignment, a framework that is grounded in empirically validated cognitive models related to pedagogical teaching and pragmatic learning. Pragmatic value alignment formalizes the cooperation between a human and a robot who perform collaboratively with the goal of achieving the best possible outcome according to an objective. The true objective however is only known to the human. The human performs pedagogical actions to teach the true objective to the robot. After observing human’s action, the robot, who is pragmatic, updates his beliefs and perform an action that maximizes expected utility. The human, observing this action, can then update their beliefs about the robot’s current beliefs and choose a new pedagogic action. Denote actions by and objectives by . We can see that when the human performs the action they act as a teacher and when robot is performing the action it is vice versa. In particular, the pedagogic human selects an action to teach the objective according to Eq eq:utility, where is the utility that captures human’s best expected outcome. As described in Section Document, this is equivalent to a single step recursion in the OT framework. Denote robot’s prior belief distribution on the objectives by . The robot interprets the human’s action rationally and updates his beliefs about the true objective using Bayes rule as Eq eq:Bayes. Then acting as a teacher, the robot chooses an action that maximizes the human’s expected utility using argmax function: where, denotes the robot’s actions and denotes the human’s actions. Unlike in human teaching where the plans are chosen proportionally to a probability distribution, here the robot chooses a deterministic action using argmax function. As described above, pragmatic-pedagogic value alignment is modeled by computing a single step of OT and selecting the action that maximizes the outcome. Unlike cooperative inference, which tends to select the leading diagonal of the common ground as (Proposition Document), pragmatic-pedagogic value alignment selects the maximal element in each column of , which is not even guaranteed to form a plan to distinguish every hypothesis. As a consequence, a drawback of such argmax method is that for large hypothesis spaces, multiple hypotheses may reach argmax on the same data which lead to low communication efficiency. Further, analysis in Section Document shows that deterministic methods as argmax are much less robust to perturbations.

## Analyzing models of cooperative communication

With prior models unified as instances of Optimal Transport via Sinkhorn scaling, we analyze the properties of these models. We focus on two of the most important aspects: understanding the models from statistical perspective and in the context of realistic assumptions about common ground.

### Full recursive reasoning is statistically and information theoretically optimal

Having demonstrated the equivalence of SK scaling and full recursive reasoning (Proposition Document), strong statistical justifications of fully Bayesian recursive reasoning follow immediately as SK scaling is optimal in the senses of entropy minimization and likelihood maximization [csiszar1975divergence, darroch1972generalized, brown1993order]. Sinkhorn scaling solves entropy minimization with marginal constraints. Let be a joint distribution matrix over and . Denote the set of all possible joint distribution matrices with marginals and by (all couplings). Consider the question of finding the approximation matrix of in that minimizes its relative entropy with , i.e. , where,

 DKL(P||M)=∑i,j∈\dataSize×\conceptSizePijlnPijMij ()

It is proved in for example [csiszar1989geometric, franklin1989scaling] that the -Sinkhorn scaling of converges to if the limit exists. We therefore directly interpret the results of fully Bayesian recursive reasoning as the communication plan with minimum discrimination information for pairs of interacting agents. In addition, Sinkhorn scaling also arises naturally as a maximum likelihood estimation. Let be the empirical distribution of i.i.d. samples from a true underlying distribution, which belongs to a model family. Then the log likelihood of this sample set over a distribution in the model family is given by , where is the sample size. Comparing with eq:KL, it is clear that maximizing the log likelihood (and so the likelihood) over a given family of is equivalent to minimizing . Both [darroch1972generalized] and [csiszar1989geometric] show that when the model is in the exponential family, the maximum likelihood estimation of can be obtained through SK scaling with empirical marginals.

### Understanding greedy choice

As a preliminary step toward analyzing common ground, we explore the effect on the optimal plans when varies, showing that as the solution converges to the leading diagonals of , as the solution goes to a uniform matrix, and more generally analyzing the variations on the distribution over all possible optimal plans caused by choice of . To simplify notation, we assume uniform priors on and for our discussions. For a given joint distribution , consider the OT problem for the teacher (similarly, for the learner). Recall that, as in Eq eq:ot_teaching, the optimal teaching plan is the limit of SK iteration of . Note that the limits of SK on and (the matrix obtained from by raising each element to power of ) are the same as they are cross-ratio equivalent (Section Document). Therefore to study the dynamics of regularized OT solutions, we may focus on and its SK limit . One extreme is when gets closer to zero. If , for any nonzero element of . Thus converges to a matrix filled with ones on the nonzero entries of , and converges to a uniform matrix if has no vanishing entries. It is shown in [wang2018generalizing] that the cooperative index (Section Document) attains its lower bound on uniform matrices. Hence reaches the lowest communicative efficiency as goes to zero. The other extreme is when gets closer to infinity. In this case, the we show that: is concentrating around the leading diagonals of as . This indicates that as , the number of diagonals of decreases. Therefore increases as since the cooperative index of a matrix is bounded below by the reciprocal of its number of positive diagonals [wang2018generalizing]. In particular, if has only one leading diagonal, converges to a doubly stochastic matrix with only one positive diagonal, i.e. a permutation matrix. In this case, , which suggests the highest communication efficiency. As pointed out in Section Document, the OT planing which picks the best diagonals is notably different from the argmax selection. In general, magnitude of causes variations to the distribution over all possible optimal plannings. Let be either the joint distribution or an agent’s planning matrix derived from . Notice that the product of elements on a diagonal of is proportional to the probability of sampling from all ’s diagonals. Then the cross-product ratio between its two diagonals and is precisely the ratio between probabilities of sampling and . Proposition Document shows that the optimal plan of an agent is concentrated on diagonals of . Thus, up to normalization, each represents a distribution over all possible optimal plans (diagonals). And constitute the true distribution over optimal plans derived from since is cross-product ratio equivalent to . , in contrast, represents a distribution that either exaggerates or suppresses cross-product ratios of , depending on whether is greater or less than one.